Hello,
Is this a one-off ingest, or continuous streaming data?
BigTable is fairly opaque, in that it hides any database
configuration from you. Thus, optimizations are limited. There is no
way to e.g. write database files directly, so whatever ingest
mechanism you use will end up using the same client writers. The
bottleneck will likely be your BigTable instance - any client
bottlenecks can be overcome by parallelizing your ingestion clients.
Client connections are configured through the hbase-site.xml file -
I haven't played around with it too much, but there might be some
optimizations possible there. An issue you might run into is
BigTable node parallelism - GeoMesa creates some initial split
points in the table structure, but my understanding is that BigTable
will eventually collapse those back down if your data isn't large
enough (in the TB). Thus, you might only be utilizing a single node
for writing.
In general, you want to have your clients 'close' to your back end -
so in this case running your ingestion in GCE. To get started, you
can pretty easily use the GeoMesa command line tools for a local
ingestion of flat files (you will have to define a GeoMesa converter
that maps your data into SimpleFeatures). You can specify multiple
local threads, up to the number of files you are processing. If you
find that you need more ingest throughput, you can use the same
converter to run a distributed map/reduce ingest. For BigTable,
there may be some classpath issues to be sorted out with the GeoMesa
map/reduce ingest - in particular getting your hbase-site.xml on the
distributed classpath. If you go this route and hit any issues, let
us know.
We don't currently have any tools for ingesting directly from
another database - you could pretty easily write something custom,
or just export to files and ingest those.
One minor GeoTools optimization is to use the PROVIDED_FID hint, if
you already have unique IDs. If not, GeoMesa will generate UUIDs for
each feature. (the converter framework I mentioned earlier supports
this by default).
Thanks,
Emilio
On 01/23/2017 09:39 AM, Damiano Albani
wrote:
Hello,
Would someone have any particular advice to provide in the
context of ingesting a lot of data into GeoMesa?
The target backend is HBase in my use case -- on Google
BigTable to be more precise.
And the source data is stored in flat files and/or
databases.
How should I architecture the loading workflow in order to
get the best performance and loading time?
I was thinking in terms of parallel jobs, Java tweaking or
even GeoTools settings.
So if you have some experience with filling a GeoMesa
instance on HBase, I'd be glad to hear it.
Thanks!
--
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users
|