Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Hello,

Is this a one-off ingest, or continuous streaming data?

BigTable is fairly opaque, in that it hides any database configuration from you. Thus, optimizations are limited. There is no way to e.g. write database files directly, so whatever ingest mechanism you use will end up using the same client writers. The bottleneck will likely be your BigTable instance - any client bottlenecks can be overcome by parallelizing your ingestion clients. Client connections are configured through the hbase-site.xml file - I haven't played around with it too much, but there might be some optimizations possible there. An issue you might run into is BigTable node parallelism - GeoMesa creates some initial split points in the table structure, but my understanding is that BigTable will eventually collapse those back down if your data isn't large enough (in the TB). Thus, you might only be utilizing a single node for writing.

In general, you want to have your clients 'close' to your back end - so in this case running your ingestion in GCE. To get started, you can pretty easily use the GeoMesa command line tools for a local ingestion of flat files (you will have to define a GeoMesa converter that maps your data into SimpleFeatures). You can specify multiple local threads, up to the number of files you are processing. If you find that you need more ingest throughput, you can use the same converter to run a distributed map/reduce ingest. For BigTable, there may be some classpath issues to be sorted out with the GeoMesa map/reduce ingest - in particular getting your hbase-site.xml on the distributed classpath. If you go this route and hit any issues, let us know.

We don't currently have any tools for ingesting directly from another database - you could pretty easily write something custom, or just export to files and ingest those.

One minor GeoTools optimization is to use the PROVIDED_FID hint, if you already have unique IDs. If not, GeoMesa will generate UUIDs for each feature. (the converter framework I mentioned earlier supports this by default).

Thanks,

Emilio

On 01/23/2017 09:39 AM, Damiano Albani wrote:
Hello,

Would someone have any particular advice to provide in the context of ingesting a lot of data into GeoMesa?
The target backend is HBase in my use case -- on Google BigTable to be more precise.
And the source data is stored in flat files and/or databases.

How should I architecture the loading workflow in order to get the best performance and loading time?
I was thinking in terms of parallel jobs, Java tweaking or even GeoTools settings.
So if you have some experience with filling a GeoMesa instance on HBase, I'd be glad to hear it.

Thanks!

--
Damiano Albani
Geodan


_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users


Back to the top