Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Hi Damiano,

No problem, more replies inline.

Thanks,

Emilio

On 01/23/2017 02:42 PM, Damiano Albani wrote:
Hello Emilio,

First, thanks for having taking the time to write such a detailed answer!

On Mon, Jan 23, 2017 at 5:56 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

Is this a one-off ingest, or continuous streaming data?

My use case currently deals with a one-off ingest -- at least that's where performance is critical, given the amount of data to process.
 
BigTable is fairly opaque, in that it hides any database configuration from you.

That's true that there's no much to configure apart from the instance ID and the project ID. But I suppose that's the beauty of the thing 😊
 
Thus, optimizations are limited. There is no way to e.g. write database files directly, so whatever ingest mechanism you use will end up using the same client writers. The bottleneck will likely be your BigTable instance - any client bottlenecks can be overcome by parallelizing your ingestion clients. Client connections are configured through the hbase-site.xml file - I haven't played around with it too much, but there might be some optimizations possible there.

I'll have a look if there are any settings to "tweak" the HBase client talking to BigTable.
(I remember having to set some timeout value by the way.)
 
An issue you might run into is BigTable node parallelism - GeoMesa creates some initial split points in the table structure, but my understanding is that BigTable will eventually collapse those back down if your data isn't large enough (in the TB). Thus, you might only be utilizing a single node for writing.
 
I suppose what you're saying relates to the point in BigTable's documentation titled "The workload isn't appropriate for Cloud Bigtable"?
 
In general, you want to have your clients 'close' to your back end - so in this case running your ingestion in GCE.

Yes, I've been indeeed using GCE so far, for performance and cost reason.
 
To get started, you can pretty easily use the GeoMesa command line tools for a local ingestion of flat files (you will have to define a GeoMesa converter that maps your data into SimpleFeatures). You can specify multiple local threads, up to the number of files you are processing.

Speaking of the command line tools, that would be nice if the binaries for the BigTable backend were built by default and provided in the Maven repository.

Because we're an Eclipse project, anything we host has to be blessed by Eclipse for provenance and license. As we haven't gotten this sign-off on all the BigTable dependencies yet, we unfortunately can't bundle it - we can still use it as a plugin, but you have to build it yourself. Hopefully we will be able to get it approved soon.
If you find that you need more ingest throughput, you can use the same converter to run a distributed map/reduce ingest. For BigTable, there may be some classpath issues to be sorted out with the GeoMesa map/reduce ingest - in particular getting your hbase-site.xml on the distributed classpath. If you go this route and hit any issues, let us know.

Using map/reduce is an interesting idea. I personally don't have much experience in that subject, but I'll definitely have a look.
Since you're doing a one-time bulk ingest, map/reduce could be a good fit. Depending on your inputs, our tools should make it fairly easy (with the classpath caveat I mentioned). If you have a cluster to run on, and your inputs are flat files, it will handle all the multi-threading and load-balancing for you.


We don't currently have any tools for ingesting directly from another database - you could pretty easily write something custom, or just export to files and ingest those.

If I'm not mistaken, GeoMesa relies on GeoTools in order to support the various file formats to read from, right?
While super useful, this abstraction makes it quite difficult for an "outside observer" like me to understand what happens precisely when, say, I call something like:

      bigTableFeatureStore.addFeatures(inputFeatureCollection);

Where inputFeatureCollection is a coming from, say, a huge Shapefile or CSV file.
Between this very simple call and the source data finally arriving in BigTable perfectly organised, quite a lot of things happened 😁
For example, during this pipe-like operation, is there any kind of buffering or "batch" commit? Can I expect GeoTools to process large amount of features and not leak any memory or such?
That's the kind of topics where my understand is a bit limited, given everything just works automagically!
GeoTools does have a lot of different ways to accomplish the same thing. The main underlying abstraction for writing is the FeatureWriter (either Append of Modify) - if you look at the addFeatures method, we just use a feature writer:

https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/geotools/GeoMesaFeatureStore.scala#L31-L53

Buffering is implementation dependent - for GeoMesa HBase/BigTable, we use an underlying org.apache.hadoop.hbase.client.BufferedMutator. You can control the batch size through the system property 'geomesa.hbase.write.batch'. If you want finer control, you can also cast a FeatureWriter to org.locationtech.geomesa.hbase.data.HBaseAppendFeatureWriter, which includes a 'flush' method (you can get a feature writer through datastore. getFeatureWriterAppend).

As I mentioned before, all methods of writing through GeoMesa will end up funneling through that feature writer class, so this applies across the board.

As for the input side, we use a combination of GeoTools data stores and custom code. Our converter framework is designed to convert flat files into simple features in a streaming fashion, and I can attest that it handles memory well. The other GeoTools data stores may work differently (loading the entire file at once, etc) - I'm not entirely sure there.


Going back to your idea of splitting the input files, could this also be done dynamically, based on a org.geootols.data.{Query + Filter} combination which would sort of "shard" the data? For example, with 5 ingester threads, each one of the them would process a fifth of the FeatureCollection (via a modulo 5).
Does that make any sense or would it be not very reliable / efficient in practice?

That may be reasonable depending on your input data (using a GeoTools query implies that you already have your data in a GeoTools data store). Feature writers and readers are all single-threaded though, so you would want to load 5 different feature collections by splitting your data on some queryable attribute (e.g. by month if your data has timestamps).

One minor GeoTools optimization is to use the PROVIDED_FID hint, if you already have unique IDs. If not, GeoMesa will generate UUIDs for each feature. (the converter framework I mentioned earlier supports this by default).

That's good to know, thanks for the tip!

Best regards,

--
Damiano Albani
Geodan


_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users


Back to the top