Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Yeah, in general non-splittable files don't work great with map/reduce. We have some code that prevents files from being split up, so that a single mapper will see the entire file - we use that for xml and I believe SHP files.

The tools are usually a quick and relatively fast way for people to get started, but feel free to use whatever works best for you (obviously). We have lots of users with custom ingest setups.

For large data sets, you might try the NYC Taxi data - they have 1.2B records available for download. I'm not sure how much that would translate to in size on disk though. We already have a converter for it and more details here:

https://github.com/locationtech/geomesa/tree/master/geomesa-tools/conf/sfts/nyctaxi

Thanks,

Emilio

On 01/26/2017 05:37 PM, Damiano Albani wrote:
Hi Emilio,

On Thu, Jan 26, 2017 at 5:23 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

Also, performance might increase as you get more data into the system - as I mentioned before sometimes you will only end up utilizing a single node of your BigTable cluster.

Yes, that is line with what I read on https://cloud.google.com/bigtable/docs/performance:
  • The workload isn't appropriate for Cloud Bigtable. If you test with a small amount (< 300 GB) of data, or if you test for a very short period of time (seconds rather than minutes or hours), Cloud Bigtable won't be able to balance your data in a way that gives you good performance. It needs time to learn your access patterns, and it needs large enough shards of data to make use of all of the nodes in your cluster. 
Who would have thought that adding more data to a database make go it faster 😁
Joke aside, I couldn't really find any public really huge dataset which would fit into the size requirement mentioned above. Do you have an idea by any chance?
 
So you were able to run a map/reduce ingest, but it performed horribly? In order to compare directly to the local ingest, you can try using the same command line tools you've been using, but put the files into hdfs - this will cause it to launch a map/reduce job (that tutorial is more a proof-of-concept). You will need to have the appropriate HADOOP_HOME environment variable set, or manually copy the hadoop configuration files onto the GeoMesa classpath. In addition, you will need to have your hbase-site.xml on the distributed hadoop classpath - the easiest way to do this might be to copy it onto each node of your hadoop cluster.

I did put the input files into HDFS -- well, actually on Google Cloud Storage, but thanks to the Google connector, it behaves the same, only with a gs:/ URI.
The deployment of the map/reduce job went fine with Google Dataproc (including shipping hbase-site.xml into the shaded JAR), but for some unknown reason, the performance was really slow.
But, as I said, it has probably to do with how I used / set up the whole thing. I'll investigate further and try to see where lies the issue.
 
The question is which part of the process is the bottleneck - if it's the GeoMesa ingest, then using map/reduce or more threads/processes will increase your throughput - but if you are maxing out your BigTable connection, then you will not seen any increase (or possibly a decrease due to resource contention).

Yes indeed, I have to find what limits the performance as whole. That's why I was wondering if the 20-30 MB/s write throughtput is in the range of BigTable, especially if the database is slightly filled in.

But I'll also have a look to see if Hadoop is the best (and easiest) tool for what I'd like to do.
Maybe a more common distributed task queue (à la Google Pub/sub) would work just as fine and be more resource effective.
Because I see for example that GDELTIngestMapper works based on TextInputFormat, due to CSV files being splittable by lines.
But how would that translate for, say, the processing of Shapefiles or XML files?

Best regards,

--
Damiano Albani
Geodan


_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users


Back to the top