Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Hi Emilio,

On Thu, Jan 26, 2017 at 5:23 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

Also, performance might increase as you get more data into the system - as I mentioned before sometimes you will only end up utilizing a single node of your BigTable cluster.

Yes, that is line with what I read on https://cloud.google.com/bigtable/docs/performance:
  • The workload isn't appropriate for Cloud Bigtable. If you test with a small amount (< 300 GB) of data, or if you test for a very short period of time (seconds rather than minutes or hours), Cloud Bigtable won't be able to balance your data in a way that gives you good performance. It needs time to learn your access patterns, and it needs large enough shards of data to make use of all of the nodes in your cluster. 
Who would have thought that adding more data to a database make go it faster 😁
Joke aside, I couldn't really find any public really huge dataset which would fit into the size requirement mentioned above. Do you have an idea by any chance?
 
So you were able to run a map/reduce ingest, but it performed horribly? In order to compare directly to the local ingest, you can try using the same command line tools you've been using, but put the files into hdfs - this will cause it to launch a map/reduce job (that tutorial is more a proof-of-concept). You will need to have the appropriate HADOOP_HOME environment variable set, or manually copy the hadoop configuration files onto the GeoMesa classpath. In addition, you will need to have your hbase-site.xml on the distributed hadoop classpath - the easiest way to do this might be to copy it onto each node of your hadoop cluster.

I did put the input files into HDFS -- well, actually on Google Cloud Storage, but thanks to the Google connector, it behaves the same, only with a gs:/ URI.
The deployment of the map/reduce job went fine with Google Dataproc (including shipping hbase-site.xml into the shaded JAR), but for some unknown reason, the performance was really slow.
But, as I said, it has probably to do with how I used / set up the whole thing. I'll investigate further and try to see where lies the issue.
 
The question is which part of the process is the bottleneck - if it's the GeoMesa ingest, then using map/reduce or more threads/processes will increase your throughput - but if you are maxing out your BigTable connection, then you will not seen any increase (or possibly a decrease due to resource contention).

Yes indeed, I have to find what limits the performance as whole. That's why I was wondering if the 20-30 MB/s write throughtput is in the range of BigTable, especially if the database is slightly filled in.

But I'll also have a look to see if Hadoop is the best (and easiest) tool for what I'd like to do.
Maybe a more common distributed task queue (à la Google Pub/sub) would work just as fine and be more resource effective.
Because I see for example that GDELTIngestMapper works based on TextInputFormat, due to CSV files being splittable by lines.
But how would that translate for, say, the processing of Shapefiles or XML files?

Best regards,

--
Damiano Albani
Geodan

Back to the top