Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Tips on ingesting a lot of data into GeoMesa

Hi Damiano,

GDELT does have a fair number of invalid records, so that is normal (however that might be slowing down the ingest due to the discarded records). About the threading, one caveat is that each input file will only be processed by a single thread - so if you specify more threads than files, the excess will not be used. Also, you might want to play around with the JAVA_OPTS - increase memory, etc. Also, performance might increase as you get more data into the system - as I mentioned before sometimes you will only end up utilizing a single node of your BigTable cluster.

I can't recall now the exact throughputs we've seen, but in general those numbers seem reasonable for a first pass...

So you were able to run a map/reduce ingest, but it performed horribly? In order to compare directly to the local ingest, you can try using the same command line tools you've been using, but put the files into hdfs - this will cause it to launch a map/reduce job (that tutorial is more a proof-of-concept). You will need to have the appropriate HADOOP_HOME environment variable set, or manually copy the hadoop configuration files onto the GeoMesa classpath. In addition, you will need to have your hbase-site.xml on the distributed hadoop classpath - the easiest way to do this might be to copy it onto each node of your hadoop cluster.

The question is which part of the process is the bottleneck - if it's the GeoMesa ingest, then using map/reduce or more threads/processes will increase your throughput - but if you are maxing out your BigTable connection, then you will not seen any increase (or possibly a decrease due to resource contention).

Thanks,

Emilio

On 01/26/2017 10:38 AM, Damiano Albani wrote:
Hi Emilio (and everyone else),

On Wed, Jan 25, 2017 at 3:19 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:
Awesome, let us know how it goes!

I've had a try at the ingestion of the GDELT (1.0 Event) dataset, following the nice tutorial that is provided.
Apart from having a significant numbers of rows being rejected by the ingest tool (data format issue?), it worked as expected.
I have remark though regarding the threading functionality: it seems that settings a high value didn't make any difference for the performance.
Running a second process of geomesa-bigtable ingest did increase the speed of the data ingestion.
Speaking of performance, I happened to reach ~ 20 MB/s write throughput on a "default" BigTable instance (i.e. 3 nodes on SSD). And ~ 30 MB/s with the second process I mentioned above.
I don't know if you used BigTable in that context, but does it seem to match the average expected performance?

The second step of my testing of GeoMesa is now to try using Hapdoop and MapReduce jobs to further improve the performance. Are my expectations correct in that regard, by the way?
I have actually followed the tutorial found on Github which, after adaptation to the Google Cloud environment, kind of worked: the "only" thing being that the performance was atrocious?!
I suppose it has to do with my (limited) knowledge of Hadoop, MapReduce and the Google Dataproc product.
In case you have some experience on that subject, I'd be happy to hear any advice or things I need to pay attention to.

Thanks,

--
Damiano Albani
Geodan


_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users


Back to the top