Hi Damiano,
GDELT does have a fair number of invalid records, so that is normal
(however that might be slowing down the ingest due to the discarded
records). About the threading, one caveat is that each input file
will only be processed by a single thread - so if you specify more
threads than files, the excess will not be used. Also, you might
want to play around with the JAVA_OPTS - increase memory, etc. Also,
performance might increase as you get more data into the system - as
I mentioned before sometimes you will only end up utilizing a single
node of your BigTable cluster.
I can't recall now the exact throughputs we've seen, but in general
those numbers seem reasonable for a first pass...
So you were able to run a map/reduce ingest, but it performed
horribly? In order to compare directly to the local ingest, you can
try using the same command line tools you've been using, but put the
files into hdfs - this will cause it to launch a map/reduce job
(that tutorial is more a proof-of-concept). You will need to have
the appropriate HADOOP_HOME environment variable set, or manually
copy the hadoop configuration files onto the GeoMesa classpath. In
addition, you will need to have your hbase-site.xml on the
distributed hadoop classpath - the easiest way to do this might be
to copy it onto each node of your hadoop cluster.
The question is which part of the process is the bottleneck - if
it's the GeoMesa ingest, then using map/reduce or more
threads/processes will increase your throughput - but if you are
maxing out your BigTable connection, then you will not seen any
increase (or possibly a decrease due to resource contention).
Thanks,
Emilio
On 01/26/2017 10:38 AM, Damiano Albani
wrote:
Hi Emilio (and everyone else),
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users
|