Hi Mario,
10k records/s seems reasonable. We were just doing some
single-threaded ingestion (without any optimizations) on a
10-node cluster and were seeing speeds of about 14k/s. That
includes parsing data off disk, of course. Distributing the
ingestion over map/reduce we got about 50k/s. This was all just
using geotools feature writers.
That said, there are a lot of things you could do to improve
ingest performance:
* If the ID field is unique, then you can use it as the
simple feature ID instead, and then you don't need to store or
index it as a separate attribute. That will decrease the amount
of data written and speed up your ingest. You can still query
for ID by using CQL: IN('myid'). Also, if the feature ID is not
set, we generate a semi-random UUID, which can be (relatively)
slow.
* Depending on your query requirements, you can turn off the
writing of various indices. Note that this may make certain
queries much slower. You can do this by setting the user data in
your simple feature type before calling createSchema:
"table.indexes.enabled" -> "records,z3"
* Tweaking the accumulo batch writer buffer size. Buffering
more entries in memory usually increases throughput, at the
expense of heap space. You can control the batch writer settings
through system properties:
"geomesa.batchwriter.latency.millis"
"geomesa.batchwriter.memory" // Measured in bytes
"geomesa.batchwriter.maxthreads"
"geomesa.batchwriter.timeout.millis"
* Aggressive pre-splitting of tables in accumulo
GeoMesa
adds some splits on table creation, but if you know your data
distribution and size (or can extrapolate it), adding splits to
the table will result in data being written across different
tservers, which will increase write speeds.
* Ensuring accumulo is using native memory maps instead of
java maps
* Turning off the accumulo write-ahead-logs - this usually
speeds things up a lot, at the expense of losing data in the
case of a crash.
If you're using map/reduce, you might also try out the
GeoMesaOutputFormat - it delegates to the AccumuloOutputFormat.
Although we haven't implemented it for GeoMesa, using the
AccumuloFileOutputFormat and bulk importing the resulting files
is generally the fastest way to get data into accumulo.
Hope that helps. If you try any of this, please circle back
and let us know the outcome. Also, pull requests are always
appreciated :)
Thanks,
Emilio
On Thu, 2015-11-12 at 17:09 +0100, Mario Pastorelli wrote:
Hello,
I'm testing GeoMesa for our data and I noticed that the
maximum speed achievable in my tests is around 10k
records/second. I have 4 servers and this means 40k
records/second which is low for the kind of data that I have
to ingest. I can't find good benchmarks of GeoMesa so I was
wondering if 10k/s per server is what to be expected from
GeoMesa.
The data is straightforward: it has a date (time-index), a
location (space-index), an id (index=true) and seven other
fields that shouldn't be indexed. The data is read from hdfs
and written using GeoMesa library without any other logic.
Thanks,
Mario Pastorelli
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users