Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Ingestion speed

Hi Mario,

10k records/s seems reasonable. We were just doing some single-threaded ingestion (without any optimizations) on a 10-node cluster and were seeing speeds of about 14k/s. That includes parsing data off disk, of course. Distributing the ingestion over map/reduce we got about 50k/s. This was all just using geotools feature writers.

That said, there are a lot of things you could do to improve ingest performance:

* If the ID field is unique, then you can use it as the simple feature ID instead, and then you don't need to store or index it as a separate attribute. That will decrease the amount of data written and speed up your ingest. You can still query for ID by using CQL: IN('myid').  Also, if the feature ID is not set, we generate a semi-random UUID, which can be (relatively) slow.

* Depending on your query requirements, you can turn off the writing of various indices. Note that this may make certain queries much slower. You can do this by setting the user data in your simple feature type before calling createSchema:
	"table.indexes.enabled" -> "records,z3"

* Tweaking the accumulo batch writer buffer size. Buffering more entries in memory usually increases throughput, at the expense of heap space. You can control the batch writer settings through system properties:
	"geomesa.batchwriter.latency.millis"
"geomesa.batchwriter.memory" // Measured in bytes
	"geomesa.batchwriter.maxthreads"
"geomesa.batchwriter.timeout.millis"

* Aggressive pre-splitting of tables in accumulo
GeoMesa adds some splits on table creation, but if you know your data distribution and size (or can extrapolate it), adding splits to the table will result in data being written across different tservers, which will increase write speeds.

* Ensuring accumulo is using native memory maps instead of java maps

* Turning off the accumulo write-ahead-logs - this usually speeds things up a lot, at the expense of losing data in the case of a crash.

If you're using map/reduce, you might also try out the GeoMesaOutputFormat - it delegates to the AccumuloOutputFormat. Although we haven't implemented it for GeoMesa, using the AccumuloFileOutputFormat and bulk importing the resulting files is generally the fastest way to get data into accumulo.

Hope that helps. If you try any of this, please circle back and let us know the outcome. Also, pull requests are always appreciated :)

Thanks,

Emilio

On Thu, 2015-11-12 at 17:09 +0100, Mario Pastorelli wrote:
Hello,

I'm testing GeoMesa for our data and I noticed that the maximum speed achievable in my tests is around 10k records/second. I have 4 servers and this means 40k records/second which is low for the kind of data that I have to ingest. I can't find good benchmarks of GeoMesa so I was wondering if 10k/s per server is what to be expected from GeoMesa.
The data is straightforward: it has a date (time-index), a location (space-index), an id (index=true) and seven other fields that shouldn't be indexed. The data is read from hdfs and written using GeoMesa library without any other logic.

Thanks,
Mario Pastorelli

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users

Back to the top