Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] GeoMesa range query performance

Hi Emilio,

Thanks so much for your answer. I will try to use the geotools API programmatically to see how it works. I'll keep you posted.


On Tue, May 21, 2019 at 10:15 AM Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:
Ah, I would not use the command-line tools for performance testing. There is a substantial overhead involved in spinning up the JVM for each query, which is likely dominating the time for smaller queries.

You can use the Accumulo monitor page to look at the index tables associated with your data and see how many splits there are, and where they are located. It is usually available on port 9995.



On 5/21/19 12:49 PM, Tin Vu wrote:
Hi Emilio,

Thanks for your enthusiasm. I did not use geotools API programmatically. Instead, I use the GeoMesa-Accumulo command lines tool to submit a query. In particular, a query looks like this:

geomesa-accumulo export -u root -p *password* -c *dataset* -f *data_model* -q bbox(geom,x1,y1,x2,y2) -F csv

How could I check that my data is distributed across cluster? I store them by Accumulo with HDFS as the file system.



On Mon, May 20, 2019 at 6:48 AM Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:
Are you using the geotools API programmatically then? There are a lot of things that can affect the query performance, a few things I would look at:

* Check if you data is distributed across the cluster. By default, GeoMesa will create 4 splits on ingestion. If your data doesn't reach the split threshold, then you will only be querying 4 regions on at most 4 servers.
* Check that client can handle the number of threads being used. GeoMesa spawns multiple client threads per query (based on the data store configuration), so by default you'd be running 8 threads per query.
* Try to determine the bottleneck - you may be saturating your network, or your client may not be reading results as fast as they are being delivered.

I'm not familiar with how SpatialHadoop works, so those things may or may not be affecting it as well.

At any rate, I don't think anyone has compared the two before. I'd be interested to see some more detailed results (code samples, timings, etc), if you'd share them.



On 5/20/19 9:10 AM, Tin Vu wrote:
I used concurrent threads. 1 thread for 1 query.

On Mon, May 20, 2019, 6:00 AM Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

How are you submitting queries to GeoMesa?



On 5/19/19 3:25 PM, Tin Vu wrote:
Hi Emilio,

Thanks for your response. I executed my experiments as follows:
1. Cluster: 1 master node, 12 slave nodes, 64 GB memory in each node.
2. Dataset: Open street map All Nodes (size 96 GB, 2.7 Billion records).
3. Queries: I created 10 batches of queries with different size (for example, query area / whole space area = 10^-12, 10^-11,...., 10^-2). Each batch contains 100 square query in the same size. Those query is randomly distributed in the whole space.
4. I submit those batches of queries to SpatialHadoop and GeoMesa, wait until they finish then count the running time.



On Thu, May 16, 2019 at 2:16 PM Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:

Could you say more about how you're querying? SpatialHadoop uses map/reduce jobs, if I understand - it seems like there would be a lot of overhead to spin up the job. How long are your queries taking? How big is your cluster?



On 5/16/19 3:20 PM, Tin Vu wrote:
Hi all,

I just wanted to to ask you a question about the performance of GeoMesa range query. This is my experimental set up:
1. Systems: GeoMesa on Accumulo, SpatialHadoop (
2. Dataset: All node dataset from, with 96GB and 2.7 billions points.
3. Query: range query with different selectivity: 10^-12, 10^-11, 10^-10, which is the ratio of query range and total area of the dataset space.

I saw that GeoMesa does not work better than SpatialHadoop, which is not expected. Since I think that GeoMesa (organize data in record-level) should be better than SpatialHadoop (organize data in block-level) in highly selective queries. Could you tell me any idea to tune GeoMesa such that it can provide a better performance?



geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

geomesa-users mailing list
To change your delivery options, retrieve your password, or unsubscribe from this list, visit

Back to the top