Hi Sandeep,
In general, we still have to scan where there *might* be data, even
if there isn't actually any data there. Opening a scan, even if it
returns no data, takes some time. For temporal queries, the number
of ranges tends to be even larger, hence the slower performance.
I believe that Accumulo handles this a bit better than HBase, as it
has a concept of a batch scanner that accepts multiple ranges, and
it has some knowledge of the data start/end. In HBase, we have to
run multiple scans using a thread pool [1], so it's not as
efficient. We could possibly leverage HBase metadata to improve
things a bit for that scenario (as future work).
We also have the concept of data statistics, which we could leverage
to only scan the ranges that have data. However, it hasn't been
implemented for HBase yet, and our current query planning doesn't
use it since it's an optional feature. As more future work, it would
be nice to leverage those stats in query planning.
To mitigate the issue, you can try increasing the "queryThreads"
data store parameter, in order to use more threads during queries.
You can also enable "looseBoundingBox", if you have currently
disabled it. For temporal queries, increasing the temporal binning
period may cause fewer ranges to be scanned [2]. However this may
result in slower queries for very small temporal ranges, so it
should be tailored to your use case.
As a final note, make sure that you have the distributed
coprocessors installed and enabled [3], especially if you are not
using loose bounding boxes.
Thanks,
Emilio
[1]
https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/utils/AbstractBatchScan.scala
[2]
http://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-z-index-time-interval
[3]
http://www.geomesa.org/documentation/user/hbase/install.html#register-the-coprocessors
On 08/31/2017 04:59 PM, Sandeep Singh
wrote:
I have inserted data with lat, lng range
(30,60) to (35,65)
In this settings, I am doing query on my local
machine:
a) In my first query, the location bounding box
is: (30,60) to (30.1,60.1), it runs on an average
in less than a second and return correct results.
b) In second query, I modified the location bounding
box (10,10) to (30.1,60.1). This query also returns
the same results as in query (a), which is expected,
but on an average it takes around 3-4 seconds per
query.
Since both queries should give me same results, but
one is running much faster than the other. I notice
the similar behavior in time domain queries too where
the performance is even much worse (10x times slower
or even more) if time ranges are not matching with
data inserted. Below are some of my questions:
1) Is this expected behavior ?
2) I know one of the solution can be to reformat the query
to map to the actual data spatial and temporal ranges
inserted into Geomesa, which will require me to maintain
additional metadata about the data. But, I think a better
solution might be designed at Geomesa layer ?
Do, let me know if there is some kind of settings etc, which
can affect this behavior. I have seen the same behavior on
multiple other local machines and on cloud VMS by setting up
Geomesa.
regards,
Sandeep Singh.
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users
|