Hello,
We don't currently use stats for this type of thing, as they are
somewhat unreliable and we don't want to exclude data that might be
there. You have three fairly easy options:
* if you're using geoserver, you can configure a default layer
filter with your 'scope' filter
* at the GeoMesa level, you can write and configure a query
interceptor[1] to add the 'scope' filter to each query
* you can modify your client code to always add the 'scope' filter
to any user queries before passing them to GeoMesa
I would suggest you specify your scope filter as simply as possible,
as complex polygon predicates can take some time to process.
FYI, the 'currentDate' function[2] might be useful to define your
scope filter, as it takes an offset so you can specify something
like 'the last month from now'.
Thanks,
Emilio
[1]: https://www.geomesa.org/documentation/user/datastores/index_config.html#query-interceptors
[2]: https://www.geomesa.org/documentation/user/datastores/filter_functions.html#currentdate
On 11/7/19 11:18 AM, Gorham, Kent
wrote:
Hello Emilio,
This will sound
like an odd question.
We have a case
where the implementation will used within a certain ‘scope’,
meaning that the database will only reference a limited time
range and a limited geographical area (e.g. the latest three
months and two or three countries).
Is it possible to optimize the system so that it does NOT
look for data outside of these limits? (A ‘mask’ if you
will)?
I thought the
stats table would manage this, but it doesn’t seem to behave
that way.
Thanks,
Kent
Generally,
when trying to debug a query result, you can get a lot of
insight from enabling explain query logging[1]. In your case,
by default GeoMesa creates 2 indices, a spatial z2 index and a
spatio-temporal z3 index. Because space is a constrained
value, we can represent the entire world as a single range.
However, because time is open-ended, we have to 'bin' the
index by time periods (weeks by default). When scanning a
large time period, you end up having to scan each time bin.
This can lead to significant overhead in query times, even if
there is no data, as we still have to construct the query
ranges and send them to Accumulo. There are a few things you
can do to mitigate this:
* You can create an attribute index[2] on your date field, at
the cost of increasing your size on disk and decreasing
overall write speeds. The attribute index key is optimized for
that type of query.
* You can increase the time period[3], which will reduce the
number of time bins scanned. Generally, you want to align your
time period with the range of data you expect to query.
* You can reduce the range decomposition[4] for a query.
Having fewer, broader ranges can slow down scans due to more
false positives being filtered, but will reduce the overhead
involved with sending many ranges to Accumulo.
* If your data is well-known, you can apply date filters to
each query that define your data bounds. In GeoServer, you can
do this through configuring default layer filters.
Thanks,
Emilio
[1]:
https://www.geomesa.org/documentation/user/datastores/query_planning.html#explaining-query-plans
[2]:
https://www.geomesa.org/documentation/user/datastores/index_basics.html#attribute-index
[3]:
https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-z-index-time-interval
[4]:
https://www.geomesa.org/documentation/user/datastores/runtime_config.html#geomesa-scan-ranges-target
On 10/25/19 12:34 PM, Gorham, Kent wrote:
Hello Emilio,
Thank you for
the information. I’ve investigated some of those avenues,
but I have also been performing additional tests and do
not understand the results.
We have
populated an GeoMesa/Accumulo database (with 8 nodes) with
3,110,440 records (360*180*48). There are 48 datapoints
recorded across every point on the planet. At each
location, we have 48 points, that increment in time by 1
millisecond each. We have both point and time has part of
our featuretype.
For clarity,
location -180,-90 will contain a datapoint with a time
value of 0. Location 180,90 will contain a time value of
3,110,439.
We wrote a
test to retrieve the data in various ways (by location
only, by time only and by time and location).
For location,
as we approach a zero-sized search area, the time for the
search approached zero. See ‘location only.png’. The
blue line is the number of records returned, the orange
line is the time in milliseconds to do the search. The
search area starts at (-180,-90,180,90) and decreases one
longitude degree for each search. We only performed 300
searches. This graph makes sense.
NOTE: The
iterator to read the records returned simply counts the
records, so it is not a factor.
However, when
we search by time only, (see time-only.png), we see that
there seems to be some significant overhead associated
with performing the search. In this search, we reach zero
because we have caculated the appropriate increment for
each search (e.g. first seach, 0 to 3110440 milliseconds
(milliseconds are converted directly to Date), 10368 to
3110400 for second search). We have still only performed
300 iterations in the test loop.
Also if we
perform a time search for the area of time that does not
contain any data (e.g. from 3110441 to NOW), the system
still takes a couple seconds to return zero results. (~2.4
seconds)
We realize
that there is a significant amount of time between 3110440
(12-31-1969 18:51:50) and NOW, even though there’s no data
there (but possibly indexes exist?). We are wonder if that
is part of the problem.
We would like
to understand this overhead that occurs with a temporal
search.
Would you be
able to explain it, or is there a good way to diagnose it?
Thanks,
Kent
Hello,
To answer some of your questions:
* Accumulo doesn't really have any concept of a trigger.
There are certain 'hacky' ways to do so (i.e. constraints),
but they aren't recommended.
* GeoMesa has a concept of query interceptors[1], which let
you rewrite a query with custom code. This may not be
sufficient for your needs as it doesn't let you directly
change the return values, but may be a useful integration
point.
* MapReduce jobs can be initiated in a variety of ways, but
that is not really within the scope of GeoMesa. I'd refer
you to the Hadoop documentation here.
In general, I would suggest that you first consider
returning data in the Apache Arrow[2] or custom GeoMesa
'binary'[3] formats. Either one can greatly reduce the
bandwidth required to return a given result set, while still
returning the same number of features. You may also want to
consider feature sampling[4], which will reduce the total
number of features returned.
If those options are insufficient, then I would suggest
writing your data reduction as an Accumulo interator or
combiner, which will let you do map/reduce style programming
directly in Accumulo. It sounds like your data reduction
depends on each query - if so, you'd need to modify the
GeoMesa query planner in order to configure and invoke your
iterators. If the data reduction can be done globally, then
you can simply configure the iterators on your table
directly, and they will be run for each query and
compaction.
GeoMesa doesn't currently have an integration point for
adding new iterators, but if you'd like to contribute
something to that effect, it may make your solution more
robust as the API would be well defined and 'officially'
supported.
Hope that helps,
Emilio
[1]:
https://www.geomesa.org/documentation/user/datastores/index_config.html#configuring-query-interceptors
[2]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#arrow-encoding
[3]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#binary-encoding
[4]:
https://www.geomesa.org/documentation/user/datastores/analytic_queries.html#feature-sampling
On 10/22/19 7:16 PM, Udstrand, Will M
wrote:
Hey
Emilio,
In our current setup we are
using accumulo as the backend database and we are
querying geomesa with the open source
open source api via org.geotools.*
and org.opengis.*
Can you
say more about your setup? What back-end database are you
using? Are you using geoserver for querying, or something
else?
Thanks,
Emilio
On 10/22/19 11:23 AM, Udstrand, Will
M wrote:
Problem Description:
Currently in our platform we are
using geomesa to store large amounts of geographical and
time sensitive metadata, and we are experiencing very
poor performance metrics (i.e. latency) with our systems
current configuration. The primary bottleneck has to do
with the large amount of data returned by geomesa, so we
are actively pursuing avenues to reduce and shrink the
size of the responses. We have been investigating the
use of MapReduce with in the system, but have run into
some knowledge gaps due to the lack of documentation.
The idea behind our MapReduce use case is to either
intercept queries coming into our cluster, or run jobs
to periodically to combine and reduce the primary
dataset and place the results into a separate table.
Ideally we would intercept the queries due to the
complications of the data reduction, since the
reductions is dependent on the parameters of a query.
MapReduce Options
·
When intercepting
queries coming into our cluster we’d have them trigger
jobs that combine and reduce the queries raw metadata
into a smaller set of formatted/processed data points
which is then returned to our backend services as the
result of the query.
·
Periodically or
have events such as a write to a table trigger a job
that process and reduces the primary data set and write
the result to our new “query” table.
Questions
·
Can MapReduce
jobs be triggered by events in the database?
·
Can one intercept
the queries written to a geomesa instance?
·
How are MapReduce
Jobs initiated, and can they be triggered
programmatically?
·
Can we send back
the results of a MapReduce Job as the result of a query?
·
Are there any
other options to reduce the latency occurred by large
responses from the database?
We were hoping that you'd be able to
give us some insight into our problems and additional
help in terms of the plausibility for our MapReduce and
geomesa use case.
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users
|