Re: [geomesa-users] Getting Results from Geomesa

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] Getting Results from Geomesa

From: Jim Hughes <jnh5y@xxxxxxxx>
Date: Sun, 03 May 2015 15:46:03 -0400
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <http://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <http://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0

Joel,

This is a great question. To reframe the question, it sounds like you'd like to be able to sort a query by a column (ascending and descending) and page through the results.

In full generality, this is a tall order for a database layer living on top of a distributed key-value store. GeoMesa uses sharding for our spatial index to distribute data evenly across the cloud. To be as efficient as possible, queries use multiple threads to read from several tablet servers at a time. This means that two subsequent queries will very likely get back results in different orders (hence paging is hard).

I think you are on the right track with caching/storing queries to serve up. Assuming that users are going to interact with the same query for a few minutes, could you possibly cache the queries in memory with a timeout of a minute or two? A load request would hit GeoMesa, but the subsequent sort and page requests could work against the data in memory. If the user leaves and comes back, their query may have to be re-requested.

For GeoMesa, we have worked a little bit with caching in the GeoTools layer, but we haven't ironed out all the issues. To give it a spin, add 'caching -> true' in the DataStore params. As I experimented with caching just now, I noticed that we don't look at the sorting part of the query. This should be an incredibly easy fix.* If in-memory caching is a suitable solution, I can help add a few lines to get sorting to work with caching. Other than that, it might be good to think through what cache settings we could expose to the user to make caching viable.

The obvious downside is that if there are too many users relative to available memory, this plan will fail. As a more complex possibility, one could imagine writing a users query results to a 'temporary' Accumulo table*. Records in this table could be indexed by session id / user / query id. During the first write, one would be able to pick a column and sort order. From there, paging might make sense. Reversing the sort order or sorting on another column would require sorting in memory or creating another temporary copy of the data.**

Thanks,

Jim

* The code for the Caching Feature Collection is here: https://github.com/locationtech/geomesa/blob/accumulo1.5.x/1.x/geomesa-core/src/main/scala/org/locationtech/geomesa/core/data/AccumuloFeatureSource.scala#L111-154

** Rather than actually trying to figure out separate tables for each user and when it is safe to delete them, one could configure Accumulo's AgeOffFilter for the table. Copies of queries would be deleted after a configurable time.

*** Now that I'm thinking of it, assuming that query results are small-ish (5k records), if there are only a few columns (say under 10), one could write entries which would be sort (forwards and backwards) on each column to the temporary table. It would require a tad custom Accumulo work, but it'd be relatively straightforward.

On 05/01/2015 04:42 PM, Joel Folkerts wrote:

Good afternoon. I am working on a project that is serving Geomesa results to users through a web interface by means of a REST API. Currently, the users construct a geospatial query, the API in turn sends this query to Geomesa, which then returns all of the records back through the API to the user. We run into problems when the returning dataset is over 5,000 records (which it normally is) and we end up crashing the user's browser.

We also serve Spark-based analytic results through Impala, which allows us to easily serve a subset of the result set by limiting the results. We expose an API endpoint specifically for DataTables and it works very nicely (https://www.datatables.net/examples/server_side/simple.html).

What we're trying to avoid to writing Geomesa search results to HDFS and then layering Impala on top of it. While this would solve the problem, we risk wasting a tremendous amount of HDFS space.

Our ultimate goal is to connect a DataTables UI to Accumulo/Geomesa and being able to only retrieve the data that we want, i.e. 10 records out of 100,000 records.

Any ideas, design patterns, or code samples would be very much appreciated. Thank you in advance!

-Joel
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users

Follow-Ups:
- Re: [geomesa-users] Getting Results from Geomesa
  - From: Jim Hughes

References:
- [geomesa-users] Getting Results from Geomesa
  - From: Joel Folkerts

Prev by Date: [geomesa-users] Getting Results from Geomesa
Next by Date: [geomesa-users] Anyone have experience with Geomesa and Storm?
Previous by thread: [geomesa-users] Getting Results from Geomesa
Next by thread: Re: [geomesa-users] Getting Results from Geomesa
Index(es):
- Date
- Thread

Breadcrumbs