Re: [geomesa-users] Attribute Indexing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] Attribute Indexing

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Date: Fri, 23 Oct 2015 10:04:34 -0400
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <http://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <http://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

Hi Marcel,

The date in the index is an optimization used in certain attribute queries that contain a date clause. If there is no date clause in the query, it doesn't 'cost' anything in terms of rows scanned, because the ranges we are scanning would be the same. However, if you are doing an attribute equals query with a specific date range, you can greatly reduce the number of rows you have to scan, which results in a much faster response.

For example, say that you want to index twitter data by author. There could be thousands (millions?) of entries under a single author. If you want to retrieve all tweets by that author, then you will have to scan all the rows. But if you want to only get tweets from the last week, having the date in the row allows you to only scan the last weeks worth of tweets, and skip all the years prior.

Similarly, the feature id is not used when we set up ranges to scan, but we include it in the row to ensure that we don't end up with extremely 'long' rows, if a particular attribute has many features with the same value. Having thousands of column families/column qualifiers in a single row causes problems for Accumulo.

Thanks,

Emilio

On 10/23/2015 06:32 AM, Marcel wrote:

One more question about the AttributeTable.
There are five parts:
1. table sharing prefix (my prefix is empty, because I don´t need it)
2. index of sft
3. attribute-value
4. dtg (which seems to mean date-time-group...recently I thougt this means default geometry) - I have set a dtg-index
5. featureID

Please tell me what this dtg is used for when querying. Wouldn´t it be enough to store parts 1,2,3 and 5? Is this sort of additional indexing?

Best regards,
Marcel Jacob.

Am 19.10.2015 19:22, schrieb Emilio Lahr-Vivaz:
Hi Marcel,

The logic for encoding rows is mostly contained in the following package. Each class corresponds to an index type:

https://github.com/locationtech/geomesa/tree/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/data/tables

As you say, the rows are not human readable. There are bits of logic for decoding the rows scattered around, but in general we don't deal with decoding rows (we encode the data, and then encode ranges for scanning them). We don't currently have any formatters, but if you create any please put up a pull-request and we'll get them merged into the codebase!

Thanks,

Emilio

On 10/19/2015 12:22 PM, Marcel wrote:
Okay,
I scanned my tables to get the raw format. Unfortunately this is not human readble. So I wrote a method which outputs rowID, family, qualifier and visibility.
This is fine for the _st_idx table. When scanning records for _attr_idx table it´s still not readable.
Is there a formatter which turns the records for me?
If not: When indexing an attribute like a date what particular values are stored in the table? I assume RowID = date as long. But what about Column family and qualifier? There has to be a mapping between the _attr_idx table and my record table (maybe rowId from record table?).

Best regards,
Marcel Jacob.

Am 09.10.2015 18:40, schrieb Emilio Lahr-Vivaz:
Hi Marcel,

We use the mango lexicoders library:

https://github.com/calrissian/mango/tree/master/mango-core/src/main/java/org/calrissian/mango/types

We treat dates as longs, based on the standard java millis since epoch, so they get sorted based on that. Makes for very efficient range searches, as you say.

Thanks,

Emilio

On 10/09/2015 12:14 PM, Marcel Jacob wrote:
Okay, so there is no special strategy for indexing a date?
I´m asking because there were different date formats I can imagine:

1) yyyy-MM-dd
lexicographic sorting order:
...
2015-10-07
2015-10-08
2015-10-09

With this format it´s very efficient to query a date range because only one table scan is needed.

2) dd-MM-yyyy
lexicographic sorting order:
...
07-10-2015
07-11-2015
07-12-2015
...
08-10-2015
...
09-10-2015

When querying for [07-10-2015 til 09-10-2015] we need multiple table scans and building an additional index would be less efficient. (But now we can efficiently ask for queries on a special day, e.g. 7th day. Although I believe this type of query is very rare.)

So I will store my date in format 1).

Please correct me if my thoughts are wrong.

Thanks in advance,
Marcel Jacob.

> From: cne1x@xxxxxxxx
> To: geomesa-users@xxxxxxxxxxxxxxxx
> Date: Fri, 9 Oct 2015 07:23:57 -0400
> Subject: Re: [geomesa-users] Attribute Indexing
>
> Marcel,
>
> Roughly in order that you asked...
>
> 1. Yes, it is always possible to get the raw key-value pairs out of
> Accumulo. The easiest way is via the Accumulo shell:
>
>
> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_accumulo_shell
>
> Login, and then scan the "_attr_idx" table with a command somewhat
> similar to this:
>
> scan -t geomesa_attr_idx
>
> 2. There is nothing particularly novel about the way GeoMesa stores
> secondary attribute indexes in the "_attr_idx" table. This is a
> straight lexicographically-encoded-value storage.
>
> 3. There are two parts to using secondary indexes effectively:
> encoding and querying. The best references are to the GeoMesa source
> where these occur:
>
> encoding:
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/data/tables/AttributeTable.scala
>
> querying:
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/AttributeIdxStrategy.scala
>
> Enjoy!
>
> Sincerely,
> -- Chris
>
>
> On Fri, 2015-10-09 at 11:27 +0200, Marcel wrote:
> > Hello,
> >
> > is there a possibility to get the complete "raw" key-value pair of a
> > table as it is saved (a sample would be enough)? I want to look in the
> > "_attr_idx" table and understand how the index is built, e.g. when
> > indexing additional attributes like another Date, an Integer or a
> > String. Is it this a special or a common strategy (adapted for Accumulo)
> > for indexing? What fields available in the Accumulo table dsign did you
> > use (RowId, ColumnFamily, etc.)?
> >
> > Thanks,
> > Marcel Jacob.
> >
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users

References:
- [geomesa-users] Attribute Indexing
  - From: Marcel
- Re: [geomesa-users] Attribute Indexing
  - From: Chris Eichelberger
- Re: [geomesa-users] Attribute Indexing
  - From: Marcel Jacob
- Re: [geomesa-users] Attribute Indexing
  - From: Emilio Lahr-Vivaz
- Re: [geomesa-users] Attribute Indexing
  - From: Marcel
- Re: [geomesa-users] Attribute Indexing
  - From: Emilio Lahr-Vivaz
- Re: [geomesa-users] Attribute Indexing
  - From: Marcel

Prev by Date: Re: [geomesa-users] Attribute Indexing
Next by Date: Re: [geomesa-users] geomesa geoserver plugin
Previous by thread: Re: [geomesa-users] Attribute Indexing
Next by thread: [geomesa-users] Accumulo Key Structure
Index(es):
- Date
- Thread

Breadcrumbs