Hi Marcel,
The date in the index is an optimization used in certain attribute
queries that contain a date clause. If there is no date clause in
the query, it doesn't 'cost' anything in terms of rows scanned,
because the ranges we are scanning would be the same. However, if
you are doing an attribute equals query with a specific date range,
you can greatly reduce the number of rows you have to scan, which
results in a much faster response.
For example, say that you want to index twitter data by author.
There could be thousands (millions?) of entries under a single
author. If you want to retrieve all tweets by that author, then you
will have to scan all the rows. But if you want to only get tweets
from the last week, having the date in the row allows you to only
scan the last weeks worth of tweets, and skip all the years prior.
Similarly, the feature id is not used when we set up ranges to scan,
but we include it in the row to ensure that we don't end up with
extremely 'long' rows, if a particular attribute has many features
with the same value. Having thousands of column families/column
qualifiers in a single row causes problems for Accumulo.
Thanks,
Emilio
On 10/23/2015 06:32 AM, Marcel wrote:
One more question about the AttributeTable.
There are five parts:
1. table sharing prefix (my prefix is empty, because I don´t need
it)
2. index of sft
3. attribute-value
4. dtg (which seems to mean date-time-group...recently I thougt
this means default geometry) - I have set a dtg-index
5. featureID
Please tell me what this dtg is used for when querying. Wouldn´t
it be enough to store parts 1,2,3 and 5? Is this sort of
additional indexing?
Best regards,
Marcel Jacob.
Am 19.10.2015 19:22, schrieb Emilio
Lahr-Vivaz:
Hi Marcel,
The logic for encoding rows is mostly contained in the following
package. Each class corresponds to an index type:
https://github.com/locationtech/geomesa/tree/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/data/tables
As you say, the rows are not human readable. There are bits of
logic for decoding the rows scattered around, but in general we
don't deal with decoding rows (we encode the data, and then
encode ranges for scanning them). We don't currently have any
formatters, but if you create any please put up a pull-request
and we'll get them merged into the codebase!
Thanks,
Emilio
On 10/19/2015 12:22 PM, Marcel
wrote:
Okay,
I scanned my tables to get the raw format. Unfortunately this
is not human readble. So I wrote a method which outputs rowID,
family, qualifier and visibility.
This is fine for the _st_idx table. When scanning records for
_attr_idx table it´s still not readable.
Is there a formatter which turns the records for me?
If not: When indexing an attribute like a date what particular
values are stored in the table? I assume RowID = date as long.
But what about Column family and qualifier? There has to be a
mapping between the _attr_idx table and my record table (maybe
rowId from record table?).
Best regards,
Marcel Jacob.
Am 09.10.2015 18:40, schrieb
Emilio Lahr-Vivaz:
Hi Marcel,
We use the mango lexicoders library:
https://github.com/calrissian/mango/tree/master/mango-core/src/main/java/org/calrissian/mango/types
We treat dates as longs, based on the standard java millis
since epoch, so they get sorted based on that. Makes for
very efficient range searches, as you say.
Thanks,
Emilio
On 10/09/2015 12:14 PM, Marcel
Jacob wrote:
Okay, so there is no special strategy for
indexing a date?
I´m asking because there were different date formats I
can imagine:
1) yyyy-MM-dd
lexicographic sorting order:
...
2015-10-07
2015-10-08
2015-10-09
With this format it´s very efficient to query a date
range because only one table scan is needed.
2) dd-MM-yyyy
lexicographic sorting order:
...
07-10-2015
07-11-2015
07-12-2015
...
08-10-2015
...
09-10-2015
When querying for [07-10-2015 til 09-10-2015] we need
multiple table scans and building an additional index
would be less efficient. (But now we can efficiently ask
for queries on a special day, e.g. 7th day. Although I
believe this type of query is very rare.)
So I will store my date in format 1).
Please correct me if my thoughts are wrong.
Thanks in advance,
Marcel Jacob.
> From: cne1x@xxxxxxxx
> To: geomesa-users@xxxxxxxxxxxxxxxx
> Date: Fri, 9 Oct 2015 07:23:57 -0400
> Subject: Re: [geomesa-users] Attribute Indexing
>
> Marcel,
>
> Roughly in order that you asked...
>
> 1. Yes, it is always possible to get the raw
key-value pairs out of
> Accumulo. The easiest way is via the Accumulo
shell:
>
>
> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_accumulo_shell
>
> Login, and then scan the "_attr_idx" table with a
command somewhat
> similar to this:
>
> scan -t geomesa_attr_idx
>
> 2. There is nothing particularly novel about the
way GeoMesa stores
> secondary attribute indexes in the "_attr_idx"
table. This is a
> straight lexicographically-encoded-value storage.
>
> 3. There are two parts to using secondary indexes
effectively:
> encoding and querying. The best references are to
the GeoMesa source
> where these occur:
>
> encoding:
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/data/tables/AttributeTable.scala
>
> querying:
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/AttributeIdxStrategy.scala
>
> Enjoy!
>
> Sincerely,
> -- Chris
>
>
> On Fri, 2015-10-09 at 11:27 +0200, Marcel wrote:
> > Hello,
> >
> > is there a possibility to get the complete
"raw" key-value pair of a
> > table as it is saved (a sample would be
enough)? I want to look in the
> > "_attr_idx" table and understand how the
index is built, e.g. when
> > indexing additional attributes like another
Date, an Integer or a
> > String. Is it this a special or a common
strategy (adapted for Accumulo)
> > for indexing? What fields available in the
Accumulo table dsign did you
> > use (RowId, ColumnFamily, etc.)?
> >
> > Thanks,
> > Marcel Jacob.
> >
> >
_______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve
your password, or unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your
password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users
|