[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [geomesa-users] Key/Index construction question.
|
Moises,
There isn't (so far as I know) a general-purpose walk-through of how to
extend the grammar in this way.
If you're certain you want to do this -- and we don't really recommend
it -- you could take the "%9#r" command for inserting a shard number as
an example.
Here's where there's a formatter (that can insert the contribution into
the key):
https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/Formatters.scala#L79
Here is where it contributes to the key plan (identifying ranges where
data might be found):
https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/QueryPlanners.scala#L399
You may also need to add a decoder, depending iff this information that
will participate in the key is not otherwise already in the feature:
https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/Decoders.scala
Best of luck!
Sincerely,
-- Chris
On Wed, 2015-09-23 at 15:19 -0400, Moises Baly wrote:
> Hi Chris,
>
>
> Thank you for your answer. We are looking into performances of using a
> secondary indexes, and in parallel we'll start looking into extending
> the grammar and query planners. For the query planners, this is the
> first time we're looking through the code (you're right, it's going to
> be challenging). Would it be too hard to provide us with a high level
> path for tackling the query planner modification?
>
>
> Thank you again for your time,
>
>
> Moises
>
> On Wed, Sep 23, 2015 at 12:19 PM, Chris Eichelberger <cne1x@xxxxxxxx>
> wrote:
> Moises,
>
> Apologies in advance, but this turned out to be (another)
> longer
> response than I had expected.
>
> There are two entire sections to this note: 1) directly
> responding to
> the question about the index-schema format; 2) suggestions for
> how you
> might avoid changing anything in the index-schema format, but
> use the
> existing indexing mechanisms.
>
>
> Part 1: Concerning the index-schema format
>
> The best reference for the index-schema format syntax is the
> code itself
> (fortunately, Scala makes this sort of DSL grammar mostly
> readable from
> source):
>
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/IndexSchema.scala#L146-L160
>
> As you can see, the RowID and the ColF re-use the same
> syntactic
> requirement ("keypart" inside the DSL), meaning that anything
> that is
> valid in the RowID is -- so far as the parser is concerned --
> valid in
> the ColF. You are correct that the ID-substitution token is
> only valid
> at the end of the ColQ; it is not allowed in either the RowID
> or ColF.
>
> In order to do exactly what you described, there are multiple
> considerations:
>
> A. You would need to extend that grammar. This is not
> particularly
> difficult.
>
> B. You would need to extend the key-planners. This can be
> more
> challenging, especially if this is the first time you've
> looked through
> this code.
>
> As you might guess, the aggregate recommendation from the
> GeoMesa team
> would probably be, "It's probably easier to find another way
> to do what
> you want." Fortunately, there may be just such a way...
>
>
> Part 2: How to use the existing index structures
>
> There are multiple query strategies. For example, there is
> one strategy
> that is geo-time oriented, and one strategy that is
> (secondary)
> attribute-oriented. The difference among these is which index
> (or
> combination of indexes) they use in what order.
>
> This is the way the Accumulo filters are applied, roughly in
> order (for
> a geo-time strategy):
>
> 1. coarse geo-time filtering on the "_st_idx" or ("_z3")
> tables
> 2. fine-grained geo-time filtering
> 3. feature-based filtering (using ECQL expressions)
>
> This suggests that, if your geo-time constraints are highly
> selective,
> then storing the filter attribute inside your simple feature
> may be
> adequate to get the performance you're looking for, because
> that
> filtering happens only on the entries with qualifying geo-time
> data, and
> is distributed (uniformly) across tablet servers.
>
> If the geo-time constraints are not very selective, but your
> attribute
> constraints are highly selective, then using an attribute
> strategy in
> your query will invert that filter order, essentially
> performing the
> attribute-selection first, and then filtering down to the
> geo-time
> constraints.
>
> If neither the geo-time nor the attribute constraints are
> singly
> selective, then you may be able to get some lift out of
> creating a
> synthetic field that is jointly selective, and then use a
> (secondary
> attribute) index on that value.
>
> I hope that helps. If not, please just let us know.
>
> Thanks!
>
> Sincerely,
> -- Chris
>
>
> On Wed, 2015-09-23 at 10:45 -0400, Moises Baly wrote:
> > Hi there:
> >
> >
> > On the same subject of keys, I have a couple of questions
> when
> > building them:
> >
> >
> > 1- I only have one way to store non constant "strings"
> within the key
> > - using the #id - correct? For example, I have a point and
> want to
> > store something of the sort -> gh :: some_string_ie_HOUSE ::
> #cstr,
> > changing that string on insertion into Acc. The way I would
> do this
> > would be with a schema such as "%~#s%99#r%0,11#gh::%~#s%
> #id::%~#s%
> > TEST#cstr". However, this gives me a parser error, I think
> because
> > there is a restriction on the id() position - has to be at
> the end.
> >
> >
> > The idea is that I want to be able to filter first by
> location (gh),
> > then by a particular string in the column family.
> >
> >
> > 2- When building the key schema, '%#i' allows you to index
> what comes
> > after right?
> >
> >
> > Thanks for your time,
> >
> >
> > Moises
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Sep 18, 2015 at 3:29 PM, Moises Baly
> <moises@xxxxxxxxxxxxx>
> > wrote:
> > Perfect.
> >
> >
> > Thank you again for your answers, we are looking
> forward to go
> > in production with GM.
> >
> >
> > Kind regards,
> >
> >
> > Moises
> >
> > On Fri, Sep 18, 2015 at 3:22 PM, Chris Eichelberger
> > <cne1x@xxxxxxxx> wrote:
> > Moises,
> >
> > These are reasonable questions. I'll re-use
> your
> > numbering.
> >
> > 1. We right-pad lower-precision (larger)
> Geohashes
> > with periods, so a
> > 10-bit Geohash for Charlottesville might be
> "dq..."
> > when padded to 35
> > bits. This becomes a minor bit of hassle
> for the
> > query planner, which
> > has to accommodate the (possible) presence
> of these
> > characters in
> > addition to valid Geohash characters, but
> it's not too
> > bad.
> >
> > 2. You are correct that each index key
> encodes a
> > disjoint subset of the
> > entire geometry's covering. Fortunately,
> the entire
> > geometry is stored
> > elsewhere in the value of the Accumulo
> entry, so no
> > reconstruction is
> > required on the client side.
> >
> > Sincerely,
> > -- Chris
> >
> >
> >
> > On Fri, 2015-09-18 at 15:14 -0400, Moises
> Baly wrote:
> > > This is an amazing explanation!! Thank you
> very much
> > for taking the
> > > time of being so clear.
> > >
> > >
> > > Two additional questions:
> > > 1- If we are deconstructing non-point
> geometries
> > into geohashes of
> > > different precisions,and, say, I specified
> my key
> > schema as being: "%
> > > ~#s%foo#cstr%0,7#gh%99#r::_::_ (don't mind
> cf and
> > cq, just an example)
> > > - in which I want to have a length 7
> geohash in the
> > row id, how do you
> > > fit the different precision you obtain
> into my 7
> > specification? Or I'm
> > > not making sense here?
> > >
> > >
> > > 2- In the index schema builder, the index
> or data
> > flag (%#i) builds an
> > > "index" over a particular portion of the
> entire key?
> > >
> > >
> > > @Emilio: so if I understood you correctly
> you have 6
> > "entire" rows,
> > > but if you look at the cf or cq portions
> you might
> > many more distinct
> > > values correct?
> > >
> > >
> > > For example, I store a polygon, and then I
> want to
> > retrieve that
> > > particular polygon. How do you go about
> putting it
> > together again? It
> > > has to depend in some sort of identifier
> no?
> > >
> > >
> > > Thank you both again for your time,
> > >
> > >
> > > Moises
> > >
> > >
> > >
> > > On Fri, Sep 18, 2015 at 2:47 PM, Chris
> Eichelberger
> > <cne1x@xxxxxxxx>
> > > wrote:
> > > Moises,
> > >
> > > Good question! The good news is
> that there
> > is nothing special
> > > about how
> > > the keys are being constructed;
> the
> > interesting part is in how
> > > GeoMesa
> > > decides which keys should be
> constructed...
> > >
> > > (Apologies in advance if, in the
> course of
> > lecturing, I tell
> > > you things
> > > you already know.)
> > >
> > > The first point to remember is
> that each
> > Geohash index-entry
> > > represents
> > > a cell. For 35-bit Geohashes,
> each cell is
> > no more than ~150
> > > meters
> > > square. A 0-bit (degenerate)
> Geohash is the
> > entire surface of
> > > the
> > > (flat) Earth. Each bit of
> precision you add
> > to a Geohash
> > > halves exactly
> > > one of its dimensions (when
> zero-based, even
> > bits halve
> > > longitude; odd
> > > bits halve latitude).
> > >
> > > Whenever you are indexing data
> that contain
> > only single-point
> > > geometries, there will be one
> index-key per
> > record, because
> > > every point
> > > will fall inside exactly one
> Geohash cell.
> > (Each Geohash cell
> > > in
> > > GeoMesa includes its minimum X and
> minimum Y
> > values, but
> > > excludes its
> > > maximum X and maximum Y extents.)
> > >
> > > Whenever you are indexing
> non-point
> > geometries -- line
> > > strings;
> > > polygons; etc. -- you have a
> problem: How
> > do you create a
> > > single
> > > index-entry for a geometry that
> can cross
> > multiple cell
> > > boundaries? If
> > > you only index the vertices, you
> lose
> > information about the
> > > fact that
> > > the geometry covers the space
> between them.
> > There are
> > > typically two
> > > approaches to solving this
> problem:
> > >
> > > 1. You can encode a single entry
> that
> > represents the
> > > minimum-bounding
> > > cell description that contains
> your
> > geometry; or
> > >
> > > 2. you can decompose your
> geometry into
> > covering cells, at
> > > potentially
> > > heterogeneous resolutions
> (different sizes),
> > and index each of
> > > those
> > > separately (and then de-duplicate
> results at
> > query time so
> > > that each
> > > feature appears no more than once
> in any
> > given results set).
> > >
> > > GeoMesa takes approach #2 (for
> now; we're
> > experimenting with
> > > other ways
> > > to do this). This is how the
> polygon you
> > quote, with a large
> > > number of
> > > points, can be decomposed into
> just a few
> > covering cells; each
> > > of those
> > > covering cells receives its own
> index key.
> > I've attached an
> > > image to
> > > this email that shows how a
> polygon and a
> > line-string can be
> > > decomposed.
> > > In practice, we do not allow
> non-point
> > geometries to be
> > > decomposed into
> > > so many covering Geohashes. Here
> is the
> > reference to the code
> > > in
> > > GeoMesa where this decomposition
> is called:
> > >
> > >
> >
> https://github.com/locationtech/geomesa/blob/master/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/STIndexEntry.scala#L49
> > >
> > > Please note that, with the advent
> of the new
> > Z3 index, we will
> > > be
> > > revisiting this scheme. The Z3
> index is
> > much faster than the
> > > old
> > > Geohash-based index, but does not
> yet
> > support non-point
> > > geometries, so
> > > it's a great opportunity for us to
> improve
> > that feature.
> > >
> > > I hope this addressed some of your
> > questions; if not, or if
> > > you think of
> > > new ones, please just let us know.
> > >
> > > Thanks!
> > >
> > > Sincerely,
> > > -- Chris
> > >
> > >
> > > On Fri, 2015-09-18 at 14:14 -0400,
> Moises
> > Baly wrote:
> > > > Hi there:
> > > >
> > > >
> > > > I've come across some tests in
> the project
> > in my quest to
> > > understand
> > > > how indexes work and how is the
> index
> > partitioned in
> > > Accumulo's Key
> > > > (what goes where, and how is
> constructed.
> > > >
> > > >
> > > > val dummyType =
> > > >
> > >
> >
> SimpleFeatureTypes.createType("DummyType",s"foo:String,bar:Geometry,baz:Date,$DEFAULT_GEOMETRY_PROPERTY_NAME:Geometry,$DEFAULT_DTG_PROPERTY_NAME:Date,$DEFAULT_DTG_END_PROPERTY_NAME:Date")
> > > > val customType =
> > > >
> > >
> >
> SimpleFeatureTypes.createType("DummyType",s"foo:String,bar:Geometry,baz:Date,*the_geom:Geometry,dt_start:Date,$DEFAULT_DTG_END_PROPERTY_NAME:Date")
> > > >
> customType.setDtgField("dt_start")
> > > > val dummyEncoder =
> > SimpleFeatureSerializers(dummyType,
> > > > SerializationType.AVRO)
> > > > val customEncoder =
> > SimpleFeatureSerializers(customType,
> > > > SerializationType.AVRO)
> > > > val dummyIndexValueEncoder =
> > IndexValueEncoder(dummyType)`
> > > > val geometryFactory = new
> > GeometryFactory(new
> > > PrecisionModel, 4326)
> > > > val now = new
> DateTime().toDate
> > > >
> > > > val Apr_23_2001 = new
> DateTime(2001, 4,
> > 23, 12, 5, 0,
> > > >
> DateTimeZone.forID("UTC")).toDate
> > > >
> > > > val schemaEncoding = "%~#s%
> feature#cstr%
> > 99#r::%~#s%
> > > 0,4#gh::%~#s%
> > > > 4,3#gh%#id"
> > > >
> > > > val index =
> > IndexSchema.buildKeyEncoder(dummyType,
> > > schemaEncoding)
> > > > val line : Geometry =
> > >
> WKTUtils.read("LINESTRING(-78.5000092574703
> > > >
> 38.0272986617359,-78.5000196719491
> > > 38.0272519798381,-78.5000300864205
> > > >
> 38.0272190279085,-78.5000370293904
> > > 38.0271853867342,-78.5000439723542
> > > >
> 38.027151748305,-78.5000509153117
> > > 38.027118112621,-78.5000578582629
> > > >
> 38.0270844741902,-78.5000648011924
> > > 38.0270329867966,-78.5000648011781
> > > >
> 38.0270165108316,-78.5000682379314
> > > 38.026999348366,-78.5000752155953
> > > >
> 38.026982185898,-78.5000786870602
> > > 38.0269657099304,-78.5000856300045
> > > >
> 38.0269492339602,-78.5000891014656
> > > 38.0269327579921,-78.5000960444045
> > > >
> 38.0269162820211,-78.5001064588197
> > > 38.0269004925451,-78.5001134017528
> > > > 38.0268847030715,-78.50012381616
> > > 38.0268689135938,-78.5001307590877
> > > >
> 38.0268538106175,-78.5001411734882
> > > 38.0268387076367,-78.5001550593595
> > > >
> 38.0268236046505,-78.5001654737524
> > > 38.0268091881659,-78.5001758881429
> > > >
> 38.0267954581791,-78.5001897740009
> > > 38.0267810416871,-78.50059593303
> > > >
> 38.0263663951609,-78.5007972751677
> > 38.0261625038609)")
> > > > val item =
> > >
> >
> AvroSimpleFeatureFactory.buildAvroFeature(dummyType,
> > > > List("TEST_LINE", line, now,
> line, now,
> > now), "TEST_LINE")
> > > > val toWrite = new
> > FeatureToWrite(item, "",
> > > dummyEncoder,
> > > > dummyIndexValueEncoder)
> > > > val indexEntries =
> > index.encode(toWrite).toList
> > > > indexEntries.size must
> equalTo(1)
> > > > indexEntries.head.size()
> > mustEqual(6)
> > > > val cf = new
> > > >
> >
> Text(indexEntries.head.getUpdates.get(0).getColumnFamily)
> > > > val cq = new
> > > >
> >
> Text(indexEntries.head.getUpdates.get(0).getColumnQualifier)
> > > > val keyStr = cf + "::" +
> cq val
> > line : Geometry =
> > > >
> > WKTUtils.read("LINESTRING(-78.5000092574703
> > > >
> 38.0272986617359,-78.5000196719491
> > > 38.0272519798381,-78.5000300864205
> > > >
> 38.0272190279085,-78.5000370293904
> > > 38.0271853867342,-78.5000439723542
> > > >
> 38.027151748305,-78.5000509153117
> > > 38.027118112621,-78.5000578582629
> > > >
> 38.0270844741902,-78.5000648011924
> > > 38.0270329867966,-78.5000648011781
> > > >
> 38.0270165108316,-78.5000682379314
> > > 38.026999348366,-78.5000752155953
> > > >
> 38.026982185898,-78.5000786870602
> > > 38.0269657099304,-78.5000856300045
> > > >
> 38.0269492339602,-78.5000891014656
> > > 38.0269327579921,-78.5000960444045
> > > >
> 38.0269162820211,-78.5001064588197
> > > 38.0269004925451,-78.5001134017528
> > > > 38.0268847030715,-78.50012381616
> > > 38.0268689135938,-78.5001307590877
> > > >
> 38.0268538106175,-78.5001411734882
> > > 38.0268387076367,-78.5001550593595
> > > >
> 38.0268236046505,-78.5001654737524
> > > 38.0268091881659,-78.5001758881429
> > > >
> 38.0267954581791,-78.5001897740009
> > > 38.0267810416871,-78.50059593303
> > > >
> 38.0263663951609,-78.5007972751677
> > 38.0261625038609)")
> > > > val item =
> > >
> >
> AvroSimpleFeatureFactory.buildAvroFeature(dummyType,
> > > > List("TEST_LINE", line, now,
> line, now,
> > now), "TEST_LINE")
> > > > val toWrite = new
> > FeatureToWrite(item, "",
> > > dummyEncoder,
> > > > dummyIndexValueEncoder)
> > > > val indexEntries =
> > index.encode(toWrite).toList
> > > > indexEntries.size must
> equalTo(1)
> > > > indexEntries.head.size()
> > mustEqual(6)
> > > > val cf = new
> > > >
> >
> Text(indexEntries.head.getUpdates.get(0).getColumnFamily)
> > > > val cq = new
> > > >
> >
> Text(indexEntries.head.getUpdates.get(0).getColumnQualifier)
> > > > val keyStr = cf + "::" +
> cq
> > > >
> > > >
> > > > How all those points in the
> Linestring
> > translate to encoding
> > > only 6
> > > > rows in Accumulo? As far as I
> understand,
> > the Key definition
> > > > (string :: gh :: gh + ID) should
> encode a
> > single point
> > > correct? What
> > > > am I missing in the process
> here?
> > > >
> > > >
> > > > If somebody could walk me
> through this
> > example with special
> > > attention
> > > > to how the key is being
> constructed it
> > would be very much
> > > appreciated.
> > > >
> > > >
> > > > Thank you for your time
> > > >
> > > >
> > > > Moises
> > > >
> > > >
> > >
> > > >
> >
> _______________________________________________
> > > > geomesa-users mailing list
> > > > geomesa-users@xxxxxxxxxxxxxxxx
> > > > To change your delivery options,
> retrieve
> > your password, or
> > > unsubscribe from this list, visit
> > > >
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> > >
> > >
> > >
> >
> _______________________________________________
> > > geomesa-users mailing list
> > > geomesa-users@xxxxxxxxxxxxxxxx
> > > To change your delivery options,
> retrieve
> > your password, or
> > > unsubscribe from this list, visit
> > >
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> > >
> > >
> > >
> _______________________________________________
> > > geomesa-users mailing list
> > > geomesa-users@xxxxxxxxxxxxxxxx
> > > To change your delivery options, retrieve
> your
> > password, or unsubscribe from this list,
> visit
> > >
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >
> >
> >
> _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve
> your
> > password, or unsubscribe from this list,
> visit
> >
> http://www.locationtech.org/mailman/listinfo/geomesa-users
> >
> >
> >
> >
> >
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or
> unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or
> unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users