Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Geohashing vs R-tree (Luca Morandini) (Chris Eichelberger)

Thanks for the reply Chris-  

My takeaway is that the tradeoffs in GeoMesa consider the "Big" (capital B!) part of big geospatio-temporal to be the high-volume point-time geometry data, and that spatial queries run against lower-volume non-point geometries scale-out "well enough for most applications" when implemented as distributed Accumulo Iterators. Did I follow that right?

I'm probably far too presently lay to have new or better ideas on how to handle high volume non-point geometries, but appreciate the challenges at hand. I'm obviously very interested and so are many others; I'm sure it will be well-addressed eventually. 

-Mike

On Wed, May 20, 2015 at 12:00 PM, <geomesa-users-request@xxxxxxxxxxxxxxxx> wrote:
Send geomesa-users mailing list submissions to
        geomesa-users@xxxxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.locationtech.org/mailman/listinfo/geomesa-users
or, via email, send a message with subject or body 'help' to
        geomesa-users-request@xxxxxxxxxxxxxxxx

You can reach the person managing the list at
        geomesa-users-owner@xxxxxxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of geomesa-users digest..."


Today's Topics:

   1. Re: Geohashing vs R-tree (Luca Morandini) (Mike Atlas)
   2. Re: Geohashing vs R-tree (Luca Morandini) (Chris Eichelberger)
   3. Re: Geohashing vs R-tree (Luca Morandini)
   4. Re: Geohashing vs R-tree (Jim Hughes)


----------------------------------------------------------------------

Message: 1
Date: Tue, 19 May 2015 12:14:39 -0400
From: Mike Atlas <mike@xxxxxxx>
To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
Subject: Re: [geomesa-users] Geohashing vs R-tree (Luca Morandini)
Message-ID:
        <CALmMYa6Fp9nRp-sBgMo1_2kzGJohjW-kuUqVSoDUAPLYUsFswg@xxxxxxxxxxxxxx>
Content-Type: text/plain; charset="utf-8"

All,

Just a general question regarding GeoMesa's geohashing,
big-data-spatio-temporal, etc, one very influential post I've read recently
was this one:

http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/

The harsh criticisms of R-trees and space-filling curves as an
indexing/sharding strategy is intriguing. I'm supposed to track down his
"SpaceCurve" DB product demo to find out more eventually.

The question to GeoMesa list/authors: Can anyone address some of the
arguments in that blog post against the design considerations of the
tablet/hash strategy of GeoMesa?

Cheers,
Mike / Weft.io


On Tue, May 19, 2015 at 12:00 PM, <geomesa-users-request@xxxxxxxxxxxxxxxx>
wrote:

> Send geomesa-users mailing list submissions to
>         geomesa-users@xxxxxxxxxxxxxxxx
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.locationtech.org/mailman/listinfo/geomesa-users
> or, via email, send a message with subject or body 'help' to
>         geomesa-users-request@xxxxxxxxxxxxxxxx
>
> You can reach the person managing the list at
>         geomesa-users-owner@xxxxxxxxxxxxxxxx
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of geomesa-users digest..."
>
>
> Today's Topics:
>
>    1. Geohashing vs R-tree (Luca Morandini)
>    2. Re: Geohashing vs R-tree (Jim Hughes)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 19 May 2015 08:00:46 +1000
> From: Luca Morandini <luca.morandini1@xxxxxxxxx>
> To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
> Subject: [geomesa-users] Geohashing vs R-tree
> Message-ID: <555A610E.5010402@xxxxxxxxx>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Folks,
>
> I am drafting a presentation on Big Data spatial DBMSes, including
> GeoMesa. I read
> some material, including the "Spatio-temporal Indexing in Non-relational
> Distributed Databases" paper, but one question still lingers in my mind:
> isn't the
> main advantage of geohashing over R-tree the absence of index rebalancing?
>
> Regards,
>
> Luca Morandini
> Data Architect - AURIN project
> Melbourne eResearch Group
> Department of Computing and Information Systems
> University of Melbourne
> Tel. +61 03 903 58 380
> Skype: lmorandini
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 18 May 2015 18:49:16 -0400
> From: Jim Hughes <jnh5y@xxxxxxxx>
> To: geomesa-users@xxxxxxxxxxxxxxxx
> Subject: Re: [geomesa-users] Geohashing vs R-tree
> Message-ID: <555A6C6C.3010600@xxxxxxxx>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Luca,
>
> I'd agree more or less.  When Accumulo splits a tablet, that is
> basically rebalancing in terms of the B^+ tree structure being
> used/presented.  Splitting/rebalancing for R-trees is generally more
> interesting and computationally intensive.
>
> To unpack it a little bit, one could consider the time to insert a
> record and/or build the index from scratch.  For a B^+-tree, inserts are
> O(log n).  For an R-tree, the worst case of O(n) can be realized with
> hotspotting.  As a real world example cited in that paper, it is noted
> that millions of GDELT data are located a handful of points.  (Figure 10
> shows that over 6 million points are associated with (0,0) and another 6
> million with Washington, DC.)  That can give an R-tree some headaches.*
>
> For my money, I'd love to a distributed database toolkit which would
> help one use more complex data structures.  It'd be a quite challenging
> project though...
>
> Great question; let us know what else we can expand on,
>
> Jim
>
> * While working on that paper, we also tried out PostGIS with the same
> dataset.  We didn't report on that since we were rather unsure of our
> results.  Building the GIST index took days.  We did try to jitter the
> data to avoid hotspotting, and I think that may have helped some.
>
> On 05/18/2015 06:00 PM, Luca Morandini wrote:
> > Folks,
> >
> > I am drafting a presentation on Big Data spatial DBMSes, including
> > GeoMesa. I read some material, including the "Spatio-temporal Indexing
> > in Non-relational Distributed Databases" paper, but one question still
> > lingers in my mind: isn't the main advantage of geohashing over R-tree
> > the absence of index rebalancing?
> >
> > Regards,
> >
> > Luca Morandini
> > Data Architect - AURIN project
> > Melbourne eResearch Group
> > Department of Computing and Information Systems
> > University of Melbourne
> > Tel. +61 03 903 58 380
> > Skype: lmorandini
> > _______________________________________________
> > geomesa-users mailing list
> > geomesa-users@xxxxxxxxxxxxxxxx
> > To change your delivery options, retrieve your password, or
> > unsubscribe from this list, visit
> > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
>
> ------------------------------
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe
> from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users
>
> End of geomesa-users Digest, Vol 15, Issue 13
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.locationtech.org/mhonarc/lists/geomesa-users/attachments/20150519/a8be2233/attachment.html>

------------------------------

Message: 2
Date: Tue, 19 May 2015 12:38:28 -0400
From: Chris Eichelberger <cne1x@xxxxxxxx>
To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
Subject: Re: [geomesa-users] Geohashing vs R-tree (Luca Morandini)
Message-ID: <1432053508.31166.17.camel@xxxxxxxxxxxxxx>
Content-Type: text/plain; charset="UTF-8"

Mike,

We've seen that article before; it's a good read.

For what it's (not) worth, I agree with some of the concerns raised
about R-trees and their variants:  Any index that whose structure is
dependent upon the data, and which must be centrally maintained, is
probably going to be slow, complicated, fragile, or all of the above.

I also agree with the concern raised about space-specific indexes (such
as space-filling curves):  non-point data get tricky.  We have wrestled
with this in GeoMesa a few different ways.  The two obvious approaches
are:  1) encode a non-point geometry by its minimum-bounding hash
rectangle; 2) decompose the non-point geometry into covering index
rectangles, and de-duplicate later in the query.  There are more exotic
variants such as space-filling curve composition that we are exploring
now.

Another (often overlooked) advantage to using space-filling curves is
the dramatic speed improvement they can lend to server-side filtering.
For example, assume we used a 60-bit Hilbert curve to index our (x, y,
t) features, but only planned row ranges using the first 30 bits.  This
would keep the number of scanned ranges relatively low, but would allow
each of the server-side processes to decode high-precision versions of
the (x, y, t) values to filter against the query in a distributed
fashion.  The take-home message for me has been that a single
space-filling curve can be used for more than just query planning.

Many of the applications for which we originally created GeoMesa
continue to use point geometries, so that's where much of our focus
remains.  If you have interesting ideas for how to handle non-point
geometries better, you would find a willing and eager audience on this
mailing list.  :)

Thanks!

Sincerely,
  -- Chris


On Tue, 2015-05-19 at 12:14 -0400, Mike Atlas wrote:
> All,
>
>
> Just a general question regarding GeoMesa's geohashing,
> big-data-spatio-temporal, etc, one very influential post I've read
> recently was this one:
>
>
> http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/
>
>
> The harsh criticisms of R-trees and space-filling curves as an
> indexing/sharding strategy is intriguing. I'm supposed to track down
> his "SpaceCurve" DB product demo to find out more eventually.
>
>
> The question to GeoMesa list/authors: Can anyone address some of the
> arguments in that blog post against the design considerations of the
> tablet/hash strategy of GeoMesa?
>
>
> Cheers,
> Mike / Weft.io
>
>
>
> On Tue, May 19, 2015 at 12:00 PM,
> <geomesa-users-request@xxxxxxxxxxxxxxxx> wrote:
>         Send geomesa-users mailing list submissions to
>                 geomesa-users@xxxxxxxxxxxxxxxx
>
>         To subscribe or unsubscribe via the World Wide Web, visit
>
>         http://www.locationtech.org/mailman/listinfo/geomesa-users
>         or, via email, send a message with subject or body 'help' to
>                 geomesa-users-request@xxxxxxxxxxxxxxxx
>
>         You can reach the person managing the list at
>                 geomesa-users-owner@xxxxxxxxxxxxxxxx
>
>         When replying, please edit your Subject line so it is more
>         specific
>         than "Re: Contents of geomesa-users digest..."
>
>
>         Today's Topics:
>
>            1. Geohashing vs R-tree (Luca Morandini)
>            2. Re: Geohashing vs R-tree (Jim Hughes)
>
>
>         ----------------------------------------------------------------------
>
>         Message: 1
>         Date: Tue, 19 May 2015 08:00:46 +1000
>         From: Luca Morandini <luca.morandini1@xxxxxxxxx>
>         To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
>         Subject: [geomesa-users] Geohashing vs R-tree
>         Message-ID: <555A610E.5010402@xxxxxxxxx>
>         Content-Type: text/plain; charset=utf-8; format=flowed
>
>         Folks,
>
>         I am drafting a presentation on Big Data spatial DBMSes,
>         including GeoMesa. I read
>         some material, including the "Spatio-temporal Indexing in
>         Non-relational
>         Distributed Databases" paper, but one question still lingers
>         in my mind: isn't the
>         main advantage of geohashing over R-tree the absence of index
>         rebalancing?
>
>         Regards,
>
>         Luca Morandini
>         Data Architect - AURIN project
>         Melbourne eResearch Group
>         Department of Computing and Information Systems
>         University of Melbourne
>         Tel. +61 03 903 58 380
>         Skype: lmorandini
>
>
>         ------------------------------
>
>         Message: 2
>         Date: Mon, 18 May 2015 18:49:16 -0400
>         From: Jim Hughes <jnh5y@xxxxxxxx>
>         To: geomesa-users@xxxxxxxxxxxxxxxx
>         Subject: Re: [geomesa-users] Geohashing vs R-tree
>         Message-ID: <555A6C6C.3010600@xxxxxxxx>
>         Content-Type: text/plain; charset=utf-8; format=flowed
>
>         Luca,
>
>         I'd agree more or less.  When Accumulo splits a tablet, that
>         is
>         basically rebalancing in terms of the B^+ tree structure being
>         used/presented.  Splitting/rebalancing for R-trees is
>         generally more
>         interesting and computationally intensive.
>
>         To unpack it a little bit, one could consider the time to
>         insert a
>         record and/or build the index from scratch.  For a B^+-tree,
>         inserts are
>         O(log n).  For an R-tree, the worst case of O(n) can be
>         realized with
>         hotspotting.  As a real world example cited in that paper, it
>         is noted
>         that millions of GDELT data are located a handful of points.
>         (Figure 10
>         shows that over 6 million points are associated with (0,0) and
>         another 6
>         million with Washington, DC.)  That can give an R-tree some
>         headaches.*
>
>         For my money, I'd love to a distributed database toolkit which
>         would
>         help one use more complex data structures.  It'd be a quite
>         challenging
>         project though...
>
>         Great question; let us know what else we can expand on,
>
>         Jim
>
>         * While working on that paper, we also tried out PostGIS with
>         the same
>         dataset.  We didn't report on that since we were rather unsure
>         of our
>         results.  Building the GIST index took days.  We did try to
>         jitter the
>         data to avoid hotspotting, and I think that may have helped
>         some.
>
>         On 05/18/2015 06:00 PM, Luca Morandini wrote:
>         > Folks,
>         >
>         > I am drafting a presentation on Big Data spatial DBMSes,
>         including
>         > GeoMesa. I read some material, including the
>         "Spatio-temporal Indexing
>         > in Non-relational Distributed Databases" paper, but one
>         question still
>         > lingers in my mind: isn't the main advantage of geohashing
>         over R-tree
>         > the absence of index rebalancing?
>         >
>         > Regards,
>         >
>         > Luca Morandini
>         > Data Architect - AURIN project
>         > Melbourne eResearch Group
>         > Department of Computing and Information Systems
>         > University of Melbourne
>         > Tel. +61 03 903 58 380
>         > Skype: lmorandini
>         > _______________________________________________
>         > geomesa-users mailing list
>         > geomesa-users@xxxxxxxxxxxxxxxx
>         > To change your delivery options, retrieve your password, or
>         > unsubscribe from this list, visit
>         > http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>
>
>         ------------------------------
>
>         _______________________________________________
>         geomesa-users mailing list
>         geomesa-users@xxxxxxxxxxxxxxxx
>         To change your delivery options, retrieve your password, or
>         unsubscribe from this list, visit
>         http://www.locationtech.org/mailman/listinfo/geomesa-users
>
>         End of geomesa-users Digest, Vol 15, Issue 13
>         *********************************************
>
>
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users




------------------------------

Message: 3
Date: Wed, 20 May 2015 08:31:14 +1000
From: Luca Morandini <luca.morandini1@xxxxxxxxx>
To: Geomesa User discussions <geomesa-users@xxxxxxxxxxxxxxxx>
Subject: Re: [geomesa-users] Geohashing vs R-tree
Message-ID: <555BB9B2.9050705@xxxxxxxxx>
Content-Type: text/plain; charset=windows-1252; format=flowed

On 19/05/15 08:49, Jim Hughes wrote:
>
> I'd agree more or less.  When Accumulo splits a tablet, that is basically
> rebalancing in terms of the B^+ tree structure being used/presented.

What if you were to shard by prefixing an evenly-distributed string to the
geohash? This would avoid rebalancing, would it not?

Regards,

Luca Morandini
Data Architect - AURIN project
Melbourne eResearch Group
Department of Computing and Information Systems
University of Melbourne
Tel. +61 03 903 58 380
Skype: lmorandini


------------------------------

Message: 4
Date: Tue, 19 May 2015 18:48:02 -0400
From: Jim Hughes <jnh5y@xxxxxxxx>
To: lmorandini@xxxxxxxx,        Geomesa User discussions
        <geomesa-users@xxxxxxxxxxxxxxxx>
Subject: Re: [geomesa-users] Geohashing vs R-tree
Message-ID: <555BBDA2.9080102@xxxxxxxx>
Content-Type: text/plain; charset=windows-1252; format=flowed

Luca,

Ah!  We do have shard prefixes in GeoMesa.  I was mainly trying to point
out that there's still some kind of tree balancing/splitting happening
in Accumulo.

The plus/minus of that sharding is that every server is involved a
spatio-temporal query.  At the minute, we are experimenting with new
table designs which may help with things.

Arguably, if there are lots of users, hitting every server is going to
be undesirable.  Lowering or removing the sharding will head toward
hotspotting.  We do have a configurable index schema to help adjust for
that.  Sadly, we have not documented it well.

Cheers,

Jim

On 05/19/2015 06:31 PM, Luca Morandini wrote:
> On 19/05/15 08:49, Jim Hughes wrote:
>>
>> I'd agree more or less.  When Accumulo splits a tablet, that is
>> basically
>> rebalancing in terms of the B^+ tree structure being used/presented.
>
> What if you were to shard by prefixing an evenly-distributed string to
> the geohash? This would avoid rebalancing, would it not?
>
> Regards,
>
> Luca Morandini
> Data Architect - AURIN project
> Melbourne eResearch Group
> Department of Computing and Information Systems
> University of Melbourne
> Tel. +61 03 903 58 380
> Skype: lmorandini
> _______________________________________________
> geomesa-users mailing list
> geomesa-users@xxxxxxxxxxxxxxxx
> To change your delivery options, retrieve your password, or
> unsubscribe from this list, visit
> http://www.locationtech.org/mailman/listinfo/geomesa-users



------------------------------

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
http://www.locationtech.org/mailman/listinfo/geomesa-users

End of geomesa-users Digest, Vol 15, Issue 14
*********************************************


Back to the top