[geomesa-users] duplicate data in geomesa 1.2.1--how? and why?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[geomesa-users] duplicate data in geomesa 1.2.1--how? and why?

From: Benjamin Weaver <Benjamin.Weaver@xxxxxxxxxxxxxxxxxx>
Date: Sun, 12 Feb 2017 20:28:33 +0000
Accept-language: en-GB, en-US
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://dev.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <https://dev.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://dev.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
Thread-index: AQHShVqYRHQsDc01P0Szvco4cTUSiw==
Thread-topic: duplicate data in geomesa 1.2.1--how? and why?

Hi all,

If we ingest, say, the same line of text data twice (by mistake) in Geomesa 1.2.1 we end up with duplicate data in our Accumulo (1.7.2) database. We are ingesting using Gemesa-generated featureIDs (setting our featureBuilder.setFeatureID to NULL without the use of Hints).

A colleague asked me, why are duplicates generated in this case? I realized I did not know.

1. How, exactly, in our configuration of geomesa + Accumulo, is a geomesa row, or record made unique? I know the importance of Accumulo logical rows, but in this case of identical data we would want to insure insertation of only one geomesa record, namely, one instance of our geomesa SimpleFeature.

1a. Are duplicate geomesa rows added because the time at insertion differs? or because different featureIDs are randomly generated on each insertion?

Potentially related questions:

2. How are featureIDs generated by geomesa? I thought randomly, but I read a comment somewhere suggesting that FeatureIDs were created out of an md5 hash of all the values in the feature. But a colleague points out that even if this is so, a featureID does not resemble an md5 hash, so must be composed at least partially by other means

3. A potentially related question: can we create a z3 index by using a data-derived timestamp--not the insertion timestamp-- as the time dimension?

All comments and perspectives are appreciated and welcome!

Ben Weaver

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.

Follow-Ups:
- Re: [geomesa-users] duplicate data in geomesa 1.2.1--how? and why?
  - From: Emilio Lahr-Vivaz

Prev by Date: Re: [geomesa-users] Possible problems and issues re updating from 1.2.1 to 1.3.0
Next by Date: Re: [geomesa-users] duplicate data in geomesa 1.2.1--how? and why?
Previous by thread: [geomesa-users] Possible problems and issues re updating from 1.2.1 to 1.3.0
Next by thread: Re: [geomesa-users] duplicate data in geomesa 1.2.1--how? and why?
Index(es):
- Date
- Thread

Breadcrumbs