Re: [geomesa-users] Transformations, Edge effects?

Hi Marcel,

The KNN query works by repeatedly querying small spiralling regions around the center point until it has found K neighbors. Theoretically, each region should be exclusive, so there shouldn't be any duplicates (assuming Point geometries in your features). However, there isn't any explicit de-duplicating, so it's entirely possible there is a subtle bug. It was also written to work against our older data index, whereas now the queries are likely going to a different index, which might introduce bugs.

I've filed a defect to track the issue here:

Unfortunately, the developer who wrote the KNN query has moved on to another project. However, he mentioned that you might want to set your search distance to a smaller value, especially if you are only trying to retrieve 10 features. I believe that the search distance should be a best guess as to the radius that contains your K neighbors.

Additionally, you are likely running into the same issues I mentioned previously with large date ranges. Increasing your memory or chunking up your time frames might alleviate that.



On 09/09/2015 12:20 PM, Marcel wrote:
okay I checked these functions and found that distance() is calculating the euclidean distance. This is not suitable for my purposes. So I wrote a work around by creating a circle with a certain radius and edges, calculated these new positions using GeodeticCalculator and finally creating a polygon with these points. While iterating over my data I can calculate the distance.

The duplicate entry occurs for a KNN-Query with a temporal constraint. If you want to reproduce the issue, I added my query down here. I think it´s a bug within the KNN-Query. The resulting globaleventid which occurs four times is: 253015471.

* find top 10 events where the usa investigated anything (eventrootcode = 09) in the years from 2004 to 2014 with washington as
     * origin.
private static Iterator<SimpleFeatureWithDistance> getResultsForQuery18(Map<String, String> dsConf) {

SimpleFeatureSource featureSource = SimpleFeatureSourceFactory.getSimpleFeatureSource(dsConf);

        GeometryFactory geomFactory = new GeometryFactory();
        // coordinates for washington
        double[] coordinates = { -77.0145665, 38.8993488 };
Coordinate coord = new Coordinate(coordinates[0], coordinates[1]);
        Point point = geomFactory.createPoint(coord);

        DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
        Date start = null;
        Date end = null;
        try {
            start = df.parse("2004-01-01");
            end = df.parse("2014-12-31");
        } catch (java.text.ParseException e) {

        FilterFactory2 ff = CommonFactoryFinder.getFilterFactory2();
Filter timeFilter = ff.between(, ff.literal(start), ff.literal(end));

        ArrayList<Filter> attributeFilters = new ArrayList<Filter>();

Filter attributeFilter1 = ff.equal(, ff.literal("09"), false);
Filter attributeFilter2 = ff.equal(, ff.literal("USA"), false);
Filter attributeFilter3 = ff.not(ff.isNull(;
Filter attributeFilter4 = ff.not(ff.isNull(;

        Filter attributeFilterCombined = ff.and(attributeFilters);

Filter completeFilter = ff.and(timeFilter, attributeFilterCombined);

        int numberOfResults = 10;
NearestNeighbors neighborsPrepare = NearestNeighbors.apply(point, numberOfResults); // initial guess for getting k points - assuming that one day will not result k points
        double searchDistanceInMeters = 21000000;
        //maximum distance between two points on earth
        double maximumdistanceInMeters = 40075160;
GeoHashSpiral spiral = GeoHashSpiral.apply(point, searchDistanceInMeters, maximumdistanceInMeters); Query q = new Query(dsConf.get(AccumuloDataStoreConfiguration.FEATURE_NAME), completeFilter, new String[] { GDELTConstants.GLOBAL_EVENTID, GDELTConstants.DATE, GDELTConstants.GEOM }); NearestNeighbors neighbors = KNNQuery.runKNNQuery(featureSource, q, spiral, neighborsPrepare);

return JavaConversions.asJavaIterator(neighbors.getK().iterator());

I wrote another query and just returned the eventrecord with globaleventid = 253015471. Only one record returned. Also this query is very slow. Do you have any ideas for a speed up by chaning some parameters like searchDistanceInMeters or maximumDistanceInMeters?

Marcel Jacob.

Am 04.09.2015 20:24, schrieb Jim Hughes:
Hi Marcel,

The functions you are calling are actually GeoTools methods. To see a list available, you can check out your the WFS GetCapabilities document from GeoServer (1) under the ogc:Function_Names tag.

Distance is a tricky thing: When ones ask for a distance calculation, the units will be determined by the Coordinate Reference System (CRS). GeoMesa makes the assumption that all your data is longitude / latitude which is EPSG:4326. In that CRS, the unit of measurement is degrees.

In order to get a 'more helpful' answer, libraries like GeoTools GeodeticCalculator or our GeoMesa wrapper (2) can take two points specified in lon-lat and return the distance in meters. Those libraries use the Haversine formula or the Vincenty's formula (3).

For the Point(0,0), it is on a corner of a GeoHash. In our implementation, GeoHashes contain their bottom and left edge, so this point is in the 's' 5-bit GeoHash.

One easy way to see duplicate data in GeoMesa is if you have entered the same data multiple times without specifying the feature id. If that's not what has happened, you may have found a bug. If you can write up some steps to reproduce, I'm happy to check things out.



1. For example: http://your-server/geoserver/ows?service=wfs&version=1.0.0&request=GetCapabilities

2.  GeoMesa's Scala wrapper about the GeoTools GeodeticCalculator:
Unit tests/examples of use:


On 09/04/2015 12:54 PM, Marcel wrote:

I played around with some geomesa transformations like strConcat() and distance(). This returns the distance in degrees which looks kind of unfamiliar to me. Is there a transformation, which returns the distance in meters or kilometers (given two points)? Which distance do you calculate (euclidean distance, distance using haversin formula or based on an ellipsoid)?

Looking at the results of another query I noticed that one record occurs 4 times (Point(0, 0)). I could imagine that there is the boundary of a geohash and this point intersects with all of the four geohashes around. Do I have to remove these duplicates afterwards?

Thanks again,
