Re: [geowave-dev] Split query results in chunks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geowave-dev] Split query results in chunks

From: Eric Robertson <rwgdrummer@xxxxxxxxx>
Date: Tue, 10 Nov 2015 16:32:49 -0500
Delivered-to: geowave-dev@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mailman/private/geowave-dev>
List-help: <mailto:geowave-dev-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=unsubscribe>

Marcel,

I would tackle this problem in one of two ways:

(1) An Accumulo Iterator/Combiner. GeoWave uses this concept with Statistics such as Count.

(2) A RDD. I have been remiss on completing the complete Spark offering. It is more of examples, than concrete classes. I will try to wrap that up this week. You can use it by inspecting the kNN branch. It adds a new analytics/spark sub project. It uses the HadoopRDD and the GeoWaveInputFormat. The INput Format still uses a Hadoop Job Context, so there is some extra functions to configure the Query and Data Store parameters (e.g. Zookeeper, user, password, instance and namespace).

On Tue, Nov 10, 2015 at 6:22 AM, Marcel Jacob <m.jacob@xxxxxxxxxxx> wrote:

Hello,
I wrote a query which needs a group by statement. Since this keyword is
not supported in GeoWave I use Spark.
This is fine for small datasize like 1 til 3 GB. However if I change to
10 GB there is not enough heap space to answer the query and I can´t
give more heap space to my mini cluster.

Iterator<SimpleFeature> intermediateResults;

This is the iterator for my intermediate results. Unfortunately the
.remove() method is not supported. So I thought chunking up the results
should save me space. A SimpleFeature is not serializable so I have to
encapsulate it in a custom object for use with Spark. Like so:

while (intermediateresults.hasNext()) {
sf = intermediateresults.next();
countryCode1 = String.valueOf(sf.getAttribute("Actor1CountryCode"));
countryCode2 = String.valueOf(sf.getAttribute("Actor2CountryCode"));
actorCountryList.add(new CountryNames(
countryCode1,
countryCode2));
}

CountryNames are serializable. This loop is my bottleneck, which causes
the error, because it is one the client node. I added a counter and each
1 million results I process spark results and clear my list. Afterwards
I merge my results to the final one. But this causes the same error, so
memory could not released. So I think the ITERATOR is the main-problem
here. Is there another way for chunking? Or do you have an idea what
else I could try?

Best regards,
Marcel Jacob.
_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

References:
- [geowave-dev] Split query results in chunks
  - From: Marcel Jacob

Prev by Date: [geowave-dev] Split query results in chunks
Next by Date: [geowave-dev] Secondary Indexing
Previous by thread: [geowave-dev] Split query results in chunks
Next by thread: [geowave-dev] Secondary Indexing
Index(es):
- Date
- Thread

Breadcrumbs