Hi list,
I’m trying to run polygonalSum for a variety of polygons on a 10x10 degree float raster. I’ve forked the geotrellis-landsat-tutorial and put together some code here:
https://github.com/wri/geotrellis-zonal-stats. I’m very new to Scala and even newer to GeoTrellis, so any help on code/style/convention is appreciated.
My code in
ZonalStats.scala does the following:
- Reads a bunch of 256 x 256 raster tiles from s3 using S3GeoTiffRDD
- Converts this to a tiledRDD and then to a layerRDD
- Reads a geojson file to get the geometries from it
- Maps over the geometries (115 in total), calculating polygonalSumDouble for each one
I’m running this on EMR using a yarn-managed cluster of m3.xlarge machines— this takes about half an hour. To package the code, I execute:
./sbt "project geotrellis-zonal-stats" assembly
And then to run it:
spark-submit --class tutorial.ZonalStats target/scala-2.11/demo-assembly-0.2.0.jar --master yarn --executor-memory 15g
Most of the GeoTrellis examples deal with making web services for tiled maps or on-the-fly geoprocessing. The workflow outlined above is for one-off analysis. While it works (the polygonalSum values are correct), I’d like to speed it up
if possible. In particular, I’m wondering:
- Would following the ETL process to ingest and write GeoTrellis layers to S3 speed things up?
- Are there any shortcuts I can take regarding GeoTiffs > tiledRDD > layerRDD?
- Is it possible to get geoJSON properties (not just geometry) when mapping over a JsonFeatureCollection?
- What colossal n00b mistakes am I making?
My ultimate goal is to store a global coverage of 0.00025 degree TIFs on s3, then tabulate polygonalSums for all GADM admin level 2 boundaries.
Thanks to all you folks for your help, and for developing such a cool tool set! Looking forward to building this into our regular workflow!
Charlie