Hi Lukasz,
If you can all the features in GeoJSON, is the amount of data actually too big to fit into memory and to do on one machine? GeoTrellis would be able to help with that if it had enough memory, without spark - read in the feature collection, spatially partition
the one collection, and do a bounds query using the features of the other collection to pull out intersecting geometries.
We have some functionality contained in an object called VectorJoin, which will efficiently join two vector datasets - with the caveat that those RDDs of vector data are spatially partitioned before the join. We actually don't have a good way to do this
type of spatial partitioning - I've written up an issue to track this, so that next release we'd have a good solution to that problem (
https://github.com/locationtech/geotrellis/issues/2116)
There's a possibility to lean on GeoTrellis for components of not-out-of-the-box solution to this problem, so if you are interested in diving deeper into this.
Thanks,
Rob