Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] GeoMesa FSDS on S3 - very slow response times

Title: Default Disclaimer Daimler AG

Thanks, Emilio!

 

The number of partitions per day is rather small for our use case, up to four only. But for one partition we easily end up having thousands of small files. The reason for this is we are collecting events from (a small number of) vehicles, which are directly stored in GeoMesa FSDS without any pre-aggregation. Maybe FSDS is just not a perfect fit for our use case and we should use some other data store? We've decided for FSDS because it was the simplest and cheapest approach we could think of and query performance was of secondary importance to us. But since the number of data / the number of files increases, performance becomes an issue now.

 

We compact by partition already. Maybe we'll give JDBC metadata persistence a try. Thanks for that hint.

 

One more thing we've noticed: Using FSDS with a local file system (i.e. a file://... URL) seems to be considerably faster than using some S3 compatible object store (i.e. an s3a://… URL). Is that due to the nature of an object store being slower than a file system or might that be an issue with the underlying org.apache.hadoop.fs.FileSystem implementation? We are using hadoop-2.8.5.

 

Best,

Christian

 

Von: geomesa-users-bounces@xxxxxxxxxxxxxxxx <geomesa-users-bounces@xxxxxxxxxxxxxxxx> Im Auftrag von Emilio Lahr-Vivaz
Gesendet: Dienstag, 30.
Juli 2019 17:58
An: geomesa-users@xxxxxxxxxxxxxxxx
Betreff: Re: [geomesa-users] GeoMesa FSDS on S3 - very slow response times

 

Hello,

The FSDS is going to work best when you only have to query a few large files. The metadata will be cached, so if you keep a data store around (e.g. in geoserver), it shouldn't be doing repeated reads of the metadata files. That leads me to believe that you are seeing slowness from scanning a large number of files, where the overhead of opening the file is dominating the query time.

A few suggestions:

You're creating a lot of partitions - up to 256 per day. How much data ends up in a typical partition with your current setup? I would suggest trying with 2 or 4 bits of precision in your partition scheme.

How are you ingesting data? You should try to avoid creating lots of small data files, as that requires a lot of overhead to scan.

If you aren't already, make sure that you compact by partition. Assuming your data is coming in semi-live, there won't be any writes going to older partitions. Compacting them again will not improve performance, but may generate considerable work.

Finally, you may want to switch to JDBC for metadata persistence, which should alleviate most of the issues around metadata operations:
https://www.geomesa.org/documentation/user/filesystem/metadata.html#relational-database-persistence

Re: getTypeNames, that could probably be improved, although the metadata is read once and then cached, so you will likely pay that penalty the first time you access each feature type anyway. I've opened a ticket to track the issue here:
https://geomesa.atlassian.net/browse/GEOMESA-2678

Thanks,

Emilio

On 7/30/19 11:23 AM, christian.sickert@xxxxxxxxxxx wrote:

Hi GeoMesa Users,

 

we are using GeoMesa with an S3 file system datastore and are experiencing extremely slow response times when we access our data - even with a “moderate” number of files stored in it (let’s say 10.000).

 

Our setup:

* GeoMesa 2.3.0

* Filesystem datastore pointing to an S3 URL

** encoding: orc

** partition scheme: daily,xz2-8bits

** leaf-storage: true

 

We’re accessing that data store using different “clients”:

* a Java microservice which uses the GeoTools GeoMesa API and is running in the same AWS region as the S3 bucket

* GeoServer (2.14) running in the same AWS region as the S3 bucket

* geomesa-fs CLI running in the same AWS region as the S3 bucket

 

All of them are really slow (it takes minutes up to hours until we get a response). Doing some debugging with our microservice we found out that even operations like org.geotools.data.DataStore.getTypeNames() takes really long because all of the metadata files seem to be scanned (which does not seem to be necessary since reading the per-feature top-level storage.json files should be sufficient). Is that “works-as-designed” or might that be a bug inside the Geomesa-FSDS implementation?

 

Is there anything (besides switching the actual data store) we can do to improve the performance?

 

We’re doing a “geomesa-fs compact …” from time to time which gives us a fairly acceptable performance (but also takes hours, sometimes even days, to complete).

 

Thanks,

Christian

 

 

 

Mit freundlichen Grüßen / Kind regards

Christian Sickert

Crowd Data & Analytics for Automated Driving
Daimler AG - Mercedes-Benz Cars Development - RD/AFC

+49 176 309 71612
christian.sickert@xxxxxxxxxxx

 


If you are not the addressee, please inform us immediately that you have received this e-mail by mistake, and delete it.
We thank you for your support.



_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.locationtech.org/mailman/listinfo/geomesa-users

 

Default Disclaimer Daimler AG
If you are not the addressee, please inform us immediately that you have received this e-mail by mistake, and delete it. We thank you for your support.


Back to the top