[geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google

[geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Hello,

I've been successfully ingesting Avro-formatted data into Bigtable using the command line program.

This was done via a MapReduce job targeting Avro files located on GCS, thanks to the

By the way, don't you think it would be appropriate to include a dependency to this connector in the geomesa-bigtable-tools module by default?

A related change would be to add "gs://" to the list of distPrefixes in AbstractIngest.

I've used Google Cloud Dataproc (i.e. hosted Hadoop environment) to run the MapReduce job.

The issue I run into was that Dataproc requires a JAR file (or several JARs) to run the job.

So I couldn't simply tell it to call "geomesa-bigtable convert ...".

The solution I came up with was to build a shaded JAR of geomesa-bigtable-tools.

Do you think it would be a good idea to provide such a JAR by default for Hadoop usage?

Last point I wanted to mention: it looks like the input of the MapReduce job was not split, even though I was using Avro files on purpose.

I suppose it has to do with AvroFileInputFormat extending FileStreamInputFormat, which explicitly returns "isSplitable = false".

Should AvroFileInputFormat thus simply overrides it to "isSplitable = true"? (I haven't tested how GeoMesa would react.)

I suppose TSV and CSV input formats should also be marked as splitable by the way, shouldn't they?

Thanks,

Damiano Albani
Geodan

Follow-Ups:
- Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc
  - From: Anthony Fox