Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Hello,

I've been successfully ingesting Avro-formatted data into Bigtable using the command line program.
This was done via a MapReduce job targeting Avro files located on GCS, thanks to the 
Google Cloud Storage Connector for Spark and Hadoop.

By the way, don't you think it would be appropriate to include a dependency to this connector in the geomesa-bigtable-tools module by default?
A related change would be to add "gs://" to the list of distPrefixes in AbstractIngest.

I've used Google Cloud Dataproc (i.e. hosted Hadoop environment) to run the MapReduce job.
The issue I run into was that Dataproc requires a JAR file (or several JARs) to run the job.
So I couldn't simply tell it to call "geomesa-bigtable convert ...".
The solution I came up with was to build a shaded JAR of geomesa-bigtable-tools.
Do you think it would be a good idea to provide such a JAR by default for Hadoop usage?

Last point I wanted to mention: it looks like the input of the MapReduce job was not split, even though I was using Avro files on purpose.
I suppose it has to do with AvroFileInputFormat extending FileStreamInputFormat, which explicitly returns "isSplitable = false".
Should AvroFileInputFormat thus simply overrides it to "isSplitable = true"? (I haven't tested how GeoMesa would react.)
I suppose TSV and CSV input formats should also be marked as splitable by the way, shouldn't they?

Thanks,

--
Damiano Albani
Geodan

Back to the top