Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Dan,

This is great!  Any chance you could submit a PR?  I'll merge ASAP as I
need it for some work I'm doing now.  I just haven't gotten around to
enabling the distributed ingest on GCP - it currently works on S3 and
HDFS.  And regarding the shaded jars, definitely open to suggestions.  I
struggled with this recently when running some Spark jobs on a GCP
Dataproc cluster.  Basically, getting the hdfs-site.xml file with the
INSTANCE and PROJECT set properly into the jar that gets distributed
should happen as part of the deployment.  What do you think?

Thanks,
Anthony


Damiano Albani <damiano.albani@xxxxxxxxx> writes:

> Hello,
>
> I've been successfully ingesting Avro-formatted data into Bigtable using
> the command line program.
> This was done via a MapReduce job targeting Avro files located on GCS,
> thanks to the
> Google Cloud Storage Connector for Spark and Hadoop
> <https://cloud.google.com/hadoop/google-cloud-storage-connector>.
>
> By the way, don't you think it would be appropriate to include a dependency
> to this connector in the *geomesa-bigtable-tools* module by default?
> A related change would be to add *"gs://"* to the list of *distPrefixes* in
> *AbstractIngest*
> <https://github.com/locationtech/geomesa/blob/master/geomesa-tools/src/main/scala/org/locationtech/geomesa/tools/ingest/AbstractIngest.scala#L91>
> .
>
> I've used Google Cloud Dataproc (i.e. hosted Hadoop environment) to run the
> MapReduce job.
> The issue I run into was that Dataproc requires a JAR file (or several
> JARs) to run the job.
> So I couldn't simply tell it to call *"geomesa-bigtable convert ..."*.
> The solution I came up with was to build a shaded JAR of
> *geomesa-bigtable-tools*.
> Do you think it would be a good idea to provide such a JAR by default for
> Hadoop usage?
>
> Last point I wanted to mention: it looks like the input of the MapReduce
> job was *not* split, even though I was using Avro files on purpose.
> I suppose it has to do with *AvroFileInputFormat*
> <https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/AvroFileInputFormat.scala>
> extending *FileStreamInputFormat*
> <https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/FileStreamInputFormat.scala>,
> which explicitly returns *"isSplitable = false"*.
> Should *AvroFileInputFormat* thus simply overrides it to *"isSplitable =
> true"*? (I haven't tested how GeoMesa would react.)
> I suppose TSV and CSV input formats should also be marked as splitable by
> the way, shouldn't they?
>
> Thanks,


Back to the top