Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [geomesa-users] Ingesting Avro files into GeoMesa using Hadoop on Google Dataproc

Hello,

On Mon, Feb 20, 2017 at 3:23 PM, Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx> wrote:
Awesome! I think that avro files are not splittable in our input format because they have a defined header and format that must be read by a single mapper. My understanding is that it's like XML - if you arbitrarily split an XML document each piece will no longer be valid. I could be wrong though, and there may be better work-arounds also.

Indeed, I agree that the reason why Avro files aren't split is due to GeoMesa's input format — well, at least, the current one.
Because I've applied what I mentioned previously: overriding AvroFileInputFormat with "isSplitable = true".
And I can report that the input was indeed split (e.g. 58 splits for a 3+ GB Avro file):
17/02/21 11:01:13 INFO mapreduce.JobSubmitter: number of splits:58

Now the remaining issue is that I don't understand the overall behavior of the MapReduce job on Google Dataproc: only 1 worker node (e.g. out of 2) gets tasks (albeit correctly 1 task per vCPU) and, even more surprising, I don't see any performance boost in Bigtable write throughput.
That's not particularly GeoMesa-specific I suppose but if you guys have an idea about what's going on, I'm interested!

Regards,

--
Damiano Albani
Geodan

Back to the top