For some reason, I will not seeing the latest threads from the group, so I will respond to your request here.
In fact, you have stumbled on two concerns
(1) The code does not handle the empty case well. I created an issue # 513 and addressed the issue in branch GEOWAVE-513. You are welcome to opine on this fix, as I plan to have it reviewed and merged tomorrow. We are about to make another minor release (which includes a DBScan refactor).
Ok, thanks. That works for me. With my schedule this week, I can wait for the merge into master and get the latest from GitHub.
(2) The batch ingest only updates statistics at the end. The DBScan refactor branch does have an adjustment to support periodic flushes.
Thanks for confirming about the statistics. Should the -stats command work to fix that after the ingest is stopped? I tried running that command but it didn't work. If it _should_ work, I can bring that up in a separate thread with the behavior I'm seeing.
Fixed width histograms have greater distortion with frequent 'merges' of independent records. The intent was to minimize the number of 'writes' and 'merges' by flushing at the end. I ran into the same problem you did, hence the fix. One question I have: The fix was a fixed number of records prior to flush. The only control given to the developer is a 'system property' that turns off flushing. It may make sense to all expose a system property to alter the flush rate (at developer's own risk of distortion and performance degradation). Thoughts?
I'm new to this whole stack (hdfs on up), so I'm not sure of the implications of altering the flush rate. If there are a couple of examples of flush rates that should work, I would try it.
The DBScan refactor addresses some critical issues. There are some bugs in the code that lead to indeterminant results. Furthermore, performance is directly affected by the partitioning in the Mapper. Originally, I thought that the best performance would be gained from having a cell size equivalent to twice to maximum distance. For a heavy load of data distributed over a large map, a small cluster can not handle that amount of keys. Since the partition size is independently configurable, I can choose a large cell size to reduce the number of keys (or buy a bigger cluster). This increases the workload (an memory requirements) of the reducer. To compensate, the reducer performs a secondary partitioning with the cell size equal to twice the maximum distance. The reducer tosses cells that contain less than the minimum number of neighbors. I realize this may toss some critical geometries, but it does reduce the over-all workload considerably and only affects less dense areas. The reducer then pre-processes the data looking for geometries with a large number of neighbors, compressing them into single convex polygons. I found there is nothing more telling than processing large amounts of data on a small under-nourished cluster.
Ours is definitely undernourished for the geolife ingest. For a reference point, I'm running Hortonworks on two CentOS VMs - One has the NameNode, SNameNode, Accumulo Master, Ambari, etc., and the other is the DataNode / Tablet. The first has 18GB RAM and it maxed on the last run so it definitely needs more RAM (that's why I killed the ingest). The second machine has 24GB and it seems to run fine. In hindsight, I think it's backwards and so we may allocate 22 to the namenode machine and 20 to the datanode. When we get more hardware, we intend to scale out so that it's a more respectable instance.
Thank you for the feedback.
Scott