Re: [geowave-dev] Geowave analytics options

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geowave-dev] Geowave analytics options

From: Scott <sctevl@xxxxxxxxx>
Date: Wed, 23 Sep 2015 12:44:09 +0000
Delivered-to: geowave-dev@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mailman/private/geowave-dev>
List-help: <mailto:geowave-dev-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geowave-dev>, <mailto:geowave-dev-request@locationtech.org?subject=unsubscribe>

On Tue, Sep 22, 2015 at 9:53 PM Eric Robertson <rwgdrummer@xxxxxxxxx> wrote:

For some reason, I will not seeing the latest threads from the group, so I will respond to your request here.
In fact, you have stumbled on two concerns

(1) The code does not handle the empty case well. I created an issue # 513 and addressed the issue in branch GEOWAVE-513. You are welcome to opine on this fix, as I plan to have it reviewed and merged tomorrow. We are about to make another minor release (which includes a DBScan refactor).

Ok, thanks. That works for me. With my schedule this week, I can wait for the merge into master and get the latest from GitHub.

(2) The batch ingest only updates statistics at the end. The DBScan refactor branch does have an adjustment to support periodic flushes.

Thanks for confirming about the statistics. Should the -stats command work to fix that after the ingest is stopped? I tried running that command but it didn't work. If it _should_ work, I can bring that up in a separate thread with the behavior I'm seeing.

Fixed width histograms have greater distortion with frequent 'merges' of independent records. The intent was to minimize the number of 'writes' and 'merges' by flushing at the end. I ran into the same problem you did, hence the fix. One question I have: The fix was a fixed number of records prior to flush. The only control given to the developer is a 'system property' that turns off flushing. It may make sense to all expose a system property to alter the flush rate (at developer's own risk of distortion and performance degradation). Thoughts?

I'm new to this whole stack (hdfs on up), so I'm not sure of the implications of altering the flush rate. If there are a couple of examples of flush rates that should work, I would try it.

The DBScan refactor addresses some critical issues. There are some bugs in the code that lead to indeterminant results. Furthermore, performance is directly affected by the partitioning in the Mapper. Originally, I thought that the best performance would be gained from having a cell size equivalent to twice to maximum distance. For a heavy load of data distributed over a large map, a small cluster can not handle that amount of keys. Since the partition size is independently configurable, I can choose a large cell size to reduce the number of keys (or buy a bigger cluster). This increases the workload (an memory requirements) of the reducer. To compensate, the reducer performs a secondary partitioning with the cell size equal to twice the maximum distance. The reducer tosses cells that contain less than the minimum number of neighbors. I realize this may toss some critical geometries, but it does reduce the over-all workload considerably and only affects less dense areas. The reducer then pre-processes the data looking for geometries with a large number of neighbors, compressing them into single convex polygons. I found there is nothing more telling than processing large amounts of data on a small under-nourished cluster.

Ours is definitely undernourished for the geolife ingest. For a reference point, I'm running Hortonworks on two CentOS VMs - One has the NameNode, SNameNode, Accumulo Master, Ambari, etc., and the other is the DataNode / Tablet. The first has 18GB RAM and it maxed on the last run so it definitely needs more RAM (that's why I killed the ingest). The second machine has 24GB and it seems to run fine. In hindsight, I think it's backwards and so we may allocate 22 to the namenode machine and 20 to the datanode. When we get more hardware, we intend to scale out so that it's a more respectable instance.

Thanks for helping.

Thank you for the feedback.

Scott

On Fri, Sep 11, 2015 at 11:26 AM, Eric Robertson <rwgdrummer@xxxxxxxxx> wrote:
Yes. drop the table and re-ingest, placing the new geowave jar in the classpath before hand.

On Fri, Sep 11, 2015 at 10:06 AM, Scott <sctevl@xxxxxxxxx> wrote:
Ok Eric, thanks. Yes, the data was loaded by a geowave 0.8.9 build from a few weeks ago. Do you see any gotchas with removing the current data using Accumulo shell and then running a new ingest? I guess I need to upload the latest geowave jar into Accumulo classpath as well before uploading.

Cheers,

Scott

On Fri, Sep 11, 2015 at 9:54 AM, Eric Robertson <rwgdrummer@xxxxxxxxx> wrote:
This looks to me like the data was loaded from an older version of GeoWave. There is a statistic that is missing.
I am finishing up an adjustment that will handle this more gracefully along with a few other optimizations to the GeoWaveInputFormat.

I am also in the middle of a DBScan refactor, fixing a few bugs and adding optimizations.

On Fri, Sep 11, 2015 at 9:45 AM, Derek Yeager <dcy2003@xxxxxxxxx> wrote:

---------- Forwarded message ----------
From: Scott <sctevl@xxxxxxxxx>
Date: Fri, Sep 11, 2015 at 9:13 AM
Subject: Re: [geowave-dev] Geowave analytics options
To: geowave-dev <geowave-dev@xxxxxxxxxxxxxxxx>

Rich,

Thanks for the help. I grabbed the latest, recompiled the analytics jar and then ran again. I got a little further but still hit an error. Is there a different way to do this that's better?

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:java.library.path=:/usr/hdp/2.2.6.0-2800/hadoop/lib/native/Linux-amd64-64:/data/hdp/2.2.6.0-2800/hadoop/lib/native

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-573.3.1.el6.x86_64

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:user.name=root

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:user.home=/root

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/sclark/geowave/analytics/mapreduce/target/munged

15/09/11 08:18:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=gilmith.nicc.noblis.org:2181 sessionTimeout=30000 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@398d81fe

15/09/11 08:19:00 INFO zookeeper.ClientCnxn: Opening socket connection to server gilmith.nicc.noblis.org/172.18.151.210:2181. Will not attempt to authenticate using SASL (unknown error)

15/09/11 08:19:00 INFO zookeeper.ClientCnxn: Socket connection established to gilmith.nicc.noblis.org/172.18.151.210:2181, initiating session

15/09/11 08:19:00 INFO zookeeper.ClientCnxn: Session establishment complete on server gilmith.nicc.noblis.org/172.18.151.210:2181, sessionid = 0x24fbb590bba0001, negotiated timeout = 30000

15/09/11 08:19:02 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

15/09/11 08:19:02 INFO compress.CodecPool: Got brand-new compressor [.bz2]

15/09/11 08:19:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

15/09/11 08:19:02 INFO compress.CodecPool: Got brand-new compressor [.gz]

15/09/11 08:19:02 INFO compress.CodecPool: Got brand-new compressor [.lz4]

15/09/11 08:19:02 INFO compress.CodecPool: Got brand-new compressor [.snappy]

15/09/11 08:19:02 WARN mapreduce.GeoWaveAnalyticJobRunner: Compression with class org.apache.hadoop.io.compress.SnappyCodec

15/09/11 08:19:04 INFO impl.TimelineClientImpl: Timeline service address: http://gilmith.nicc.noblis.org:8188/ws/v1/timeline/

15/09/11 08:19:04 INFO client.RMProxy: Connecting to ResourceManager at gilmith.nicc.noblis.org/172.18.151.210:8050

15/09/11 08:19:30 WARN metadata.AbstractAccumuloPersistence: Object 'ROW_RANGE_SPATIAL_VECTOR_IDX' not found

15/09/11 08:19:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/root/.staging/job_1441722061750_0001

15/09/11 08:19:30 ERROR analytic.AnalyticCLIOperationDriver: Unable to run analytic job

java.lang.NullPointerException

at mil.nga.giat.geowave.datastore.accumulo.mapreduce.input.GeoWaveInputFormat.getRangeMax(GeoWaveInputFormat.java:452)

at mil.nga.giat.geowave.datastore.accumulo.mapreduce.input.GeoWaveInputFormat.getIntermediateSplits(GeoWaveInputFormat.java:516)

at mil.nga.giat.geowave.datastore.accumulo.mapreduce.input.GeoWaveInputFormat.getSplits(GeoWaveInputFormat.java:393)

at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)

at mil.nga.giat.geowave.analytic.mapreduce.ToolRunnerMapReduceIntegration.waitForCompletion(ToolRunnerMapReduceIntegration.java:43)

at mil.nga.giat.geowave.analytic.mapreduce.GeoWaveAnalyticJobRunner.run(GeoWaveAnalyticJobRunner.java:272)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at mil.nga.giat.geowave.analytic.mapreduce.ToolRunnerMapReduceIntegration.submit(ToolRunnerMapReduceIntegration.java:31)

at mil.nga.giat.geowave.analytic.mapreduce.GeoWaveAnalyticJobRunner.run(GeoWaveAnalyticJobRunner.java:184)

at mil.nga.giat.geowave.analytic.mapreduce.nn.NNJobRunner.run(NNJobRunner.java:80)

at mil.nga.giat.geowave.analytic.mapreduce.dbscan.DBScanJobRunner.run(DBScanJobRunner.java:180)

at mil.nga.giat.geowave.analytic.mapreduce.dbscan.DBScanIterationsJobRunner.run(DBScanIterationsJobRunner.java:123)

at mil.nga.giat.geowave.analytic.mapreduce.dbscan.DBScanIterationsJobRunner.run(DBScanIterationsJobRunner.java:259)

at mil.nga.giat.geowave.analytic.AnalyticCLIOperationDriver.run(AnalyticCLIOperationDriver.java:62)

at mil.nga.giat.geowave.core.cli.GeoWaveMain.main(GeoWaveMain.java:48)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

On Wed, Sep 9, 2015 at 4:57 PM, Rich Fecher <rfecher@xxxxxxxxx> wrote:
Thanks for pointing that out Scott. I actually haven't run DBScan myself using the shaded jar produced by activating the 'analytics-singlejar' profile, but it looks to me, at least regarding your experience with an unrecognized CLI operation, that there's an issue with running analytics through GeoWaveMain in 0.8.9-SNAPSHOT. The analytics operations simply don't seem to be provided within META-INF/services. Its a very easy fix at least, I just committed the file with the SPI operation provider that should get you past this error.

I do know Eric's been running DBScan recently, but probably through a different means than this? Eric, should he just be using a different packaging?

We are trying to move toward everything including the analytics running through GeoWaveMain, but we aren't fully there yet (apparently). Let us know if this works for you.

Rich

On Wed, Sep 9, 2015 at 4:09 PM, Scott <sctevl@xxxxxxxxx> wrote:
Hello,

I tried out the sample command to run the dbscan analytic against the geolife tables I loaded into GeoWave. However, when I run the command:

yarn jar geowave-analytic-mapreduce-0.8.9-SNAPSHOT-analytics-singlejar.jar -dbscan -n geowave.geolife -u geowave -p hadoop -z FQDN:2181 -i accumulo -emn 2 -emx 6 -pd 1000 -pc mil.nga.giat.geowave.analytic.partitioner.OrthodromicDistancePartitioner -cms 10 -orc 4 -hdfsbase /user/geolife -b bdb4 -eit geolife

I get the following: ERROR cli.GeoWaveMain: Unable to parse operation

org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -dbscan

Is dbscan a different option now? Is there a list of the proper options for all of the analytics (such as KMeans, Nearest Neighbors, etc)? Am I just overlooking something in the docs?
Thanks,
Scott

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

_______________________________________________
geowave-dev mailing list
geowave-dev@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geowave-dev

Follow-Ups:
- Re: [geowave-dev] Geowave analytics options
  - From: Eric Robertson

References:
- [geowave-dev] Geowave analytics options
  - From: Scott
- Re: [geowave-dev] Geowave analytics options
  - From: Rich Fecher
- Re: [geowave-dev] Geowave analytics options
  - From: Scott
- Re: [geowave-dev] Geowave analytics options
  - From: Scott
- Re: [geowave-dev] Geowave analytics options
  - From: Eric Robertson

Prev by Date: Re: [geowave-dev] Geowave analytics options
Next by Date: [geowave-dev] Remote Debugging
Previous by thread: Re: [geowave-dev] Geowave analytics options
Next by thread: Re: [geowave-dev] Geowave analytics options
Index(es):
- Date
- Thread

Breadcrumbs