Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...

From: Emilio Lahr-Vivaz <elahrvivaz@xxxxxxxx>
Date: Wed, 14 Dec 2016 09:23:27 -0500
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1

Hi Takashi,

In geomesa-1.2.3 there are 2 different parameters you might bereferencing: generateStats and collectQueryStats. collectQueryStats wasnamed collectStats in older versions, but was renamed to make itclearer. generateStats will store summary statistics for your data,which is then used for query planning. Since you are deleting the dataevery few minutes, you probably want to disable this, as it willintroduce needless overhead. collectQueryStats is a simple form ofauditing that will log every query to accumulo, in the'<catalog>_queries' table.

When you delete your data, how are you doing it? It might be the rapiddeleting and re-creating that is causing Accumulo problems - managingtable state is a single master process and it often seems to cause somecontention. Depending on your use case, maybe you should consider usingthe Kafka data store instead - it excels at real-time data, and hardwarecosts are considerably lower. It doesn't provide some of the moreadvanced features of the Accumulo data store, but that might not be aproblem for you.

Another tip to reduce your hardware requirements is to disable anyindices that you aren't using. It sounds like all your queries areagainst the z2 index (i.e. they have a spatial component) - if so youcould disable the other indices. See here for instructions:http://www.geomesa.org/documentation/1.2.3/user/data_management.html#customizing-index-creation


Thanks,

Emilio

On 12/14/2016 01:47 AM, Takashi Sasaki wrote:

Oops,  I forgot to mention important things.

I'm ingesting the data actually not using mapreduce, but I'm using
multithread programing.
So I delete and re-ingest seven tables "in parallel".

2016-12-14 15:23 GMT+09:00 Takashi Sasaki <tsasaki609@xxxxxxxxx>:

Hi Andrew,

Answer:
I'm ingesting the data from Apache Spark program which is mapreduce,
but ingesting the data is after collecting data to driver node, so
actually single java program.

Note:
I'm not using timestamp attribute, so GeoMesa create Z2 index table.
I delete and re-ingest the data every few minutes for near realtime
image rendering by GeoServer.

Additional Question:
Accumulo(GeoMesa) Data Store exist "collectStats" parameter.
What kind of merit is there by using this?
I tried to find a document, but I could not find it.

Thank you for reply,
Takashi

2016-12-14 13:22 GMT+09:00 Andrew <ahulbert@xxxxxxxx>:

Trying again...

Hi Takashi,

I'll try to look into question #1 but note that I generally use ec2 instead
so I'll have to see how it does zookeepers ... Question: How are you
ingesting the data? Is it mapreduce or a single java program?

But in the meantime for #2 you should certainly be able to use the r3.xlarge
instances with 4 CPU and 30gig ram. We have built many clusters on EC2 this
way. I would try to use at least 5 or 10 nodes if you can. Just make sure to
spread the accumulo master, hadoop name more, and 3 zookeepers on separate
nodes. I'll update that wiki page with some more examples.

Also note that you can use EBS storage which can be cheaper.

Andrew




-------- Original message --------
From: Takashi Sasaki <tsasaki609@xxxxxxxxx>
Date: 12/13/16 20:19 (GMT-05:00)
To: geomesa-users@xxxxxxxxxxxxxxxx
Subject: [geomesa-users] How to avoid Tablet Server crashing, and server
cost problem...

Hello,

I use GeoMesa 1.2.3 with Accumulo 1.7.2 on AWS EMR which is release
emr-4.7.0, Hadoop 2.7.2, ZooKeeper-Sandbox 3.4.8.
I have two problem for the system management. One is Accumulo Tablet
server suddenly crashing, and other is Geomesa(Accumulo) hardware spec
requirement is too expensive for my company.

1.
I encountered too slow writting features to GeoMesa(Accumulo)
Datastore when Accumulo Tablet Server crashed, and GeoServer image
rendering respose is also slowing.
If Tablet Server alive, it is one or two minute for writting features,
but it is eight minutes or more in the above state.

Probably, this cause is waitting Accumulo Master server for tablet
rebalancing.

[Accumulo Master Server log]
2016-12-13 03:34:26,008 [master.Master] WARN : Lost servers
[ip-10-24-83-37:9997[358b52962d101f8]]

[Accumulo Tablet Server log]
2016-12-13 03:34:47,943 [hdfs.DFSClient] WARN : DFSOutputStream
ResponseProcessor exception for block
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2282)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2016-12-13 03:34:47,943 [zookeeper.ClientCnxn] WARN : Client session
timed out, have not heard from server in 53491ms for sessionid
0x358b52962d101f8
2016-12-13 03:34:47,944 [hdfs.DFSClient] WARN : Error Recovery for
block BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
in pipeline
DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK],
DatanodeInfoWithStorage[10.24.83.39:50010,DS-f66dcecb-1b44-431c-83fc-ab343339c485,DISK]:
bad datanode
DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK]
2016-12-13 03:34:48,444 [hdfs.DFSClient] WARN : DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [hdfs.DFSClient] WARN : Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,463 [zookeeper.ClientCnxn] WARN : Unable to
reconnect to ZooKeeper service, session 0x358b52962d101f8 has expired
2016-12-13 03:34:48,546 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,546 [log.DfsLogger] ERROR: Failed to close log file
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,564 [zookeeper.ZooReader] WARN : Saw (possibly)
transient exception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tservers/ip-10-24-83-37:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2016-12-13 03:34:48,565 [zookeeper.ZooCache] WARN : Saw (possibly)
transient exception communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tables/1n/namespace
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:295)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:272)
at
org.apache.accumulo.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:171)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:323)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:259)
at
org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:235)
at
org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:600)
at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.accumulo.core.trace.wrappers.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
at org.apache.accumulo.server.rpc.RpcWrapper$1.invoke(RpcWrapper.java:74)
at com.sun.proxy.$Proxy19.startMultiScan(Unknown Source)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2330)
at
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2314)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:63)
at
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
at
org.apache.accumulo.server.rpc.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:78)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
---
(retry log many times)
---
2016-12-13 03:34:48,844 [tserver.TabletServer] ERROR: Lost tablet
server lock (reason = SESSION_EXPIRED), exiting.

According this log, Tablet Server was accessing HDFS and ZooKeeper
session expired, server shutdowned self.

How can I avoid Tablet Server crashing?


2.
I use AWS EMR with r3.8xlarge instance for Accumulo Tablet server on 5
core node.
It is high performance(32 vCPU/ 244G memory/ 2 x 320GB SSD), but too
expensive.

For GeoMesa Tuning Accumulo,
https://geomesa.atlassian.net/wiki/display/GEOMESA/Tuning+Accumulo#TuningAccumulo-SmallCluster,LargeServers
example hardware spec requirement is almost fitting for r3.8xlarge.

But I want to cost down.
Is not Small Cluster, "Small" Servers example  hardware spec
requirement somewhere?

If possible, I wanna use r3.xlarge(4 vCPU/ 30.5G memory/ 1 x 80GB SSD)
for Accumulo Tablet server.
How do you think?


Thank you,
Takashi
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from
this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users

_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users

References:
- Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
  - From: Andrew
- Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
  - From: Takashi Sasaki
- Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
  - From: Takashi Sasaki

Prev by Date: Re: [geomesa-users] How to write scala class that transforms JSONObject into SimpleFeatureType and writes it to the Geomesa Accumulo
Next by Date: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Previous by thread: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Next by thread: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Index(es):
- Date
- Thread

Breadcrumbs