Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[geomesa-users] How to avoid Tablet Server crashing, and server cost problem...

Hello,

I use GeoMesa 1.2.3 with Accumulo 1.7.2 on AWS EMR which is release
emr-4.7.0, Hadoop 2.7.2, ZooKeeper-Sandbox 3.4.8.
I have two problem for the system management. One is Accumulo Tablet
server suddenly crashing, and other is Geomesa(Accumulo) hardware spec
requirement is too expensive for my company.

1.
I encountered too slow writting features to GeoMesa(Accumulo)
Datastore when Accumulo Tablet Server crashed, and GeoServer image
rendering respose is also slowing.
If Tablet Server alive, it is one or two minute for writting features,
but it is eight minutes or more in the above state.

Probably, this cause is waitting Accumulo Master server for tablet rebalancing.

[Accumulo Master Server log]
2016-12-13 03:34:26,008 [master.Master] WARN : Lost servers
[ip-10-24-83-37:9997[358b52962d101f8]]

[Accumulo Tablet Server log]
2016-12-13 03:34:47,943 [hdfs.DFSClient] WARN : DFSOutputStream
ResponseProcessor exception for block
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
java.io.EOFException: Premature EOF: no length prefix available
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2282)
 at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2016-12-13 03:34:47,943 [zookeeper.ClientCnxn] WARN : Client session
timed out, have not heard from server in 53491ms for sessionid
0x358b52962d101f8
2016-12-13 03:34:47,944 [hdfs.DFSClient] WARN : Error Recovery for
block BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
in pipeline DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK],
DatanodeInfoWithStorage[10.24.83.39:50010,DS-f66dcecb-1b44-431c-83fc-ab343339c485,DISK]:
bad datanode DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK]
2016-12-13 03:34:48,444 [hdfs.DFSClient] WARN : DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
 at org.apache.hadoop.ipc.Client.call(Client.java:1475)
 at org.apache.hadoop.ipc.Client.call(Client.java:1412)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
 at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [hdfs.DFSClient] WARN : Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
 at org.apache.hadoop.ipc.Client.call(Client.java:1475)
 at org.apache.hadoop.ipc.Client.call(Client.java:1412)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
 at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,463 [zookeeper.ClientCnxn] WARN : Unable to
reconnect to ZooKeeper service, session 0x358b52962d101f8 has expired
2016-12-13 03:34:48,546 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,546 [log.DfsLogger] ERROR: Failed to close log file
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
 at org.apache.hadoop.ipc.Client.call(Client.java:1475)
 at org.apache.hadoop.ipc.Client.call(Client.java:1412)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
 at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,564 [zookeeper.ZooReader] WARN : Saw (possibly)
transient exception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tservers/ip-10-24-83-37:9997
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
 at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:132)
 at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
 at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2016-12-13 03:34:48,565 [zookeeper.ZooCache] WARN : Saw (possibly)
transient exception communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tables/1n/namespace
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
 at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:295)
 at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:272)
 at org.apache.accumulo.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:171)
 at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:323)
 at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:259)
 at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:235)
 at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:600)
 at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.accumulo.core.trace.wrappers.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
 at org.apache.accumulo.server.rpc.RpcWrapper$1.invoke(RpcWrapper.java:74)
 at com.sun.proxy.$Proxy19.startMultiScan(Unknown Source)
 at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2330)
 at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2314)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:63)
 at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
 at org.apache.accumulo.server.rpc.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:78)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
 at java.lang.Thread.run(Thread.java:745)
---
(retry log many times)
---
2016-12-13 03:34:48,844 [tserver.TabletServer] ERROR: Lost tablet
server lock (reason = SESSION_EXPIRED), exiting.

According this log, Tablet Server was accessing HDFS and ZooKeeper
session expired, server shutdowned self.

How can I avoid Tablet Server crashing?


2.
I use AWS EMR with r3.8xlarge instance for Accumulo Tablet server on 5
core node.
It is high performance(32 vCPU/ 244G memory/ 2 x 320GB SSD), but too expensive.

For GeoMesa Tuning Accumulo,
https://geomesa.atlassian.net/wiki/display/GEOMESA/Tuning+Accumulo#TuningAccumulo-SmallCluster,LargeServers
example hardware spec requirement is almost fitting for r3.8xlarge.

But I want to cost down.
Is not Small Cluster, "Small" Servers example  hardware spec
requirement somewhere?

If possible, I wanna use r3.xlarge(4 vCPU/ 30.5G memory/ 1 x 80GB SSD)
for Accumulo Tablet server.
How do you think?


Thank you,
Takashi


Back to the top