Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...

From: Andrew <ahulbert@xxxxxxxx>
Date: Tue, 13 Dec 2016 23:22:16 -0500
Delivered-to: geomesa-users@xxxxxxxxxxxxxxxx
Importance: normal
List-archive: <https://www.locationtech.org/mhonarc/lists/geomesa-users>
List-help: <mailto:geomesa-users-request@locationtech.org?subject=help>
List-subscribe: <https://www.locationtech.org/mailman/listinfo/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=subscribe>
List-unsubscribe: <https://www.locationtech.org/mailman/options/geomesa-users>, <mailto:geomesa-users-request@locationtech.org?subject=unsubscribe>

Trying again...

Hi Takashi,

I'll try to look into question #1 but note that I generally use ec2 instead so I'll have to see how it does zookeepers ... Question: How are you ingesting the data? Is it mapreduce or a single java program?

But in the meantime for #2 you should certainly be able to use the r3.xlarge instances with 4 CPU and 30gig ram. We have built many clusters on EC2 this way. I would try to use at least 5 or 10 nodes if you can. Just make sure to spread the accumulo master, hadoop name more, and 3 zookeepers on separate nodes. I'll update that wiki page with some more examples.

Also note that you can use EBS storage which can be cheaper.

Andrew

-------- Original message --------

From: Takashi Sasaki <tsasaki609@xxxxxxxxx>

Date: 12/13/16 20:19 (GMT-05:00)

To: geomesa-users@xxxxxxxxxxxxxxxx

Subject: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...

Hello,

I use GeoMesa 1.2.3 with Accumulo 1.7.2 on AWS EMR which is release
emr-4.7.0, Hadoop 2.7.2, ZooKeeper-Sandbox 3.4.8.
I have two problem for the system management. One is Accumulo Tablet
server suddenly crashing, and other is Geomesa(Accumulo) hardware spec
requirement is too expensive for my company.

1.
I encountered too slow writting features to GeoMesa(Accumulo)
Datastore when Accumulo Tablet Server crashed, and GeoServer image
rendering respose is also slowing.
If Tablet Server alive, it is one or two minute for writting features,
but it is eight minutes or more in the above state.

Probably, this cause is waitting Accumulo Master server for tablet rebalancing.

[Accumulo Master Server log]
2016-12-13 03:34:26,008 [master.Master] WARN : Lost servers
[ip-10-24-83-37:9997[358b52962d101f8]]

[Accumulo Tablet Server log]
2016-12-13 03:34:47,943 [hdfs.DFSClient] WARN : DFSOutputStream
ResponseProcessor exception for block
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2282)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2016-12-13 03:34:47,943 [zookeeper.ClientCnxn] WARN : Client session
timed out, have not heard from server in 53491ms for sessionid
0x358b52962d101f8
2016-12-13 03:34:47,944 [hdfs.DFSClient] WARN : Error Recovery for
block BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460
in pipeline DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK],
DatanodeInfoWithStorage[10.24.83.39:50010,DS-f66dcecb-1b44-431c-83fc-ab343339c485,DISK]:
bad datanode DatanodeInfoWithStorage[10.24.83.37:50010,DS-43dbba0e-2bbd-4f8e-8a07-3f869123925c,DISK]
2016-12-13 03:34:48,444 [hdfs.DFSClient] WARN : DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [hdfs.DFSClient] WARN : Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,445 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,463 [zookeeper.ClientCnxn] WARN : Unable to
reconnect to ZooKeeper service, session 0x358b52962d101f8 has expired
2016-12-13 03:34:48,546 [log.DfsLogger] WARN : Exception syncing
java.lang.reflect.InvocationTargetException
2016-12-13 03:34:48,546 [log.DfsLogger] ERROR: Failed to close log file
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1424542533-10.24.83.115-1481002292587:blk_1074003283_262460 does
not exist or is not under Constructionblk_1074003283_264798
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6239)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6306)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:805)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:901)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.updateBlockForPipeline(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1173)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2016-12-13 03:34:48,564 [zookeeper.ZooReader] WARN : Saw (possibly)
transient exception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tservers/ip-10-24-83-37:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2016-12-13 03:34:48,565 [zookeeper.ZooCache] WARN : Saw (possibly)
transient exception communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for
/accumulo/3a6635fa-2c60-4860-b8aa-56a2d654b419/tables/1n/namespace
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:295)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:272)
at org.apache.accumulo.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:171)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:323)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:259)
at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:235)
at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:600)
at sun.reflect.GeneratedMethodAccessor59.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.accumulo.core.trace.wrappers.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
at org.apache.accumulo.server.rpc.RpcWrapper$1.invoke(RpcWrapper.java:74)
at com.sun.proxy.$Proxy19.startMultiScan(Unknown Source)
at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2330)
at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2314)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:63)
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
at org.apache.accumulo.server.rpc.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
---
(retry log many times)
---
2016-12-13 03:34:48,844 [tserver.TabletServer] ERROR: Lost tablet
server lock (reason = SESSION_EXPIRED), exiting.

According this log, Tablet Server was accessing HDFS and ZooKeeper
session expired, server shutdowned self.

How can I avoid Tablet Server crashing?

2.
I use AWS EMR with r3.8xlarge instance for Accumulo Tablet server on 5
core node.
It is high performance(32 vCPU/ 244G memory/ 2 x 320GB SSD), but too expensive.

For GeoMesa Tuning Accumulo,
https://geomesa.atlassian.net/wiki/display/GEOMESA/Tuning+Accumulo#TuningAccumulo-SmallCluster,LargeServers
example hardware spec requirement is almost fitting for r3.8xlarge.

But I want to cost down.
Is not Small Cluster, "Small" Servers example hardware spec
requirement somewhere?

If possible, I wanna use r3.xlarge(4 vCPU/ 30.5G memory/ 1 x 80GB SSD)
for Accumulo Tablet server.
How do you think?

Thank you,
Takashi
_______________________________________________
geomesa-users mailing list
geomesa-users@xxxxxxxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.locationtech.org/mailman/listinfo/geomesa-users

Follow-Ups:
- Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
  - From: Takashi Sasaki

Prev by Date: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Next by Date: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Previous by thread: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Next by thread: Re: [geomesa-users] How to avoid Tablet Server crashing, and server cost problem...
Index(es):
- Date
- Thread

Breadcrumbs