Restarting leveldb when failed to initialize in Akka

Restarting leveldb when failed to initialize in Akka - akka

My application is designed in such a way that whenever someone tries to start it, the application will close other instances of itself. (notice that this is working as expected and should not be changed).
I am using Akka and one of my Actors is a PersistentActor that uses leveldb.
When the application starts it locks <path-to-lock>/LOCK, when it starts again the leveldb cannot lock the file so the PersistentActor cannot start.
After some inspection I found that the leveldb Actor class is LeveldbJournal and it starts under the system guardian with path akka://actor-system/system/akka.persistence.journal.leveldb.
I would like leveldb to restart itself until the file can be locked or a max retries limit has reached.
Logs:
[ERROR] [02/04/2019 15:39:25.731] [operator-actor-system-akka.actor.default-dispatcher-5] [akka://operator-actor-system/system/akka.persistence.journal.leveldb] Unable to acquire lock on '<path-to-lock>/LOCK'
akka.actor.ActorInitializationException: akka://operator-actor-system/system/akka.persistence.journal.leveldb: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:180)
at akka.actor.ActorCell.create(ActorCell.scala:607)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:461)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)
at akka.dispatch.Mailbox.run(Mailbox.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Unable to acquire lock on '<path-to-lock>/LOCK'
at org.iq80.leveldb.impl.DbLock.<init>(DbLock.java:55)
at org.iq80.leveldb.impl.DbImpl.<init>(DbImpl.java:167)
at org.iq80.leveldb.impl.Iq80DBFactory.open(Iq80DBFactory.java:59)
at akka.persistence.journal.leveldb.LeveldbStore$class.preStart(LeveldbStore.scala:178)
at akka.persistence.journal.leveldb.LeveldbJournal.preStart(LeveldbJournal.scala:23)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:510)
at akka.persistence.journal.leveldb.LeveldbJournal.aroundPreStart(LeveldbJournal.scala:23)
at akka.actor.ActorCell.create(ActorCell.scala:590)
... 7 more
When the PersistentActor restarts:
[ERROR] [02/04/2019 15:39:57.168] [operator-actor-system-akka.actor.default-dispatcher-16] [akka://operator-actor-system/user/controller/view-manager/records-service-api-supervisor/records-service-api] Persistence failure when replaying events for persistenceId [record-service-persistence-actor]. Last known sequence number [0] (akka.persistence.RecoveryTimedOut)
Thanks,
Ido Sorozon
P.S. Versions:
Scala: 2.11.8
Akka: 2.4.19

Related

Collect one cell from pyspark Dataframe failed [duplicate]

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /XXX.XX.XXX.XX:36245 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
The reason of adding this configuration was the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o171.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Therefore, I increased maxResultSize to 2.5 Gb, but the Spark job fails anyway (the error shown above).
How to solve this issue?

It seems like the problem is the amount of data you are trying to pull back to to your driver is too large. Most likely you are using the collect method to retrieve all values from a DataFrame/RDD. The driver is a single process and by collecting a DataFrame you are pulling all of that data you had distributed across the cluster back to one node. This defeats the purpose of distributing it! It only makes sense to do this after you have reduced the data down to a manageable amount.
You have two options:
If you really need to work with all that data, then you should keep it out on the executors. Use HDFS and Parquet to save the data in a distributed manner and use Spark methods to work with the data on the cluster instead of trying to collect it all back to one place.
If you really need to get the data back to the driver, you should examine whether you really need ALL of the data or not. If you only need summary statistics then compute that out on the executors before calling collect. Or if you only need the top 100 results, then only collect the top 100.
Update:
There is another reason you can run into this error that is less obvious. Spark will try to send data back the driver beyond just when you explicitly call collect. It will also send back accumulator results for each task if you are using accumulators, data for broadcast joins, and some small status data about each task. If you have LOTS of partitions (20k+ in my experience) you can sometimes see this error. This is a known issue with some improvements made, and more in the works.
The options for getting past if if this is your issue are:
Increase spark.driver.maxResultSize or set it to 0 for unlimited
If broadcast joins are the culprit, you can reduce spark.sql.autoBroadcastJoinThreshold to limit the size of broadcast join data
Reduce the number of partitions

Cause: caused by actions like RDD's collect() that send big chunk of data to the driver
Solution:
set by SparkConf: conf.set("spark.driver.maxResultSize", "4g")
OR
set by spark-defaults.conf: spark.driver.maxResultSize 4g
OR
set when calling spark-submit: --conf spark.driver.maxResultSize=4g

AmazonDynamoDBLockClient - Heartb eat thread recieved interrupted

have this error in AmazonDynamoDBLockClient. Im using org.springframework.cloud:spring-cloud-stream-binder-kinesis:2.0.1.RELEASE
#Spring Cloud Stream Kinesis Binder properties
spring.cloud.stream.bindings.cdcInput.group=listener
spring.cloud.stream.bindings.cdcInput.destination=my_stream
spring.cloud.stream.bindings.cdcInput.content-type=application/json
spring.cloud.stream.kinesis.binder.auto-create-stream=false
spring.cloud.stream.kinesis.binder.locks.table=LockRegistry
spring.cloud.stream.kinesis.binder.checkpoint.table=MetadataStore
spring.cloud.stream.kinesis.binder.locks.leaseDuration=10
spring.cloud.stream.kinesis.binder.locks.heartbeat-period=3
and here is my application.properties configs
com.amazonaws.services.dynamodbv2.AmazonDynamoDBLockClient - Heartbeat thread recieved interrupt, exiting run() (possibly exiting thread)
java.lang.InterruptedException: sleep interrupted
java.lang.Thread.sleep(Native Method)
com.amazonaws.services.dynamodbv2.AmazonDynamoDBLockClient.run(AmazonDynamoDBLockClient.java:1184)
java.lang.Thread.run(Thread.java:748)
[SpringContextShutdownHook] INFO org.springframework.integration.aws.inbound.kinesis.KinesisMessa
geDrivenChannelAdapter - stopped KinesisMessageDrivenChannelAdapter{shardOffsets=[KinesisShardOffset{iteratorType=TRIM_HORIZON, sequenceNumber='null', timestamp=null, stream='binlog_updates', shard='shardId-000000000000', reset=false}], consumerGroup='cdc-listener'}
[-kinesis-shard-locks-1] ERROR org.springframework.integration.aws.inbound.kinesis.KinesisMessageD
rivenChannelAdapter - Error during unlocking: DynamoDbLock [lockKey=cdc-listener:rds_binlog_updates:shardId-000000000000,lockedAt=2021-01-21#15:54:36.735, lockItem=null]
org.springframework.dao.DataAccessResourceFailureException: Failed to release lock at cdc-listener:binlog_updates:shardI
d-000000000000; nested exception is java.util.concurrent.RejectedExecutionException: Task org.springframework.integration.aws.lock.DynamoDbLockRegistry$DynamoDbLock$ reject
ed from java.util.concurrent.ThreadPoolExecutor#[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
org.springframework.integration.aws.lock.DynamoDbLockRegistry$DynamoDbLock.unlock(DynamoDbLockRegistry.java:526)
org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumerManager.run(KinesisMessageDrivenChannelAdapter.java:1294)
We are getting this error even if the stream is empty and no data is read by consumer. Service starts application context as regular without any errors, but in 1-2 minutes such an error appears and application falls down.

How to solve stability problems in Google Dataflow

I have a Dataflow job that has been running stable for several months.
The last 3 days or so, I've problems with the job, it's getting stuck after a certain amount of time and the only thing I can do is stop the job and start a new one. This happened after 2, 6 and 24 hours of processing. Here is the latest exception:
java.lang.ExceptionInInitializerError
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:183)
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:169)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy54.getWindmillServerStub (Unknown Source)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.<init> (StreamingDataflowWorker.java:677)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromDataflowWorkerHarnessOptions (StreamingDataflowWorker.java:562)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main (StreamingDataflowWorker.java:274)
Caused by: java.lang.RuntimeException: Loading windmill_service failed:
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:42)
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0 (Native Method)
at sun.nio.ch.FileDispatcherImpl.write (FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer (IOUtil.java:93)
at sun.nio.ch.IOUtil.write (IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write (FileChannelImpl.java:211)
at java.nio.channels.Channels.writeFullyImpl (Channels.java:78)
at java.nio.channels.Channels.writeFully (Channels.java:101)
at java.nio.channels.Channels.access$000 (Channels.java:61)
at java.nio.channels.Channels$1.write (Channels.java:174)
at java.nio.file.Files.copy (Files.java:2909)
at java.nio.file.Files.copy (Files.java:3027)
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:39)
Seems like there is no space left on a device, but shouldn't this be managed by Google? Or is this an error in my job somehow?
UPDATE:
The workflow is as follows:
Reading mass data from PubSub (up to 1500/s)
Filter some messages
Keeping session window on key and grouping by it
Sort the data and do calculations
Output the data to another PubSub

You can increase the storage capacity in the parameter of your pipelise. Look at this one diskSizeGb in this page
In addition, more you keep data in memory, more you need memory. It's the case for the windows, if you never close them, or if you allow late data for too long time, you need a lot of memory to keep all these data up.
Tune either your pipeline, or your machine type. Or both!

Infinispan RemoveCache Error with Concurrency

In my application i use infinispan as distributed cache.
I work with 3 application server running: wildfly 9.2. On each of them i job is exectuted and its work is just to validate some cache items. If the validation fails the job will remove the cache as it's not valid any more.
The removing code is quite simple:
if (somecondition){
cacheManager.removeCache(sessionCacheName);
}
I realized that when all three server are running (so there are 3 jobs that concurrently execute the romove operation) i get systematically this exception:
19:43:00,005 WARN [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-20,ws-7-aor-57542) ISPN000220: Problems un-marshalling remote command from byte buffer
a.lang.NullPointerException
at org.infinispan.commands.RemoteCommandsFactory.fromStream(RemoteCommandsFactory.java:219)
at org.infinispan.marshall.exts.ReplicableCommandExternalizer.fromStream(ReplicableCommandExternalizer.java:107)
at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:155)
at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:65)
at org.infinispan.marshall.core.ExternalizerTable$ExternalizerAdapter.readObject(ExternalizerTable.java:436)
at org.infinispan.marshall.core.ExternalizerTable.readObject(ExternalizerTable.java:227)
at org.infinispan.marshall.core.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:153)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:354)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
at org.infinispan.commons.marshall.jboss.AbstractJBossMarshaller.objectFromObjectStream(AbstractJBossMarshaller.java:134)
at org.infinispan.marshall.core.VersionAwareMarshaller.objectFromByteBuffer(VersionAwareMarshaller.java:101)
at org.infinispan.commons.marshall.AbstractDelegatingMarshaller.objectFromByteBuffer(AbstractDelegatingMarshaller.java:80)
at org.infinispan.remoting.transport.jgroups.MarshallerAdapter.objectFromBuffer(MarshallerAdapter.java:28)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:298)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:377)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:675)
at org.jgroups.JChannel.up(JChannel.java:739)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1029)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:383)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:1042)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:435)
at org.jgroups.protocols.pbcast.NAKACK2.deliver(NAKACK2.java:961)
at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:843)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:618)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:200)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:297)
at org.jgroups.protocols.MERGE3.up(MERGE3.java:288)
at org.jgroups.protocols.Discovery.up(Discovery.java:291)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
at org.jgroups.protocols.TP$MyHandler.run(TP.java:1796)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
on one server while this one:
19:44:00,199 ERROR [stderr] (DefaultQuartzScheduler_Worker-1) Caused by: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from ws-7-aor-36158, see cause for remote stack trace
19:44:00,200 ERROR [stderr] (DefaultQuartzScheduler_Worker-1) at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:46)
19:44:00,211 ERROR [stderr] (DefaultQuartzScheduler_Worker-1) at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:71)
19:44:00,211 ERROR [stderr] (DefaultQuartzScheduler_Worker-1) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:586)
19:44:00,212 ERROR [stderr] (DefaultQuartzScheduler_Worker-1) at org.infinispan.manager.DefaultCacheManager.removeCache(DefaultCacheManager.java:492)
on the other 2.
This error disappear when there is only 1 application server instance running.
So it's clearly related to the concurrence.
What am i missing?

The removeCache() method was only intended as an admin operation to be called from a JMX/RHQ console, so concurrent calls weren't much of a concern.
The good news is that concurrent calls will work in Infinispan 8.1+/WildFly 10, which include the fix for ISPN-5756.

IO exception while copying 38GB of file from Local to HDFS

I am trying to load 38GB of file from local server to HDFS, I am getting below error while file transfer
hadoop fs -copyFromLocal filename.csv /
Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Input/output error
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:161)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:252)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:456)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:382)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:319)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:254)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:239)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:234)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:211)
at org.apache.hadoop.fs.shell.CopyCommands$Put.processArguments(CopyCommands.java:263)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
Caused by: java.io.IOException: Input/output error
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:154)
... 31 more
after issuing copyfromlocal command I am getting above exception.
Kindly help.

Issue got resolved, its a disk issue,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js