storm.kafka.UpdateOffsetException - Issue with Opaque Trident Kafka Spout - clojure

I'm using trident topology with OpaqueTridentKafkaSpout.
Code snippet of TridentKafkaConfig i’m using :-
OpaqueTridentKafkaSpout kafkaSpout = null;
TridentKafkaConfig spoutConfig = new TridentKafkaConfig(new ZkHosts("xxx.x.x.9:2181,xxx.x.x.1:2181,xxx.x.x.2:2181"), "topic_name");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.fetchSizeBytes = 147483600;
kafkaSpout = new OpaqueTridentKafkaSpout(spoutConfig);
I get this runtime exception from one of the workers :-
java.lang.RuntimeException: storm.kafka.UpdateOffsetException at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:135)
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:106)
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
at
backtype.storm.daemon.executor$fn_5694$fn5707$fn5758.invoke(executor.clj:819)
at backtype.storm.util$async_loop$fn545.invoke(util.clj:479) at
clojure.lang.AFn.run(AFn.java:22) at
java.lang.Thread.run(Thread.java:745) Caused by:
storm.kafka.UpdateOffsetException at
storm.kafka.KafkaUtils.fetchMessages(KafkaUtils.java:186) at
storm.kafka.trident.TridentKafkaEmitter.fetchMessages(TridentKafkaEmitter.java:132)
at
storm.kafka.trident.TridentKafkaEmitter.doEmitNewPartitionBatch(TridentKafkaEmitter.java:113)
at
storm.kafka.trident.TridentKafkaEmitter.failFastEmitNewPartitionBatch(TridentKafkaEmitter.java:72)
at
storm.kafka.trident.TridentKafkaEmitter.emitNewPartitionBatch(TridentKafkaEmitter.java:79)
at
storm.kafka.trident.TridentKafkaEmitter.access$000(TridentKafkaEmitter.java:46)
at
storm.kafka.trident.TridentKafkaEmitter$1.emitPartitionBatch(TridentKafkaEmitter.java:204)
at
storm.kafka.trident.TridentKafkaEmitter$1.emitPartitionBatch(TridentKafkaEmitter.java:194)
at
storm.trident.spout.OpaquePartitionedTridentSpoutExecutor$Emitter.emitBatch(OpaquePartitionedTridentSpoutExecutor.java:127)
at
storm.trident.spout.TridentSpoutExecutor.execute(TridentSpoutExecutor.java:82)
at
storm.trident.topology.TridentBoltExecutor.execute(TridentBoltExecutor.java:370)
at
backtype.storm.daemon.executor$fn5694$tuple_action_fn5696.invoke(executor.clj:690)
at
backtype.storm.daemon.executor$mk_task_receiver$fn5615.invoke(executor.clj:436)
at
backtype.storm.disruptor$clojure_handler$reify_5189.onEvent(disruptor.clj:58)
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:127)
... 6 more
As per some posts, I have tried setting spoutConfig :-
spoutConfig.maxOffsetBehind = Long.MAX_VALUE;
spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
My Kafka retention time is default - 128 Hours i.e. 7 Days and kafka producer is sending 6800 messages/second to Storm/Trident topology. I have gone through most of the posts, but none of them seem to solve this issue. What's the best way to handle this issue ?

I still dont know what caused this issue. But basically we did not shut down storm, zookeeper and kafka properly. This resulted in storm topologies failing, we had to tear down the entire cluster and re-build it again. Updating to storm 0.10.0 helped fixing some of the other issues.

Related

kafka-python: Closing the kafka producer with 0 vs inf secs timeout

I am trying to produce the messages to a Kafka topic using kafka-python 2.0.1 using python 2.7 (can't use Python 3 due to some workplace-related limitations)
I created a class as below in a separate and compiled the package and installed in virtual environment:
import json
from kafka import KafkaProducer
class KafkaSender(object):
def __init__(self):
self.producer = self.get_kafka_producer()
def get_kafka_producer(self):
return KafkaProducer(
bootstrap_servers=['locahost:9092'],
value_serializer=lambda x: json.dumps(x),
request_timeout_ms=2000,
)
def send(self, data):
self.producer.send("topicname", value=data)
My driver code is something like this:
from mypackage import KafkaSender
# driver code
data = {"a":"b"}
kafka_sender = KafkaSender()
kafka_sender.send(data)
Scenario 1:
I run this code, it runs just fine, no errors, but the message is not pushed to the topic. I have confirmed this as offset or lag is not increased in the topic. Also, nothing is getting logged at the consumer end.
Scenario 2:
Commented/removed the initialization of Kafka producer from __init__ method.
I changed the sending line from
self.producer.send("topicname", value=data) to self.get_kafka_producer().send("topicname", value=data) i.e. creating kafka producer not in advance (during class initialization) but right before sending the message to topic. And when I ran the code, it worked perfectly. The message got published to the topic.
My intention using scenario 1 is to create a Kafka producer once and use it multiple times and not to create Kafka producer every time I want to send the messages. This way I might end up creating millions of Kafka producer objects if I need to send millions of messages.
Can you please help me understand why is Kafka producer behaving this way.
NOTE: If I write the Kafka Code and Driver code in same file it works fine. It's not working only when I write the Kafka code in separate package, compile it and import it in my another project.
LOGS: https://www.diffchecker.com/dTtm3u2a
Update 1: 9th May 2020, 17:20:
Removed INFO logs from the question description. I enabled the DEBUG level and here is the difference between the debug logs between first scenario and the second scenario
https://www.diffchecker.com/dTtm3u2a
Update 2: 9th May 2020, 21:28:
Upon further debugging and looking at python-kafka source code, I was able to deduce that in scenario 1, kafka sender was forced closed while in scenario 2, kafka sender was being closed gracefully.
def initiate_close(self):
"""Start closing the sender (won't complete until all data is sent)."""
self._running = False
self._accumulator.close()
self.wakeup()
def force_close(self):
"""Closes the sender without sending out any pending messages."""
self._force_close = True
self.initiate_close()
And this depends on whether kafka producer's close() method is called with timeout 0 (forced close of sender) or without timeout (in this case timeout takes value float('inf') and graceful close of sender is called.)
Kafka producer's close() method is called from __del__ method which is called at the time of garbage collection. close(0) method is being called from method which is registered with atexit which is called when interpreter terminates.
Question is why in scenario 1 interpreter is terminating?

libPusher pod issues - Disconnection during channel subscription

I'm using the libPusher pod in a Ruby Motion project but running into an issue where my code works when used in the REPL but not in the app itself.
When I try this code in a viewDidAppear method it connects successfully and then disconnects during the channel subscription call.
When I try it in the console, it connects and subscribes perfectly. (same code)
I'm trying to figure out:
Why this is happening
What should I change to alleviate the issue?
I'm using v 1.5 of the pod v2.31 of Ruby Motion
For reference, I'm also using ProMotion framework but I doubt that has anything to do with the issue.
Here's my code:
client = PTPusher.pusherWithKey("my_pusher_key_here", delegate:self, encrypted:true)
client.connect
channel = client.subscribeToChannelNamed("test_channel_1")
channel.bindToEventNamed('status', target: self, action: 'test_method:')
Well I got it working by separating the connection and subscription calls into separate lifecycle methods.
I put:
client = PTPusher.pusherWithKey("my_pusher_key_here", delegate:self, encrypted:true)
client.connect
into the viewDidLoad method
and:
channel = client.subscribeToChannelNamed("test_channel_1")
channel.bindToEventNamed('status', target: self, action: 'test_method:')
into the viewDidAppear method.
I can't say I know exactly why this worked but I assume it has to do with the time between the calls. The connection process must need a little time to complete.

Wso2 BPS: No such channel Error

I'm using wso2 bps 2.1.2 for running simple bpel process with tree invokes called one by one in loop. The loop is around one hundred times. Problem is that sometimes process hang in running state. In logs I get error:
[2013-03-25 14:44:17,897] ERROR - BpelEngineImpl - Scheduled job failed; jobDetail=JobDetails( instanceId: 14109433 mexId: null processId: null type: TIMER channel: 11513 correlatorId: null correlationKeySet: null retryCount: null inMem: false detailsExt: {})
java.lang.IllegalArgumentException: No such channel; id=11513
at org.apache.ode.jacob.vpu.ExecutionQueueImpl.findChannelFrame(ExecutionQueueImpl.java:205)
at org.apache.ode.jacob.vpu.ExecutionQueueImpl.consumeExport(ExecutionQueueImpl.java:232)
at org.apache.ode.jacob.vpu.JacobVPU$JacobThreadImpl.importChannel(JacobVPU.java:369)
at org.apache.ode.jacob.JacobObject.importChannel(JacobObject.java:47)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl$5.run(BpelRuntimeContextImpl.java:964)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at org.apache.ode.jacob.vpu.JacobVPU$JacobThreadImpl.run(JacobVPU.java:451)
at org.apache.ode.jacob.vpu.JacobVPU.execute(JacobVPU.java:139)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.execute(BpelRuntimeContextImpl.java:879)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.timerEvent(BpelRuntimeContextImpl.java:968)
at org.apache.ode.bpel.engine.BpelProcess.handleJobDetails(BpelProcess.java:478)
at org.apache.ode.bpel.engine.BpelEngineImpl.onScheduledJob(BpelEngineImpl.java:560)
at org.apache.ode.bpel.engine.BpelServerImpl.onScheduledJob(BpelServerImpl.java:445)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:537)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:531)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:284)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:239)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:531)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:515)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:314)
at java.util.concurrent.FutureTask.run(FutureTask.java:149)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:897)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
at java.lang.Thread.run(Thread.java:738)
I can't find any useful information about this error. I'm using oracle database. I tried to modify bps.xml with:
<tns:OpenJPAConfig>
<tns:property name="openjpa.FlushBeforeQueries" value="true"/>
<!-- added this line as for https://wso2.org/jira/browse/CARBON-7500 (use also Oracle 11g Driver!!) -->
<tns:property name="openjpa.jdbc.DBDictionary" value="oracle(batchLimit=0)"/>
</tns:OpenJPAConfig>
But this didn't help.
Process is really simple it look like this:
<forEach counterName="count" parallel="no" >
< doXslTransform …>
<wait 1s>
<invoke ...>
<doXslTransform …>
<wait 1s>
<invoke ...>
< doXslTransform …>
<wait 1s>
<invoke ...>
</forEach>
How can I solve “No such channel” errors?
Thanks Tomek
We identified one issue that was causing this problem and fixed it. It was due to a missing process instance lock in the ode run-time embedded within BPS. We found this issue and fixed it.
https://issues.apache.org/jira/browse/ODE-989
https://wso2.org/jira/browse/BPS-218
If you can attach your sample scenario to the jira, it would help us add another test case. The fix is already available in the trunk and will be available in the next release.
Regards
Nandika
I had remove waits from process and everything start working without problems. It seems that there is some bug in < wait > activity in WSO2 BPS 2.1.2. In BPS 3.0.0 it seams that waits are working.

Celery Storing unrecoverable task failures for later resubmission

I'm using the djkombu transport for my local development, but I will probably be using amqp (rabbit) in production.
I'd like to be able to iterate over failures of a particular type and resubmit. This would be in the case of something failing on a server or some edge case bug triggered by some new variation in data.
So I could be resubmitting jobs up to 12 hours later after some bug is fixed or a third party site is back up.
My question is: Is there a way to access old failed jobs via the result backend and simply resubmit them with the same params etc?
You can probably access old jobs using:
CELERY_RESULT_BACKEND = "database"
and in your code:
from djcelery.models import TaskMeta
task = TaskMeta.objects.filter(task_id='af3185c9-4174-4bca-0101-860ce6621234')[0]
but I'm not sure you can find the arguments that the task is being started with ... Maybe something with TaskState...
I've never used it this way. But you might want to consider the task.retry feature?
An example from celery docs:
#task()
def task(*args):
try:
some_work()
except SomeException, exc:
# Retry in 24 hours.
raise task.retry(*args, countdown=60 * 60 * 24, exc=exc)
From IRC
<asksol> dpn`: task args and kwargs are not stored with the result
<asksol> dpn`: but you can create your own model and store it there
(for example using the task_sent signal)
<asksol> we don't store anything when the task is sent, only send a
message. but it's very easy to do yourself
This was what I was expecting, but hoped to avoid.
At least I have an answer now :)

Getting error while running EIM job

EIM job is getting error out while running it. Below is my IFB file -
"[Siebel Interface Manager]
USER NAME = 'SADMIN'
PASSWORD = 'SADMIN'
PROCESS = "PROCESS UPDATE"
[PROCESS UPDATE]
TYPE = IMPORT
BATCH = 30032012 - 30032015
TABLE = EIM_FN_ASSET5
INSERT ROWS = S_ASSET_CON, FALSE
UPDATE ROWS = S_ASSET_CON, TRUE
ONLY BASE TABLES = S_ASSET_CON
ONLY BASE COLUMNS = S_ASSET_CON.ATTRIB_37,S_ASSET_CON.ATTRIB_38,S_ASSET_CON.ATTRIB_50,S_ASSET_CON.ASSET_ID,S_ASSET_CON.CONTACT_ID,\
S_ASSET_CON.RELATION_TYPE_CD"
In application, it shows error --
"SBL-EIM-00426: All batches in run failed."
I have placed IFB in admin folder itself and below is the log file -
"2021 2012-04-03 05:35:25 2012-04-03 05:35:25 -0500 00000002 001 003f 0001 09 srvrmgr 16187618 1 /004fs02/siebel/siebsrvr/log/srvrmgr.log 8.1.1.4 [21225] ENU
SisnapiLayerLog Error 1 0000000c4f7a00e2:0 2012-04-03 05:35:25 258: [SISNAPI] Async Thread: connection (0x204ec5b0), error (1180682) while reading message"
Kindly help.
Async Thread: connection (0x204ec5b0), error (1180682) while reading message
This happens when an object manager lost the connection to the gateway. There can be many reasons for this: Restart the gateway without bouncing the app server. Network issues... etc.
But, this is the error in your Server Manager session, not in the EIM session (Batch Component). For each EIM job that you start (via server manager) you should see a corresponding EIM tasks. The best is to see the error in the EIMxxxx.log file. Also, you can debug your EIM task by setting Event Logs levels:
change evtloglvl %=3 for comp EIM
(set detailed logging)
(run your EIM job) start task ......
list active tasks for comp EIM
(you should see the job running..)
list tasks for comp EIM
(Or you can see the list of jobs)
change evtloglvl %=1 for comp EIM
(use this line to set the log levels back to "normal")
This will give you some detailed info on what the EIM component is doing. Note: Make use of a small batch or your log will be too big to manage.
If you have some connection errors and you recently lost your DB connection, the best is to completely restart the siebel servers and gateway in the correct order.
Have you tried re-runing the EIM Job.
If the scenario continues even after the second run - Please check the batch number you have given in the IFB file with the batch numbers given in the Input Data file for the EIM component - as from the error it seems that the EIM component is not able to fetch the data.
SBL-SVR-01042 is a generic error when this error is encountered while attempting to instantiate a new instance of a given component and is generic. As to why the error has occurred, one needs to review the accompanying error messages which will help provide context and more detailed information
You can ignore SisnapiLayerLog Error. This is generic error and does not have any significance.
You should concentrate on SBL-EIM-00426. before running task can you check if there is any record in your EIM table. This error comes when you have zero record in interface table.you should increase log level to high and try to trache error. There is also fixed released by Oracle. Refer oracle support for same.
https://support.oracle.com/epmos/faces/BugDisplay?parent=DOCUMENT&sourceId=498041.1&id=10469733
I have edited the IFB file code little bit and it worked for me.
Can you please try the below code and let me know.
[Siebel Interface Manager]
USER NAME = 'SADMIN'
PASSWORD = 'SADMIN'
PROCESS = "PROCESS UPDATE"
[PROCESS UPDATE]
TYPE = SHELL
INCLUDE = "Update Records"
[Update Records]
TYPE = IMPORT
BATCH = 30032012 - 30032015
TABLE = EIM_FN_ASSET5
INSERT ROWS = S_ASSET_CON, FALSE
UPDATE ROWS = S_ASSET_CON, TRUE
ONLY BASE TABLES = S_ASSET_CON
ONLY BASE COLUMNS = S_ASSET_CON.ATTRIB_37 \
,S_ASSET_CON.ATTRIB_38 \
,S_ASSET_CON.ATTRIB_50 \
,S_ASSET_CON.ASSET_ID \
,S_ASSET_CON.CONTACT_ID \
,S_ASSET_CON.RELATION_TYPE_CD
Hope this helps!