I just want to use prometheus agent to monitor some parameters.
Application is launched with agent
-javaagent:lib/boot/jmx_prometheus_javaagent-0.9.jar=$PROMETHEUS_PORT:etc/jmx.prometheus.yaml
but unfortunately I get a lot of warnings like this
2017-05-09 16:39:11.585:WARN:ipjsoeji.nio:Dispatched Failed!
SCEP#24e5b398{l(/172.20.26.126:55958)<->r(/172.20.26.100:9255),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection#6471d314,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}
to
io.prometheus.jmx.shaded.org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager#6d2459c4
It's some kind of flood of logs. In few seconds log file reaches over 1GB size.
I've read threads about this issue (like Jetty 8.1 flooding the log file with "Dispatched Failed" messages) but it does not solve my problem.
Does anyone meet the same issue and know why this problem occures and how to fix it?
Looks like https://github.com/prometheus/jmx_exporter/issues/122, assuming you have many cores.
Related
I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638󗾞
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.
I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations.
The logs I get are the following:
2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
2017-08-25 (00:06:48) Workflow failed.
2017-08-25 (00:06:48) Stopping worker pool...
2017-08-25 (00:06:58) Worker pool stopped.
How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully.
When I try to rerun the template from Google Cloud Console, I get the message:
No metadata file found for this template
But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.
Any Input is highly appreciated. Thanks
EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.
I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.
It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.
I have download WSO2 MB 2.1.0 and run it with the built-in Cassandra server in Windows 7 64bit.
But the start-up procedure failed with the following error message.
[2013-12-14 11:27:03,371] ERROR {org.apache.cassandra.service.AbstractCassandraD
aemon} -
Exception in thread Thread[Thread-21,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:713)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThre
adPoolServer.java:103)
at org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.run(Cassandr
aDaemon.java:213)
[2013-12-14 11:27:03,396] INFO {me.prettyprint.cassandra.service.JmxMonitor} -
Registering JMX me.prettyprint.cassandra.service_ClusterOne:ServiceType=hector,
MonitorType=hector
I found a related bug issue: https://wso2.org/jira/browse/MB-210
Does anyone know if the next release will really fix this bug?
Or I have to use standalone deployment with external Cassandra server as this suggestion?
http://udarakr.blogspot.tw/2013/09/how-to-overcome-wso2-message-broker.html
This issue is something to do with Cassandra,In Linux i faced the same issue and once i increased the max user processes Everything went fine, Please refer the article written on this topic Unable to create new native thread and max user processes. Since this issue is occurring in Windows, Better to Run Cassandra externally and fine tune the Cassandra on it!
Problem
I have a very basic configuration for a Spring integration mail adapter setup (below is the relevant sample):
<int:channel id="emailChannel">
<int:interceptors>
<int:wire-tap channel="logger"/>
</int:interceptors>
</int:channel>
<mail:inbound-channel-adapter id="popChannel"
store-uri="pop3://user:password#domain.net/INBOX"
channel="emailChannel"
should-delete-messages="true"
auto-startup="true">
<int:poller max-messages-per-poll="1" fixed-rate="30000"/>
</mail:inbound-channel-adapter>
<int:logging-channel-adapter id="logger" level="DEBUG"/>
<int:service-activator input-channel="emailChannel" ref="mailResultsProcessor" method="onMessage" />
This is working fine the majority of the time and I can see the logs showing the polling (and it works fine hooking into my mailResultsProcessor when a mail is there):
2013-08-13 08:19:29,748 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - opening folder [pop3://user:password#fomain.net/INBOX]
2013-08-13 08:19:29,796 [task-scheduler-3] INFO org.springframework.integration.mail.Pop3MailReceiver - attempting to receive mail from folder [INBOX]
2013-08-13 08:19:29,796 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - found 0 new messages
2013-08-13 08:19:29,796 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - Received 0 messages
2013-08-13 08:19:29,893 [task-scheduler-3] DEBUG org.springframework.integration.endpoint.SourcePollingChannelAdapter - Received no Message during the poll, returning 'false'
The problem I have is that the polling stops during the day, with no indication in the logs why it has stopped working. The only reason I can tell is the debug above is not present in the logs and E-Mails build up on the E-Mail account.
Questions
Has anyone seen this before and know how to resolve it?
Is there a change that I can make in my configuration to capture the issue into the log? I thought the logging channel adapter set to debug would have this covered.
Using version 2.2.3.RELEASE of Spring Integration on Tomcat 7, logs output default to catalina.out. Deployed on AWS standard tomcat 7 instance.
Most likely the poller thread is hung someplace upstream. With your configuration, the next poll won't happen until the current poll completes.
You can use jstack or VisualVM to get a thread dump to find out what the thread is doing.
Another possibility is you are suffering from poller thread starvation - if you have a lot of other polled elements in your application, and depending on their configuration. The default taskScheduler bean has only 10 threads.
You can add a task executor to the <poller/> so each poll is handed off to another thread, but be aware that that can result in concurrent polls if a polled task takes longer to execute than the polling rate.
To resolve this problem specifically I used the configuration below:
<mail:inbound-channel-adapter id="popChannel"
store-uri="pop3://***/INBOX"
channel="emailChannel"
should-delete-messages="true"
auto-startup="true">
<int:poller max-messages-per-poll="5" fixed-rate="60000" task-executor="pool"/>
</mail:inbound-channel-adapter>
<task:executor id="pool" pool-size="10" keep-alive="50"/>
Once moving to this approach we saw no further problems, and is with any use of pool the advantage is any Threads that become a problem are cleaned up and recreated.
I'm running Celery in a Django app with RabbitMQ as the message broker. However, RabbitMQ keeps breaking down like so. First is the error I get from Django. The trace is mostly unimportant, because I know what is causing the error, as you will see.
Traceback (most recent call last):
...
File "/usr/local/lib/python2.6/dist-packages/amqplib/client_0_8/transport.py", line 85, in __init__
raise socket.error, msg
error: [Errno 111] Connection refused
I know that this is due to a corrupt rabbit_persister.log file. This is because after I kill all processes tied to RabbitMQ, I run "sudo rabbitmq-server start" to get the following crash:
...
starting queue recovery ...done
starting persister ...BOOT ERROR: FAILED
Reason: {{badmatch,{error,{{{badmatch,eof},
[{rabbit_persister,internal_load_snapshot,2},
{rabbit_persister,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{child,undefined,rabbit_persister,
{rabbit_persister,start_link,[]},
transient,100,worker,
[rabbit_persister]}}}},
[{rabbit_sup,start_child,2},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
Erlang has closed
My current fix: Every time this happens, I rename the corresponding rabbit_persister.log file to something else (rabbit_persister.log.bak) and am able to restart RabbitMQ with success. But the problem keeps occurring, and I can't tell why. Any ideas?
Also, as a disclaimer, I have no experience with Erlang; I'm only using RabbitMQ because it's the broker favored by Celery.
Thanks in advance, this problem is really annoying me because I keep doing the same fix over and over.
The persister is RabbitMQ's internal message database. That "log" is presumably like a database log and deleting it will cause you to lose messages. I guess it's getting corrupted by unclean broker shutdowns, but that's a bit beside the point.
It's interesting that you're getting an error in the rabbit_persister module. The last version of RabbitMQ that has that file is 2.2.0, so I'd strongly advise you to upgrade. The best version is always the latest, which you can get by using the RabbitMQ APT repository. In particular, the persister has seen a fairly large amount of fixes in the versions after 2.2.0, so there's a big chance your problem has already been resolved.
If you still see the problem after upgrading, you should report it on the RabbitMQ Discuss mailing list. The developers (of both Celery and RabbitMQ) make a point of fixing any problems reported there.
A. Because you are running an old version of RabbitMQ earlier than 2.7.1
B. Because RabbitMQ doesn't have enough RAM. You need to run RabbitMQ on a server all by itself and give that server enough RAM so that the RAM is 2.5 times the largest possible size of your persisted message log.
You might be able to fix this without any software changes just by adding more RAM and killing other services on the box.
Another approach to this is to build your own RabbitMQ from source and include the toke extension that persists messages using Tokyo Cabinet. Make sure you are using local hard drive and not NFS partitions because Tokyo Cabinet has corruption issues with NFS. And, of course, use version 2.7.1 for this. Depending on your message content, you might also benefit from Tokyo Cabinets compression settings to reduce the read/write activity of persisted messages.