Jetty 9 Hangs, QueuedThreadPool Growing Large - jetty

We recently upgraded our Jetty servers from version 6.1.25 to 9.0.4. They are deployed on Java 1.7.0_11 64-bit on a Windows 2008 server.
Other than required configuration changes for Jetty (start.ini - very nice), we kept all the JVM flags the same as they were previously. 6 days after deploying in the production environment, the server became unresponsive to HTTP requests. Internal 'heartbeat' processing continued to run per normal during this time but it was not servicing external requests. The service was restarted and 6 days later it again became unresponsive.
During my initial review, I thought I was onto something with https://bugs.eclipse.org/bugs/show_bug.cgi?id=357318. However, that JVM issue was backported from Java 1.8_0XX to Java 1.7.0_06. This led me to review the Thread processing.
Thought it might be related to case 400617/410550 on the eclipse site although it doesn't present itself quite like the write-up, and the case had been apparently resolved in Jetty 9.0.3.
Monitoring the application via JMX shows that Thread count for 'qtp' threads continues to grow over time and I've been unsuccessful in searching for a resolution. Thread configuration is currently set for:
threads.min=10
threads.max=200
threads.timeout=60000
All the qtp threads are typically in WAITING state with the following stack trace:
Name: qtp1805176801-285
State: WAITING on java.util.concurrent.Semaphore$NonfairSync#4bf4a3b0
Total blocked: 0 Total waited: 110
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source)
java.util.concurrent.Semaphore.acquire(Unknown Source)
org.eclipse.jetty.util.BlockingCallback.block(BlockingCallback.java:96)
org.eclipse.jetty.server.HttpConnection$Input.blockForContent(HttpConnection.java:457)
org.eclipse.jetty.server.HttpInput.consumeAll(HttpInput.java:282)
- locked org.eclipse.jetty.util.ArrayQueue#3273ba91
org.eclipse.jetty.server.HttpConnection.completed(HttpConnection.java:360)
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:340)
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:224)
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
java.lang.Thread.run(Unknown Source)
After a closer look, this appears different from the newest threads that have the following state:
Name: qtp1805176801-734
State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#77b83b6e
Total blocked: 5 Total waited: 478
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:390)
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:509)
org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:48)
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:563)
java.lang.Thread.run(Unknown Source)
Based on the naming convention, some of the qtp threads are very old (qtp1805176801-206) while some are very new (qtp1805176801-6973). I find it interesting that the older threads aren't timing out based on the 60 second idle timeout. The application services customers during US business hours and is largely idle in the early morning hours at which time I'd expect almost all of the pool to get cleaned up.
Hoping someone may be able to point me the right direction in terms of how to track this issue down. My experience with Jetty leads me to believe their stuff is very solid and most issues are either programmatic in our implementation (been there) or JVM related (done that). Also open to suggestions if you think I might be chasing a red-herring on the Threads.
NEW INFORMATION:
Tracing the exceptions a little farther, this appears to be caused when GWT RPC calls are timing out while waiting for a response. The following stack trace shows an exception in the log file that is related to a Thread that is in an invalid state. Using this to review and look for other reports of Jetty/GWT interaction issues.
2013-09-03 08:41:49.249:WARN:/webapp:qtp488328684-414: Exception while dispatching incoming RPC call
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30015/30000 ms
at org.eclipse.jetty.util.BlockingCallback.block(BlockingCallback.java:103)
at org.eclipse.jetty.server.HttpConnection$Input.blockForContent(HttpConnection.java:457)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:130)
at java.io.InputStream.read(Unknown Source)
at com.google.gwt.user.server.rpc.RPCServletUtils.readContent(RPCServletUtils.java:175)
at com.google.gwt.user.server.rpc.RPCServletUtils.readContentAsGwtRpc(RPCServletUtils.java:205)
at com.google.gwt.user.server.rpc.AbstractRemoteServiceServlet.readContent(AbstractRemoteServiceServlet.java:182)
at com.google.gwt.user.server.rpc.RemoteServiceServlet.processPost(RemoteServiceServlet.java:239)
at com.google.gwt.user.server.rpc.AbstractRemoteServiceServlet.doPost(AbstractRemoteServiceServlet.java:62)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:755)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:698)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1506)
at c.t.b.servlet.PipelineFilter.doFilter(PipelineFilter.java:56)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1494)
at c.v.servlet.SetRequestEncoding.doFilter(SetRequestEncoding.java:27)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1494)
at c.t.b.servlet.OutOfMemoryFilter.doFilter(OutOfMemoryFilter.java:39)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1094)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1028)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:445)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:267)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:224)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
at java.lang.Thread.run(Unknown Source)
Caused by:
java.util.concurrent.TimeoutException: Idle timeout expired: 30015/30000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:153)
at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Ended up posting the question on the Eclipse/Jetty web site. The following link can be used to track any permanent fix to the solution.
https://bugs.eclipse.org/bugs/show_bug.cgi?id=416477
The issue has to do with the Semaphore locking on a QTP Thread that has been timed out during the request as part of a GWT RPC call. The original request is timed, with a timeout of 30 seconds. The request times out while it is waiting on the Semaphore.acquire method to complete. As part of the clean-up of the request, the HTTPConnection attempts to .consumeAll on the request, which again attempts a Sempahore.acquire. This time, the request is not timed and the lock remains in place until the thread is interrupted.
The issue does appear to be very specific to the platform as Jetty has not been able to reproduce the issue and I've not be able to find any other reports of the issue. Furthermore, this only occurs in one of our production environments. My guess is that there is something going on between the GWT RPC Code, Jetty and the Operating System. We have minor upgrades planned for the JDK, Jetty and the GWT SDK.
Workaround
The initial work around was to manually interrupt locked threads a couple times a day via the JMX console. Our longer term solution was to build a clean-up mechanism that looks for these locked threads and calls the interrupt method on them.

The QueuedThreadPool is a shared pool of threads. The threads in it will be reused for other processing. Yes, chasing the thread pool, assuming threads will be cleaned up, is a red herring. Those threads will fall off the pool, slowly, over a long period of time (think hours). This is a performance decision in the thread pool (create is expensive, do it as infrequently as possible).
As for the stacktrace you pasted, its incomplete, so the amount of guessing on behavior is extremely high. But that being said, those 2 lines can indicate normal operations, but without the rest of the stacktrace there's little to go on.
Also, the versions of Java you are using 1.7.0_06 and 1.7.0_11 are very old, and you are subject to hundreds of bugs fixes.

I have the same with Jetty 9.2.3.v20140905 and Java (build 1.8.0_20-b26) 64 bit.
Workaround. Install monit http://mmonit.com/monit/
# monit.conf
check process jetty-service with pidfile "/opt/jetty-service/jetty.pid"
start program = "/usr/sbin/service jetty-service start" with timeout 30 seconds
stop program = "/usr/sbin/service jetty-service stop"
if totalmem is greater than 1268 MB for 10 cycles then restart
if 5 restarts within 5 cycles then timeout

Related

Kafka Streams : Stream Thread failed to lock State Directory

I am trying to test my Kafka Streams application. I have built a simple topology where I read from an input topic and store the same data in a state store.
I tried writing unit tests for this topology using TopologyTestDriver. When I run the test, I got encountered with following error.
org.apache.kafka.streams.errors.LockException: stream-thread [main] task [0_0] Failed to lock the state directory for task 0_0
at org.apache.kafka.streams.processor.internals.AbstractTask.registerStateStores(AbstractTask.java:197)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeStateStores(StreamTask.java:275)
at org.apache.kafka.streams.TopologyTestDriver.<init>(TopologyTestDriver.java:403)
at org.apache.kafka.streams.TopologyTestDriver.<init>(TopologyTestDriver.java:257)
at org.apache.kafka.streams.TopologyTestDriver.<init>(TopologyTestDriver.java:228)
at streams.checkStreams.checkStreamsTest.setup(checkStreamsTest.java:99)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
....
I can see state store getting created locally in /tmp/kafka-streams, but somehow streams thread is unable to get a lock over it. I searched and found that this error might be because of two streams threads are trying to acces it, one has the lock so that other has to wait. But I don't see two streams thread getting created in my code. I am new to this kafka streams and its testing, am I missing any thing here?
The TopologyTestDriver does not create any background threads, so multi-threading (from KafkaStreams itself) should not be an issue -- however, as #BartoszWardziƄski pointed out, if your testing framework executed tests in parallel, and you use the same application.id in different tests, it may lead to locking issues.
The recommendation for tests is, to generate a random application.id to avoid this issue.
If your tests are not running in parallel a solution could be to call the close() method on the TopologyTestDriver. This will clean the resources and remove the locks. This is probably best practice for disposable objects anyway.
If running tests in parallel you can set a random application.id. The problem with this is if you're using a schema registry and connected to a test registry, this will create potentially thousands of schemes (one for each test).
Your two options here are:
Have a unique application.id per test but which is hard-coded (i.e. the name
of the test) and not random.
Don't run your tests in parallel and call close() on
TopologyTestDriver

Threads parked with HTTP-Kit

I have a few threads on the go, each of which make a blocking call to HTTP Kit. My code's been working but has recently taken to freezing after about 30 minutes. All of my threads are stuck at the following point:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
clojure.core$promise$reify__7005.deref(core.clj:6823)
clojure.core$deref.invokeStatic(core.clj:2228)
clojure.core$deref.invoke(core.clj:2214)
my_project.web$fetch.invokeStatic(web.clj:35)
Line my_project.web.clj:35 is something like:
(let [result #(org.httpkit.client/get "http://example.com")]
(I'm using plain Java threads rather than core.async because I'm running the context of a set of concurrent Apache Kafka clients each in their own thread. The Kafka Client does spin up a lot of its own threads, especially as I'm running it a few times, e.g. 5 in parallel).
The fact that all of my threads end up parked like this in HTTP Kit suggests a resource leak, or some code in HTTP Kit dying before it has chance to deliver, or perhaps resource starvation.
Another thread seems to be stuck here. It's possible that it's blocking all of the promise deliveries.
sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:850)
sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
org.httpkit.client.HttpsRequest.unwrapRead(HttpsRequest.java:35)
org.httpkit.client.HttpClient.doRead(HttpClient.java:131)
org.httpkit.client.HttpClient.run(HttpClient.java:377)
java.lang.Thread.run(Thread.java:748)
Any ideas what the problem could be, or pointers for how to diagnose it?
A common thing to do is to set up a DefaultUncaughtExceptionHandler.
This will at least give you an indication if there are exceptions in your threads.
(defn init-jvm-uncaught-exception-logging []
(Thread/setDefaultUncaughtExceptionHandler
(reify Thread$UncaughtExceptionHandler
(uncaughtException [_ thread ex]
(log/error ex "Uncaught exception on" (.getName thread))))))
Stuart Sierra has written nicely on this: https://stuartsierra.com/2015/05/27/clojure-uncaught-exceptions

Azure Event Hub ServiceBusException causing skipped messages

We are using the Azure Java event hub library to read messages out of an event hub. Most of the time it works perfectly, but periodically we see exceptions of type "com.microsoft.azure.servicebus.ServiceBusException" occur that correspond to times when messages seem to be skipped that are in the event hub.
Here are some examples of exception details:
"The message container is being closed (some number here)."
This generally hits multiple partitions at the same time, but not all.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
"The link 'xxx' is force detached by the broker due to errors occurred in consumer(link#). Detach origin: InnerMessageReceiver was closed."
This is generally tied to com.microsoft.azure.servicebus.amqp.AmqpException exceptions.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
Example callstack:
at com.microsoft.azure.servicebus.ExceptionUtil.toException(ExceptionUtil.java:93)
at com.microsoft.azure.servicebus.MessageReceiver.onError(MessageReceiver.java:393)
at com.microsoft.azure.servicebus.MessageReceiver.onClose(MessageReceiver.java:646)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:83)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:52)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:309)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:276)
at com.microsoft.azure.servicebus.MessagingFactory$RunReactor.run(MessagingFactory.java:340)
at java.lang.Thread.run(Thread.java:745)
There doesn't seem to be a way for clients of the library to recognize a problem occurs and avoid moving ahead in the event hub past our skipped messages. Has anyone else run into this? Is there some other way to recognize and avoid skipping or retrying missed messages?
This error DOESN'T SKIP any messages - it will throw an Exception, when it shouldn't have. This will result in EPH to RESTART the affected Partitions' Receiver. If the application using EventHubs javaclient doesn't handle the errors - they may experience loss of messages.
This is a bug in our retry logic - in the current version of EventHubs JavaClient - until 0.11.0.
Here's the corresponding issue to track progress.
In EventHubs service - these errors happen if - for any reason - the Container hosting your EventHubs' code has to close (for the sake of the explanation, imagine we run a set of Container's - like DockerContainers for every EventHub namespace) - this is a transient error - this Container will eventually be opened in another Node.
Our javaclient-retry logic should have handled this error and should have retried - Will keep this thread posted with the fix.
EDIT
We just released 0.12.0 - which fixes this issue.
Thanks!
Sreeram

Issue with mule starting up (at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)

mule is not starting up, it tries to start hangs for a while and after some time tries to start from first again like restart. Took the thread dump. There is a warning while analyzing thread dump which says "3 threads are transitively BLOCKED. It's indicating lock is not getting released." which could be potential issue probably some thing to do with jetty, but not clear what that is. Here is part of thread dump analysis
0x00000000e0f43f40
Object
Held by:
qtp383251638-61-acceptor-0-ServerConnector#7d75f858{HTTP/1.1}{0.0.0.0:7777}
Threads waiting to take lock:
qtp383251638-62-acceptor-1-ServerConnector#7d75f858{HTTP/1.1}{0.0.0.0:7777}
qtp383251638-63-acceptor-2-ServerConnector#7d75f858{HTTP/1.1}{0.0.0.0:7777}
qtp383251638-64-acceptor-3-ServerConnector#7d75f858{HTTP/1.1}{0.0.0.0:7777}
"qtp383251638-61-acceptor-0-ServerConnector#7d75f858{HTTP/1.1}{0.0.0.0:7777}": running, holding [0x00000000e0f43f40]
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:321)
at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:460)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
at java.lang.Thread.run(Thread.java:745)
Acceptors are always in blocked state when they are not actively accepting connections, that is normal for that kind of thread.
Your issue is elsewhere.
You haven't given enough details about it to troubleshoot though. (sorry)
Resolved the issue from thread dump. There was issue in establishing connection with message broker.
nid=0xe128 in Object.wait() [0x00007f41303ef000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:613)
- locked <0x00000000ddddecf0> (a java.lang.Object)
at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)

Jetty: interrupt long lasting requests (timeout earlier)

I would like to cancel or stop the thread where the request came in X seconds ago e.g. to avoid overloading of the system and improve overall stability. Is that possible with jetty >= 9?
I tried connector0.setIdleTimeout but it does not seem to have any effect e.g. setting it to 1000 (ms) and delaying my response 10000ms should result in a timeout but does not.
I have found similar questions on the mailing list here and here and related SO questions are here, here and here
but all without a inbuilt solution from jetty.
Can't I set the read timeout of the socket somehow?
Or is this statement from the mailing list correct:
the servlet spec does not allow jetty to interrupt a dispatched thread
Mitigating excessive load is accomplished through other means, not by harshly killing / interrupting threads (something that even the core Java Classes discourage!)
Consider using DoSFilter, QoSFilter, or LowResourceMonitor to mitigate excessive load instead.