Stackdriver Trace with Google Cloud Run - google-cloud-platform

I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.
Given that
The Stackdriver agent aggregates traces in a small buffer and sends them periodically.
CPU access is restricted when a Cloud Run service is not handling a request.
There is no shutdown hook for Cloud Run services; you can't clear the buffer before shutdown: the container just gets a SIGKILL. This is a signal you can't catch from your application.
Running a background process that sends information outside of the request-response cycle seems to violate the Knative Container Runtime contract
The collections of logging data is documented and does not require me to run an agent, but there is no such solution for telemetry.
I found one report of someone experiencing lost traces on Cloud Run using the agent-based approach
How Google does it
I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.
Question
While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.
Is this a hypothetical problem or a real issue?
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?

Is this a hypothetical problem or a real issue?
If you consider a Cloud Run service receiving a single request, then it is definitely a problem, as the library will not have time to flush the data before the CPU of the container instance get throttled.
However, in real life use cases:
A Cloud Run service often receives requests continuously or frequently, which means that its container instance are going to either: continuously have CPU or have CPU available from time to time.
It is OK to drop traces: If some traces are not collected because the instance is turned down, it is likely that you have collected a diverse enough set of samples before this happens. Also, you might just be interested in the aggregated reports, in which case, collecting individual traces do not matter.
Note that Trace libraries usually themselves sample the requests to trace, they rarely trace 100% of the requests.
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?
No, Stackdriver Trace takes its data from the spans sent to its API. Note that to send data to Stackdriver Trace, you can use libraryes like OpenCenss and OpenTelemetry, proprietary Stackdriver Trace libraries are not the recommended way anymre.

You're right. This is a fair concern since most tracing libraries tend to sample/upload trace spans in the background.
Since (1) your CPU is nearly scaled nearly to zero when the container isn't handling any requests and (2) the container instance can be killed any time due to inactivity, you cannot reliably upload those trace spans collected in your app. As you said, it may sometimes work since we don't fully stop CPU, but it won't always work.
It appears like some of the Stackdriver (and/or OpenTelemetry f.k.a. OpenCensus) libraries let you control the lifecycle of pushing trace spans.
For example, this Go package for OpenCensus Stackdriver exporter has a Flush() method that you can call before completing your request rather than relying on the runtime to periodically upload the trace spans: https://godoc.org/contrib.go.opencensus.io/exporter/stackdriver#Exporter.Flush
I assume other tracing libraries in other languages also expose similar Flush() methods, if not, please let me know in the comments and this would be a valid feature request to those libraries.

Cloud Run now supports sending SIGTERM. If your application handles SIGTERM it'll get 10 seconds grace time before shutdown.
You can use the 10 seconds to:
Flush buffers that have unsent data
Close connections to other systems
Docs: Container runtime contract

Related

WSO2 ESB.5.0.0 Open Parallel Connection Limitation

We have wso2esb-5.0.0 and we can see that intermittently the server gets high CPU usage and starts to increase gradually and then makes the API run slow and finally stops to respond back, in order make it work we restart the ESB servers which will come back to the normal working state. Could anyone please let me know what could be the issue?
Do we have any limitation that ESB can handle only x-number of API calls/sec and can have only x-number of open connection/sec? Any inputs and suggestion would be helpful.!
Configuration -
We have 2 ESB & 2 MB running on a cluster mode. The issue is seen in both the ESB's.
ESB - 16GB RAM, cache 8GB
We can see the ESTABLISHED connection value varying from 100 to 500 based on the numbers of incoming requests.
Thanks
There are limitations to the number of requests that can be handled by the ESB server. This depends on a number of factors such as backend latency, mediation implementations, request paylods, etc.
For example, consider a scenario where you use a mediator such as a script mediator to process a large payload (which is not recommended). In this scenario, the transformation may take a considerable amount of time resulting in threads blocked at the script mediator. By default, the passthrough message processor thread pool is defined as 500. Thus it can result in a scenario where there are no threads to process new requests resulting in a delay for the responses and in the worst case an out of memory scenario.
Therefore with the available information, we are unable to determine the exact cause of the issue. But from the above description, we can suspect that there is an issue with the available threads (due to the slow response). You can capture thread dumps and thread usage in your environment and analyze the possible cause of the issue. Please refer to the documentation [1], [2] to identify how you can capture a thread dump and a thread usage. Please refer to [3] to clarify the thread usage analysis.
Also, capture and analyze a heap dump in your environment.
[1]-https://docs.wso2.com/display/CLUSTER420/Troubleshooting+in+Production+Environments
[2]-https://gist.github.com/bsenduran/02e8bf024fcaaa7707a6bb2321e097a8
[3]-https://medium.com/#prabushi/analyse-thread-dump-with-process-instructions-c5490b97e2d1

ColdFusion logging vs performance

I am wondering what kind of performance hit I can expect if I enable logging for the following two logs in ColdFusion 8 on a IIS webserver connected to sql 2005 server.
Log slow pages taking longer than
Enable logging for scheduled tasks
It's a very subjective question and depends on your settings, architecture and loads. Generally, logging takes a very small portion of your server's processing in comparison to everything else it does, however the amount of logging and your log retention policy can affect your server's performance if not tuned properly.
With those caveats in mind, I will attempt to address each setting:
Slow Pages logging: Depends on your threshold for slow pages, and if your threshold is reasonable and all your pages are being logged, then the performance issue would likely be on the pages themselves, not the logging of said pages.
Scheduled Tasks: Depending on the amount of scheduled tasks and the execution intervals each scheduled task is set to run, the logging of the execution takes up very little space in the logs, and the only real issue would be size and retention policy of the logs.
You can launch the server monitor and it won't have any impact on your server at all.
That would at least let you see what's going on in real time.

ETW how to survive a reboot

Using C++/Win32 API I create myself an event trace session. My application must supported NT5 thus I can't newer the newer APIs.
I am using the circular mode flags and real time flags.
I have everything working apart from one snag, when I reboot the machine the ETW session isn't persisted, my service starts up and recreates the ETW session (as the reboot has wiped it) which then causes the log file to be overwritten.
According to MSDN I must use the "global" logger on NT5 of which there can only be one, or an "AutoLogger" on NT6 of which there can be many. However MSDN says:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363687(v=vs.85).aspx
The AutoLogger sessions increase the system boot time and should be
used sparingly. Services that want to capture information during the
boot process should consider adding controller logic to itself instead
of using the AutoLogger session.
Sounds like overkill for what I'm trying to do. Indeed my service does contain the "controller" logic itself.
So how do I get ETW to keep my trace session for the next reboot? Or alternatively how do I re-create my ETW session on the next reboot without overwriting the ETW file if its already there?

Failover strategies for stateful servers

in our project, we have a stateful server. The server runs a rule engine (Drools) and exposes functionality using a rest service. It is monitoring system and it is very critical to have an uptime or more less 100%. Therefore we also need strategies to shut down a server for maintainance and to have strategies to be able to continue monitoring of an agent when one server is offline.
The first might be to put a message queue or service bus in front of the drools servers to keep messages that have not been processed and to have mechanisms to backup the state of the server to a database or another storage. This makes it possible to shut down the server for a few minutes to deploy a new version. But the question is, what to do when one server goes offline unexpectedly. Are there any failover strategies for stateful servers, what is your experience? And advice is welcome.
There's no 'correct' way that I can think of. It rather depends on things like:
sensitivity to changes over time windows.
how quickly your application needs to be brought back up.
impact if events are missed.
impact if the events it is monitoring are not up to the second.
how the application raises events back to the outside world.
Some ideas for enabling fail-over:
Start from a clean slate. Examine the most serious impact of this before spending time developing anything else.
Load a list of facts (today's transactions perhaps) from a database. Potentially replay in order. Possibly whilst using a pseudo clock. I'm aware of this being used for some pricing applications in the financial sector, although at the same time, I'm also aware that some of those systems can take a very long time to catch up due to the number of events that need to be replayed.
Persist the stateful session periodically. The interval to be determined based on how far behind the DR application is permitted to be, and how long it takes to persist a session. This way, the DR application can retrieve the same session from the database. However, there will be a gap in events received based on the interval between persists. Of course, if the reason for failure is corruption of the session, then this doesn't work so well.
Configure middleware to forward events to 2 queues, and subscribe primary and DR applications to those queues. This way, both monitors should be in sync and able to make decisions based on the last 1 minute of activity. Note that if one leg is taken out for a period then it will need to catch up, and your middleware needs capacity to store multiple hours (however long an outage might be) worth of events on a queue. Also, your rules need to work off the timestamp on the event itself when queued, rather than the current time. Otherwise, when bringing a leg back after an outage, it could well raise alerts based on events in a time window.
An additional point to consider when replaying events is that you probably don't want any alerts to be raised to the outside world until you have completed the replay. For instance you probably don't want 50 alert emails sent to say that ApplicationX is down, up, down, up, down, up, ...
I'll assume that a monitoring application might be pushing alerts to the outside world in some form. If you have a hot-hot configuration as in 4, you also need to control your alerts. I would be tempted to deal with this by configuring each to push alerts to its own queue. Then middleware could forward alerts from the secondary monitor to a dead letter queue. Failover would be to reconfigure middleware so that primary alerts go to the dead letter queue and secondary alerts go to the alert channel. This mechanism could also be used to discard events raised during a replay recovery.
Given the complexity and potential mess that can arise from replaying events, for a monitoring application I would probably prefer starting from a clean slate, or going with persisted sessions. However this may well depend on what you are monitoring.

Is there any way to access information about a Coldfusion server's load from within coldfusion?

I am writing a scheduled task which I would like to run frequently.
The problem is that I do not want this task to be run if the server is experiencing a high traffic load.
Is there any way other then getting the free/total/max memory from java to try and figure out whether this task should continue?
GetMetricData() is going to give you a very good indication of how busy your server is, i.e. how many requests are running and how many are queued as well as other info.
It's the same info that you get from running cfstat from the command line (you'll find that under {cfroot}\bin\cfstat.exe).
However, knowing how busy you are at the very moment might not be very useful to you if you just call that function once. It might be better for you to log performance data to file or to a database table using Windows perfmon. You can then get the average number of running/queued requests over the past 5 minutes (or whatever) and make your decision on whether to run your task.
There's an easy way to retrieve the memory usage information.
http://misterdai.wordpress.com/2009/11/25/retrieving-coldfusion-memory-usage/
For CPU load I think you can get it from getMetricData() but there are other methods too, but since this is my first stackoverflow post I'm only allowed one link :P But it's on my blog so just do a CPU search when you look at the link above.
You might find it useful to dig into getMetricData() for the performance monitoring stats. It's a good way of telling how busy your server is by the number of running and queued requests.
Hope this helps,
Dave (aka Mister Dai)
Use the ColdFusion AdminApi. Call http://servername/CFIDE/adminapi/servermonitor.cfc in your browser to get the cfcdocs of the component. If gives you many methods to get the health of you CF server instance.