I am wondering what kind of performance hit I can expect if I enable logging for the following two logs in ColdFusion 8 on a IIS webserver connected to sql 2005 server.
Log slow pages taking longer than
Enable logging for scheduled tasks
It's a very subjective question and depends on your settings, architecture and loads. Generally, logging takes a very small portion of your server's processing in comparison to everything else it does, however the amount of logging and your log retention policy can affect your server's performance if not tuned properly.
With those caveats in mind, I will attempt to address each setting:
Slow Pages logging: Depends on your threshold for slow pages, and if your threshold is reasonable and all your pages are being logged, then the performance issue would likely be on the pages themselves, not the logging of said pages.
Scheduled Tasks: Depending on the amount of scheduled tasks and the execution intervals each scheduled task is set to run, the logging of the execution takes up very little space in the logs, and the only real issue would be size and retention policy of the logs.
You can launch the server monitor and it won't have any impact on your server at all.
That would at least let you see what's going on in real time.
Related
i have a newbie question in here, but i'm new to clouds and linux, i'm using google cloud now and wondering when choosing a machine config
what if my machine is too slow? will it make the app crash? or just slow it down
how fast should my vm be? in the image bellow
last 6 hours of a python scripts i'm running and it's cpu usage, it's obviously running for less than %2 of the cpu for most of it's time, but there's a small spike, should i care about the spike? and also, how much should my cpu usage be max before i upgrade? if a script i'm running is using 50-60% of the cpu most of the i assume i'm safe, or what's the max before you upgrade?
what if my machine is too slow? will it make the app crash? or just
slow it down
It depends.
Some applications will just respond slower. Some will fail if they have timeout restrictions. Some applications will begin to thrash which means that all of a sudden the app becomes very very slow.
A general rule, which varies among architects, is to never consume more than 80% of any resource. I use the rule 50% so that my service can handle burst traffic or denial of service attempts.
Based on your graph, your service is fine. The spike is probably normal system processing. If the spike went to 100%, I would be concerned.
Once your service consumes more than 50% of a resource (CPU, memory, disk I/O, etc) then it is time to upgrade that resource.
Also, consider that there are other services that you might want to add. Examples are load balancers, Cloud Storage, CDNs, firewalls such as Cloud Armor, etc. Those types of services tend to offload requirements from your service and make your service more resilient, available and performant. The biggest plus is your service is usually faster for the end user. Some of those services are so cheap, that I almost always deploy them.
You should choose machine family based on your needs. Check the link below for details and recommendations.
https://cloud.google.com/compute/docs/machine-types
If CPU is your concern you should create a managed instance group that automatically scales based on CPU usage. Usually 80-85% is a good value for a max CPU value. Check the link below for details.
https://cloud.google.com/compute/docs/autoscaler/scaling-cpu
You should also consider the availability needed for your workload to keep costs efficient. See below link for other useful info.
https://cloud.google.com/compute/docs/choose-compute-deployment-option
I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.
Given that
The Stackdriver agent aggregates traces in a small buffer and sends them periodically.
CPU access is restricted when a Cloud Run service is not handling a request.
There is no shutdown hook for Cloud Run services; you can't clear the buffer before shutdown: the container just gets a SIGKILL. This is a signal you can't catch from your application.
Running a background process that sends information outside of the request-response cycle seems to violate the Knative Container Runtime contract
The collections of logging data is documented and does not require me to run an agent, but there is no such solution for telemetry.
I found one report of someone experiencing lost traces on Cloud Run using the agent-based approach
How Google does it
I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.
Question
While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.
Is this a hypothetical problem or a real issue?
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?
Is this a hypothetical problem or a real issue?
If you consider a Cloud Run service receiving a single request, then it is definitely a problem, as the library will not have time to flush the data before the CPU of the container instance get throttled.
However, in real life use cases:
A Cloud Run service often receives requests continuously or frequently, which means that its container instance are going to either: continuously have CPU or have CPU available from time to time.
It is OK to drop traces: If some traces are not collected because the instance is turned down, it is likely that you have collected a diverse enough set of samples before this happens. Also, you might just be interested in the aggregated reports, in which case, collecting individual traces do not matter.
Note that Trace libraries usually themselves sample the requests to trace, they rarely trace 100% of the requests.
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?
No, Stackdriver Trace takes its data from the spans sent to its API. Note that to send data to Stackdriver Trace, you can use libraryes like OpenCenss and OpenTelemetry, proprietary Stackdriver Trace libraries are not the recommended way anymre.
You're right. This is a fair concern since most tracing libraries tend to sample/upload trace spans in the background.
Since (1) your CPU is nearly scaled nearly to zero when the container isn't handling any requests and (2) the container instance can be killed any time due to inactivity, you cannot reliably upload those trace spans collected in your app. As you said, it may sometimes work since we don't fully stop CPU, but it won't always work.
It appears like some of the Stackdriver (and/or OpenTelemetry f.k.a. OpenCensus) libraries let you control the lifecycle of pushing trace spans.
For example, this Go package for OpenCensus Stackdriver exporter has a Flush() method that you can call before completing your request rather than relying on the runtime to periodically upload the trace spans: https://godoc.org/contrib.go.opencensus.io/exporter/stackdriver#Exporter.Flush
I assume other tracing libraries in other languages also expose similar Flush() methods, if not, please let me know in the comments and this would be a valid feature request to those libraries.
Cloud Run now supports sending SIGTERM. If your application handles SIGTERM it'll get 10 seconds grace time before shutdown.
You can use the 10 seconds to:
Flush buffers that have unsent data
Close connections to other systems
Docs: Container runtime contract
I have been doing load test for very long in my company but tps never passed 500 transaction per minute. I have more challenging problem right now.
Problem:
My company will start a campaing and ask a questiong to it's customers and first correct answer will be rewarded. Analists expect 100.000 k request in a second at global maximum. (doesnt seem to me that realistic but this can be negotiable)
Resources:
Jmeter,
2 different service requests,
5 x slave with 8 gb ram,
80 mbps internet connection,
3.0 gigahertz
Master computer with same capabilities with slaves.
Question:
How to simulete this scenario, is it possible? What are the limitations. How should be the load model. Are there any alternative to do that?
Any comment is important..
Your load test always need to represent real usage of application by real users so first of all carefully implement your test scenario to mimic real human using a real browser with all its stuff like:
cookies
headers
embedded resources (proper handling of images, scripts, styles, fonts, etc.)
cache
think times
etc.
Make sure your test is following JMeter Best Practices, i.e.:
being run in non-GUI mode
all listeners are disabled
JVM settings are optimised for maximum performance
etc.
Once done you need to set up monitoring of your JMeter engines health metrics like CPU, RAM, Swap usage, Network and Disk IO, JVM stats, etc. in order to be able to see if there is a headroom to continue. JMeter PerfMon Plugin is very handy as its results can be correlated with the Test Metrics.
Start your test from 1 virtual user and gradually increase the load until you reach the target throughput / your application under test dies / JMeter engine(s) run out of resources, whatever comes the first. Depending on the outcome you will either report success or defect or will need to request more hosts to use as JMeter engines / upgrade existing hardware.
I am trying to load test for web services with 1000 Users using JMeter. I can see CPU usage is around 30% once 1000 Users are injected but when it comes to response, maximum time taken is around 12 Seconds.
My query is, if the CPU is not utilized 100%, maximum time in receiving any response should not get more than few seconds.
It is good that you are monitoring CPU usage on application server side. However the devil may live somewhere else, for instance application can experience the lack of available RAM, does intensive swapping or reaches the limits of network or disk IO so you should consider these metrics as well. Aforementioned ones (and more) can be monitored using JMeter PerfMon Plugin.
Basically the same as point 1, but applied to JMeter side of things. JMeter tests are very resource intensive and if JMeter lacks resources it will be sending requests much slower. So make sure you monitor baseline OS health metrics on JMeter machine(s) as well. Also 1000 users is quite a high load, double check you test corresponds JMeter Best Practices
It may be the bottleneck in your application, i.e. it isn't capable of providing a good response time given 1000 concurrent users. Use the relevant profiler tool to detect the most long running functions and investigate the root cause.
UPDATE
It appears this issue is caused by a bug related specifically to using Axis2 with ColdFusion and we have been able
to replicate the issue in our production environment on two different servers by
switching between Axis1 and Axis2. My original tests to compare the
two were apparently thwarted by an override in an Application.cfc
which forced Axis2.
We ran into a memory leak today which forced us to speed up the resolution to this issue. It resembled the leak
discussed here though we aren't sure if it is the exact same
problem
(https://www.hass.de/content/coldfusion-10-webservice-leaking-memory-trusted-cache-leaks-memory).
Our primary webservices are in Axis1 and we only switched to Axis2 for
this new set of webservices because we needed document literal style
for SalesForce and with Axis1 an invalid wsdl was being created (did
not properly describe all object types in arrays). So now we have it as
Axis1 and using a manually manipulated wsdl. Not entirely sure if it
will work out with SalesForce but as far as a general fix this works.
I am investigating an issue with our coldfusion based soap webservices in our production environment. It appears that the time between the return statement in the webservices method code and actually receiving a response can be significant and appears to directly correspond to the size of the response and/or number of objects.
In development a particular request that returns 1000 records takes about 6 seconds to return. However in production that same hit takes 50+ seconds to return. I added some timing code and found that the actual function code takes less than 1 second to run at the start of the request, meaning that generating the response is taking coldfusion about 50 seconds in production. Hitting the webservice with simple http request does not have the same slowness so seems to be soap/axis specific. The resulting xml is about 1MB which I have compared and found no differences. I also copied out settings from cfadmin in both environments to compare and could find no performance related setting differences.
Both environments are at the same CF 10 update level. The server monitor shows no significant memory usage. I also ran the request from in the server to make sure there wasn't some slow connection issues or https slowing things down but the results are the same.
Any suggestions or solution would be appreciated.
Additional notes...
CPU sits at about 17% for most of the time of the request which is a lot of work to be doing. Something is happening very inefficiently
I tried switching instance to Axis1 and back again followed by an instance restart and additional tests with no change in results
One possibility is that you have them throttled - check the "request tuning" in your CF administrator. By default the setting for "number of simultaneous web service requests" is 10. Are you looping and hitting the server? In production is there more traffic?
In server monitor enable profiling and monitoring, then click on "statistics". On the far right there is a little chart icon. click on it and you will see a chart and a counter legend in the top right. Then run your code. Does the "web services running" reach a threshold and cross into "web services queued" - if so you need to increase that threshold.
One more clue - in the server monitor do NOT run the "memory profiling for more than a few seconds - say 30. If you don't you will have performance problems for sure.