Jmeter- Load testing EC2 instance, only 50% request are successful - amazon-web-services

I am trying to load test Nginx installed on an EC2 instance via Jmeter, Everytime I try to load test, only 50% request are successful,
For Eg:
If I try with 10 users, only 5 response are OK
If I try with 100 users, only 50 response are OK
If I try with 500, only 250 response are OK
Any Idea, regarding this strange behavior?

This sounds weird. I would recommend the following troubleshooting techniques:
First of all always check jmeter.log file, it should contain enough information to get to the bottom of your test failure(s).
If JMeter log file doesn't contain any suspicious entries next step would be checking response messages using i.e. View Results In Table and/or View Results Tree listener. This should provide you some high-level information and trends, i.e. you will be able to see if some particular sampler(s) is(are) always failing.
If above steps don't give enough clue to resolve your issue you can temporary enable saving of request and response data to see what is wrong with the failing sampler(s). Add the next lines to user.properties file (located in JMeter's "bin" folder)
jmeter.save.saveservice.output_format=xml
jmeter.save.saveservice.response_data=true
jmeter.save.saveservice.samplerData=true
jmeter.save.saveservice.requestHeaders=true
jmeter.save.saveservice.responseHeaders=true
jmeter.save.saveservice.url=true
and next time your run JMeter test the .jtl results file will contain all the relevant data which can be analyzed using aforementioned View Results Tree listener. Don't forget to revert the change once you fix the script as JMeter listeners are very resource intensive per se and above settings greatly increase disk IO and it may ruin your test.
If none of above helps - check logs on the application under test side, most probably you will get something from them.

Related

Troubleshooting error 503 on Google Cloud Run

I am running a container on google cloud run. For each request a new instance is started. The requests need around 15 minutes to get processed. I modified the default timeout and everything is working fine. But sometimes, around 10% of the request, I get an error
The request failed because either the HTTP response was malformed or
connection to the instance had an error. Additional troubleshooting
documentation can be found at:
https://cloud.google.com/run/docs/troubleshooting#timeout-503
When I re-run the exact same request, I get no errors. I tried to put try catch every where, but I am not able to figure out what is happening. I checked the CPU, memory usage ... Everything looks fine, he maximum reached is 50%. Any advice on how can I get more information about the problem?

Request data seemingly dirty in multithreaded flask app

We are seeing a random error that seems to be caused by two requests' data getting mixed up. We receive a request for quoting shipping costs on an Order, but the request fails because the requested Order is not accessible by the requesting account. I'm looking for anyone who can provide an inkling on what might be happening here, I haven't found anything on google, the official flask help channels, or SO that looks like what we're experiencing.
We're deployed on AWS, with apache, mod_wsgi, 1 process, 15 threads, about 10 instances.
Here's the code that sends the email:
msg = f"Order ID {self.shipping.order.id} is not valid for this Account {self.user.account_id}"
body = f"Error:<br/>{msg}<br/>Request Data:<br/>{request.data}<br/>Headers:<br/>{request.headers}"
send_email(msg, body, "devops#*******.com")
request_data = None
The problem is that in that scenario we email ourselves with the error and the request data, and the request data we're getting, in many cases, would've never landed in that particular piece of code. It can be a request from the frontend to get the current user's settings, for example, that make no reference to any orders, nevermind trying to get a shipping quote for it.
Comparing the application logs with apache's access_log, we see that, in all cases, we got two requests on the same instance, one requesting the quoting, and another which is the request that is actually getting logged. We don't know whether these two requests are processed by the same thread in rapid succession, or by different threads, but they come so close together that I think the latter is much more probable. We have no way of univocally tying the access_log entries with the application logging, so far, so we don't know which one of the requests is logging the error, but the fact is that we're getting routed to a view that does not correspond to the request's content (i.e., we're not sure whether the quoting request is getting the wrong request object, or if the other one is getting routed to the wrong view).
Another fact that is of interest is that we use graphql, so part of the routing is done after flask/werkzeug do theirs, but the body we get from flask.request at the moment the error shows up does not correspond with the graphql function/mutation that gets executed. But this also happens in views mapped directly through flask. The user is looked up by the flask-login workflow at the very beginning, and it corresponds to the "bad" request (i.e., the one not for quoting).
The actual issue was a bug on one of python-graphql's libraries (promise), not on Flask, werkzeug or apache. It was not the request data that was "moving" to a different thread, but a different thread trying to resolve the promise for a query that was supposed to be handled elsewhere.

How to handle file processing request in Django?

I am making a Django Rest framework based server and in one of the request, I get an audio file from front-end, on which I need to run some ML based algorithm(I have script for same) and respond back to user with the result. Problem is that this request might take 5-10 seconds to execute. I am trying to understand following things:
Will Celery help me reduce the workload on server, as in any case I need to wait for the result of the ML Algo and respond back to user.
Should I create a different server to handle this type of request? Will that be a better approach?
Also, is my flow of doing things correct. First, Upload the file to some cloud platform for storage and serialize the instance to get the url of file. Second run the script using celery and wait for the result. Third, Respond back with the result.
Thanks for helping.

SWF Activity is not completing even though the computation has finished

I'm testing a new SWF workflow, and I've got some activity that makes a RESTful call out to another service. Problem is, I can see through logging that the actual call takes less than a second to complete, but the Activity always times out in SWF (START_TO_CLOSE of 5 mins). Being more specific, the RESTful call is a list call, and when I limit the batch size to a small number, the Activity completes and moves on very quickly. But at some seemingly arbitrary threshold, it chokes completely.
Does anyone have any insight into this? I've read that SWF calls have a size limitation of 1 MB, does anyone know how to find the size of data my workers are trying to pass SWF?
After some remote debugging, it turns out the response from the task is too big and the activity is failing silently. The failure occurs when the framework tries to report the response back to SWF, and the SDK calls RespondActivityTaskCompleted. That API has a length restriction on the internal result param:
Length Constraints: Maximum length of 32768.
This is a validation error that throws an uncaught exception and is swallowed internally until the Activity times out.
I wouldn't recommend using activity input and output parameters for passing large data sets. SWF is an orchestration technology, not the data passing one. The standard workarounds are:
Storing result in a separate store (S3 for example) and passing reference to it.
Caching result locally on a machine and route all following activities to the same host for them to have access to the cached result. See fileprocessing sample for the details of routing approach.
BTW. Have you checked out Cadence which is an open source version of SWF with much better client side libraries?

Pentaho Kettle- run check db connections without stopping job

I have read blogs and one question close to mine, but have not found a solution to my problem. I have a transformation job setup to extract three tables from 84 DBs to generate one report. My problem is when a DB connection is not available, the whole job stops.
I would like to be able to check DB connections before initializing the job, log errors for inaccessible DBs and create a new dynamic list of successful tests from which I will then run my job. I have used the check DB connections step but it still stalls when a connection is false.How can I process my list of DBs, running through to the end, without aborting the job?
First of all you have absolutely used the correct step to check the DB Connections. Now for your question, i would try to explain in parts (hope i am correct):
Case I: "My problem is when a DB connection is not available, the whole job stops"
This scenario is obvious. Whenever a step finds any error, it would throw an exception and would stop the entire execution of the Job.
But does it mean that the step "Check Db connections" would stop checking the db connections if it gets an error connecting. Answer is NO. The Step would complete testing all the connections even if it gets an error in some connection in middle. Try observing the logs carefully, it would give you a final consolidated list of all the checked db connections (check the image below):
I tried testing with 4 db connections out of which i got One error and 3 Success.
Now for the "Whole Job Stops" portion: Since the stopping behavior is obvious (as i have mentioned above), what you can do is to pass the flow using "Error hop" so that if a job finds an error, it will take the error hop. Check the image below:
Here i have used two hops: One Success and One Error. If the Job fails, it would take the error path (red colored hop) else it would take the Success path (green colored hop).
CASE II: "log errors for inaccessible DBs and create a new dynamic list of successful tests"
You can either log the errors into a separate log files or table (depends on your requirement) and then read through the log to generate a list of DB connections. Check the image below:
The output generates a list of Connections along with an Error flag.
Y : Failure in connecting to Database
N : successful connection
Note: i have used text file input since i have logged the previous step into a text file instead of database. You can customize as per your req.
I have placed sample code in gist. You can check for your ref.
Hope it helps :)