Appfabric Cache Perfmon Errors - appfabric

We have a critical system that is highly dependent on Appfabric Caching. The setup we use is three nodes which serves around 2000 simultaneous connections and 150-200 requests/second.
Configurations are the default ones. We receives maybe 5-10 "ErrorCode:SubStatus" each day which is unacceptable.
I have added some performance counters but I can't see anything weird except that we sometimes see values on "Total Failure Exceptions / sec" and "Total Failure Exceptions" is increasing but one 2-3 times a day.
I would like to see what these errors comes from but I can't find them in any logs in the Event Viewer (enabled them all according to documentation). Does anyone know if these errorc could be logged somewhere and/or if it possible to seem them in any other way?

We receives maybe 5-10 "ErrorCode:SubStatus" each day which is
unacceptable.
Between 5 or 10 errors per day, with 150 requests/sec per day ?. It's quite anecdotic. Your cache client have to always handle properly caching errors. A network failure can always occurs.
5-10 "ErrorCode:SubStatus" is quite obsur. There are more than 50 error codes in AppFabric Caching. Try to get exactly these error codes. See full list here.
would like to see what these errors comes from but I can't find them
in any logs in the Event Viewer (enabled them all according to
documentation). Does anyone know if these errorc could be logged
somewhere and/or if it possible to seem them in any other way?
The only documentation available is here. The event viewer is useful to regularly monitor the health of the cache cluster. However, when troubleshooting an error, it is possible to get an even more detailed log of the cache cluster activities. I'm not sure, this will help you a lot because it's sometimes too specific.

Related

Reduce alert noise in GCP stackdriver

We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).
Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.

How do I handle Google DLP rate limiting when using the Java library?

At one point when doing some testing with the Google DLP Java library, I got an exception that indicated that I had exceeded the API rate limit. Unfortunately I don't have the stack trace anymore, so I can't give any more detail at this point. However, it made me realize that I'm not handling that situation in the code. What is the recommended way of dealing with this from a Java application? I haven't seen any examples in the GitHub repo that gives any guidance on this. I'm aware of the ability to request quota increases, and I have already put in a request. My question is on how to gracefully handle this in the code, should I run into the quota exceeded situation again. Thanks.
It depends greatly on your design and place in which you are making the call from.
Can you afford to retry until it succeeds?
Are user's waiting on the response and errors are not acceptable?
Is this a batch pipeline working offline where taking longer is okay?
If you never want to hit the error, you'll need to implement your own client side rate throttling with accompanying monitoring to assure that you know it's time to request more quota.
If you can retry and wait, try retrying using exponential backoff.

Unusual request activity log found in django server

Following is the screenshot of the server activity log.I can see that many requests are automatically raised in the server.How can I avoid this.?
It looks like someone is fuzzing your website and scanning to find any common file names or extensions that commonly have security vulnerabilities. One way to limit this behaviour is to implement rate limiting whereby you might limit the number of requests a user makes that result in HTTP 404 Not Found during some time period before giving them a temporary ban. Note: this solution doesn't stop this from happening but it does buy you time and may deter the attacker or researcher

Amazon EC2 EBS i/o costs

I am hosting my application on amazon ec2 , on one of their micro linux instances.
It costs (apart from other costs) $0.11 per 1 million I/O requests . I was wondering how much I/O requests does it take when I have say 1000 users using it for say 1 hours per day for 1 month ?
I guess my main concern is : if a hacker keeps hitting my login page (simple html) , will it increase the I/O request count ? I guess yes, as every time the server needs to do something to server that page.
There are a lot of factors that will impact your IO requests, as #datasage says, try it and see how it behaves under your scenario. Micro Linux instances are incredible cheap to begin with, but if you are really concerned, setup a billing alert that will notify you when your usage passes a pre-determined threshold - if it suddenly spikes up, you can take some action to shut it down if that is what you want.
https://portal.aws.amazon.com/gp/aws/developer/account?ie=UTF8&action=billing-alerts
Take a look at CloudWatch, and (for free) set up a VolumeWriteOps and VolumeReadOps alarm to work with Amazon Simple Notification Service (SNS) to send you a text message and eMail notice right away if things get too busy, before the bill gets high! (A billing alert will let you know too late - after it has reached the threshold.)
In general though, from my experience, you will not have the problem you outline. Scan the EC2 Discussion Forum at forums.aws.amazon.com where you would find evidence of this kind of problem if were prevalent; it does not seem to be happening.
#Dilpa yes you are right. If some brute force attack will occur to your website eg: somebody hitting to your loginn page then it will increase the server I/O if you enable loging for your webserver. Webserver will keep log to it's log files of every event and that will increase your I/O. Just verify your webserver log for such kind of attack and you can prevent them.

Django/Postgres performance worsening after repeatedly processing the same query

I am running Django on Apache. I have several client computers which should call urllib2.urlopen() and send over some data which my server will process and immediately send back a reply. However, when I am testing this I found a very tricky issue. I have one client repeatedly send the same data to be processed. The first time, it takes around ~20 seconds, second time, it takes about 40 seconds, third time I get a 504 (gateway timeout) error. If I try to send data some more 504 errors randomly pop up. I am pretty sure this is an issue with Postgres as the function that processes the information makes many database calls, however, I do not know why the performance of Postgres would decline so much. I have tried several database optimization tricks, including this one: (http://stackoverflow.com/questions/1125504/django-persistent-database-connection), to no avail.
Thanks in advance.
Edit: The requests are not coming concurrently. They are coming in back to back and each query involves a lot of SELECTs and JOINs, and there are a few INSERTs and UPDATEs as well. The apache error logs show that it is just a simple timeout, where the function to process the client posted data takes over 90 seconds.
If it's really Postgres, then you should turn on the logging of slow statements in the Postgres configuration to find out which statement exactly is taking so much time.
This can be done by setting the configuration property log_min_duration.
Details are in the manual:
http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT
You say the function makes "many database calls" so I'd start with a very low number, or even 0 to log the duration of all statements, then you might be able to identify the slow ones.
It could also be a locking issued. Maybe the first call does not end its transaction properly and subsequent calls run into a timeout when waiting for a resource.
You can verify this by checking the system view pg_locks after the first call.
Have you checked the Apache error_logs? Have you set django DEBUG = True or ADMINS = ('email#addr.com',) so you can get a detailed error report about what the actual cause of the issue is? If so, how about pasting some information here.
Why are you certain that it's postgres? Have you done diagnostics to come to that conclusion? If so, please let us know.
Are you running apache with mod_wsgi? How many processes and threads have you allocated to your django application?
Also, 20 seconds to process the first transaction is a huge amount of time. Perhaps you could show us the view code that is causing the time out. We may be able to help there.
I sincerely doubt that it's going to be postgres alone that is causing the issue. It probably has something to do with application code, or server configuration.