How can I filter out errors on sentry to avoid consuming my quota? - django

I'm using Sentry to log my errors, but there are errors I'm not able to fix (or could not be fixed by me) like
OSError (write error)
Or error that come from RQ (each time I deploy my app)
Or client errors (which are client.errors)
I can't just ignore them because I consume all my quota. How I can filter out this errors?
Here some references for interested people.
uwsgi: OSError: write error during GET request
Fixing broken pipe error in uWSGI with Python
https://github.com/unbit/uwsgi/issues/1623

I created a Gist for rate limiting the amount of events that are being send to Sentry:
https://gist.github.com/jurrian/e22f8e724b8499a29c5537e956f0dc7f
It uses ratelimitingfilter which can be configured to set a rate per minute, and additionally add a burst to start rate limiting after a number of events.

I get the same errors, but i never had any problems with my quota. But if you really want to filter it, you can just do it in your sdk:
https://docs.sentry.io/error-reporting/configuration/filtering/?platform=python
But beware, this could hide other errors as mentioned here:
https://github.com/pypa/warehouse/issues/679

To safe yourself some quota, you have two options:
Avoid forwarding events client side, thus preventing events being send to sentry at all. Have a look at the docs for available client-side filters. The drawback with this approach is of course that you need a new code deployment for any adjustment of client-side filters and some clients may not instantly reflect your code changes.
Avoid forwarding events on sentry's side, via inbound filters ([Project] > Project Settings > Inbound Filters). According to the sentry documentation on quota usage, events filtered via inbound filters are not affecting your quota.
Inbound filters include:
Common browser extension errors
Events coming from localhost
Known legacy browsers errors
Known web crawlers
By their error message
From specific release versions of your code
From certain IP addresses
Business plans and above also allow to filter events by error messages.

Related

Should django health-check endpoint /ht/ be accessible from everybody?

From the documentation reported here I read
This project checks for various conditions and provides reports when
anomalous behavior is detected.The following health checks are bundled
with this project: cache, database, storage, disk and memory
utilization (viapsutil), AWS S3 storage, Celery task queue, Celery
ping, RabbitMQ, Migrations
and from use case section
The primary intended use case is to monitor conditions via HTTP(S),
with responses available in HTML and JSONformats. When you get back a
response that includes one or more problems, you can then decide the
appropriate courseof action, which could include generating
notifications and/or automating the replacement of a failing node with
a newone
And then
The /ht/ endpoint will respond aHTTP 200 if all checks passed and a HTTP
500 if any of the tests failed.
From a security point of view: should this url (https://example.com/ht) be reachable from everybody? It seems to give away different information.

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?
without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING
Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

Unusual request activity log found in django server

Following is the screenshot of the server activity log.I can see that many requests are automatically raised in the server.How can I avoid this.?
It looks like someone is fuzzing your website and scanning to find any common file names or extensions that commonly have security vulnerabilities. One way to limit this behaviour is to implement rate limiting whereby you might limit the number of requests a user makes that result in HTTP 404 Not Found during some time period before giving them a temporary ban. Note: this solution doesn't stop this from happening but it does buy you time and may deter the attacker or researcher

How do I receive API Throttling Warnings?

We need to fetch mutual friend data for each one of our new users. (We're currently doing that through the REST API.) In load testing for an upcoming traffic surge, we ran into API throttling, which breaks our production site. Oops!
In the Insights -> Diagnostics pane, it looks like they issue throttling warnings before they actually throttle. Is there some way we can monitor those limits in code so that we back off gracefully?
You will want to watch for the two errors coming back, then put your next call on a wait timer.
API_EC_TOO_MANY_CALLS Application request limit reached
API_EC_USER_TOO_MANY_CALLS User request limit reached
See: http://www.fb-developers.info/tech/fb_dev/faq/general/gen_10.html for more information.

Architecture for robust payment processing

Imagine 3 system components:
1. External ecommerce web service to process credit card transactions
2. Local Database to store processing results
3. Local UI (or win service) to perform payment processing of the customer order document
The external web service is obviously not transactional, so how to guarantee:
1. results to be eventually persisted to database when received from web service even in case the database is not accessible at that moment(network issue, db timeout)
2. prevent clients from processing the customer order while payment initiated by other client but results not successfully persisted to database yet(and waiting in some kind of recovery queue)
The aim is to do processing having non transactional system components and guarantee the transaction won't be repeated by other process in case of failure.
(please look at it in the context of post sell payment processing, where multiple operators might attempt manual payment processing; not web checkout application)
Ask the payment processor whether they can detect duplicate transactions based on an order ID you supply. Then if you are unable to store the response due to a database failure, you can safely resubmit the request without fear of double-charging (at least one PSP I've used returned the same response/auth code in this scenario, along with a flag to say that this was a duplicate).
Alternatively, just set a flag on your order immediately before attempting payment, and don't attempt payment if the flag was already set. If an error then occurs during payment, you can investigate and fix the data at your leisure.
I'd be reluctant to go down the route of trying to automatically cancel the order and resubmitting, as this just gets confusing (e.g. what if cancelling fails - should you retry or not?). Best to keep the logic simple so when something goes wrong you know exactly where you stand.
In any system like this, you need robust error handling and error reporting. This is doubly true when it comes to dealing with payments, where you absolutely do not want to accidentaly take someone's money and not deliver the goods.
Because you're outsourcing your payment handling to a 3rd party, you're ultimately very reliant on the gateway having robust error handling and reporting systems.
In general then, you hand off control to the payment gateway and start a task that waits for a response from the gateway, which is either 'payment accepted' or 'payment declined'. When you get that response you move onto the next step in your process and everything is good.
When you don't get a response at all (time out), or the response is invalid, then how you proceed very much depends on the payment gateway:
If the gateway supports it send a 'cancel payment' style request. If the payment cancels successfully then you probably want to send the user to a 'sorry, please try again' style page.
If the gateway doesn't support canceling, or you have no communications to the gateway then you will need to manually (in person, such as telephone) contact the 3rd party to discover what went wrong and how to proceed. To aid this you need to dump as much detail as you have to error logs, such as date/time, customer id, transaction value, product ids etc.
Once you're back on your site (and payment is accepted) then you're much more in control of errors, but in brief if you cant complete the order, then you should either dump the details to disk (such as csv file for manual handling) or contact the gateway to cancel the payment.
Its also worth having a system in place to track errors as they occur, and if an excessive number occur then consider what should happen. If its a high traffic site for example you may want to temporarily prevent further customers from placing orders whilst the issue is investigated.
Distributed messaging.
When your payment gateway returns submit a message to a durable queue that guarantees a handler will eventually get it and process it. The handler would update the database. Should failure occur at that point the handler can leave the message in the queue or repost it to the queue, or post an alternate message.
Should something occur later that invalidates the transaction, another message could be queued to "undo" the change.
There's a fair amount of buzz lately about eventual consistency and distribute messaging. NServiceBus is the new component hotness. I suggest looking into this, I know we are.