OpenGraph Debugger reporting bad HTTP response codes - facebook-graph-api

For a number of sites that are functioning normally, when I run them through the OpenGraph debugger at developers facebook com/tools/debug, Facebook reports that the server returned a 502 or 503 response code.
These sites are clearly working fine on servers that are not under heavy load. URLs I've tried include but are not limited to:
http://ac.mediatemple.net
http://freespeechforpeople.org
These are in fact all sites hosted by MediaTemple. After talking to people at MediaTemple, though, they've insisted that it must be a bug in the API and is not an issue on their end. Anyone else getting unexpected 500/502/503 HTTP response codes from the Facebook Debug tool, with sites hosted by MediaTemple or anyone else? Is there a fix?
Note that I've reviewed the Apache logs on one of these and could find no evidence of Apache receiving the request from Facebook, or of a 502 response etc.

Got this response of them:
At this time, it would appear that (mt) Media Temple servers are returning 200 response codes to all requests from Facebook, including the debugger. This can be confirmed by searching your access logs for hits from the debugger. For additional information regarding viewing access logs, please review the following KnowledgeBase article:
Where are the access_log and error_log files for my server?
http://kb.mediatemple.net/questions/732/Where+are+the+access_log+and+error_log+files+for+my+server%3F#gs
You can check your access logs for hits from Facebook by using the following command:
cat <name of access log> | grep 'facebook'
This will return all hits from Facebook. In general, the debugger will specify the user-agent 'facebookplatform/1.0 (+http://developers.facebook.com),' while general hits from Facebook will specify 'facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php).'
Using this information, you can perform even further testing by using 'curl' to emulate a request from Facebook, like so:
curl -Iv -A "facebookplatform/1.0 (+http://developers.facebook.com)" http://domain.com
This should return a 200 or 206 response code.
In summary, all indications are that our servers are returning 200 response codes, so it would seem that the issue is with the way that the debugger is interpreting this response code. Bug reports have been filed with Facebook, and we are still working to obtain more information regarding this issue. We will be sure to update you as more information becomes available.
So good news, is that they are busy with it solving it. Bad news, it's out of our control.
There's a forum post here of the matter:
https://forum.mediatemple.net/topic/6759-facebook-503-502-same-html-different-servers-different-results/
With more than 800 views, and recent activity, it states that they are working hard on it.
I noticed that https MT sites don't even give a return code:
Error parsing input URL, no data was scraped.
RESOLUTION
MT admitted it was their fault and fixed it:
During our investigation of the Facebook debugger issue, we have found that multiple IPs used by this tool were being filtered by our firewall due to malformed requests. We have whitelisted the range of IP addresses used by the Facebook debugger tool at this time, as listed on their website, which should prevent this from occurring again.
We believe our auto-banning system has been blocking several Facebook IP addresses. This was not immediately clear upon our initial investigation and we apologize this was not caught earlier.
The reason API requests may intermittently fail is because only a handful of the many Facebook IP addresses were blocked. The API is load-balanced across several IP ranges. When our system picks up abusive patterns, like HTTP requests resulting in 404 responses or invalid PUT requests, a global firewall rule is added to mitigate the behavior. More often than not, this system works wonderfully and protects our customers from constant threats.
So, that being said, we've been in the process of whitelisting the Facebook API ranges today and confirming our system is no longer blocking these requests. We'd still like those affected to confirm if the issue persists. If for any reason you're still having problems, please open up or respond to your existing support request

Related

How best to manage Cloudfront/Nginx 502 Bad Gateway errors in AWS

We have a website which is served over CloudFront. Sometime this week the origin EC2 (ECS) server crashed and for a short time it started returning 502 errors:
502 Bad Gateway | Nginx
This issue was resolved quickly, but we have had a couple of users still seeing the errors in their browsers. They are both using Google Chrome and the problem seems to be constant (like the browser/CloudFront has cached the error). One user fixed the issue by entering Incognito mode, the other sees the issue every time they click on a link from our newsletter. Some other users have fixed the issue only by using a different browser.
I am unsure how to start debugging this. Also, I'd imagine if the receives a 502 error it wouldn't cache the page content. Also, I'm unable to replicate from my end.
To add extra information to the question:
I'm not looking for advice on how to stop or manage 502 bad gateway errors. We know why these happen(ed) this question is purely advice on fixing cached 502 errors after they have been delivered to the user.
From the feedback so far it looks like we have can uncache 502 errors in CloudFront after 10 seconds. This was enabled, but the issue still persists.
My feeling here is that the user's browser has Cached the 503 error page and isn't requesting an update from the server. Without getting them to clear their cache, is there a way to set CloudFront or their browser only to cache a 502 error for a short period before requesting an updated page from the server?
Also, thinking about this again. The error is '502 Bad Gateway | Nginx' is this even coming from CloudFront? could my server be sending long
Cache-Control headers with 502 errors?
After going down a lot of dead ends, I finally found a solution to this issue. Apologies the initial question was incorrect in its assumptions. But thanks for everyone's input anyway. My previous experience of 502 errors was limited to instances where the origin server went down. So when a small number of our users started receiving constant 502 errors, when the server was functioning correctly, I immediately thought it was a CloudFront caching issue. The origin server had crashed, and the 502 error was being cached for these unfortunate users.
After debugging more, the actual issue was due to a large (growing) cookie being set when the user came to the website from our emails. If the user wasn't logged in the cookie would save more data over time and get larger in filesize. This was limited to the max filesize of a cookie. But it didn't count on Nginx's header limits. So this created an 'upstream sent too big header' error. Hence the 502. Removing the cookie and increasing the header limits fixed the issue. We will lower the limits back over time once the cookie has been deleted or expired for our users.
fastcgi_buffers 8 16k;
updated to:
fastcgi_buffers 16 16k;
upstream sent too big header while reading response header from upstream
If you have an error 502 please do an invalidation... this clean cache for all your users.
Cloudfront -> Distributions -> Your Distribution -> Invalidations Tab -> Create Invalidation -> Put in the textbox "/*" without quotation marks -> Invalidate
And that's all.
I'm suggest you to research why you have Bad Gateway (maybe the scale in specific day of the week) and schedule more container for that day in specific hour. :)

"Not Found: /406.shtml" from django

I'm running django with apache fcgi on a shared host. I've set it up to report 404 errors and keep seeing Not Found: /406.shtml via emails (I'm guessing the s is because it's https only). However I have error documents already set up in .htaccess:
ErrorDocument 406 /error/406.html
I was getting a bunch of similar 404 errors from django before setting up an ErrorDocument for each one, but it's still happening for 406. From a grep 406 through the apache error log I'm seeing an occasional 406 (not 404) error for 406.shtml, such as the following, but not nearly as often as django emails me:
[Fri ...] [error] [client ...]
ModSecurity: Access denied with code 406 (phase 1).
Pattern match "Mozilla ... AhrefsBot ...)" at REQUEST_HEADERS:User-Agent.
[file "/usr/local/apache/conf/mod_sec/mod_sec.hg.conf"] [line "126"]
[id "900165"]
[msg "AhrefsBot BOT Request"]
[hostname "www.myhostname.com"]
[uri "/406.shtml"]
[unique_id "..."]
I'm not even sure if this is apache redirecting internally to 406.shtml and it being forwarded on to django or if some bot is trying to find 406.shtml directly. The former seems to indicate a problem with ErrorDocument. The latter isn't really my problem, but then either I should be seeing a 404 for 406.shtml in the apache logs or nothing at all because django will handle the 404? How can I track it down further?
I haven't been able to reproduce the issue just by visiting my site, but I'd like to know what's going on.
You have ModSecurity installed in your Apache which is a WAF which attempts to protect your website from attacks, bots and the like. These, like email spam are part and parcel of running a website now a days unfortunately.
ModSecurity is an add on module to Apache which allows you to define rules and then it runs each request against those rules and decides whether to block the request or not.
In this case a rule (900165, which is defined in file "/usr/local/apache/conf/mod_sec/mod_sec.hg.con) has decided to block this request with a 406 status based on the user agent (AhrefsBot).
Ahref is a website which crawls the web trying to build up a database of links. It's used by SEO people to see who links to your websites (back links are very important to SEO) as Google (who you think would be better providers of this type of information) only give samples of links rather than full listing.
Is AhrefBot a danger and should it be blocked? Well that's a matter of opinion. Assuming it's really AhrefBot (some nefarious bots might pretend to be it so as to look legitimate so check the IP address to see the hostname it came from), then it's probably wasting your resources without doing you much good. On the other hand this is the price of an open web. Your website is available to the public and so also to those that write bots and tools (good or bad).
Why does it return a 406? Well that's how your ModSecurity and/or your rule is defined. Check your Apache config. 406 is a little unusual as would normally expect a 403 (access denied) or 500 (internal server error).
What's the 406.shtml file? That I don't get. A .shtml is a HTML file which also allows server side includes to embed other files and code into an HTML file. They are not used much any more to be honest as the likes of PHP and/or other languages are more common. It could be an attack: I.e. someone's attempting to upload the 406.shtml file and then cause it to be called so it "executes" and includes the contents of the file, potentially giving access to files Apache can see which are not available on the webserver, or the user has requested that (for some reason) or Apache is configured to show that for 406 errors or the ModSecurity rule is redirecting to that file.
Hopefully that gives a good bit of background, and best thing I can suggest is to go through your Apache config file, and any other config files it loads (including mod_sec.hg.con file which it must load) to fully understand your set up and the. Decide if you need to do anything here.
You could do one of several things:
Leave as is. ModSecurity is doing what it was told to do and blocking this with a 406
Turn off this rule and allow AhrefRef through so you don't get alerted by this.
Alter the ModSecurity config/rule to return an error other than 406 so you can ignore it
Turn off ModSecurity completely. I think it is a good tool and worthwhile but does take some time and effort to get most out of it.
Set up the 406 error page properly. To do that you need to understand why it's attempting to return 406.shtml at the moment.
Also not sure which of these options are available to you as you are on a shared host and might not have full access. If so speak to your hosting provider for advice.

API Console Issue

I've been using WSO2 API Manager 1.9.1 for the past month on a static IP and we liked it enough to put it on Azure behind a full qualified domain name. As we are still only using for internal purposes, we shut the VM down during off hours to save money. Our Azure setup does not guarantee the same IP address each time the VM restarts. The FQDN allows us to always reach https://api.mydomain.com regardless of what happens with the VM IP.
I updated the appropriate config files to the FQDN and everything seems to be working well. However! The one issue I have and cannot seem to resolve is calling APIs from the API consoloe. No matter what I do, I get a response as below
Response Body
no content
Response Code
0
Response Headers
{
"error": "no response from server"
}
Mysteriously, I can successfully make the same calls from command line or SOAPUI. So it's something unique about the API Console. I can't seem to find anything useful in the logs or googling. I do see a recurring error but it's not very clear or even complete (seems to cut off).
[2015-11-17 21:33:21,768] ERROR - AsyncDataPublisher Reconnection failed for
Happy to provide further inputs / info. Any suggestions on root cause or where to look is appreciated. Thanks in advance for your help!
Edit#1 - adding screenshots from chrome
The API Console may not be giving you response due to following issues
If you are using https, you have to type the gateway url in browser and accept it before invoke the API from the API Console (This case there is no signed certificate in the gateway)
CORS issue which may due to your domain is not in access allow origins response of Options call
If you create a API which having https backend. You have to import endpoint SSL certificate to client-trustore.jks

How to identify GET/POST requests made by a human ignoring any requests following

I'm writing an application that listens to HTTP traffic and tries to recognize which requests where initiated by a human.
For example:
The user types cnn.com in their address bar, which starts a request. Then I want to find
CNN's server response while discarding any others requests (such as XHR, etc.)
How could you tell from the header information what means what?
After doing some research I've found that relevant responses come with :
Content-Type: text/html
Html comes with a meaningful title
status 200 ok
There is no way to tell from the bits on the wire. The HTTP protocol has a defined format, which all (non-broken) user agents adhere to.
You are probably thinking that the translation of a user's typing of just 'cnn.com' into 'http://www.cnn.com/' on the wire can be detected from the protocol payload. The answer is no, it can't.
To detect the user agent allowing the user such shorthand, you would have to snoop the user agent application (e.g. a browser) itself.
Actually, detecting non-human agency is the interesting problem (with spam detection as one obvious motivation). This is because HTTP belongs to the family of NVT protocols, where the basic idea, believe it or not, is that a human should be able to run the protocol "by hand" in a network terminal/console program (such as a telnet client.) In other words, the protocol is basically designed as if a human were using it.
I don't think header information can suffice to identify real users from bots, since bots are made to mimic real users and headers are very easy to imitate.
One thing you can do, is to track the path (sequence of clicks) followed by a user, which is most likely to be different from one made by a bot, and made some analysis on the posted information (i.e. bayesian filters).
A very easy to implement check is based on the IP source. There are databases of black listed IP addresses, see Project Honeypot - and if you are writing your software in java, here is an example on how to check an IP address: How to query HTTP:BL for spamming IP addresses.
What I do on my blog is this (using wordpress plugins):
check if an IP address is in the HTTP:BL, if it is the user is shown an html page to take action to whitelist his IP address. This is done in Wordpress by Bad Behavior plugin.
when the user submits some content, a bayesian filter verifies the content of his submission and if his comment is identified as spam, a captcha is displayed before completing the submission. This is done with akismet and conditional captcha, and the comment is also enqueued for manual approval.
After being approved once, the same user is considered safe, and can post without restrictions/checks.
Applying the above rules, I have nomore spam on my blog. And I think that a similar logic can be used for any website.
The advantage of this approach, is that most of the users don't even notice any security mechanism, since no captcha is displayed, nor anything unusual happens in 99% of the times. But still there is quite restrictive, and effective, checks going on under the hoods.
I can't offer any code to help, but I'd say look at the Referer HTTP header. The initial GET request shouldn't have a Referer, but when you start loading the resources on the page (such as JavaScript, CSS, and so on) the Referer will be set to the URL that requested those resources.
So when I type in "stackoverflow.com" in my browser and hit enter, the browser will send a GET request with no Referer, like this:
GET / HTTP/1.1
Host: stackoverflow.com
# ... other Headers
When the browser loads the supporting static resources on the page, though, each request will have a Referer header, like this:
GET /style.css HTTP/1.1
Host: stackoverflow.com
Referer: http://www.stackoverflow.com
# ... other Headers

OAuthException (#368) The action attempted has been deemed abusive or is otherwise disallowed

I'm trying to post a feed on my wall or on the wall on some of my friends using Graph API. I gave all permissions that this application needs, allow them when i make the request from my page, I'm having a valid access token but even though this exception occurs and no feed is posted. My post request looks pretty good, the permissions are given. What do I need to do to show on facebook app that I'm not an abusive person. The last think I did was to dig in my application Auth Dialog to set all permission I need there, and to write why do I need these permissions.
I would be very grateful if you tell me what is going on and point me into the right direction of what do I need to do to fix this problem.
Had the same problem. I figured out that Facebook was refusing my shortlinks, which makes me a bit mad...but I get the point because its possible that shortlinks can be used to promote malicious content...so if you have shortlinks as part of your test, replace them w the full url...
I believe this message is encountered for one of the two reasons :
Your post contains malicious links
You are trying to make a POST request over a non-https connection.
The second one is not confirmed but I have seen that behavior. While same code in my heroku hosted app worked fine, it gave this #368 error on my 000webhost hosted .tk domain which wasn't secured by SSL
Just in case anyone is still struggling with this, the problem occurs when you put URLs or "action links" that are not in your own app domain, if you really need to post to an extarnal page, you'll have to post to your app first, then redirect from there using a script or something. hope that helps.
also it's better in my opinion to use HTTPS links, as sometimes i've seen a behaviour where http links would be rejected, but that's intermittent.
I started noticing that recently as well when running my unit tests. One of the tests I run is submitting a link that I know Facebook has blocked to verify that I handle the error correctly. I used to get this error:
Warning: This Message Contains Blocked Content: Some content in this message has been reported as abusive by Facebook...
But starting on July 4th, I started receiving this error instead:
(#368) The action attempted has been deemed abusive or is otherwise disallowed'
Both errors indicate that Facebook doesn't like what you're publishing.