broken links and images in search engine cached pages only - web-services

I got a weird problem, my site shows up perfectly in all browsers but when checking cached pages of site in Google, Bing or even Yahoo, all of them shows broken links and images because some links are overridden such as
Let's say direct url is http://www.expatads.com/47-Thailand/ and it shows perfect.
Here's google cache of the same url.
http://webcache.googleusercontent.com/search?q=cache:qFAzM4VMsJsJ:www.expatads.com/47-Thailand/+&cd=1&hl=en&ct=clnk
What I want to know is the best way to reproduce such errors that are visible instead of waiting for search engines to cache and show pages. Since web browsers do not show any error but actually there is path error that cause that.
I'll appreciate if anyone can give me a way that I can reproduce these errors using some browser, software or whatever?

I believe you are mistaken about this being an error. If you take a look at the screenshot of Google's search result for your page, the images are shown.
It appears that Google's cache does not rewrite relative URLs, which makes some sense because it wouldn't always work and some sites might not allow hotlinking, etc. So, all the the resources linked to on your page using relative links won't show up in Google's cached version.
If you would rather see what your site looks like in other browsers you may want to try Browsershots. This will give you screenshots from a huge number of browsers in order to test compatibility.

Related

does favicon 404 affect performance

We have been noticing a lot of 404 errors being thrown in our Coldfusion CFIDE server monitor, and it took us a while to find out that things like missing favicons are causing these errors.
We use a custom 404 template page, which contains some logic to it (more than just basic HTML). So, whenever a 404 for a favicon occurs, these pages are generated and returned to the user.
Since these favicons are requested by default on many browsers (if there is no in the header that specifies one, it looks in the site root or something like that), it's throwing massive amounts of 404s on our server, which costs processing time as well as bandwidth for delivery. Our server runs fine most of the time, but when it does get some heavy usage, we can sometimes have major performance problems.
I know that this is a performance issue, but would it be enough of one to warrant trying to fix this? If so, is there a way with the Coldfusion server (or our underlying Windows Server 2003 running IIS) to filter what files actually throw a CF 404 error? Ideally, for files like these favicons, CSS, and Javascript (since a visitor never really "sees" the output of these), we would simply want to return an HTTP 404 response with no content, as it is unneeded...
Yes, missing favicons do affect performance, as does any missing content when you have a custom 404 page (esp. if the 404 page is handled by a content management system). This is because every time the file (image, video, page, etc) is requested by the browser, it causes load on your server. Let's say the favicon is missing on every page of your website. If your 404 page is part of your content management system, this doubles the load on your server (basically requesting 2 pages every time instead of 1). If your 404 page is different, but still has logic, this increases the load, but only as much as the file requires (less logic = less load, and vice-versa).
I would suggest fixing this issue, but not necessarily by killing custom 404 pages for certain extensions. In my opinion, it would be better (for you, and your visitors) if you simply added a favicon file to all of your sites. Not only would this solve your 404 issue, but it would help your visitors to recognize your website quicker when bookmarking pages, or adding the site as an App Tab (which would apply, even if your site isn't available to the public, as backend sites are a great use case for App Tabs). Aside from your server's performance, having to download the 404 page causes for network performance as well, both on your server's end as well as the end-user. The 404 page may also not be cached, and even if it is, probably not for as long as an existing favicon would be, which causes the request to happen far more often than it would if you simply created a favicon.
If you don't want to take the time (or don't have a need) to do advanced branding (such as creating a custom logo for the favicon), a basic image with a letter on it (e.g. "K") will do. Favicons are extremely useful for the public, any staff, and even yourself, so I would say it's definitely worth your time to at least do a basic favicon.
Only you can really judge if its enough of a problem to warrant fixing... how many times per second are the 404's generated, what additional load do they put on the server, etc.
Regarding a fix... why don't you just deploy a favicon? It would probably be quicker than worrying about the problem.
As someone else touched on, I think the easiest and most sensible way of dealing with this problem is not to try to deal with it at the 404-handling side of things, but rather just make sure the 404 doesn't occur in the first place. If this is all happening because of missing favicons... fix it by not having missing favicons! If you don't have the resources or desire to brand one appropriately, just use a generic one. It's better to treat the actual problem than a symptom of the problem.

Tracking User Actions on Landing Pages in Django

I'm developing a web application. It's months away from completion but I would like to build a landing page to show to potential customers to explain things and gauge their interest--basically collecting their email address and if they feel like it additional information like names + addresses.
Because I'm already using Django to build my site I thought I might use another Django App to serve as this landing page. The features I need are
to display a fairly static page and potentially a series of pages,
collect emails (and additional customer data)
track their actions--e.g., they got through the first two pages but didnt fill out the final page.
Is there any pre-existing Django app that provides any of these features?
If there is not a Django app, then does anyone know of another, faster/better way than building my own app? Perhaps a pre-existing web service that you can skin and make look like your own? Maybe there's the perfect system but it's PHP?--I'm open for whatever.
Option 1: Google Sites
You can set it up very very quickly. Though your monitoring wouldn't be as detailed as you're asking for.. Still, easy and fasssst!
Option 2: bbclone
Something else that may be helpful is to set up some PHP based site (wordpress or something) and use bbclone for tracking stuff on it. I've found bbclone to be pretty intense with the reporting what everyone does - though it's been a while since I used it.
Option 3: Django Flatpages
The flatpages Django contrib app is pretty handy for making static flat pages. I'd probably just embed a Google Docs Form to collect email addresses (as that's super fast and lets you get back to real work). But this suggestion would still leave you needing to figure out how to get the level of detail you want on the stats end.
Perhaps consider Google Analytics anyway?
Regardless, I suggest you use Google Analytics with everything. That'll work with anything you do really, and for all I know, perhaps you can find a way to get the stats you're really looking for out of it.

Is someone trying to hack my Django website

I have a website that I built using Django. Using the settings.py file, I send myself error messages that are generated from the site, partly so that I can see if I made any errors.
From time to time I get rather strange errors, and they seem to mostly be around about the same area of the site (where I wrote a little tutorial trying to explain how I set up a Django Blog Engine).
The errors I'm getting all appear like something I could have done in a typo.
For example, these two errors are very close together. I never had an 'x' or 'post' as a variable on those pages.
'/blog_engine/page/step-10-sub-templates/{{+x.get_absolute_url+}}/'
'/blog_engine/page/step-10-sub-templates/{{+post.get_absolute_url+}}/'
The user agent is:
'HTTP_USER_AGENT': 'Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)',
Which I take it is a scraper bot, but I can't figure out what they would be able to get with this kind of attack.
At the risk of sounding stupid, what should I do? Is it a hack attempt or are they simply trying to copy my site?
Edit: I'll follow the advice already given, but I'm really curios as to why someone would run a script like this. Are they just trying to copy. It isn't hitting admin pages or even any of the forms. It would seem like harmless (aside from potential plagiarism) attempts to dig in and find content?
From your USER_AGENT info it looks like this is a web spider from puritysearch.net.
I suggest you do is put a CAPTCHA code in you website. Program it to trigger when something tries to access 10 pages in 10 seconds (mostly no humans would do this or figure out a proper criteria to trigger your CAPTCHA).
Also, maintain robots.txt file which most crawlers honor. Mention your rules in robots.txt. You can say the crawlers to keep off certain busy sections of your site etc.
If the problem persists, you might want to contact that particular site's system admin & try to figure out what's going on.
This way you will not be completely blocking crawlers (which are needed for your website to become popular) and at the same time you are making sure that your users get fast experience on your site.
Project HoneyPot has this bot listed as a malicious one http://www.projecthoneypot.org/ip_174.133.177.66 (check the comments there) and what you should probably do is ban that IP and/or Agent.

Is it possible to be attacked with XSS on a static page (i.e. without PHP)?

A client I'm working for has mysteriously ended up with some malicious scripting going on on their site. I'm a little baffled however because the site is static and not dynamically generated - no PHP, Rails, etc. At the bottom of the page though, somebody opened a new tag and a script. When I opened the file on the webserver and stripped the malicious stuff and re-uploaded, it was still there. How is this possible? And more importantly, how can I combat this?
EDIT:
To make it weirder, I just noticed the script only shows up in the source if the page is accessed directly as 'domain.com/index.html' but not as just 'domain.com'.
EDIT2:
At any rate, I found some php file (x76x09.php) sitting on the web server that must have been updating the html file despite my attempts to strip it of the script. I'm currently in the clear but I do have to do some work to make sure rogue files don't just appear again and cause problems. If anyone has any suggestions on this feel free to leave a comment, otherwise thanks for the help everyone! It was very much appreciated!
No it's not possible unless someone has access to your files. So in your case someone has access to your files.
Edit: It's best if you ask in serverfault.com regarding what to do in case the server is compromised, but:
change your shell passwords
have a look at /var/log/messages for login attempts
finger root
have a look at last modification time of those files
There is also a high propability that the files where altered via http by using a vulnerability of a software component you use together with the static files.
To the point about the site not having pages executing on the server, XSS is absolutely still possible using a DOM based attack. Usually this will relate to JavaScript execution outputting content to the page. Just last week WhiteHat Security had an XSS vulnerability identified on a purely “static” page.
It may well be that the attack vector relates to file level access but I suggest it’s also worthwhile taking a look at what’s going on JS wise.
You should probably talk to your hosting company about this. Also, check that your file permissions aren't more lenient than they should be for your particular environment.
That's happened to me before - this happens if they get your ftp details. So, whoever did it, obviously got ahold of your ftp details somehow.
Best thing to do is change your password and contact your webhosting company to figure out a better solution.
Unfortunately, FTP isn't the most secure...

Search engines and migrating a static site to a web app

We're replacing a static web-site with a Django app. All the uri's will change. The current web-site has a substantial presence on the search engine rankings and we don't want to mess that up too much. Is it simply a case of setting up 301 redirects to the new uri's, or is there something more subtle we need to do to ensure the search engines understand what's happened.
Normally when you change your site you will get a hit on your search result page rankings which will last for about 2-4 weeks.
Apex Internet has a good article on setting up the 301 redirects on both Apache, IIS, and other variants. Take a look here.
Steven Hargrove also has a good article on it here with a follow up here.
In addition, Webmaster World has a thread on the impact of the 301's updating in Google, Yahoo and others as well as tips and a little more advice. Take a look at that here.
Lastly here is a article from Google Groups on Dynamic vs. Static URL's that touches on changing structure and how it maps.
I was hoping I would have more information for you and a way to use the robots.txt file to help keep the rankings up when you start the migration. I'll keep looking and see what I can find for you. Cheers and good luck!
301's should in fact cover it.
Searchengines are generally pretty good at this :)