I was wondering how google is capturing all those websites that are featured in google's instant preview? I'm sure they are not using a thumbnail service (like www.thumbalizr.com, websnapr.com, snapcasa.com, thumbshots.com) but rather use their own software. BUT: given that google captures A LOT of websites, they must have a very sophisticated system. PLUS: this generates HUGE amounts of data (jpgs?).
Does somebody have more insight into how google does this?
Yes, it's something like that. Their webmaster pages hint that they render the page with the same engine Chrome uses, and the preview is based on the result.
It's hard to say, but here's some info from a Google project manager discussing it:
http://googleblog.blogspot.com/2010/11/beyond-instant-results-instant-previews.html
It says in part:
"we match your query with an index of the entire web, identify the
relevant parts of each webpage, stitch them together and serve the
resulting preview completely customized to your search—usually in
under one-tenth of a second"
That plus looking at the source of a preview page suggests that they're using their own index (the same webcache.googleusercontent.com that is used to serve the Cached pages) to serve JPEG Base64 image strings as screenshots.
Related
I am concerned about page ranking on google with the following situation:
I am looking to convert my existing site with 150k+ unique page results to a ember app, off the route. so currently its something like domain.com/model/id - With ember and hash change - it will be /#/model/id. I really want history state but lack of IE support doesn't leave that as a option. So my Sitemap for google has lots and lots of great results using the old model/id. On the rails side I will test browser for compatibility, before either rendering the JS rich app or the plain HTML / CSS. Does anyone have good SEO suggestions with my current schema for success.
Linked below is my schema and looking at the options -
http://static.allplaces.net/images/EmberTF.pdf
History state is awesome but it looks like support is only around 60% of browsers.
http://caniuse.com/history
Thanks guys for the suggestions, the google guide is similar to what I'm going to try. I will roll it out to 1 client this month, and see what webmasters and analytics show.
here is everything you need to have your hash links be seo friendly: https://developers.google.com/webmasters/ajax-crawling/
basically You write Your whole app with hashlinks, but You have to add "!" to them, so You have #!/model/id. Next You must have all pages somewhere generated and if google asks for them, return "plain html" as described here: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
use google webmaster tools to check if Your site is crawlable.
I'm not sure if you're aware that you can configure Ember to use the browser history for the location API and keep using your pages the way they are reference now. All you need to do is configure the Route's location property
App.Router.reopen({
location: 'history'
});
See more details about specifying the location api here
I am building a small side project - a simple news site. I want to use the Django Admin for uploading articles and allow access to non-coders so as they can publish articles a la Wordpress or such. I have added some functionality to the admin, first trying out TinyMCE and Dojo rich text editors. However, these do not come with the ability to insert an image into an article from a file (just urls).
I really only want some light text formatting in the text area plus the ability to upload and insert images from the users's harddrive directly into the article. Is there a simple way to achieve this?
If you are already using django-tinymce, you can integrate django-filebrowser with it. See django-tinymce's documentation.
There is also a commercial choice which looks good, but I have never tested it.
I'm developing a web application. It's months away from completion but I would like to build a landing page to show to potential customers to explain things and gauge their interest--basically collecting their email address and if they feel like it additional information like names + addresses.
Because I'm already using Django to build my site I thought I might use another Django App to serve as this landing page. The features I need are
to display a fairly static page and potentially a series of pages,
collect emails (and additional customer data)
track their actions--e.g., they got through the first two pages but didnt fill out the final page.
Is there any pre-existing Django app that provides any of these features?
If there is not a Django app, then does anyone know of another, faster/better way than building my own app? Perhaps a pre-existing web service that you can skin and make look like your own? Maybe there's the perfect system but it's PHP?--I'm open for whatever.
Option 1: Google Sites
You can set it up very very quickly. Though your monitoring wouldn't be as detailed as you're asking for.. Still, easy and fasssst!
Option 2: bbclone
Something else that may be helpful is to set up some PHP based site (wordpress or something) and use bbclone for tracking stuff on it. I've found bbclone to be pretty intense with the reporting what everyone does - though it's been a while since I used it.
Option 3: Django Flatpages
The flatpages Django contrib app is pretty handy for making static flat pages. I'd probably just embed a Google Docs Form to collect email addresses (as that's super fast and lets you get back to real work). But this suggestion would still leave you needing to figure out how to get the level of detail you want on the stats end.
Perhaps consider Google Analytics anyway?
Regardless, I suggest you use Google Analytics with everything. That'll work with anything you do really, and for all I know, perhaps you can find a way to get the stats you're really looking for out of it.
I don't know if it's called data mining or something else.
Let's say I have a world business listing site, that list all the shops. And I saw this website ABC that also list shops, but only in Ausralia. They are in page by page, with no ID.
How do I start to write a program, that will crawl their pages, and put in the selective information of a page in the format of CSV, which I can then import it to my website?
At least, where can I learn this? Thank you.
What you are attempting to do is known as "Web Scraping", here's a good starting point for information, including the legal issues
http://en.wikipedia.org/wiki/Web_scraping
One common framework for writing crawlers like this is Scrapy- http://scrapy.org/
Yes, this process called Web Scraping. If you are familiar with java, most useful tools here is HTMLUnit and WEbDriver. You should use headless browser to go though you pages and extract important information using selector(mostly xpath, regexp in html format)
I want to make my website available offline even if the user clears the cache and cookies. Is is possible? Also I am dealing with database. Is is possible to handle databases offline?
A user could store a local copy of a single webpage using Chrome (right click save-as) and it will store all resources (images, css, js) required to fully load the page offline. Other browsers will have similar options.
You can use wget to mirror a whole website for offline browsing.
wget --mirror --convert-links --html-extension -p http://www.example.com/
of course neither of these options will handle database driven elements of your site/page.
If you want to mock a database or dynamic elements of a page offline then Google Gears is probably the closest to what you are looking for but I think it was deprecated by Google last year.
If your users have modern browsers, try HTML5 Application Cache.
References:
Overview - http://www.html5rocks.com/en/features/offline
Demo - https://jonathanstark.com/labs/app-cache-7/
Tutorial - https://www.html5rocks.com/en/tutorials/appcache/beginner/
Article - http://grinninggecko.com/developing-cross-platform-html5-offline-app-1/
Summary: Click me, I'm the newish thing that browsers now support!
I clicked some of the links found in other answers, and all tools mentioned are deprecated or will/should be soon.
Later when I wasn't connected to the internet, I opened a site operated by Google (either Google Docs or YouTube, I sadly forgot since then) and went to view the page source, as I was curious to see other answers in action. I found something called ORIGIN-TRIAL in the manifest file.
After a quick Google search, I found this, which brought me to this, which somehow brought me to the last link:
https://developers.google.com/web/fundamentals/primers/service-workers
In conclusion, use Service Workers now. If you're curious if it now works with all browsers, don't worry. All popular browsers should support it as seen here.
No, if your databases are housed online. then you need a internet connection for the PHP/ASP (whatever you're using to deal with DBs) to connect/communicate to the DB's
For storing data locally and accessing them offline take a look at Gears and Web Storage.
The main problem is what degree of functionality you want to provide with your website. It always requires some work on the client (user) side to "store" aka. save your website offline. You would have to store all your functionality in one page that the user stores (be it a Flash movie or some Javascript-Code).
You can use simple command to download whole website locally with all links working properly.
wget -rk 'http://www.website.com'
For https url you need to add one more property like below :
wget -rk --no-check-certificate 'https://www.website.com'