static site pagination with google app engine - regex

I have an octopress/jekyll blog that I am trying to host with Google App Engine. Here's a different SO question that got me started: How to Regex for static webpage on Google App Engine?
However, I would ALSO like to get pretty urls working (with no index.html required at the end of the url).
e.g. /blog/post/ instead of /blog/post/index.html
For some of this I can wire up explicit rules (though it's pretty ugly). For the pagination pages (/blog/page2 for example) there is no way to know how many there will be and therefore can't wire them up.
I dimly suspect this will require a python script, but I'm wondering if there might be some regex magic that would accomplish the same thing? Either way, anyone have an idea? An example of a script that might work?

Related

ember hash urls in google

I am concerned about page ranking on google with the following situation:
I am looking to convert my existing site with 150k+ unique page results to a ember app, off the route. so currently its something like domain.com/model/id - With ember and hash change - it will be /#/model/id. I really want history state but lack of IE support doesn't leave that as a option. So my Sitemap for google has lots and lots of great results using the old model/id. On the rails side I will test browser for compatibility, before either rendering the JS rich app or the plain HTML / CSS. Does anyone have good SEO suggestions with my current schema for success.
Linked below is my schema and looking at the options -
http://static.allplaces.net/images/EmberTF.pdf
History state is awesome but it looks like support is only around 60% of browsers.
http://caniuse.com/history
Thanks guys for the suggestions, the google guide is similar to what I'm going to try. I will roll it out to 1 client this month, and see what webmasters and analytics show.
here is everything you need to have your hash links be seo friendly: https://developers.google.com/webmasters/ajax-crawling/
basically You write Your whole app with hashlinks, but You have to add "!" to them, so You have #!/model/id. Next You must have all pages somewhere generated and if google asks for them, return "plain html" as described here: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
use google webmaster tools to check if Your site is crawlable.
I'm not sure if you're aware that you can configure Ember to use the browser history for the location API and keep using your pages the way they are reference now. All you need to do is configure the Route's location property
App.Router.reopen({
location: 'history'
});
See more details about specifying the location api here

Emulating a web browser to wrap the functionality of several similar web sites

I'm interested in emulating the functionality of a web browser in C++ so that I can create a wrapper for several web sites. Right now, the biggest issues with these sites are that they make heavy use of JavaScript that interacts with the HTML DOM. Thus, the simple solution of using curl to download the page, and something like RapidXML to parse its contents is out.
Next, I considered using something like v8 with curl, and that solves the issue of interpreting the JavaScript on the page nicely. However, it doesn't solve the issue of connecting the HTML DOM methods with the JavaScript; in other words, document.getElementById() would fail in v8.
Next, I considered WebKit, which seems like it's perfectly suited to emulate a web browser--after all, Chromium and Safari both utilize it in their web browsers. However, it's a little too complete. I don't need all of the rendering aspects it includes.
So, I'd be looking for some way to:
Make an SSL connection to a web site
Interpret the JavaScript on that web site in connection with the HTML DOM
Set the value of the username/passwords <input> fields with my username and password
Simulate clicking the "Submit" button by calling the formSubmit() function, from <input type="button" onClick="formSubmit()">
Handle the HTTP POST form action and the subsequent HTTP 301 and JavaScript redirects (accomplished using window.location)
Repeat 2-5 as needed
Besides what I've already considered, what other options do I have? Ideally, I'd want this to be extremely lightweight, without requiring linking to many libraries.
I'm primarily concerned with developing for Windows 7 64-bit.
Well, this sounds all too much like a brute-force program. Disregarding that, and since you don't seem to need to render any website, I think you should just fetch the file through cURL or something, then parse it, check for the form through using a regex, retrieve the form action, then make a request using the method taken from the <form> tag and whichever input you want.
Problem is, there would be no proper way to know when is it that you've logged in properly, unless you made some kind of per-site checking. This comes mainly from the fact that many sites use sessions rather than direct cookies or HTTP auth, and since you can't read from sessions directly, it is impossible for you to guess when the session has changed.
That's the most lightweight solution I can come up with right now.

Tracking User Actions on Landing Pages in Django

I'm developing a web application. It's months away from completion but I would like to build a landing page to show to potential customers to explain things and gauge their interest--basically collecting their email address and if they feel like it additional information like names + addresses.
Because I'm already using Django to build my site I thought I might use another Django App to serve as this landing page. The features I need are
to display a fairly static page and potentially a series of pages,
collect emails (and additional customer data)
track their actions--e.g., they got through the first two pages but didnt fill out the final page.
Is there any pre-existing Django app that provides any of these features?
If there is not a Django app, then does anyone know of another, faster/better way than building my own app? Perhaps a pre-existing web service that you can skin and make look like your own? Maybe there's the perfect system but it's PHP?--I'm open for whatever.
Option 1: Google Sites
You can set it up very very quickly. Though your monitoring wouldn't be as detailed as you're asking for.. Still, easy and fasssst!
Option 2: bbclone
Something else that may be helpful is to set up some PHP based site (wordpress or something) and use bbclone for tracking stuff on it. I've found bbclone to be pretty intense with the reporting what everyone does - though it's been a while since I used it.
Option 3: Django Flatpages
The flatpages Django contrib app is pretty handy for making static flat pages. I'd probably just embed a Google Docs Form to collect email addresses (as that's super fast and lets you get back to real work). But this suggestion would still leave you needing to figure out how to get the level of detail you want on the stats end.
Perhaps consider Google Analytics anyway?
Regardless, I suggest you use Google Analytics with everything. That'll work with anything you do really, and for all I know, perhaps you can find a way to get the stats you're really looking for out of it.

Is someone trying to hack my Django website

I have a website that I built using Django. Using the settings.py file, I send myself error messages that are generated from the site, partly so that I can see if I made any errors.
From time to time I get rather strange errors, and they seem to mostly be around about the same area of the site (where I wrote a little tutorial trying to explain how I set up a Django Blog Engine).
The errors I'm getting all appear like something I could have done in a typo.
For example, these two errors are very close together. I never had an 'x' or 'post' as a variable on those pages.
'/blog_engine/page/step-10-sub-templates/{{+x.get_absolute_url+}}/'
'/blog_engine/page/step-10-sub-templates/{{+post.get_absolute_url+}}/'
The user agent is:
'HTTP_USER_AGENT': 'Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)',
Which I take it is a scraper bot, but I can't figure out what they would be able to get with this kind of attack.
At the risk of sounding stupid, what should I do? Is it a hack attempt or are they simply trying to copy my site?
Edit: I'll follow the advice already given, but I'm really curios as to why someone would run a script like this. Are they just trying to copy. It isn't hitting admin pages or even any of the forms. It would seem like harmless (aside from potential plagiarism) attempts to dig in and find content?
From your USER_AGENT info it looks like this is a web spider from puritysearch.net.
I suggest you do is put a CAPTCHA code in you website. Program it to trigger when something tries to access 10 pages in 10 seconds (mostly no humans would do this or figure out a proper criteria to trigger your CAPTCHA).
Also, maintain robots.txt file which most crawlers honor. Mention your rules in robots.txt. You can say the crawlers to keep off certain busy sections of your site etc.
If the problem persists, you might want to contact that particular site's system admin & try to figure out what's going on.
This way you will not be completely blocking crawlers (which are needed for your website to become popular) and at the same time you are making sure that your users get fast experience on your site.
Project HoneyPot has this bot listed as a malicious one http://www.projecthoneypot.org/ip_174.133.177.66 (check the comments there) and what you should probably do is ban that IP and/or Agent.

Some basic questions about Django, Pyjamas and Clean URLs

I am farily new to the topic, but I am trying to combine both Django and Pyjamas. What would be the smart way to combine the two? I am not asking about communication, but rather about the logical part.
Should I just put all the Pyjamas generated JS in the base of the domain, say http://www.mysite.com/something and setup Django on a subdirectory, or even subdomain, so all the JSON calls will go for http://something.mysite.com/something ?
As far as I understand now in such combination theres not much point to create views in Django?
Is there some solution for clean urls in Pyjamas, or that should be solved on some other level? How? Is it a standard way to pass some arguments as GET parameteres in a clean url while calling a Pyjamas generated JS?
You should take a look at the good Django With Pyjamas Howto.
I've managed to get the following to work, but it's not ideal. Full disclosure: I haven't figured out how to use the django's template system to get stuff into the pyjamas UI elements, and I have not confirmed that this setup works with django's authentication system. The only thing I've confirmed is that this gets the pyjamas-generated page to show up. Here's what I did.
Put the main .html file generated by pyjamas in django's "templates" directory and serve it from your project the way you'd serve any other template.
Put everything else in django's "static" files directory.
Make the following changes to the main .html file generated by pyjamas: in the head section find the meta element with name="pygwt:module" and change the content="..." attribute to content="/static/..." where "/static/" is the static page URL path you've configured in django; in the body section find the script element with src="bootstrap.js" and replace the attribute with src="/static/bootstrap.js".
You need to make these edits manually each time you regenerate the files with pyjamas. There appears to be no way to tell pyjamas to use a specific URL prefix when generating together its output. Oh well, pyjamas' coolness makes up for a lot.
acid, I'm not sure this is as much an answer as you would hope but I've been looking for the same answers as you have.
As far as I can see the most practical way to do it is with an Apache server serving Pyjamas output and Django being used as simply a service API for JSONrpc calls and such.
On a side note I am starting to wonder if Django is even the best option for this considering using it simply for this feature is not utilizing most of it's functionality.
The issue so far as I have found with using Django to serve Pyjamas output as Django Views/Templates is that Pyjamas loads as such
Main html page loads "bootstrap.js" and depending on the browser used bootstrap.js will load the appropriate app page. Even if you appropriately setup the static file links using the Django templating language to reference and load "bootstrap.js", I can't seem to do the same for bootstrap.js referencing each individual app page.
This leaves me sad since I do so love the "cruftless URLS" feature of Django.