Reading web page with log in (clojure) - clojure

I need to download some contents of a web page daily and I plan on using enlive for that. Trouble is, that I need to log in in with a POST first and the authentification of the page that I am interested in is then done with the session's cookies. So I can't just use
(html/html-resource (java.net.URL. url))
I didn't find a way to do this in clojure. Otherwise doing the reading in Java would be fine as well. In the end it should work as a worker on heroku.
Thanks!

You need an http-client to create a valid session. You can use clj-http's Cookie Store to simplify maintaining the cookie across requests.

Related

Populate web page using API result directly from API server

Firstly sorry if this question has been asked before but I'm a novice so even if it has I'm unaware of the language I'd even use to try and seek it out.
I'm beginning to learn about REST API's and it got me thinking. Is it possible to load the JSON response directly from the API server into the user's browser and bypass your own server?
Imagine you have say a Django app running on a server that accesses email messages from Outlook.com using the graph API.
I assume an ordinary flow would go something like:
User request->your server->graph api-> your server-> user browser.
It seems like a waste for it to hit your server that second time before it goes on to be presented to the user's browser.
Is there a way the Django app can render a template and effectively tell the browser "expect some data from X source, and place it in y location in this template"?
You could do that with javascript. You'd have to include either a script tag in your template, or create and include some static javascript files with the code.
I'd recommend learning and using the jQuery javascript library, as it makes what you're talking about much easier to implement. Research ajax requests, those are what you'll need to make requests directly to another server, bypassing your own

How to run Django views from a script?

I am writing a Django management command that visits a few pages, logged in as a superuser, and saves the results to a set of .html files.
Right now I'm just using the requests library and running the command with the development server running. Is there an easy way to generate the HTML from a view response so I do without actual HTTP requests?
I could create a request object from scratch but that seems like more overhead than the current solution. I was hoping for something simple.
Django has a RequestFactory which seems to suit your needs.
While it's not exactly meant for this purpose, an option would be to use the testing framework's Client to fake a request to the url - be sure to use client.login() before making your requests, to ensure you have superuser capabilities.

How Google crawls Angular-based apps with HTML5 urls

I'm building a web application with the following url structure:
/ is the landing page, not angular based
/choose uses Angular, it basically contains search
/fund/<code> also with Angular, contains specific data for a certain fund
There's no problem indexing /, it's just a plain and simple html, already SEO optimized. But I need both /choose and /fund/... being crawled by Google, that's the problem.
My app uses the HTML5 mode, and we never point to the app urls using hashbangs like foo.com#!/choose, always foo.com/choose.
Also, according to Google's docs on that matter, I put <meta name="fragment" content="!"> on the head of every Angular page we have. But using "fetch as google" to inspect my site, I can't realise how Google's asking the pages for my server. I'm using Django on the backend and I built a middleware to catch _escaped_fragment_ and act on it, but Google's never sending it.
So, simply put, my questions are:
Why isn't Google fetching my urls using _escaped_fragment_?
How Google will fetch the pages?
foo.com?_escaped_fragment_=/choose
foo.com/choose?_escaped_fragment_=
According to the google specs, You should use
foo.com/choose?_escaped_fragment_=hashfragment.
But As mentioned here, you don't seem to need hashfragment and the equal sign part since your url is already mapped on your Django server side. So, get rid of it and give it a try.
Your final url should look like this: foo.com/choose?_escaped_fragment_.
Hope it helps!

Emulating a web browser to wrap the functionality of several similar web sites

I'm interested in emulating the functionality of a web browser in C++ so that I can create a wrapper for several web sites. Right now, the biggest issues with these sites are that they make heavy use of JavaScript that interacts with the HTML DOM. Thus, the simple solution of using curl to download the page, and something like RapidXML to parse its contents is out.
Next, I considered using something like v8 with curl, and that solves the issue of interpreting the JavaScript on the page nicely. However, it doesn't solve the issue of connecting the HTML DOM methods with the JavaScript; in other words, document.getElementById() would fail in v8.
Next, I considered WebKit, which seems like it's perfectly suited to emulate a web browser--after all, Chromium and Safari both utilize it in their web browsers. However, it's a little too complete. I don't need all of the rendering aspects it includes.
So, I'd be looking for some way to:
Make an SSL connection to a web site
Interpret the JavaScript on that web site in connection with the HTML DOM
Set the value of the username/passwords <input> fields with my username and password
Simulate clicking the "Submit" button by calling the formSubmit() function, from <input type="button" onClick="formSubmit()">
Handle the HTTP POST form action and the subsequent HTTP 301 and JavaScript redirects (accomplished using window.location)
Repeat 2-5 as needed
Besides what I've already considered, what other options do I have? Ideally, I'd want this to be extremely lightweight, without requiring linking to many libraries.
I'm primarily concerned with developing for Windows 7 64-bit.
Well, this sounds all too much like a brute-force program. Disregarding that, and since you don't seem to need to render any website, I think you should just fetch the file through cURL or something, then parse it, check for the form through using a regex, retrieve the form action, then make a request using the method taken from the <form> tag and whichever input you want.
Problem is, there would be no proper way to know when is it that you've logged in properly, unless you made some kind of per-site checking. This comes mainly from the fact that many sites use sessions rather than direct cookies or HTTP auth, and since you can't read from sessions directly, it is impossible for you to guess when the session has changed.
That's the most lightweight solution I can come up with right now.

Besides URL rewriting, what options are available for maintaining sessions without using cookies?

I've seen various options for URL rewriting here on Stack Overflow, and other places on the web, but was curious to see if there were other options.
This is speculation, as Cookies and URL Rewriting are the big two, but technologically, I think it'd be possible to:
do some massive hackery with javascript that captures all links and submits a form with information.
track the session on the server based on IP
Both have their downsides and holes obviously.
Session variables? At work, we are not allowed to use non session-cookies without a load of permissions.
You can either maintain state through a cookie or through a query parameter. The browser needs to be able to pass data to the web server somehow and those are the only two options.
I suppose that would depend on what technology you are using. In ColdFusion you can maintain session variables without cookies.
Using a client-side database storage, such as Google Gears (sqlite) ? Html5 is expected to include one (webkit already does it).