I have just started with Python Web scraping through Requests. This could be a broad question, I will try to make it as brief as possible.
I came through situation where sometimes an entire page source can be downloaded with r.content (where r is a response object of requests's get call)
Sometimes some part of the data is stored in json format... In files that can be accessed by deeply observing the get and post calls made.
However, I even found websites where the entire content is in DOM but part of it is neither in Page source nor in Json files.
I am wondering how many of such places can a website store a data in?
(Just the names, I am not looking for how to get there)
For these last type of websites, I have observed almost every requests call made, but couldn't find where the data is.
So are there any other place except the 2 mentioned above? Or those are the only two indicating I am not doing my job right of observing the requests call?
You may answer it in brief bullet points and I can take my study from there.
Thanks in advance.
Lets assume we are talking only about HTML data. A web server could serve you data in many other formats (JSON/XML, etc.)
Please note that what I have described is generalisation, and like most generalisations, you could find exceptions that do not fit in it.
Broadly we could divide the type of data displayed (for the end user) into two categories
Pre render
Post render
Pre render
The entire HTML page is constructed at server-side and sent across to the client. Here, the JS side is concerned with the user interaction, and not with the structure of the data.
We are slowly moving away from this type of structure, but currently a large majority of all web pages uses this.
Web scraping is relatively easy here, as we can programatically pull the html page, and not bother about the javascript code that accompanies it.
a combination of requests and beautifulsoup should work in almost all of the cases (assuming that you could identify the general structure of the document).
Post render
Here the HTML page that is returned from the server is just a "skeleton" or placeholders for the actual data. The data is rendered by the accompanying JS code.
In such cases, if you fetch the source file via for eg., requests, you will get an empty shell, with no data in it.
for this if you inspect the calls made by a browser while rendering, (chrome's network tab or firefox's inspect tool or the more popular firebug), you will most likely see ajax requests that brings back the actual data from the server)
depending on how the requests are made, you could hit that ajax endpoint, and get the data in JSON.
you could use response.json() function to extract it into python-dicts.
In certain (rare) cases, there would not be an ajax call, but the HTML served from the server will still be a shell. The actual data is part of that file served, but stored as part of the JS code itself. This could be done for a variety of reasons, for example for dynamic data to be sent to static js files, or just to deter simple attempts of scraping the page.
One approach to scraping such pages would be to 'render' the page in a headless browser, which executes the JS code and returns an HTML that could be parsed via parsers like beautifulsoup
beautifulsoup has the ability to work with many parsers, one of which is html5lib, which could solve this issue.
you could also look at selenium or mechanize
or you could try parsing the js code yourself which might be faster.
Arriving at a conclusion as to what to use requires careful inspection of how the page is rendered on a browser. Even if you don't see an ajax request, the html that is served by the server need not be how the browser displays it.
A good way to start is by looking at the bare-html that is being served, by either downloading the page via curl or requests.get or simply rendering it in your browser with javascript disabled.
Good luck.
Related
Firstly sorry if this question has been asked before but I'm a novice so even if it has I'm unaware of the language I'd even use to try and seek it out.
I'm beginning to learn about REST API's and it got me thinking. Is it possible to load the JSON response directly from the API server into the user's browser and bypass your own server?
Imagine you have say a Django app running on a server that accesses email messages from Outlook.com using the graph API.
I assume an ordinary flow would go something like:
User request->your server->graph api-> your server-> user browser.
It seems like a waste for it to hit your server that second time before it goes on to be presented to the user's browser.
Is there a way the Django app can render a template and effectively tell the browser "expect some data from X source, and place it in y location in this template"?
You could do that with javascript. You'd have to include either a script tag in your template, or create and include some static javascript files with the code.
I'd recommend learning and using the jQuery javascript library, as it makes what you're talking about much easier to implement. Research ajax requests, those are what you'll need to make requests directly to another server, bypassing your own
We've got a WordPress site and I've built a page that pulls from different sections of our site which I'd like to use as the content for a bi-weekly MailChimp newsletter. Is there anyway to automate pulling in a div on our site into the body of a MailChimp template?
All the tools I've found pull in the page as "an article" and just put an image and headline into the message body, rather than the full page verbatim.
Not adverse to doing some coding, but not sure how to start.
Thanks for any suggestions.
I can think of two different routes you might be able to try. The first is to generate an RSS feed for the content you're talking about and then use an RSS Campaign to send the email. Depending on how you have this data stored on your site, WordPress might already be generating an RSS feed for you for that content.
The second option involves more coding. If you create a template with an editable section you can then pass in the content of that section via the API. This is probably harder, since the campaign content APIs are pretty convoluted in v2.0. v3.0 should make that easier, but it's still in beta.
Here's the deal: I had a excel table that fulfills a MySQL table. I already made a procedure in server side who receives the sheet, read it and put it on the database. Saddly the sheet and data table doesn't have the same structure, so I need to use a php object/script in server side to manipulate it. I have a interface to upload the file (excel file), so the PHP program can read it...
...but my boss job isn't make my life easier, is it? NO! He says that is a lot of work have to upload every excel file by the web interface. So, he asked me to make a button in the sheet that he might click after his "job" is done. That would replace the web interface.
But, the system itself is a interface that would be saled one day (well, it's the plan!). So, I just can't just role out the web interface.
WHAT I'M ASKING IS: There's a way that I could send a file (the sheet itself) in a post method straight from the VBA Macro without using XML files and name each data that I'm sending, like a form post?
So far, I've found some tutorials or even some SO posts that made me get somewhere. But all of them were talking about a XML, and I already have a method that receives a HTTP POST (from a form) and work. I aiming to reuse the same method. From my VBA script I'm already able to make the request (not a big deal) and post it. But, in the server-side script, I'm expecting a POST come out from a form, so it calls a field's name. I don't seen to be able to do that from a VBA post. =/
Here's the answer... the two first functions/methods define how to send a file to a web service. You only need the file path and the URL from service. It has answered even more than I expected. :D
I am using the Django template system. What I want is, when I submit a form, or click to an url link, page does not refreshes, but loads with the data returning from the server. Is it possible?
I recommend a combination of jQuery (easy, powerful, popular javascript library) and dajax/dajaxice (http://www.dajaxproject.com/). Dajax is very easy to set up and use, and jQuery is also easy to set up and use. Dajax is strictly for AJAX communications through Django. jQuery is perfect for taking a simple site and making it more fluid, intuitive, and user-friendly.
You need JavaScript to do that. What you are looking for is called AJAX (Asynchronous JavaScript and XML). Essentially, it means you use JavaScript to send a request to the server as soon as the link/button is clicked. The server returns some data to your Script, which then can be used to manipulate the HTML page, e.g. by inserting the responded data into the DOM. Since you do everything with JavaScript, no reloading of the whole page is required.
To start, read the AJAX tutorial. There are certain JavaScript libraries that make these things more simple for you (e.g. jQuery), but you really should understand how this stuff works first, since else you might get into trubble while trying to debug it.
I'm interested in emulating the functionality of a web browser in C++ so that I can create a wrapper for several web sites. Right now, the biggest issues with these sites are that they make heavy use of JavaScript that interacts with the HTML DOM. Thus, the simple solution of using curl to download the page, and something like RapidXML to parse its contents is out.
Next, I considered using something like v8 with curl, and that solves the issue of interpreting the JavaScript on the page nicely. However, it doesn't solve the issue of connecting the HTML DOM methods with the JavaScript; in other words, document.getElementById() would fail in v8.
Next, I considered WebKit, which seems like it's perfectly suited to emulate a web browser--after all, Chromium and Safari both utilize it in their web browsers. However, it's a little too complete. I don't need all of the rendering aspects it includes.
So, I'd be looking for some way to:
Make an SSL connection to a web site
Interpret the JavaScript on that web site in connection with the HTML DOM
Set the value of the username/passwords <input> fields with my username and password
Simulate clicking the "Submit" button by calling the formSubmit() function, from <input type="button" onClick="formSubmit()">
Handle the HTTP POST form action and the subsequent HTTP 301 and JavaScript redirects (accomplished using window.location)
Repeat 2-5 as needed
Besides what I've already considered, what other options do I have? Ideally, I'd want this to be extremely lightweight, without requiring linking to many libraries.
I'm primarily concerned with developing for Windows 7 64-bit.
Well, this sounds all too much like a brute-force program. Disregarding that, and since you don't seem to need to render any website, I think you should just fetch the file through cURL or something, then parse it, check for the form through using a regex, retrieve the form action, then make a request using the method taken from the <form> tag and whichever input you want.
Problem is, there would be no proper way to know when is it that you've logged in properly, unless you made some kind of per-site checking. This comes mainly from the fact that many sites use sessions rather than direct cookies or HTTP auth, and since you can't read from sessions directly, it is impossible for you to guess when the session has changed.
That's the most lightweight solution I can come up with right now.