I am trying to access a proprietary website which provides access to a large database. The database is quite large (many billions of entries). Each entry in the database is a link to a webpage that is essentially a flat file containing the information that I need.
I have about 2000 entries from the database and their corresponding webpages in the database. I have two related issues that I am trying to resolve:
How to get wget (or any other similar program) to read cookie data. I downloaded my cookies from google chrome (using: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en) but for some reason the html downloaded by wget still cannot be rendered as a webpage. Similarly, I have not been able to get Google Chrome from the command line to read cookies. These cookies are needed to access the database, since they contain my credentials.
In my context, it would be OK if the webpage was downloaded as a PDF, but I cannot seem to figure out how to download a webpage as a pdf using wget or similar tools. I tried using automate-save-page-as (https://github.com/abiyani/automate-save-page-as) but I continuously get an error of the browser not being in my PATH.
I solved both of these issues:
Problem 1: I switched away from wget, curl and python's requests to simply using the selenium webdriver in python. Using selenium, I did not have to deal with issues such as passing cookies,headers, post and get, since it actually opens a browser. This also has a plus that as I was writing the script to use selenium, I could inspect the page and see what it was doing as it was doing it.
Problem 2: Selenium has a method called page_source, which downloaded the html of the webpage. When I tested it, it rendered the html correctly.
Related
I hosted a small django project on virtual machine on digital ocean. Before I started using digital ocean to serve static files everything worked fine. But immediately I created a storage space and push my static files there the static files are not being served on the webpage, including django admin page (the project pages are showing net::ERR_ABORTED 403 (Forbidden). I have already installed django_storages and boto3.
I got some hint as to the possible cause of the problem when I clicked on digital ocean aws bucket url that looks like https://django-space-storage.nyc3.digitaloceanspaces.com. When I cliked on it I got the following error:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
It seems the browser is rendering my django pages as xml instead of html. I might be wrong with this assumption because I'm finding a hard time trying to understand what is actually going on. My question is, how do I cause the browser to render my django pages as html instead of xml?
I’m currently trying to check this website for it’s current list of patches and select the newest zip file back and download it. Manually, the zip file automatically pops up after I accept EULA, but in the background it redirects elsewhere (somewhere beyond the scope of network tools/Wireshark etc.)
I’ve been using the request commands to pull the zip, but instead it returns the html of the page that it redirects to (not of the zip itself).
Thus it seems I can accept the EULA, and then it redirects somewhere. But when I create that zip, based off of that the content inside is all wrong.
Is it possible to still save that zip (with requests?) If not, what are the alternatives?
Edit: Maybe an add on, or maybe the real issue. How can I tell if I’m still logged in to a site? I previously used the request session, and posted my login details then checked if it went through. But now I’m starting to feel like it’s not logging me in because the same html I received from the EULA get command is recurring everywhere else. My return looks something like the src code behind LOG IN, also shows when I’m looking at the response from the EULA
Instead of the actual list
I currently have an application which always makes a request to a page with a '.data' extension every time it opens. This application is currently released to the public and people use it, and I want to see how many people use it. To do this, I am taking the approach of just counting how many times this page is downloaded. I want to use Google Analytics, but I cannot get javascript/html to run on this page because it has a 'data' extension (random).
Is there anyway I could get some javascript to run, or somehow count the amount of times this page is downloaded without having to change my current application and update it to make a request to a PHP page? Also, I've tried redirecting the page with the .htaccess to a PHP script, and my application that is currently public won't follow the redirection.
Any reason you can't change put the following in your Apache configs?
AddType text/html .data
I am trying to use HTSQL for one of my Django projects. For that I followed the procedure given HERE for furnishing HTSQL/Django requirements. Then I cloned the HTSQL repository for trying the example/demo in it from HERE. I am testing this on Django v1.4. The default db used in the demo example is sqlite3. In Django python shell, the queries are working fine now as per THIS question. But as demonstrated on HTSQL Website, it has a very powerful frontend to communicate with the database and also generates efficient queries (reference). I am trying to use this particular feature for my Django application which is also demonstrated in the demo/example django app from HTSQL. In the demo app, when I started my local django server and tried to access the following url:
localhost:8000/htsql/
The page loads and when I write the following lines:
/polls_poll
to see the data from the polls_poll table, the RUN button does nothing and so does the more drop down menu. No error, no response, no data fetched from the polls_poll table. Then I noticed that the page wasn't loading properly i.e this trace was generated on the Django server terminal. So basically,
codemirror .js and .css files were throwing HTTP 500 error. For that I searched for the links of the codemirror .css and .js files and provided those links in the index.html of the HTSQL resided in the static folder. Following is its path:
>>> /usr/local/lib/python2.7/dist-packages/htsql/tweak/shell/static
Now the terminal trace has changed to THIS
But still the RUN button does nothing and no data is fetched from the table polls_poll
Am I doing something wrong or missing something??
CodeMirror just changed the download URL for their packages, which broke HTSQL shell. You either need to apply the following patch manually:
https://bitbucket.org/prometheus/htsql/changeset/f551f8996610bb68f2f8530fc6c0dbf6b5c34d90
or you can wait for the next bugfix release of HTSQL, which will be out in a day or two.
I am looking for a program which can crawl through a site and get a list of pages which set cookies. Normal site crawls do not parse JavaScript, so they won't pick up cookies set in this way.
It seems that you don't find sutable tool. The best way is to write own script on GreasyMonkey (or another script engine plugin for browser) that search mask "*cookie*" and analyze results manually. Also you can write auto downloading and parsing external JS files for HTML page.