I currently have an application which always makes a request to a page with a '.data' extension every time it opens. This application is currently released to the public and people use it, and I want to see how many people use it. To do this, I am taking the approach of just counting how many times this page is downloaded. I want to use Google Analytics, but I cannot get javascript/html to run on this page because it has a 'data' extension (random).
Is there anyway I could get some javascript to run, or somehow count the amount of times this page is downloaded without having to change my current application and update it to make a request to a PHP page? Also, I've tried redirecting the page with the .htaccess to a PHP script, and my application that is currently public won't follow the redirection.
Any reason you can't change put the following in your Apache configs?
AddType text/html .data
Related
I am creating a forum using Gatsby
I have develop a form that users can use to create threads to add to the forum in a page called create.js which which sends the data to an external DB.
Once, the user has submitted the thread, I want to create a new page using a template, normally I would use in Gatsby-node.js; according to the Gatsby Docs Gatsby-node.js is only run once on deployment.
Is there another way that I can access CreatePage() outside of Gatsby-node.js or is there another function I am missing?
Ultimately I want the new page to available in the Gatsby application, without redeploying, after the user has created the necessary content.
The way Gatsby works is that all pages need to be generated at build time. You cannot add new pages without triggering a new build.
Gatsby is not a suitable platform for a forum since content changes hundreds or thousands of times a day. Gatsby is intended for content that changes infrequently such as blogs (which might update a few times a day).
I order to generate pages without using CreatePage in Gatsby-node.js, Gatsby advises to use #reach/Router and matchPage to extend the client application's router, we call this functionality (Client-only Routes)
more info here
I was studying mozilla django tutorial and in between that I came across one point where I have to redirect the url ' ' to '/catalog/' with permanent = True.
Now I have one new project (another project in different directory ) with django-admin and gave the command of manage.py runserver (notice I haven't made any changes in this project) the url '' is automatically being redirect to '/catalog/' in chrome it works fine in opera mini though.
It’s probably not a Django issue, but due to Chrome caching certain requests. You could try a hard refresh of the page:
https://www.getfilecloud.com/blog/2015/03/tech-tip-how-to-do-hard-refresh-in-browsers/
Unless the redirect is being handled by some unusual Javascript (and you are doing this in Django so probably not), you probably just need to refresh the page. Hold the shift key while clicking on the refresh button in Chrome.
This is because chrome has cached this particular redirect and uses it without checking with the server. You can get rid of this by clearing your browser cache - but you might not want to do this because you'll lose other data too. Instead, right click on the page in question and go to developer tools (or hit F12). Go to the Network tab and tick "Disable cache". Now refresh the page without the redirected part of the url, and the page should load correctly. Now close developer tools and it should continue to work as intended.
I solved it by deleting the browser history(going to Advanced mode and selecting everything).
I am trying to access a proprietary website which provides access to a large database. The database is quite large (many billions of entries). Each entry in the database is a link to a webpage that is essentially a flat file containing the information that I need.
I have about 2000 entries from the database and their corresponding webpages in the database. I have two related issues that I am trying to resolve:
How to get wget (or any other similar program) to read cookie data. I downloaded my cookies from google chrome (using: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en) but for some reason the html downloaded by wget still cannot be rendered as a webpage. Similarly, I have not been able to get Google Chrome from the command line to read cookies. These cookies are needed to access the database, since they contain my credentials.
In my context, it would be OK if the webpage was downloaded as a PDF, but I cannot seem to figure out how to download a webpage as a pdf using wget or similar tools. I tried using automate-save-page-as (https://github.com/abiyani/automate-save-page-as) but I continuously get an error of the browser not being in my PATH.
I solved both of these issues:
Problem 1: I switched away from wget, curl and python's requests to simply using the selenium webdriver in python. Using selenium, I did not have to deal with issues such as passing cookies,headers, post and get, since it actually opens a browser. This also has a plus that as I was writing the script to use selenium, I could inspect the page and see what it was doing as it was doing it.
Problem 2: Selenium has a method called page_source, which downloaded the html of the webpage. When I tested it, it rendered the html correctly.
I have just started with Python Web scraping through Requests. This could be a broad question, I will try to make it as brief as possible.
I came through situation where sometimes an entire page source can be downloaded with r.content (where r is a response object of requests's get call)
Sometimes some part of the data is stored in json format... In files that can be accessed by deeply observing the get and post calls made.
However, I even found websites where the entire content is in DOM but part of it is neither in Page source nor in Json files.
I am wondering how many of such places can a website store a data in?
(Just the names, I am not looking for how to get there)
For these last type of websites, I have observed almost every requests call made, but couldn't find where the data is.
So are there any other place except the 2 mentioned above? Or those are the only two indicating I am not doing my job right of observing the requests call?
You may answer it in brief bullet points and I can take my study from there.
Thanks in advance.
Lets assume we are talking only about HTML data. A web server could serve you data in many other formats (JSON/XML, etc.)
Please note that what I have described is generalisation, and like most generalisations, you could find exceptions that do not fit in it.
Broadly we could divide the type of data displayed (for the end user) into two categories
Pre render
Post render
Pre render
The entire HTML page is constructed at server-side and sent across to the client. Here, the JS side is concerned with the user interaction, and not with the structure of the data.
We are slowly moving away from this type of structure, but currently a large majority of all web pages uses this.
Web scraping is relatively easy here, as we can programatically pull the html page, and not bother about the javascript code that accompanies it.
a combination of requests and beautifulsoup should work in almost all of the cases (assuming that you could identify the general structure of the document).
Post render
Here the HTML page that is returned from the server is just a "skeleton" or placeholders for the actual data. The data is rendered by the accompanying JS code.
In such cases, if you fetch the source file via for eg., requests, you will get an empty shell, with no data in it.
for this if you inspect the calls made by a browser while rendering, (chrome's network tab or firefox's inspect tool or the more popular firebug), you will most likely see ajax requests that brings back the actual data from the server)
depending on how the requests are made, you could hit that ajax endpoint, and get the data in JSON.
you could use response.json() function to extract it into python-dicts.
In certain (rare) cases, there would not be an ajax call, but the HTML served from the server will still be a shell. The actual data is part of that file served, but stored as part of the JS code itself. This could be done for a variety of reasons, for example for dynamic data to be sent to static js files, or just to deter simple attempts of scraping the page.
One approach to scraping such pages would be to 'render' the page in a headless browser, which executes the JS code and returns an HTML that could be parsed via parsers like beautifulsoup
beautifulsoup has the ability to work with many parsers, one of which is html5lib, which could solve this issue.
you could also look at selenium or mechanize
or you could try parsing the js code yourself which might be faster.
Arriving at a conclusion as to what to use requires careful inspection of how the page is rendered on a browser. Even if you don't see an ajax request, the html that is served by the server need not be how the browser displays it.
A good way to start is by looking at the bare-html that is being served, by either downloading the page via curl or requests.get or simply rendering it in your browser with javascript disabled.
Good luck.
I have a handful of users on a server. After updating the site, they don't see the new pages. Is there a way to globally force their browsers and providers to display the new page? Maybe from settings.py? I see there are decorators that look like they do this on a function level.
Depends on browser and cache settings.
There may be no way to tell browsers to do so (as pages are cached, they are not even talking to server, so there is nothing You can do there).
Good trick is to set Vary: Cookie header, so You can always invalidate cache (by changing cookie somewhere) in case of need.
One way to force the browser to load a new page rather than loading the cached version is to change the file name. You could add a date/time to the file name and use a rewrite rule (assuming Apache web server here) to get the new page.
This site gives a quick explanation: http://www.askapache.com/htaccess/mod_rewrite-fix-for-caching-updated-files.html
and google will show many more.
you may also have to examine your cache control headers.