How do I parse my own HTML files in Flask?

How do I parse my own HTML files in Flask? - flask

I am trying to create a sitemap for my website instead of having to run it through a website that will make it for me. This is because the website changes quite often.
I found code online that achieves part of it:
#app.route('/sitemap.xml', methods=['GET'])
def sitemap():
try:
"""Generate sitemap.xml. Makes a list of urls and date modified."""
pages = []
seven_days_ago = (datetime.datetime.now() - datetime.timedelta(days=7)).date().isoformat()
for rule in app.url_map.iter_rules():
if "GET" in rule.methods and len(rule.arguments) == 0:
pages.append( ["..." + str(rule.rule), seven_days_ago])
sitemap_xml = render_template('pages/sitemap_template.xml', pages=pages)
response = make_response(sitemap_xml)
response.headers["Content-Type"] = "application/xml"
return response
except Exception as e:
return(str(e))
It works to create a basic sitemap. Okay, easy enough.
I want to add a priority in the meta tags of each page and then build the sitemap off of that. This SO question/answer covers that but it is using beautifulsoup and urllib, and is geared more toward the web, not a local instance.
So, I figure I need to render_template for each route (in this case, rule.rule) then parse that, maybe with BeautifulSoup and get the priority. I have no idea how to do this. Is there a way to GET each template per its route so I can parse it?

I ended up using beautifulsoup4 on those routes. Shame that flask does not seem to offer a more direct way of this.

Related

Django syndication framework: prevent appending SITE_ID to the links

According to the documentation here: https://djangobook.com/syndication-feed-framework/
If link doesn’t return the domain, the syndication framework will
insert the domain of the current site, according to your SITE_ID
setting
However, I'm trying to generate a feed of magnet: links. The framework doesn't recognize this and attempts to append the SITE_ID, such that the links end up like this (on localhost):
<link>http://localhost:8000magnet:?xt=...</link>
Is there a way to bypass this?

Here's a way to do it with monkey patching, much cleaner.
I like to create a separate folder "django_patches" for these kinds of things:
myproject/django_patches/__init__.py
from django.contrib.syndication import views
from django.contrib.syndication.views import add_domain
def add_domain_if_we_should(domain, url, secure=False):
if url.startswith('magnet:'):
return url
else:
return add_domain(domain, url, secure=False)
views.add_domain = add_domain_if_we_should
Next, add it to your INSTALLED_APPS so that you can patch the function.
settings.py
INSTALLED_APPS = [
'django_overrides',
...
]

This is a bit gnarly, but here's a potential solution if you don't want to give up on the Django framework:
The problem is that the method add_domain is buried deep in a huge method within syndication framework, and I don't see a clean way to override it. Since this method is used for both the feed URL and the feed items, a monkey patch of add_domain would need to consider this.
Django source:
https://github.com/django/django/blob/master/django/contrib/syndication/views.py#L178
Steps:
1: Subclass the Feed class you're using and do a copy-paste override of the huge method get_feed
2: Modify the line:
link = add_domain(
current_site.domain,
self._get_dynamic_attr('item_link', item),
request.is_secure(),
)
To something like:
link = self._get_dynamic_attr('item_link', item)

I did end up digging through the syndication source code and finding no easy way to override it and did some hacky monkey patching. (Unfortunately I did it before I saw the answers posted here, all of which I assume will work about as well as this one)
Here's how I did it:
def item_link(self, item):
# adding http:// means the internal get_feed won't modify it
return "http://"+item.magnet_link
def get_feed(self, obj, request):
# hacky way to bypass the domain handling
feed = super().get_feed(obj, request)
for item in feed.items:
# strip that http:// we added above
item['link'] = item['link'][7:]
return feed
For future readers, this was as of Django 2.0.1. Hopefully in a future patch they allow support for protocols like magnet.

How to crawl multiple domains using single crawler?

How can I crawl data from multiple domains using a single crawler. I have done crawling of single sites using beautiful soup but couldn't figure out how to create a generic one.

Well this question is flawed, sites that you want to scrape have to have something in common for instance.
from bs4 import BeautifulSoup
from urllib import request
import urllib.request
for counter in range(0,10):
# site = input("Type the name of your website") Python 3+
site = raw_input("Type the name of your website")
# Takes the website you typed and stores it in > site < variable
make_request_to_site = request.urlopen(site).read()
# Makes a request to the site that we stored in a var
soup = BeautifulSoup(make_request_to_site, "html.parser")
# We pass it through BeautifulSoup parser in this case html.parser
# Next we make a loop to find all links in the site that we stored
for link in soup.findAll('a'):
print link['href']

As mentioned, each site has their own distinct setup for selectors (, , etc). A single generic crawler won't be able to go into a url and intuitively understand what to scrape.
BeautifulSoup might not be the best choice for this type of request. Scrapy is another web crawler library that's a bit more robust that BS4.
Similar question here on stackoverflow: Scrapy approach to scraping multiple URLs
Scrapy Documentation:
https://doc.scrapy.org/en/latest/intro/tutorial.html

Get a list of all URLs on the Internet that match a regex

I would like to get a list of all URLs on the Internet that match a regular expression.
E.g. I'd like to know all the unique page URLs under site:jobs.lever.co that return a 200 OK. E.g. https://jobs.lever.co/reddit is a good result, https://jobs.lever.co/reddit?utm_source=fff and https://jobs.lever.co/bl4hbl4h are bad results.
Any tips/tricks for getting this data? Thank you!

Have a look at Scrapy if you're looking for a brilliant crawler (in python) and willing to invest some time for the fine tuning. It uses a combination of xPath and CSS queries.
A possible spider could look like the following code:
import scrapy
class leverSpider(scrapy.Spider):
name = 'lever'
start_urls = (
'https://jobs.lever.co/reddit',
)
def parse(self, response):
# get all links
for link in response.xpath("//a[#class='posting-title']/#href").extract():
# do sth. with the link, e.g. parse the item
yield scrapy.Request(link, self.parse_item)
def parse_item(self, response):
# do sth. useful with your link here
You would start it like scrapy crawl lever, it searches for links with the class posting-title and requests these pages as well.

Scraping data off flipkart using scrapy

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience

Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/

Django: creating/modifying the request object

I'm trying to build an URL-alias app which allows the user create aliases for existing url in his website.
I'm trying to do this via middleware, where the request.META['PATH_INFO'] is checked against database records of aliases:
try:
src: request.META['PATH_INFO']
alias = Alias.objects.get(src=src)
view = get_view_for_this_path(request)
return view(request)
except Alias.DoesNotExist:
pass
return None
However, for this to work correctly it is of life-importance that (at least) the PATH_INFO is changed to the destination path.
Now there are some snippets which allow the developer to create testing request objects (http://djangosnippets.org/snippets/963/, http://djangosnippets.org/snippets/2231/), but these state that they are intended for testing purposes.
Of course, it could be that these snippets are fit for usage in a live enviroment, but my knowledge about Django request processing is too undeveloped to assess this.

Instead of the approach you're taking, have you considered the Redirects app?
It won't invisibly alias the path /foo/ to return the view bar(), but it will redirect /foo/ to /bar/

(posted as answer because comments do not seem to support linebreaks or other markup)
Thank for the advice, I have the same feeling regarding modifying request attributes. There must be a reason that the Django manual states that they should be considered read only.
I came up with this middleware:
def process_request(self, request):
try:
obj = A.objects.get(src=request.path_info.rstrip('/')) #The alias record.
view, args, kwargs = resolve_to_func(obj.dst + '/') #Modified http://djangosnippets.org/snippets/2262/
request.path = request.path.replace(request.path_info, obj.dst)
request.path_info = obj.dst
request.META['PATH_INFO'] = obj.dst
request.META['ROUTED_FROM'] = obj.src
request.is_routed = True
return view(request, *args, **kwargs)
except A.DoesNotExist: #No alias for this path
request.is_routed = False
except TypeError: #View does not exist.
pass
return None
But, considering the objections against modifying the requests' attributes, wouldn't it be a better solution to just skip that part, and only add the is_routed and ROUTED_TO (instead of routed from) parts?
Code that relies on the original path could then use that key from META.
Doing this using URLConfs is not possible, because this aliasing is aimed at enabling the end-user to configure his own URLs, with the assumption that the end-user has no access to the codebase or does not know how to write his own URLConf.
Though it would be possible to write a function that converts a user-readable-editable file (XML for example) to valid Django urls, it feels that using database records allows a more dynamic generation of aliases (other objects defining their own aliases).

Sorry to necro-post, but I just found this thread while searching for answers. My solution seems simpler. Maybe a) I'm depending on newer django features or b) I'm missing a pitfall.
I encountered this because there is a bot named "Mediapartners-Google" which is asking for pages with url parameters still encoded as from a naive scrape (or double-encoded depending on how you look at it.) i.e. I have 404s in my log from it that look like:
1.2.3.4 - - [12/Nov/2012:21:23:11 -0800] "GET /article/my-slug-name%3Fpage%3D2 HTTP/1.1" 1209 404 "-" "Mediapartners-Google
Normally I'd just ignore a broken bot, but this one I want to appease because it ought to better target our ads (It's google adsense's bot) resulting in better revenue - if it can see our content. Rumor is it doesn't follow redirects so I wanted to find a solution similar to the original Q. I do not want regular clients accessing pages by these broken urls, so I detect the user-agent. Other applications probably won't do that.
I agree a redirect would normally be the right answer.
My (complete?) solution:
from django.http import QueryDict
from django.core.urlresolvers import NoReverseMatch, resolve
class MediapartnersPatch(object):
def process_request(self, request):
# short-circuit asap
if request.META['HTTP_USER_AGENT'] != 'Mediapartners-Google':
return None
idx = request.path.find('?')
if idx == -1:
return None
oldpath = request.path
newpath = oldpath[0:idx]
try:
url = resolve(newpath)
except NoReverseMatch:
return None
request.path = newpath
request.GET = QueryDict(oldpath[idx+1:])
response = url.func(request, *url.args, **url.kwargs)
response['Link'] = '<%s>; rel="canonical"' % (oldpath,)
return response

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js