Csv parsing program & how to flatten multiple lists into single list - list

Ive been working on a small program that I need to do the following:
Take a csv file 'domains_prices.csv' with a column of domains and then a price for each e.g.:
http://www.example1.com,$20
http://www.example2.net,$30
and so on
and then a second file 'orders_list.csv' which is just a single column of blog post urls from the same domains listed in the 1st file e.g.:
http://www.exmaple2.net/blog-post-1
http://www.example1.com/some-article
http://www.exmaple3.net/blog-post-feb-19
and so on
I need to check the full urls from the orders_list against the domains in the 1st file and check what the price is of a blog post on that domain and then output all the blog post urls into a new file with the price for each e.g.:
http://www.example2.net/blog-post-1, $20
and then there would be a total amount at the end of the output file.
My plan was to create a dict for domains_prices with k,v as domain & price and then have all the urls from orders_list in a list and then compare the elements of that list against the prices in the dict.
This is my code, Im stuck towards the end, I have parsed_orders_list and it seems to be returning all the urls as individual lists so Im thinking I should put all those urls into a single list?
Also the final commented out code at the end is the operation I intend to do once I have the correct list of urls to compare them against the k, v of the dict, Im not sure if thats correct too though.
Please note this is also my first every full python program Ive created from scratch so if its horrendous then thats why :)
import csv
from urlparse import urlparse
#get the csv file with all domains and prices in
reader = csv.reader(open("domains_prices.csv", 'r'))
#get all the completed blog post urls
reader2 = csv.reader(open('orders_list.csv', 'r'))
domains_prices={}
orders_list = []
for row in reader2:
#put the blog post urls into a list
orders_list.append(','.join(row))
for domain, price in reader:
#strip the domains
domain = domain.replace('http://', '').replace('/','')
#insert the domains and prices into the dictionary
domains_prices[domain] = price
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url
parsed_orders = data.netloc
parsed_orders_list = parsed_orders.split()
print parsed_orders_list
"""
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]
"""

with the help of someone else Ive figured it out, made the following changes to the 'for i in orders_list' section
parsed_orders = []
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url then put each netloc url into a list
parsed_orders.append(data.netloc)
#print parsed_orders - to check that Im getting a list of netloc urls back
#Iterate over the list of urls and dict of domains and prices to match them up
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]

Related

Having trouble creating a list of urls by slicing

So I've been stuck on this homework problem and don't seem to be getting anywhere:/...I'm trying to create a new URL list that contains the proper URLs for the top 10 movies, here's my code so far:
tree = html.fromstring(response.content)
titles = [tree.xpath("//a/text()")]
urls = [tree.xpath("//td[#class='titleColumn']//a/#href")]
top10_urls = urls[:10]
top10_urls_fixed = []
for t in top10_urls:
if len(t) > 0:
t = "https://www.imdb.com"+ urls
top10_urls_fixed.append(t)
***my URLs are currently displaying like '/title/tt0111161/', and I'm trying to insert 'https://www.imdb.com' in front of each URL, so they look like 'https://www.imdb.com/title/tt0111161/'.
With labs being online now my professor never answers his emails and I'm stuck waiting all day, any help would be amazing TT-TT
Looks like there is a mistake here:
t = "https://www.imdb.com"+ urls
Should be:
t = "https://www.imdb.com"+ t
Here's a working example with IMDB. We extract the urls of each recommandation from a starting page. After getting the #href attributes (stored in a list), we paste the beginning of the url with the + operator inside a loop.
from lxml import html
import requests
page = requests.get('https://www.imdb.com/title/tt0111161/')
tree = html.fromstring(page.content)
movie_recs = tree.xpath('//div[#class="rec_overlay"]/following::a[1]/#href')
urls =["https://www.imdb.com" + el for el in movie_recs]
print (urls)
Output :

Soup.find and findAll unable to find table elements on hockey-reference.com

I'm just a beginner at webscraping and python in general so I'm sorry if the answer is obvious, but I can't figure out I'm unable to find any of the table elements on https://www.hockey-reference.com/leagues/NHL_2018.html.
My initial thought was that this was a result of the whole div being commented out, so following some advice I found on here in another similar post, I replaced the comment characters and confirmed that they were removed when I saved the soup.text to a text file and searched. I was still unable to find any tables however.
In trying to search a little further I took the ID out of my .find and did a findAll and still table was coming up empty.
Here's the code I was trying to use, any advice is much appreciated!
import csv
import requests
from BeautifulSoup import BeautifulSoup
import re
comm = re.compile("<!--|-->")
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(comm.sub("", html))
table = soup.find('table', id="stats")
When searching for all of the table elements I was using
table = soup.findAll('table')
I'm also aware that there is a csv version on the site, I was just eager to practice.
Give a parser along with your markup, for example BeautifulSoup(html,'lxml') . Try the below code
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
table = soup.findAll('table')

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

How to scrape through search results spanning multiple pages with lxml

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.
I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

new to python web scraping

I am new to web scraping
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.text)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
I got this example
But I want to navigate to page to grab content
Example:
www.example.com/category/report
navigate to
www.example.com/category/report/annual
I have to grap pdf or xls or csv and save it in my system.
Suggest me the current trending python scraping tech using xpath and Regular expression
i am waiting for the answer