new to python web scraping - python-2.7

I am new to web scraping
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.text)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
I got this example
But I want to navigate to page to grab content
Example:
www.example.com/category/report
navigate to
www.example.com/category/report/annual
I have to grap pdf or xls or csv and save it in my system.
Suggest me the current trending python scraping tech using xpath and Regular expression
i am waiting for the answer

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

Working with Scrapy 'regex definition'

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am aiming for anything at this point), but cannot seem to get it to work. I suspect this is because I have not configured my regex properly to identify the span tag I am trying to scrape from. Does anyone have any idea what I might be doing wrong and how I fix it?
Much appreciated.
Matt
import urllib
import re
url = "https://services.aamc.org/msar/home#null"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<td colspan="2" class="schoolLocation">(.+?)</td>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the school location is ",price
First of all, don't use regular expressions to parse HTML. There are specialized tools called HTML parsers, like BeautifulSoup or lxml.html.
Actually, the advice is not that relevant to this particular problem, since there is no need to parse HTML. The search results on this page are dynamically loaded from a separate endpoint to which a browser sends an XHR request, receives a JSON response, parses it and displays the search results with the help of javascript executed in the browser. urllib is not a browser and provide you with an initial page HTML only with an empty search results container.
What you need to do is to simulate the XHR request in your code. Let's use requests package. Complete working code, printing a list of school programs:
import requests
url = "https://services.aamc.org/msar/home#null"
search_url = "https://services.aamc.org/msar/search/resultData"
with requests.Session() as session:
session.get(url) # visit main page
# search
data = {
"start": "0",
"limit": "40",
"sort": "",
"dir": "",
"newSearch": "true",
"msarYear": ""
}
response = session.post(search_url, data=data)
# extract search results
results = response.json()["searchResults"]["rows"]
for result in results:
print(result["schoolProgramName"])
Prints:
Albany Medical College
Albert Einstein College of Medicine
Baylor College of Medicine
...
Howard University College of Medicine
Howard University College of Medicine Joint Degree Program
Icahn School of Medicine at Mount Sinai

Csv parsing program & how to flatten multiple lists into single list

Ive been working on a small program that I need to do the following:
Take a csv file 'domains_prices.csv' with a column of domains and then a price for each e.g.:
http://www.example1.com,$20
http://www.example2.net,$30
and so on
and then a second file 'orders_list.csv' which is just a single column of blog post urls from the same domains listed in the 1st file e.g.:
http://www.exmaple2.net/blog-post-1
http://www.example1.com/some-article
http://www.exmaple3.net/blog-post-feb-19
and so on
I need to check the full urls from the orders_list against the domains in the 1st file and check what the price is of a blog post on that domain and then output all the blog post urls into a new file with the price for each e.g.:
http://www.example2.net/blog-post-1, $20
and then there would be a total amount at the end of the output file.
My plan was to create a dict for domains_prices with k,v as domain & price and then have all the urls from orders_list in a list and then compare the elements of that list against the prices in the dict.
This is my code, Im stuck towards the end, I have parsed_orders_list and it seems to be returning all the urls as individual lists so Im thinking I should put all those urls into a single list?
Also the final commented out code at the end is the operation I intend to do once I have the correct list of urls to compare them against the k, v of the dict, Im not sure if thats correct too though.
Please note this is also my first every full python program Ive created from scratch so if its horrendous then thats why :)
import csv
from urlparse import urlparse
#get the csv file with all domains and prices in
reader = csv.reader(open("domains_prices.csv", 'r'))
#get all the completed blog post urls
reader2 = csv.reader(open('orders_list.csv', 'r'))
domains_prices={}
orders_list = []
for row in reader2:
#put the blog post urls into a list
orders_list.append(','.join(row))
for domain, price in reader:
#strip the domains
domain = domain.replace('http://', '').replace('/','')
#insert the domains and prices into the dictionary
domains_prices[domain] = price
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url
parsed_orders = data.netloc
parsed_orders_list = parsed_orders.split()
print parsed_orders_list
"""
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]
"""
with the help of someone else Ive figured it out, made the following changes to the 'for i in orders_list' section
parsed_orders = []
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url then put each netloc url into a list
parsed_orders.append(data.netloc)
#print parsed_orders - to check that Im getting a list of netloc urls back
#Iterate over the list of urls and dict of domains and prices to match them up
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]

How to scrape through search results spanning multiple pages with lxml

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.
I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

Python Scraping, how to get text from this through BeautifulSoup?

well here is my code to scrape text content from a site.... well it is working though i am not getting plane text only.... how to handle that
from bs4 import BeautifulSoup
import mechanize
def getArticle(url):
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.contents
for example when i scrape a site.
i get this output
"[u"\nIn Soviet Russia, it's the banks that pay customers' bills.\xa0Or, at least, one might.", , u'\n', , u'\r\nAn interesting case has surfaced in Voronezh, Russia, where a man is suing a bank for more than 24 million Russian rubles (about $727,000) in compensation over a handcrafted document that was signed and recognized by the bank.\xa0', , u'\n', , u'\r\nA person who goes by name Dmitry Alexeev (his surname was changed ', by the first Russian outlet to publish this story, u') said that in 2008 he received a letter from ', Tinkoff Credit Systems, u'\xa0in his mailbox. It was a credit card application form with an agreement contract enclosed, much like the applications Americans receive daily from various banks working with ', Visa
how to get plain text only?
Use tag.text instead of tag.contents:
from bs4 import BeautifulSoup
import mechanize
url = "http://www.minyanville.com/business-news/editors-pick/articles/A-Russian-Bank-Is-Sued-for/8/7/2013/id/51205"
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.text