Connecting urls using Python's urllib2 - python-2.7

I'm scraping some stock information using urllib2.
Some of my codes are as followings.
cap_url = "http://wisefn.stock.daum.net/company/c1010001.aspx?cmp_cd=%s" % code
cap_req = urllib2.Request(cap_url)
cap_data = urllib2.urlopen(cap_req).read()
~
~
~
depr_url = "http://wisefn.stock.daum.net/company/cF3002.aspx?cmp_cd=%s&frq=Q&rpt=ISM&finGubun=MAIN" % code
depr_req = urllib2.Request(depr_url)
depr_data = urllib2.urlopen(depr_req).read()
~
~
~
transaction_url = "http://www.shinhaninvest.com/goodicyber/mk/1206.jsp?code=%s" % code
transaction_data = urllib2.urlopen(transaction_url).read()
soup = BeautifulSoup(transaction_data, fromEncoding="utf-8")
As you know, %s is stock code. With given stock code, I'm scraping all of stock information. Total number of stock codes are over 1,600. Then I write gathered information to Excel with xlwt.
However, I can't get connecton to some url or get informationm for those I can conncet manually typing that url.
What's the problem? And how can I speed up scraping pages?

First I would check robots.txt file of the website. It most likely prohibit native python user-agent. So you may consider changing user-agent of urllib2.
Second. The website content might be generated by JavaScript, and if so, urllib2 can not evaluate it. For this purpose you can use Selenium driver, or PyQt framework or similar.

Related

Can only scrape the base price with requests html

In the product price listed below, I fail to scrape the prices of the more expensive options.
Strangely, I can save a direct URL link and see the correct price when the page loads. However, when I scrape the same link with requests-html, I only get the base price of the base product.
from requests_html import HTMLSession
url = 'https://www.staegerag.ch/shop/index.php?id_product=321&controller=product&search_query=level&results=2#/273-farbe-gold_tone_light_oak/233-sprachsteuerung-nein'
session = HTMLSession()
r = session.get(url)
print(r.html.find('span[id="our_price_display"]', first=True).text)
Results in 1'399.00 CHF instead of 1'699.00 CHF
After various google searches, i came across a simple solution. since i'm still a relatively new kid on the programming block, i didn't notice that my code didn't call chromium at all. my price to be scrapped is in java. I just had to render the page via: r.html.render() in chromium.
solution:
from requests_html import HTMLSession
url = 'https://www.staegerag.ch/shop/index.php?
id_product=321&controller=product&search_query=level&results=2#/273-farbe-
gold_tone_light_oak/233-sprachsteuerung-nein'
session = HTMLSession()
r = session.get(url)
r.html.render()
print(r.html.find('span[id="our_price_display"]', first=True).text)

How to send cookies separately in Python with urllib2

I'm trying to send multiple cookies to a url until I get the right one and I don't know why my current code isn't working
I've looked at the existing answers for sending cookies to a url but none of them seem to work in my case
The comments in the code are instructions for the task
# Write a script that can guess cookie values
# and send them to the url http://127.0.0.1:8082/cookiestore
# Read the response from the right cookie value to get the flag.
# The cookie id the aliens are using is alien_id
# the id is a number between 1 and 75
import urllib2
req = urllib2.Request('http://127.0.0.1:8082/cookiestore')
for i in range(75):
req.add_header('alien_id', i)
response = urllib2.urlopen(req)
html = response.read()
print(html)
I expected that one of the iterations would print something different, but they are all the same
This code is Python 3.8 as Python 2.7 is no longer supported, and your req.add_header line did need modifying. Here is your solution:
import urllib.request
url = "http://127.0.0.1:8082/cookiestore"
y = 5
for i in range(75):
request = urllib.request.Request(url)
request.add_header("Cookie", "alien_id = "+str(i))
response = urllib.request.urlopen(request)
responseStr = str(response.read().decode("utf-8"))
print(responseStr)
If you are still using Python 2.7, you can just copy the request.add_header line.
Cheers!
Ahaha cyber discovery i see ;)
You were very close, infact i used your code as my base except for the add header line.
Not sure if you even still need to know but ill leave this here for others
When you send a cookie as a header you send it like so:
req.add_header('Cookie', 'cookiename=cookievalue')
Hope this helps :D

Scraping Aliexpress site with Python don't Give me Correct Result

I have problem in scraping aliexpress site.
https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html
This is one url.
What i want to get.
r = requests.get('https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html')
beautifulsoup
content = soup.find('div', {'id':'j-product-tabbed-pane'})
lxml parsing.
root = html.fromstring(r.content)
results = root.xpath('//img[#alt="aeProduct.getSubject()"]')
f = open('result.html', 'w')
f.write(lxml.html.tostring(results[0]))
f.close()
This my my code but give me false result.
Inspect on browser has that elements
But above code don't give me anything.
I think requests.get don't give me correct contents. But why and how i can solve this problem. They detect as a bot?. How can help me.
Thank you every one.
try this
1-use user agent
2-use proxy
3-disable javascript from this site and refresh it then see if the site have this element or it loads by javascript if it loads by javascript
you should find a way to render JS

Working with Scrapy 'regex definition'

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am aiming for anything at this point), but cannot seem to get it to work. I suspect this is because I have not configured my regex properly to identify the span tag I am trying to scrape from. Does anyone have any idea what I might be doing wrong and how I fix it?
Much appreciated.
Matt
import urllib
import re
url = "https://services.aamc.org/msar/home#null"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<td colspan="2" class="schoolLocation">(.+?)</td>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the school location is ",price
First of all, don't use regular expressions to parse HTML. There are specialized tools called HTML parsers, like BeautifulSoup or lxml.html.
Actually, the advice is not that relevant to this particular problem, since there is no need to parse HTML. The search results on this page are dynamically loaded from a separate endpoint to which a browser sends an XHR request, receives a JSON response, parses it and displays the search results with the help of javascript executed in the browser. urllib is not a browser and provide you with an initial page HTML only with an empty search results container.
What you need to do is to simulate the XHR request in your code. Let's use requests package. Complete working code, printing a list of school programs:
import requests
url = "https://services.aamc.org/msar/home#null"
search_url = "https://services.aamc.org/msar/search/resultData"
with requests.Session() as session:
session.get(url) # visit main page
# search
data = {
"start": "0",
"limit": "40",
"sort": "",
"dir": "",
"newSearch": "true",
"msarYear": ""
}
response = session.post(search_url, data=data)
# extract search results
results = response.json()["searchResults"]["rows"]
for result in results:
print(result["schoolProgramName"])
Prints:
Albany Medical College
Albert Einstein College of Medicine
Baylor College of Medicine
...
Howard University College of Medicine
Howard University College of Medicine Joint Degree Program
Icahn School of Medicine at Mount Sinai

Having issues with Python xpath scraping

I'm back again with a question for the wonderful people here :)
Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.
My issue:
I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints
<Element html at 0xRANDOM>
with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.
My Code:
#!/bin/python
#Scrape current gold spot price in CAD
from lxml import html
import requests
def scraped_price():
page = requests.get('http://goldprice.org/gold-price-canada.html')
tree = html.fromstring(page.content)
print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
bid = tree.xpath("//span[#id='gpotickerLeftCAD_price']/text()")
print "Scraped content: ", bid
return bid
gold_scraper = scraped_price()
My research:
1) www.w3schools.com/xsl/xpath_syntax.asp
This is where I figured out to use '//span' to find all 'span' objects and then used the #id to narrow it down to the one I need.
2)Scraping web content using xpath won't work
This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.
Any assistance would be greatly appreciated.
<Element html at 0xRANDOM>
What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():
print(html.tostring(tree, pretty_print=True))
You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.
The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):
>>> import time
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
... print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
... time.sleep(1)
...
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...