Scraping Aliexpress site with Python don't Give me Correct Result - python-2.7

I have problem in scraping aliexpress site.
https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html
This is one url.
What i want to get.
r = requests.get('https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html')
beautifulsoup
content = soup.find('div', {'id':'j-product-tabbed-pane'})
lxml parsing.
root = html.fromstring(r.content)
results = root.xpath('//img[#alt="aeProduct.getSubject()"]')
f = open('result.html', 'w')
f.write(lxml.html.tostring(results[0]))
f.close()
This my my code but give me false result.
Inspect on browser has that elements
But above code don't give me anything.
I think requests.get don't give me correct contents. But why and how i can solve this problem. They detect as a bot?. How can help me.
Thank you every one.

try this
1-use user agent
2-use proxy
3-disable javascript from this site and refresh it then see if the site have this element or it loads by javascript if it loads by javascript
you should find a way to render JS

Related

How to send cookies separately in Python with urllib2

I'm trying to send multiple cookies to a url until I get the right one and I don't know why my current code isn't working
I've looked at the existing answers for sending cookies to a url but none of them seem to work in my case
The comments in the code are instructions for the task
# Write a script that can guess cookie values
# and send them to the url http://127.0.0.1:8082/cookiestore
# Read the response from the right cookie value to get the flag.
# The cookie id the aliens are using is alien_id
# the id is a number between 1 and 75
import urllib2
req = urllib2.Request('http://127.0.0.1:8082/cookiestore')
for i in range(75):
req.add_header('alien_id', i)
response = urllib2.urlopen(req)
html = response.read()
print(html)
I expected that one of the iterations would print something different, but they are all the same
This code is Python 3.8 as Python 2.7 is no longer supported, and your req.add_header line did need modifying. Here is your solution:
import urllib.request
url = "http://127.0.0.1:8082/cookiestore"
y = 5
for i in range(75):
request = urllib.request.Request(url)
request.add_header("Cookie", "alien_id = "+str(i))
response = urllib.request.urlopen(request)
responseStr = str(response.read().decode("utf-8"))
print(responseStr)
If you are still using Python 2.7, you can just copy the request.add_header line.
Cheers!
Ahaha cyber discovery i see ;)
You were very close, infact i used your code as my base except for the add header line.
Not sure if you even still need to know but ill leave this here for others
When you send a cookie as a header you send it like so:
req.add_header('Cookie', 'cookiename=cookievalue')
Hope this helps :D

Scraper stopped scraping

I ran scraping ops this morning:
The scraper runs through list fine, but just keeps saying "skipped" as per code.
I have checked a few and confirmed the information i require is on the website.
I have pulled my code apart piece by piece but cannot find any changes - I've even gone back to a vanilla version of my code to see and still no luck.
Could someone please run this and see what I am missing as I am going insane!
Target website https://www.realestate.com.au/property/12-buckingham-dr-werribee-vic-3030
Code:
import requests
import csv
from lxml import html
text2search = '''<p class="property-value__title">
RECENTLY SOLD
</p>'''
quote_page = ["https://www.realestate.com.au/property/12-buckingham-dr-werribee-vic-3030"]
with open('index333.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(quote_page):
page = requests.get(url)
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([url, title, price, sold])
else:
writer.writerow([url, 'skipped'])
There was a change in the HTML code that introduced an additional white space.
This stopped the text2search in page.text: from running.
Thanks to #MarcinOrlowski for pointing me in the right direction
Thanks to advice from #MT - the code has been shortened to lessen the chances of this occurring again.

Create List from path expression with Python

I'm currently working on a webscraper without any frameworks and experiencing an issue where I test an xpath xpression to, say, get the table data on a wikipedia page. However when I scrape it and print it to the console it only returns an empty list. Can anyone please advise? and perhaps suggest some useful books on xpath for webscraping? (i have safaribooks of that helps)
import requests
from lxml import html
page = requests.get('https://en.wikipedia.org/wiki/L.A.P.D._(band)')
tree = html.fromstring(page.content)
# OK
bandName = tree.xpath('//*[#id="firstHeading"]/text()')
overview = tree.xpath('//*[#id="mw-content-text"]/p[1]//text()')
print(bandName)
print(overview)
#Trouble Code
yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[6]//text()')
print(yearsActive)
members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[11]/td[1]/ul/li/a//text()')
print(members)
UPDATE: While Conducting more testing I discovered that print(len(members)) returns zero which seems to indicate something is wrong with my xpath expression, yet when testing my members expression in chrome console it returns a list of band members.
Your XPath fails because the raw HTML tables don't have tbody. The tbody elements in this case are likely generated by browser (see related question below) :
>>> yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[6]/td/text()')
>>> print yearsActive
[u'1989\u20131992']
>>> members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[10]/td[1]//text()[normalize-space()]')
>>> print members
['James Shaffer', 'Reginald Arvizu', 'David Silveria', '\nRichard Morrill', '\nPete Capra', '\nCorey (surname unknown)', '\nDerek Campbell', '\nTroy Sandoval', '\nJason Torres', '\nKevin Guariglia']
In the future, it is often useful to inspect HTML that you actually receives from requests.get(), in case your XPath unexpectedly fails when run from codes but the same worked fine when run from browser tools.
Related : Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

How to scrape through search results spanning multiple pages with lxml

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.
I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

Problems Scraping a Page With Beautiful Soup

I am using Beautiful Soup to try and scrape a page.
I am trying to follow this tutorial.
I am trying to get the contents of the following page after submitting a Stock Ticker Symbol:
http://www.cboe.com/delayedquote/quotetable.aspx
The tutorial is for a page with a "GET" method, my page is a "POST". I wonder if that is part of the problem?
I want use the first text box – under where it says:
“Enter a Stock or Index symbol below for delayed quotes.”
Relevant code:
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
values = {'ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol' : 'IBM' }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.cboe.com/delayedquote/quotetable.aspx", data, headers)
response = urllib2.urlopen(request)
The call does not fail, I do not get a set of options and prices returned to me like when I run the page interactively. I a bunch of garbled HTML.
Thanks in advance!
Ok - I think I figured out the problem (and found another). I decided to switch to 'mechanize' from 'urllib2'. Unfortunately, I kept having problems getting the data. Finally, I realized that there are two 'submit' buttons, so I tried passing the name parameter when submitting the form. That did the trick as far as getting the correct response.
However, the next problem was that I could not get BeautifulSoup to parse the HTML and find the necessary tags. A brief Google search revealed others having similar problems. So, I gave up on BeautifulSoup and just did a basic regex on the HTML. Not as elegant as BeautifulSoup, but effective.
Ok - enough speechifying. Here's what I came up with:
import mechanize
import re
br = mechanize.Browser()
url = 'http://www.cboe.com/delayedquote/quotetable.aspx'
br.open(url)
br.select_form(name='aspnetForm')
br['ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol'] = 'IBM'
# here's the key step that was causing the trouble - pass the name parameter
# for the button when calling submit
response = br.submit(name="ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$btnSubmit")
data = response.read()
match = re.search( r'Bid</font><span> \s*([0-9]{1,4}\.[0-9]{2})', data, re.MULTILINE|re.M|re.I)
if match:
print match.group(1)
else:
print "There was a problem retrieving the quote"