Scraperwiki scrape query: using lxml to extract links

Scraperwiki scrape query: using lxml to extract links - python-2.7

I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build.
https://scraperwiki.com/scrapers/thisisscraper/
I'm working line-by-line through the tutorial 3 and have got so far with trying to extract the next page link. I can use cssselect to identify the link, but I can't work out how to isolate just the href attribute rather than the whole anchor tag.
Can anyone help?
def scrape_and_look_for_next_link(url):
html = scraperwiki.scrape(url)
print html
root = lxml.html.fromstring(html) #turn the HTML into lxml object
scrape_page(root)
next_link = root.cssselect('ol.pagination li a')[-1]
attribute = lxml.html.tostring(next_link)
attribute = lxml.html.fromstring(attribute)
#works up until this point
attribute = attribute.xpath('/#href')
attribute = lxml.etree.tostring(attribute)
print attribute

CSS selectors can select elements that have an href attribute with eg. a[href] but they can not extract the attribute value by themselves.
Once you have the element from cssselect, you can use next_link.get('href') to get the value of the attribute.

link = link.attrib['href']
should work

Related

Extract texts until certain patterns on Scrapy

I'm trying to scrape certain contents from a webpage using Scrapy.
The html element looks like below.
'<p>\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n （<a href="javascript:void(0);" style="cursor:pointer;" onclic
k=\'window.open("http://athome.ekiworld.net/?id=athome&to=ａｓｓｏ ３０２ ワンルーム&to_near_station1=25824&to_near_time1=1&to_near_traffic1=徒歩 1 分");return false;\'>電車ルート案内</a>）\n
</p>'
My goal is to extract only this part "阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n".
I tried to use .re() with response and I thought ^(.+?<a) would work since it succeeded parsing on https://regex101.com/. But on scrapy shell, it doesn't parse anything (gives me []).
Could someone help me with this?
I use Python3/scrapy1.3.0.
Thanks!

import re
text = '''<p>\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n （<a href="javascript:void(0);" style="cursor:pointer;" onclic
k=\'window.open("http://athome.ekiworld.net/?id=athome&to=ａｓｓｏ ３０２ ワンルーム&to_near_station1=25824&to_near_time1=1&to_near_traffic1=徒歩 1 分");return false;\'>電車ルート案内</a>）\n
</p>'''
re.search(r'\n.+?\n', text).group()
out:
'\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n'

Python xpath returns an empty list

I need to extract some of the href attributes under the "ARTICLES" section on this page.
I am using the following code
from lxml import html
import requests
page = requests.get('http://www.dlib.org/dlib/november14/11contents.html')
tree = html.fromstring(page.content)
result = tree.xpath('/html/body/form/table[3]/tbody/tr/td/table[5]/tbody/tr/td/table/tbody/tr/td[2]/p[6]/#href')
print result
I know for sure that the XPath is correct but when I run the script it prints
[]
I've tried with some others elements on the page and it works as expected.
Any idea?

Using BeautifulSoup to print specific information that has a <div> tag

I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.

You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})

How to get large amounts of href links of very large contents of website with Beautifulsoup

I am parsing a large html website that has over 1000 href links. I am using Beautifulsoup to get all the links but second time when I run the program again, beautifulsoup cannot handle it. (find specific all 'td' tags. how will I overcome this problem? Though I can load the html page with urllib, all the links cannot be printed. When I use it with find one 'td' tag, it is passed.
Tag = self.__Page.find('table', {'class':'RSLTS'}).findAll('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
Now working as
Tag = self.__Page.find('table', {'class':'RSLTS'}).find('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']

You need to iterate over them:
tds = self.__Page.find('table', class_='RSLTS').find_all('td')
for td in tds:
a = td.find('a', href=True)
if a:
print "found", a['href']
Although I'd just use lxml if you have a ton of stuff:
root.xpath('table[contains(#class, "RSLTS")]/td/a/#href')

How to properly use xpath & regexp extractor in jmeter?

I have the following text in the HTML response:
<input type="hidden" name="test" value="testValue">
I need to extract the value from the above input tag.
I've tried both regexp and xpath extractor, but neither is working for me:
regexp pattern
input\s*type="hidden"\s*name="test"\s*value="(.+)"\s*>
xpath query
//input[#name="test"]/#value
The above xpath gives an error at the Xpath Assertion Listener .. "No node matched".
I tried a lot and concluded that the xpath works only if I use it as //input[#name].
At the moment I'm trying to add an actual name it gives the error .. "No node matched".
Could anyone please suggest me how to resolve the above issue?

Please take a look at my previous answer :
https://stackoverflow.com/a/11452267/169277
The relevant part for you would be step 3:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Element;
String html = prev.getResponseDataAsString(); // get response from your sampler
Document doc = Jsoup.parse(html);
Element inputElement = doc.select("input[name=test]").first();
String inputValue = inputElement.attr("value");
vars.put("inputTextValue", inputValue);
Update
So you don't get tangled with the code I've created jMeter post processor called Html Extractor here is the github url :
https://github.com/c0mrade/Html-Extractor

Since you are using XPath Extractor to parse HTML (not XML) response ensure that Use Tidy (tolerant parser) option is CHECKED (in XPath Extractor's control panel).
Your xpath query looks fine, check the option mentioned above and try again.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scraperwiki scrape query: using lxml to extract links - python-2.7

CSS selectors can select elements that have an href attribute with eg. a[href] but they can not extract the attribute value by themselves. Once you have the element from cssselect, you can use next_link.get('href') to get the value of the attribute.

link = link.attrib['href'] should work

Related

Extract texts until certain patterns on Scrapy

Python xpath returns an empty list

Using BeautifulSoup to print specific information that has a <div> tag

How to get large amounts of href links of very large contents of website with Beautifulsoup

How to properly use xpath & regexp extractor in jmeter?

Categories

Resources