Extract texts until certain patterns on Scrapy - regex

I'm trying to scrape certain contents from a webpage using Scrapy.
The html element looks like below.
'<p>\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n (<a href="javascript:void(0);" style="cursor:pointer;" onclic
k=\'window.open("http://athome.ekiworld.net/?id=athome&to=asso 302 ワンルーム&to_near_station1=25824&to_near_time1=1&to_near_traffic1=徒歩 1 分");return false;\'>電車ルート案内</a>)\n
</p>'
My goal is to extract only this part "阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n".
I tried to use .re() with response and I thought ^(.+?<a) would work since it succeeded parsing on https://regex101.com/. But on scrapy shell, it doesn't parse anything (gives me []).
Could someone help me with this?
I use Python3/scrapy1.3.0.
Thanks!

import re
text = '''<p>\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n (<a href="javascript:void(0);" style="cursor:pointer;" onclic
k=\'window.open("http://athome.ekiworld.net/?id=athome&to=asso 302 ワンルーム&to_near_station1=25824&to_near_time1=1&to_near_traffic1=徒歩 1 分");return false;\'>電車ルート案内</a>)\n
</p>'''
re.search(r'\n.+?\n', text).group()
out:
'\n 阪急宝塚線\xa0/\xa0石橋駅\xa0徒歩1分\n'

Related

Regex for extracting only the Youtube Embedment URL in Angular 5

I think it is not very convenient for an user the get this link here:
https://www.youtube.com/embed/GmvM6syadl0
Because YouTube provides an entire code snipped like so:
<iframe width="560" height="315" src="https://www.youtube.com/embed/GmvM6syadl0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
It would be a lot better if the user could take the code snippet above and my program is just going to extract the url for him.
Any ideas how to go about this? I'm usually not very good at extracting data from elaborate strings, what I would like to end up with is something like this:
let yTLink = extractYoutubeLinkfromIframe(providedInput);
extractYoutubeLinkfromIframe(iframeTag) {
// do fancy regex stuff
}
If you will have a format like that iFrame you could use split and I did it using the follwoing code:
extractYoutubeLinkfromIframe(iframeTag) {
let youtubeUrl = iframeTag.split('src');
youtubeUrl = youtubeUrl[1].split('"');
return youtubeUrl[1];
}
First we split by the src, so, we will separte the iFrame string, after that, we split by quote ", to get just the part that we need as the link is with "[link]", we get the first position that will indicate that we want to get the link.

Methods in Python 2.7 that enable text extraction from multiple HTML pages with different element tags?

I primarily work in Python 2.7. I'm trying to extract the written content (body text) of hundreds of articles from their respective URLs. To simplify things, I've started by trying to extract the text from just one website in my list, and I've been able to do so successfully using BeautifulSoup4. My code looks like this:
import urllib2
from bs4 import BeautifulSoup
url = 'http://insertaddresshere'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
response = urllib2.urlopen(request)
soup = BeautifulSoup((response),"html.parser")
texts = soup.find_all("p")
for item in texts:
print item.text
This gets me the body text of a single article. I know how to iterate through a csv file and write to a new one, but the list of sites I need to iterate through are all from different domains, so the HTML code varies a lot. Is there any way to find body text from multiple articles that have different element labels (here, "p") for said body text? Is it possible to use BeautifulSoup to do this?

Using BeautifulSoup to print specific information that has a <div> tag

I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.
You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})

Cant find table with soup.findAll('table') using BeautifulSoup in python

Im using soup.findAll('table') to try to find the table in an html file, but it will not appear.
The table indeed exists in the file, and with regex Im able to locate it this way:
import sys
import urllib2
from bs4 import BeautifulSoup
import re
webpage = open(r'd:\samplefile.html', 'r').read()
soup = BeautifulSoup(webpage)
print re.findall("TABLE",webpage) #works, prints ['TABLE','TABLE']
print soup.findAll("TABLE") # prints an empty list []
I know I am correctly generating the soup since when I do:
print [tag.name for tag in soup.findAll(align=None)]
It will correctly print tags that it finds. I already tried also with different ways to write "TABLE" like "table", "Table", etc.
Also, if I open the file and edit it with a text editor, it has "TABLE" on it.
Why beautifulsoup doesnt find the table??
Context
python 2.x
BeautifulSoup HTML parser
Problem
bsoup findall does not return all the expected tags, or it returns none at all, even though the user knows that the tag exists in the markup
Solution
Try specifying the exact parser when initializing the BeautifulSoup constructor
## BEFORE
soup = BeautifulSoup(webpage)
## AFTER
soup = BeautifulSoup(webpage, "html5lib")
Rationale
The target markup may include mal-formed HTML, and there are varying degrees of success with different parsers.
See also
related post by Martijn that addresses the same issue

How to properly use xpath & regexp extractor in jmeter?

I have the following text in the HTML response:
<input type="hidden" name="test" value="testValue">
I need to extract the value from the above input tag.
I've tried both regexp and xpath extractor, but neither is working for me:
regexp pattern
input\s*type="hidden"\s*name="test"\s*value="(.+)"\s*>
xpath query
//input[#name="test"]/#value
The above xpath gives an error at the Xpath Assertion Listener .. "No node matched".
I tried a lot and concluded that the xpath works only if I use it as //input[#name].
At the moment I'm trying to add an actual name it gives the error .. "No node matched".
Could anyone please suggest me how to resolve the above issue?
Please take a look at my previous answer :
https://stackoverflow.com/a/11452267/169277
The relevant part for you would be step 3:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Element;
String html = prev.getResponseDataAsString(); // get response from your sampler
Document doc = Jsoup.parse(html);
Element inputElement = doc.select("input[name=test]").first();
String inputValue = inputElement.attr("value");
vars.put("inputTextValue", inputValue);
Update
So you don't get tangled with the code I've created jMeter post processor called Html Extractor here is the github url :
https://github.com/c0mrade/Html-Extractor
Since you are using XPath Extractor to parse HTML (not XML) response ensure that Use Tidy (tolerant parser) option is CHECKED (in XPath Extractor's control panel).
Your xpath query looks fine, check the option mentioned above and try again.