Append values in regular expressions - regex

I'm using Xpath and regular expressions to obtain data from a web page
I'm using the following xpath to get the portion I'm interested in.
response.xpath('//*[#id="business-detail"]/div/p').extract()
EDIT:
Which provides the following:
[u'<p><span class="business-phone" itemprop="telephone">(415) 287-4225</span><span class="business-address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">2180 Bryant St. STE 203, </span><span itemprop="addressLocality">San Francisco</span>,\xa0<span itemprop="addressRegion">CA</span>\xa0<span itemprop="postalCode">94110</span></span><span class="business-link">www.klopfarchitecture.com</span> <br><br></p>']
I'm interested in
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
<span itemprop="addressLocality">San Francisco</span>
<span itemprop="addressRegion">CA</span>
<span itemprop="postalCode">94110</span>
So I'm using this regex to extract the data
reg = r'"streetAddress">[0-9]+[^<]*'
reg = r'"addressLocality"[^<]*'
reg = r'"addressRegion"[^<]*'
reg = r'"postalCode"[^<]*'
The problem is that are four of them so I get four variables, I need to append the data to have the full address in one variable to assign it to an Item, what would be an efficient way to accomplish it?
EDIT2:
You're right Roshan Jossey, I can use response.xpath('//*[#itemprop="streetAddress"]').extract()
But still are four labels, addressLocality, addressRegion and postal code. how I merge the results?
I looking for this result:
2180 Bryant St. STE 203, San Francisco, CA 94110
And I'm getting this format for each of the four parts
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>

I'd recommend to use just xpaths to solve this problem
response.xpath('//*[#id="business-detail"]/div/p//span[#itemprop="streetAddress"]/text()').extract()[0]
will provide you the street address. You can extract all other elements in a similar fashion. Then its just a matter of concatenating them.
Regular expressions looks like an overkill when such simple xpath solutions exist.

Related

Adding a space between paragraphs when extracting text with BeautifulSoup

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.
You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

Beautiful Soup Exact tag data

I am using BeautifulSoup to extract some data from HTML page. What I am doing is:
list=soup.find_all('td', {'align': 'left', 'valign': None})
print list[0]
It gives me
<td align="left">\n<h3>Name XYZ</h3>\n CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, <br/>KANDIVALI EAST,<br/>Mumbai MAHARASHTRA-400101</td>
But I want output like:
Name: Name XYZ, Add: CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, KANDIVALI EAST, Mumbai MAHARASHTRA-400101
What should I do?
find_all will return a list of tag, so when you access the first item in the list list[0], it will return the first tag, like you output
if you want to extract text for tag, you can use tag.text, in your case
list[0].text
Actually, I think that there are two methods for that, depending on what you are looking for.
Im not sure whether the "name" and "add" strings in front of your desired output are typos or not, so here are the two possible ways I see on how to do it:
In the case you simply want to extract all the text beneath each tag of your list_tags obtained from the find_all method, without any manipulations such as separating each word, go for the get_text() method.
With it, you could opt for a simple list comprehension like:
>>> simple_uni_text = [tag.get_text() for tag in list_tags]
>>> simple_uni_text
[u'\nName XYZ\n CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX, KANDIVALI EAST,Mumbai MAHARASHTRA-400101', u'\nName ABC\n DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE, BOMBAY WEST,BOMBAY RASHTRAMAHA-400101']
>>> len(simple_uni_text)
>>> 2 # I pretended the list_tags to have two tags, so it generated a list of length two!
The stripped_strings generator.
Its maybe a trickier method. But you could gain in precision.
>>> uni_stripped_words = []
>>> for tag in list_tags:
for string in tag.stripped_strings:
uni_stripped_words.append(string)
>>> uni_stripped_words
[u'Name XYZ', u'CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX,', u'KANDIVALI EAST,', u'Mumbai MAHARASHTRA-400101', u'Name ABC', u'DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE,', u'BOMBAY WEST,', u'BOMBAY RASHTRAMAHA-400101']
>>> len(uni_stripped_words)
8
Here you separate each string that are found beneath each teag of your list_tags from another. Thus, if you indeed want to add the following "Name" and "Add" in front of your text, then it could better correspond to your needs.
>>> for word in uni_stripped_words:
print word
Name XYZ
CTS SANSKRUTI LAYOUT, 90 FEET RAOD, THAKUR COMPLEX,
KANDIVALI EAST,
Mumbai MAHARASHTRA-400101
Name ABC
DUT WITHOUT LAYIN, 45 FOOT AODR, RUKTHA SIMPLE,
BOMBAY WEST,
BOMBAY RASHTRAMAHA-400101 # Sorry for the weird text example haha
I however find the second method less controllable. There are, for instance, sometimes unexpected characters. Personnally I prefer to concatenate when writing out the output to a file!
Anyway, in both cases, don't forget that the resulting lists will contain extracted text of unicode type.
Cheers

Regular Expressions | Delete words on multiple lines before a given word

I scraped several articles from a website and now I am trying to make the corpus more readable by deleting the first part from the text scraped.
The interval that it should be deleted is within the tag <p>Advertisement and the final tag </time> before the article starts. As you can see, the regular expression should delete several words on multiple lines. I tried with the DOTALL sequence but it wasn't successful.
This is my first attempt:
import re
text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)
Try this:
my_pattern=("[\s\S]+\<\/time\>")
If you also want to delete also the following tag </p>, the comma , and the space, you can use this:
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")

Sublime Text regex to re-order Facebook message export

I've exported my Facebook message history and found that messages are displayed as below. Some are in blocks, some blocks are out of order, some are together, all in reverse.
John Doe Sunday, 24 August 2014, 01:18
Hello!
Jane Doe Sunday, 24 August 2014, 01:17
Hi!
What I'm trying to do is to use a regex, [a-z]* and $1, etc., to search and replace in Sublime Text to re-layout the data such that the above becomes sortable in Excel (to get everything in correct order) as below (or any date-sortable manner):
2014.08.24 01:17 Jane Doe Hi
2014.08.24 01:18 John Doe Hello!
Is this possible? I've managed to select the name date and time, but cannot get the variable-length messages consistently and nor re-order/move the date/time or message to the example blow. Does this make sense, or am I wasting my time?
Would it be best in a tab/comma separated way, too?
You'd be much better off using your favorite language to parse the HTML in html/messages.htm. Open the file in Sublime, use a code formatter like HTML-CSS-JS Prettify (note: requires Node.js) to format it, then look at the structure. Basically, each conversation is in reverse order, oldest first, and within the conversation each message is sorted most recent first (yay consistency!). Here's an example:
<div class="thread">Fred Smith, Joe Blow
<div class="message">
<div class="message_header">
<span class="user">Fred Smith</span>
<span class="meta">Monday, June 13, 2011 at 3:42pm EDT</span>
</div>
</div>
<p>Not much Joe. How are you?</p>
<div class="message">
<div class="message_header">
<span class="user">Joe Blow</span>
<span class="meta">Monday, June 13, 2011 at 11:00am EDT</span>
</div>
</div>
<p>Hey there Fred, what's up?</p>
</div>

Regular Expression for Date format: D, dd M yy "Tue, 25 Oct 2011"

I am looking for a regular expression to validation Date selected from a datepicker with format of D, dd M yy
$('#expirydate').datepicker({
constrainInput: true,
minDate: 0,
dateFormat: 'D, dd M yy'
});
And the equivalent format in View is like this:
<div class="editor-label">
Expiry Date
</div>
<div class="editor-field">
#Html.TextBox("ExpiryDate", String.Format("{0:ddd, d MMM yyyy}", DateTime.Now), new { id = "expirydate" })
#Html.ValidationMessageFor(model => model.ExpiryDate)
</div>
I could not get it in the regex library.. can anyone help??
Appreciate any feedback.. Thanks...
If it's coming from a datepicker, and you trust that it won't produce impossible dates, then to validate that format you could use:
(Sun|Mon|Tue|Wed|Thu|Fri|Sat|Sun), [1-3]{1}[0-9]{0,1} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}
Well, I wrote a Regex for all the months, started to validate the days, and realized that you have to validate 28-31 depending on the month and year anyway.
Given that, you should skip the Regex and start looking down the road of looking at the MSDN documentation on parsing data and time strings.
If you really want to pre-check it with a regex, (Mon|Tue|Wed|Thu|Fri|Sat|Sun), (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ([0-3]\d) (\d{4}) is the the basic setup.