Regular Expressions | Delete words on multiple lines before a given word - regex

I scraped several articles from a website and now I am trying to make the corpus more readable by deleting the first part from the text scraped.
The interval that it should be deleted is within the tag <p>Advertisement and the final tag </time> before the article starts. As you can see, the regular expression should delete several words on multiple lines. I tried with the DOTALL sequence but it wasn't successful.
This is my first attempt:
import re
text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)

Try this:
my_pattern=("[\s\S]+\<\/time\>")
If you also want to delete also the following tag </p>, the comma , and the space, you can use this:
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")

Related

Adding a space between paragraphs when extracting text with BeautifulSoup

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.
You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Combining remove tags regex and remove empty lines in sed - Unix

Given a markup file like this:
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
<seg id="5">In addition, participants at the event plan to discuss the rules for forming an expert panel, which is responsible for evaluating the work of scientific groups, as well as the criteria for carrying out evaluations.</seg>
<seg id="6">The third Expert Session will be the final meeting in a series of events on the formation of a unified approach for all three academies to the evaluation of the effectiveness of activities of scientific organizations.</seg>
<seg id="7">Over the past five months, we were able to achieve this, and the final version of the regulatory documents is undergoing approval.</seg>
<seg id="8">According to the plans for the upcoming session, we should complete the development of procedures for scientometric and expert analysis, and come to an agreement on the stages and timeframes for the evaluation process”, said the Head of FANO’s Expert-Analytical Department, Elena Aksenova.</seg>
<seg id="9">Representatives from more than one hundred Russian scientific institutes will take part in the event.</seg>
<seg id="10">It is expected that a resolution will be adopted based on its results.</seg>
<seg id="11">The meeting will begin at 10 am, Moscow time, on September 16, 2014, at the following address: 14 Solyanka Street, Moscow.</seg>
</p>
</doc>
</srcset>
I can remove the markup tags with Sed remove tags from html file:
sed -e 's/<[^>]*>//g' file.txt
which will leave me outputs with empty lines and I have to do this Delete empty lines using SED:
sed -e 's/<[^>]*>//g' file.txt | sed '/^\s*$/d'
How should I combine the remove tag and remove empty lines regexes into one?
What about deleting right away? :
sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt

Append values in regular expressions

I'm using Xpath and regular expressions to obtain data from a web page
I'm using the following xpath to get the portion I'm interested in.
response.xpath('//*[#id="business-detail"]/div/p').extract()
EDIT:
Which provides the following:
[u'<p><span class="business-phone" itemprop="telephone">(415) 287-4225</span><span class="business-address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">2180 Bryant St. STE 203, </span><span itemprop="addressLocality">San Francisco</span>,\xa0<span itemprop="addressRegion">CA</span>\xa0<span itemprop="postalCode">94110</span></span><span class="business-link">www.klopfarchitecture.com</span> <br><br></p>']
I'm interested in
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
<span itemprop="addressLocality">San Francisco</span>
<span itemprop="addressRegion">CA</span>
<span itemprop="postalCode">94110</span>
So I'm using this regex to extract the data
reg = r'"streetAddress">[0-9]+[^<]*'
reg = r'"addressLocality"[^<]*'
reg = r'"addressRegion"[^<]*'
reg = r'"postalCode"[^<]*'
The problem is that are four of them so I get four variables, I need to append the data to have the full address in one variable to assign it to an Item, what would be an efficient way to accomplish it?
EDIT2:
You're right Roshan Jossey, I can use response.xpath('//*[#itemprop="streetAddress"]').extract()
But still are four labels, addressLocality, addressRegion and postal code. how I merge the results?
I looking for this result:
2180 Bryant St. STE 203, San Francisco, CA 94110
And I'm getting this format for each of the four parts
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
I'd recommend to use just xpaths to solve this problem
response.xpath('//*[#id="business-detail"]/div/p//span[#itemprop="streetAddress"]/text()').extract()[0]
will provide you the street address. You can extract all other elements in a similar fashion. Then its just a matter of concatenating them.
Regular expressions looks like an overkill when such simple xpath solutions exist.

Sublime Text regex to re-order Facebook message export

I've exported my Facebook message history and found that messages are displayed as below. Some are in blocks, some blocks are out of order, some are together, all in reverse.
John Doe Sunday, 24 August 2014, 01:18
Hello!
Jane Doe Sunday, 24 August 2014, 01:17
Hi!
What I'm trying to do is to use a regex, [a-z]* and $1, etc., to search and replace in Sublime Text to re-layout the data such that the above becomes sortable in Excel (to get everything in correct order) as below (or any date-sortable manner):
2014.08.24 01:17 Jane Doe Hi
2014.08.24 01:18 John Doe Hello!
Is this possible? I've managed to select the name date and time, but cannot get the variable-length messages consistently and nor re-order/move the date/time or message to the example blow. Does this make sense, or am I wasting my time?
Would it be best in a tab/comma separated way, too?
You'd be much better off using your favorite language to parse the HTML in html/messages.htm. Open the file in Sublime, use a code formatter like HTML-CSS-JS Prettify (note: requires Node.js) to format it, then look at the structure. Basically, each conversation is in reverse order, oldest first, and within the conversation each message is sorted most recent first (yay consistency!). Here's an example:
<div class="thread">Fred Smith, Joe Blow
<div class="message">
<div class="message_header">
<span class="user">Fred Smith</span>
<span class="meta">Monday, June 13, 2011 at 3:42pm EDT</span>
</div>
</div>
<p>Not much Joe. How are you?</p>
<div class="message">
<div class="message_header">
<span class="user">Joe Blow</span>
<span class="meta">Monday, June 13, 2011 at 11:00am EDT</span>
</div>
</div>
<p>Hey there Fred, what's up?</p>
</div>