Adding a space between paragraphs when extracting text with BeautifulSoup - python-2.7

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.

You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

Related

Why is my regex displaying in alphabetical order?

For my assignment, I am trying to scrape information off the following website: https://www.blueroomcinebar.com/movies/now-showing/.
My code needs to find movie names, times and posters. Both the movie times and posters appear to be displayed in the list I have created according to the order they appear in the HTML, however, the names seem to be in alphabetical order.
We are not allowed to use BeautifulSoup
This is my current code for scraping movies:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
movies_name = []
movies_times = []
movies_image = []
movies_list = []
movies_page = urlopen("https://www.blueroomcinebar.com/movies/now-showing/").read().decode('utf-8')
#Add movies to Movies at Blue Room Screen
find_movie_names = findall(r'<h1>(.*?)</h1>', movies_page)
find_movie_times = findall(r'<p>([0-9]{1,2}:[0-9]{2} AM|PM)</p>', movies_page)
find_movie_image = findall(r'<div class="poster" style="background-image: url\((.*?)\)">', movies_page)
print(find_movie_names)
#Add movies to arrays
for movie in find_movie_names:
movies_name.append(movie)
for movie in find_movie_times:
movies_times.append(movie)
for movie in find_movie_image:
movies_image.append(movie)
print(movies_name)
print(movies_image)
for movie in range(len(movies_name)):
movies_list.append("{};{};{}".format(movies_name[movie], movies_times[movie], movies_image[movie - 1]))
Currently, the names are in the list in the order of
['Aladdin', 'Avengers: Endgame', 'Chandigarh Amritsar Chandigarh', 'John Wick - Parabellum', 'Long Shot', 'Pokemon Detective Pikachu', 'Poms', 'The Hustle', 'Top End Wedding']
They should be in the order:
['Avengers: Endgame', 'Long Shot', 'Pokemon Detective Pikachu', 'The Hustle', 'John Wick - Parabellum', 'Aladdin', 'Chandigarh Amritsar Chandigarh']
N.P.
There may be a movie that comes up a second time with the precursor OCAP. I'm not 100% sure why it has that but it seems to be some kind of special screening that rotates through different movies each day.

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Regular Expressions | Delete words on multiple lines before a given word

I scraped several articles from a website and now I am trying to make the corpus more readable by deleting the first part from the text scraped.
The interval that it should be deleted is within the tag <p>Advertisement and the final tag </time> before the article starts. As you can see, the regular expression should delete several words on multiple lines. I tried with the DOTALL sequence but it wasn't successful.
This is my first attempt:
import re
text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)
Try this:
my_pattern=("[\s\S]+\<\/time\>")
If you also want to delete also the following tag </p>, the comma , and the space, you can use this:
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")

BeautifulSoup: pulling a tag preceding another tag

I'm pulling lists on webpages and to give them context, I'm also pulling the text immediately preceding them. Pulling the tag preceding the <ul> or <ol> tag seems to be the best way. So let's say I have this list:
I'd want to pull the bullet and word "Millennials". I use a BeautifulSoup function:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
When I print ul_with_context, I get the following:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
As you can see, "Millennials" wasn't pulled. The page I'm pulling from is http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/
Here's the section of code for the bullet:
The <p> and <ul> tags are siblings. Any idea why it's not pulling the tag with the word "Millennials" in it?
Previous_sibling will return elements or strings preceding the tag. In your case, it returns the string '\n'.
Instead, you could use the findPrevious method to get the node preceding what you selected:
doc = """
<h2>test</h2>
<ul>
<li>1</li>
<li>2</li>
</ul>
"""
soup = BeautifulSoup(doc, 'html.parser')
tags = soup.find_all('ul')
print [ul.findPrevious() for ul in tags]
print tags
will output :
[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]

Converting a textfile into a list

I have a text file which contains a series of movie titles, which looks like this once opened.
A Nous la Liberte (1932) About Schmidt (2002) Absence of Malice
(1981) Adam's Rib (1949) Adaptation (2002) The Adjuster (1991) The
Adventures of Robin Hood (1938) Affliction (1998) The African Queen
(1952)
Using the code below:
def movie_text():
moviefile = open("movies.txt", 'r')
yourResult = [line.split('\n') for line in moviefile.readlines()]
movie_text()
I get nothing.
Your code doesn't prints right.
If I understand it well,
moviefile = open("movies.txt", 'r')
lines=moviefile.readlines()
print(len(lines)) # Shows list size
for line in lines:
print(line[:1]) # The [:1] part cuts the \n
The method readlines returns a list, I am not sure why your use split. I mean, if all you want is to remove the '\n', you can do it in many ways, being the one I used just one of them.
Hope it works!