BeautifulSoup: pulling a tag preceding another tag - python-2.7

I'm pulling lists on webpages and to give them context, I'm also pulling the text immediately preceding them. Pulling the tag preceding the <ul> or <ol> tag seems to be the best way. So let's say I have this list:
I'd want to pull the bullet and word "Millennials". I use a BeautifulSoup function:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
When I print ul_with_context, I get the following:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
As you can see, "Millennials" wasn't pulled. The page I'm pulling from is http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/
Here's the section of code for the bullet:
The <p> and <ul> tags are siblings. Any idea why it's not pulling the tag with the word "Millennials" in it?

Previous_sibling will return elements or strings preceding the tag. In your case, it returns the string '\n'.
Instead, you could use the findPrevious method to get the node preceding what you selected:
doc = """
<h2>test</h2>
<ul>
<li>1</li>
<li>2</li>
</ul>
"""
soup = BeautifulSoup(doc, 'html.parser')
tags = soup.find_all('ul')
print [ul.findPrevious() for ul in tags]
print tags
will output :
[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]

Related

Wrapping long text sections in Jinja2

I have the definition of a variable, it's name and an associated comment in a YAML file and am trying to use Jinja2 to create an appropriate target file; in this case a proprietary config file
...
- comment: >
This is a comment which will almost certainly end up longer than standard eighty characters or at least on the occasion on which it does.
name: my_useful_variable
value: /a/long/example/path/to/my/file.txt
I would like this text to be rendered as follows:
# This is a comment which will almost certainly end up
# longer than standard eighty characters or at least on
# the occasion on which it does.
my_useful_variable = "/a/long/example/path/to/my/file.txt"
Does Jinja2 have any way of wrapping text so that the long comment line is limited in length and split over however many lines is necessary?
So far I have:
# {{item.comment}}
{{item.name}} = "{{item.value}}"
But this of course does not deal with the length of the comment.
Solution
Following on from the answer provided by #blhsing below, I came up with the following macro, which works fine for basic variables and simple lists (i.e. not dictionaries or more complex hierarchical data structures:
{% macro set_params(param_list, ind=4, wid=80) -%}
{% for item in param_list %}
{% if item.comment is defined %}{{item.comment|wordwrap(wid - ind - 2)|replace('', ' ' * ind +'# ', 1)|replace('\n', '\n' + ' ' * ind + '# ')}}
{% endif %}
{% if item.value is iterable and item.value is not string %}{{item.name|indent(ind, True)}} = [ {% for item_2 in item.value %}{{item_2}}{{ ", " if not loop.last else " " }}{% endfor %}{% else %}{{item.name|indent(ind, True)}} = {{item.value}}{% endif %}
{% endfor %}
{%- endmacro %}
To use this, simply pass a list of items similar to the spec given at the top together with the indentation and the page width.
A bit of explanation:
Line 3, If comment is defined then it is word wrapped to the correct length bearing in mind the width and the indent. The first replace deals with indenting the first line and the second indents subsequent lines. All prefixed with '# '
Line 5, depending on whether the variable is simple or iterable, it is rendered in the form name = value or name = [ value1, value2, value3 ]
Of course, it is not fool-proof but it meets my basic requirements.
You can prepend the given string with a newline character, then use the wordwrap filter to wrap the text into multiple lines first, and use the replace filter to replace newline characters with newline plus '# ':
{{ ('\n' ~ item.comment) | wordwrap(78) | replace('\n', '\n# ') }}
The above assumes you want each line to be no more than 80 characters. Change 78 to your desired line width minus 2 to leave room for '# '.
If you're doing this in Ansible, another option is to use the Ansible comment filter:
{{ item.comment | wordwrap(78) | comment }}
or, for more detailed control
{{ item.comment | wordwrap(78) | comment(decoration="# ", prefix="", postfix="") }}
Docs: https://docs.ansible.com/ansible/latest/user_guide/playbooks_filters.html#adding-comments-to-files

display a parent-child-child with django

I'm new to Django and get easily lost.
I have this app that have items. They are set as a list with parent-child relations.
Later I want to display tasks attached to items. But for now. I can't even figure out how to display the parent-childs.
This is my simplt model
class Item(models.Model):
item_title = models.Charfield()
item_parent = models.ForeignKey('self')
I want to display them as:
Item 1
- item 2
- item 3
-- item 4
-- item 5
Item 6
- item 7
-- item 8
--- item 9
I have tried with making a view that take Item.objects.all().order_by('item_parent')
And the template with a FOR - IN. But I don't know how to seperate to show first parent then child, and another child is that exist.
I just manage to list everthing in order by the item_parent. Which is not the same.
Appreciate some expert help to a beginner like me.
If performance is not an issue then a solution based on MaximeK's answer is the simplest. The performance is not great as you are recursively querying the database. For a very small amount of items this is OK.
A more efficient way that also can support an indefinite depth of children is to fetch all the items at once and then create a tree that you can then traverse to print the items in order. It is more code to write but it might be educational if not directly helpful with your problem.
First step: we generate a tree for each item that does not have a root (stored in roots). Side note: We can think of these trees as one big tree starting at a single root node that has all the items with no parents as children, but for simplicity we don't do that.
references = {}
roots = []
items = Item.objects.all()
for item in items:
# get or make a new node
if item.pk in references:
n = references[item.pk]
n.item = item
else:
n = Node(children=[], item=item)
references[item.pk] = n
if item.item_parent is None:
# if item is root (no parent)
roots.append(n)
else:
# if item has a parent
if item.item_parent_id in references:
# item parent already seen
parent_n = references[item.item_parent_id]
else:
# item not seen yet
parent_n = Node(children=[], item=None)
parent_n.children.append(n)
Second step: we traverse the tree depth-first
def dfs(root, level=0):
print("-"*level, root.item.item_title)
for node in root.children:
dfs(node, level+1)
for root in roots:
dfs(root)
This is just printing the item_title with - in front to denote the indentation level. I generated some random items and the output looks like this:
python mouse cat
- mouse monitor car
-- blue car cat
green machine computer
- monitor green yellow
yellow pen blue
- mouse cat yellow
yellow blue green
- cat monitor python
-- blue yellow python
-- machine green cat
--- monitor blue python
-- machine computer mouse
-- machine car blue
car pen yellow
I don't know how to do this in Django templates, but we can generate HTML that looks like this:
<ul>
<li>pen monitor cat
<ul>
<li>computer mouse machine</li>
<li>yellow python car
<ul>
<li>monitor python pen</li>
<li>mouse blue green</li>
<li>python blue cat</li>
</ul>
</li>
</ul>
</li>
<li>mouse computer cat</li>
<li>computer python car
<ul>
<li>pen green python
<ul>
<li>mouse computer machine</li>
</ul>
</li>
<li>machine yellow mouse</li>
</ul>
</li>
<li>yellow python monitor</li>
<li>car cat pen
<ul>
<li>pen machine blue
<ul>
<li>mouse computer machine</li>
</ul>
</li>
</ul>
</li>
</ul>
Depth-first traversal that generates the above HTML. I wrote it as a class to avoid global variables.
class TreeHtmlRender:
def __init__(self, roots):
self.roots = roots
def traverse(self):
self.html_result = "<ul>"
for root in self.roots:
self.dfs(root, 0)
self.html_result += "</ul>"
return self.html_result
def dfs(self, root, level=0):
self.html_result += ("<li>%s" % root.item.item_title)
if len(root.children) > 0:
self.html_result += "<ul>"
for node in root.children:
self.dfs(node, level+1)
self.html_result += "</ul>"
self.html_result += "</li>"
r = TreeHtmlRender(roots)
print(r.traverse())
To render on a webpage you can simply send the HTML to your template via a context and use the safe flag ({{ items_tree | html }}). You can pack all I said in this answer into a neat template tag that will render trees if you need or want to.
Note: A clear limitation of this approach is that it will not function properly if not all items are selected. If you select a subset of all your items and if it happens that you select child nodes and omit their parents, the child nodes will never be displayed.
You need to use :
item_parent__self_set
Its mean for each item_parent you have the childs list (_set if for query_set)
When you define a ForeignKey, you automatically get a reverse relation.
You can do something more simple :
class Item(models.Model):
item_title = models.Charfield()
item_parent = models.ForeignKey('self', blank=True, null=True, related_name='children')
And you retrieve :
for item in Item.objects.filter(item_parent__isnull=True):
print item.item_title
for child in item.children.all():
print child.item_title

Adding a space between paragraphs when extracting text with BeautifulSoup

I need to extract useful text from news articles. I do it with BeautifulSoup but the output sticks together some paragraphs which prevents me from analysing the text further.
My code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/uk-england-39607452")
soup = BeautifulSoup(r.content, "lxml")
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text() for s in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
The output looks like this (just first 5 sentences):
The family of British student Hannah Bladon, who was stabbed to death in Jerusalem, have said they are "devastated" by the "senseless
and tragic attack".Ms Bladon, 20, was attacked on a tram in Jerusalem
on Good Friday.She was studying at the Hebrew University of Jerusalem
at the time of her death and had been taking part in an archaeological
dig that morning.Ms Bladon was stabbed several times in the chest and
died in hospital. She was attacked by a man who pulled a knife from
his bag and repeatedly stabbed her on the tram travelling near Old
City, which was busy as Christians marked Good Friday and Jews
celebrated Passover.
I tried adding a space after certain punctuations like ".", "?", and "!".
article = article.replace(".", ". ")
It works with paragraphs (although I believe there should be a smarter way of doing this) but not with subtitles for different sections of the articles which don't have any punctuation in the end. They are structured like this:
</p>
<h2 class="story-body__crosshead">
Subtitle text
</h2>
<p>
I will be grateful for your advice.
PS: adding a space when I 'join' the article_soup doesn't help.
You can use separator in your get_text, which will fetch all the strings in the current element separated by the given character.
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

BeautifulSoup: Combining continious NavigableString into single NavigableString

<html>
<body>
<p>A <span>die</span> is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 fromboth the throws?</p>
<p> Test </p>
</body>
<html>
I am trying to wrap \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) within span tags. I was able to do so when is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws? is a single NavigableString but in some cases is thrown \(x = {-b \pm\, sqrt{b^2-4ac} and \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws? are split up into three NavigableString. So is there any way using beautifulsoup to merge continuous NavigableString to a single NavigableString.
The code which I used to wrap them within span tag when (x = {-b \pm\sqrt{b^2-4ac} \over 2a})` is withn a single NavigableString.
mathml_regex = re.compile(r'\\\(.*?\\\)', re.DOTALL)
def mathml_wrap(soup):
for p_tags in soup.find_all('p'):
for p_child in p_tags.children:
try:
match = re.search(mathml_regex, p_child)
if match:
start = match.start()
end = match.end()
text = p_child
new_str = NavigableString(text[:start])
p_child.replace_with(new_str)
new_str1 = NavigableString(text[end:])
span_tag = soup.new_tag("span", **{'class':'math-tex'})
span_tag.string= text[start:end]
new_str.insert_after(span_tag)
span_tag.insert_after(new_str1)
except TypeError:
pass
Edit:
from bs4 import BeautifulSoup
import re
html = """<p>
A
<span>die</span>
is thrown \(x = {-b \pm
<span>\sqrt</span>
{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p> <p> Test </p>"""
soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')
for p_tags in soup.find_all('p'):
match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
for p_child in p_tags.children:
try: #Captures Tags that contains \(
if re.findall(mathml_start_regex, p_child.text):
match += 1
except: #Captures NavigableString that contains \(
if re.findall(mathml_start_regex, p_child):
match += 1
try: #Replaces Tag with Tag's text
if match == 1:
p_child.replace_with(p_child.text)
except: #No point in replacing NavigableString since they are just strings without Tags
pass
try: #Captures Tags that contains \)
if re.findall(mathml_end_regex, p_child.text):
match = 0
except: #Captures NavigableString that contains \)
if re.findall(mathml_end_regex, p_child):
match = 0
After processing my soup with the above code to remove the span tag between \( and \)
is thrown \(x = {-b \pm\, sqrt and {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws? are split up into 3 NavigableStrings in my soup object.
I don't know if I got your question properly, but as you said you want to concatenate the string you are getting in those <p> tags,
I used this as input -
mystr = """<html>
<body>
<p>A <span>die</span> is thrown \(x = {-b \pm\sqrt{b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 fromboth the throws?</p>
<p> Test </p>
</body>
<html>"""
So here is what I did -
soup = BeautifulSoup(mystr,"lxml")
my_p = soup.findAll("p")
for p in my_p:
print p.text
This extracts whole text you are getting in the <p> tag, tell me if your question was something else.

Regular Expressions | Delete words on multiple lines before a given word

I scraped several articles from a website and now I am trying to make the corpus more readable by deleting the first part from the text scraped.
The interval that it should be deleted is within the tag <p>Advertisement and the final tag </time> before the article starts. As you can see, the regular expression should delete several words on multiple lines. I tried with the DOTALL sequence but it wasn't successful.
This is my first attempt:
import re
text='''
<p>Advertisement</p>, <p class="byline-dateline"><span class="byline"itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author"
data-byline-name="MILAN SCHREUER" itemprop="name">MILAN SCHREUER</span> and </span><span class="byline"
itemid="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><a href="http://topics.nytimes.com/top/reference/timestopics/people/r/alissa_johannsen_rubin/index.html"
title="More Articles by ALISSA J. RUBIN"><span class="byline-author" data-byline-name="ALISSA J. RUBIN" data-twitter-handle="Alissanyt" itemprop="name">ALISSA J. RUBIN</span></a></span><time class="dateline" content="2016-10-06T01:02:19-04:00"
datetime="2016-10-06T01:02:19-04:00" itemprop="dateModified">OCT. 5, 2016</time>
</p>, <p class="story-body-text story-content" data-para-count="163" data-total-count="163">BRUSSELS — A man wounded two police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.”</p>, <p class="story-body-text story-content"
data-para-count="231" data-total-count="394">The two officers were attacked on the Boulevard Lambermont in the Schaerbeek district, just north of the city center. A third police officer, who came to their aid, was also injured. None of the three had life-threatening injuries.</p>
'''
my_pattern=("(.*)</time>")
results= re.sub(my_pattern," ", text)
print(results)
Try this:
my_pattern=("[\s\S]+\<\/time\>")
If you also want to delete also the following tag </p>, the comma , and the space, you can use this:
my_pattern=("[\s\S]+\<\/time\>[\s\S]\<\/p\>\,\s")