BeautifulSoup4 doesn't read ≈ as an HTML entity

BeautifulSoup4 doesn't read ≈ as an HTML entity - python-2.7

This seems familiar; why does &approx; not get picked up by html.parser?
>>> from bs4 import BeautifulSoup
>>> for html in ['hey ‘ 3','hey π','hey &approx; 3']:
... print repr(unicode(BeautifulSoup(html,'html.parser')))
...
u'hey \u2018 3'
u'hey \u03c0'
u'hey &approx 3'

I managed to figure it out myself from looking at the bs4 source code for the htmlparser builder.
BeautifulSoup's builder uses the entity-name-to-character mapping in bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER so it is easy to patch.
import bs4
from bs4 import BeautifulSoup
rawhtml = '<p>‘ &approx; π θ 3.</p>'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('Before: %s' % repr(soup))
# &approx; -> \u2248
# from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER['approx'] = u'\u2248'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('After: %s' % repr(soup))
which prints out
Before: <p>\u2018 &approx \u03c0 \u03b8 3.</p>
After: <p>\u2018 \u2248 \u03c0 \u03b8 3.</p>

Related

python 2.7 absolute links from webpage

I have the following code that I need to get absolute links from rather than relative links.
I believe I need to use urlparse and urljoin somewhere in here, but I'm just not sure where to use that.
The .csv from this code is also giving me rows like this: "/about.html" which is obviously not an link to another web page.
import urllib
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import re
r = urllib.urlopen('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile(r'(^http|.html)')}):
links.append(link.get('href'))
web_links_df = pd.DataFrame(links)
web_links_df.columns = ['web_link']
web_links_df['web_link'] = web_links_df['web_link'].apply(lambda x:
x.rstrip('/'))
url_tail = web_links_df['web_link'].apply(lambda x: x[-4:])
web_links = pd.DataFrame(web_links_df['web_link'].unique())
web_links.columns = ['web_link']
print web_links.head()
web_links.to_csv("D:/MLCV/web_links_1.csv")
Any help would be greatly appreciated. I have spent hours going through other examples on Stack but I am just not getting the correct results.

Python Web scraper using Beautifulsoup 4

I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?
import requests
import re
from bs4 import BeautifulSoup
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
# Creating the CSV file
commonFile = open('common_words.csv', 'wb')
# Grabbing the lines you want
for node in soup.findAll("tr"):
# Getting just the text and removing the html
words = ''.join(node.findAll(text=True))
# Removing the extra lines
ID = re.sub(r'[\t\r\n]', '', words)
# Needed to add a break in the line to make the rows
update = ''.join(ID)+'\n'
# Now we add this to the file
commonFile.write(update)
commonFile.close()

How about this?
import requests
import csv
from bs4 import BeautifulSoup
f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
words = soup.select('div[class=file] tr')
for i in range(len(words)):
word = words[i].text
f.writerow([word.replace('\n', '')])

The BeautifulSoup object isn't fetching <ul> tags with class set to comments. Any suggestions?

>>> from bs4 import BeautifulSoup
>>> import urllib
>>> url = "http://www.securitytube.net/video/7313"
>>>
>>> page = urllib.urlopen(url)
>>>
>>> pageDom = BeautifulSoup(page)
On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
>>> allComments = pageDom.find_all("ul", class_="comments")
>>>
>>> allComments
[]
>>>
>>>
So now I removed 'class_' and am able to fetch all the unordered list tags.
Check the code below.
>>> allComments = pageDom.find_all("ul")
>>> len(allComments)
27
>>>
If I look at the source code of the page I can very well see all the < ul > with the class as "comments". I don't know where am I missing stuffs. I also tried changing the parser to "lxml" but no joy.
Any suggestions/ improvements will be highly appreciated ...

I am not sure if there is a difference from the versions but here is the code and the output that worked fine with Python 3.4:
url = "http://www.securitytube.net/video/7313"
page = urllib.request.urlopen(url)
pageDom = BeautifulSoup(page)
#print(pageDom)
#On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
allComments = pageDom.find_all("ul", class_="comments")
#print(allComments)
print(len(allComments))
#So now I removed 'class_' and am able to fetch all the unordered list tags. Check the code below.
allComments = pageDom.find_all("ul")
#print(allComments)
print(len(allComments))
Output:
C:\Python34\python.exe C:/{path}/testPython.py
2
27
Process finished with exit code 0
You can uncomment the print lines to see the array contents

I tested (multiple times) in python 2.7 32 bit-
from bs4 import BeautifulSoup
import urllib
url = "http://www.securitytube.net/video/7313"
page = urllib.urlopen(url)
page = d=page.read()
pageDom = BeautifulSoup(page,'lxml')
allComments = pageDom.find_all("ul", class_="comments")
print len(allComments)
allComments = pageDom.find_all("ul")
print len(allComments)
It prints-
2
27

How to remove html tags from strings in Python using BeautifulSoup

programming newbie here :)
I'd like to print the prices from the website using BeautifulSoup. this is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup, SoupStrainer
from urllib2 import urlopen
url = "Some retailer's url"
html = urlopen(url).read()
product = SoupStrainer('span',{'style': 'color:red;'})
soup = BeautifulSoup(html, parse_only=product)
print soup.prettify()
and it prints prices in the following order:
<span style="color:red;">
180
</span>
<span style="color:red;">
1250
</span>
<span style="color:red;">
380
</span>
I tried print soup.text.strip() but it returned 1801250380
Please help me to print the prices per single row :)
Many thanks!

>>> print "\n".join([p.get_text(strip=True) for p in soup.find_all(product)])
180
1250
380

This will get you a list of strings converted to integers:
>>> [int(span.text) for span in soup.find_all('span')]
[180, 1250, 380]

removing double quotes and brackets from csv in python

I am trying to remove quotes and brackets from csv in python,I tryed for the folloing code but it can't give proper csv the code is:
import json
import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup
import csv
u = urllib2.urlopen("http://timesofindia.indiatimes.com/")
content = u.read()
u.close()
soup2 = BeautifulSoup(content)
blog_posts = []
for e in soup2.findAll("a", attrs={'pg': re.compile('^Head')}):
for b in soup2.findAll("div", attrs={'style': re.compile('^color:#ffffff;font-size:12px;font-family:arial;padding-top:3px;text-align:center;')}):
blog_posts.append(("The Times Of India",e.text,b.text))
print blog_posts
out_file = os.path.join('resources', 'ch05-webpages','newspapers','time1.csv')
f = open(out_file, 'wb')
wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
#f.write(json.dumps(blog_posts, indent=1))
wr.writerow(blog_posts)
f.close()
print 'Wrote output file to %s' % (f.name, )
the csv looks like:
"('The Times Of India', u'Missing jet: Air search expands to remote south Indian Ocean', u'Fri, Mar 21, 2014 | Updated 11.53AM IST')",
but i want csv like this:
The Times Of India,u'Missing jet: Air search expands to remote south Indian Ocean, u'Fri, Mar 21, 2014 | Updated 11.53AM IST
So what can i do for getting this type of csv?

Writer.writerow() expects a sequence containing strings or numbers. You are passing a sequence of tuples. Use Writer.writerows() instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BeautifulSoup4 doesn't read ≈ as an HTML entity - python-2.7

This seems familiar; why does ≈ not get picked up by html.parser? >>> from bs4 import BeautifulSoup >>> for html in ['hey ‘ 3','hey π','hey ≈ 3']: ... print repr(unicode(BeautifulSoup(html,'html.parser'))) ... u'hey \u2018 3' u'hey \u03c0' u'hey &approx 3'

Related

python 2.7 absolute links from webpage

Python Web scraper using Beautifulsoup 4

The BeautifulSoup object isn't fetching <ul> tags with class set to comments. Any suggestions?

How to remove html tags from strings in Python using BeautifulSoup

removing double quotes and brackets from csv in python

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BeautifulSoup4 doesn't read &approx; as an HTML entity - python-2.7

This seems familiar; why does &approx; not get picked up by html.parser? >>> from bs4 import BeautifulSoup >>> for html in ['hey ‘ 3','hey π','hey &approx; 3']: ... print repr(unicode(BeautifulSoup(html,'html.parser'))) ... u'hey \u2018 3' u'hey \u03c0' u'hey &approx 3'

Related

python 2.7 absolute links from webpage

Python Web scraper using Beautifulsoup 4

The BeautifulSoup object isn't fetching <ul> tags with class set to comments. Any suggestions?

How to remove html tags from strings in Python using BeautifulSoup

removing double quotes and brackets from csv in python

Categories

Resources

BeautifulSoup4 doesn't read ≈ as an HTML entity - python-2.7

This seems familiar; why does ≈ not get picked up by html.parser? >>> from bs4 import BeautifulSoup >>> for html in ['hey ‘ 3','hey π','hey ≈ 3']: ... print repr(unicode(BeautifulSoup(html,'html.parser'))) ... u'hey \u2018 3' u'hey \u03c0' u'hey &approx 3'