This seems familiar; why does ≈ not get picked up by html.parser?
>>> from bs4 import BeautifulSoup
>>> for html in ['hey ‘ 3','hey π','hey ≈ 3']:
... print repr(unicode(BeautifulSoup(html,'html.parser')))
...
u'hey \u2018 3'
u'hey \u03c0'
u'hey &approx 3'
I managed to figure it out myself from looking at the bs4 source code for the htmlparser builder.
BeautifulSoup's builder uses the entity-name-to-character mapping in bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER so it is easy to patch.
import bs4
from bs4 import BeautifulSoup
rawhtml = '<p>‘ ≈ π θ 3.</p>'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('Before: %s' % repr(soup))
# ≈ -> \u2248
# from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER['approx'] = u'\u2248'
soup = BeautifulSoup(rawhtml, 'html.parser')
print('After: %s' % repr(soup))
which prints out
Before: <p>\u2018 &approx \u03c0 \u03b8 3.</p>
After: <p>\u2018 \u2248 \u03c0 \u03b8 3.</p>
Related
I have the following code that I need to get absolute links from rather than relative links.
I believe I need to use urlparse and urljoin somewhere in here, but I'm just not sure where to use that.
The .csv from this code is also giving me rows like this: "/about.html" which is obviously not an link to another web page.
import urllib
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import re
r = urllib.urlopen('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(r, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile(r'(^http|.html)')}):
links.append(link.get('href'))
web_links_df = pd.DataFrame(links)
web_links_df.columns = ['web_link']
web_links_df['web_link'] = web_links_df['web_link'].apply(lambda x:
x.rstrip('/'))
url_tail = web_links_df['web_link'].apply(lambda x: x[-4:])
web_links = pd.DataFrame(web_links_df['web_link'].unique())
web_links.columns = ['web_link']
print web_links.head()
web_links.to_csv("D:/MLCV/web_links_1.csv")
Any help would be greatly appreciated. I have spent hours going through other examples on Stack but I am just not getting the correct results.
I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?
import requests
import re
from bs4 import BeautifulSoup
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
# Creating the CSV file
commonFile = open('common_words.csv', 'wb')
# Grabbing the lines you want
for node in soup.findAll("tr"):
# Getting just the text and removing the html
words = ''.join(node.findAll(text=True))
# Removing the extra lines
ID = re.sub(r'[\t\r\n]', '', words)
# Needed to add a break in the line to make the rows
update = ''.join(ID)+'\n'
# Now we add this to the file
commonFile.write(update)
commonFile.close()
How about this?
import requests
import csv
from bs4 import BeautifulSoup
f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
words = soup.select('div[class=file] tr')
for i in range(len(words)):
word = words[i].text
f.writerow([word.replace('\n', '')])
>>> from bs4 import BeautifulSoup
>>> import urllib
>>> url = "http://www.securitytube.net/video/7313"
>>>
>>> page = urllib.urlopen(url)
>>>
>>> pageDom = BeautifulSoup(page)
On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
>>> allComments = pageDom.find_all("ul", class_="comments")
>>>
>>> allComments
[]
>>>
>>>
So now I removed 'class_' and am able to fetch all the unordered list tags.
Check the code below.
>>> allComments = pageDom.find_all("ul")
>>> len(allComments)
27
>>>
If I look at the source code of the page I can very well see all the < ul > with the class as "comments". I don't know where am I missing stuffs. I also tried changing the parser to "lxml" but no joy.
Any suggestions/ improvements will be highly appreciated ...
I am not sure if there is a difference from the versions but here is the code and the output that worked fine with Python 3.4:
url = "http://www.securitytube.net/video/7313"
page = urllib.request.urlopen(url)
pageDom = BeautifulSoup(page)
#print(pageDom)
#On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
allComments = pageDom.find_all("ul", class_="comments")
#print(allComments)
print(len(allComments))
#So now I removed 'class_' and am able to fetch all the unordered list tags. Check the code below.
allComments = pageDom.find_all("ul")
#print(allComments)
print(len(allComments))
Output:
C:\Python34\python.exe C:/{path}/testPython.py
2
27
Process finished with exit code 0
You can uncomment the print lines to see the array contents
I tested (multiple times) in python 2.7 32 bit-
from bs4 import BeautifulSoup
import urllib
url = "http://www.securitytube.net/video/7313"
page = urllib.urlopen(url)
page = d=page.read()
pageDom = BeautifulSoup(page,'lxml')
allComments = pageDom.find_all("ul", class_="comments")
print len(allComments)
allComments = pageDom.find_all("ul")
print len(allComments)
It prints-
2
27
programming newbie here :)
I'd like to print the prices from the website using BeautifulSoup. this is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup, SoupStrainer
from urllib2 import urlopen
url = "Some retailer's url"
html = urlopen(url).read()
product = SoupStrainer('span',{'style': 'color:red;'})
soup = BeautifulSoup(html, parse_only=product)
print soup.prettify()
and it prints prices in the following order:
<span style="color:red;">
180
</span>
<span style="color:red;">
1250
</span>
<span style="color:red;">
380
</span>
I tried print soup.text.strip() but it returned 1801250380
Please help me to print the prices per single row :)
Many thanks!
>>> print "\n".join([p.get_text(strip=True) for p in soup.find_all(product)])
180
1250
380
This will get you a list of strings converted to integers:
>>> [int(span.text) for span in soup.find_all('span')]
[180, 1250, 380]
I am trying to remove quotes and brackets from csv in python,I tryed for the folloing code but it can't give proper csv the code is:
import json
import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup
import csv
u = urllib2.urlopen("http://timesofindia.indiatimes.com/")
content = u.read()
u.close()
soup2 = BeautifulSoup(content)
blog_posts = []
for e in soup2.findAll("a", attrs={'pg': re.compile('^Head')}):
for b in soup2.findAll("div", attrs={'style': re.compile('^color:#ffffff;font-size:12px;font-family:arial;padding-top:3px;text-align:center;')}):
blog_posts.append(("The Times Of India",e.text,b.text))
print blog_posts
out_file = os.path.join('resources', 'ch05-webpages','newspapers','time1.csv')
f = open(out_file, 'wb')
wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
#f.write(json.dumps(blog_posts, indent=1))
wr.writerow(blog_posts)
f.close()
print 'Wrote output file to %s' % (f.name, )
the csv looks like:
"('The Times Of India', u'Missing jet: Air search expands to remote south Indian Ocean', u'Fri, Mar 21, 2014 | Updated 11.53AM IST')",
but i want csv like this:
The Times Of India,u'Missing jet: Air search expands to remote south Indian Ocean, u'Fri, Mar 21, 2014 | Updated 11.53AM IST
So what can i do for getting this type of csv?
Writer.writerow() expects a sequence containing strings or numbers. You are passing a sequence of tuples. Use Writer.writerows() instead.