Cant find table with soup.findAll('table') using BeautifulSoup in python - python-2.7

Im using soup.findAll('table') to try to find the table in an html file, but it will not appear.
The table indeed exists in the file, and with regex Im able to locate it this way:
import sys
import urllib2
from bs4 import BeautifulSoup
import re
webpage = open(r'd:\samplefile.html', 'r').read()
soup = BeautifulSoup(webpage)
print re.findall("TABLE",webpage) #works, prints ['TABLE','TABLE']
print soup.findAll("TABLE") # prints an empty list []
I know I am correctly generating the soup since when I do:
print [tag.name for tag in soup.findAll(align=None)]
It will correctly print tags that it finds. I already tried also with different ways to write "TABLE" like "table", "Table", etc.
Also, if I open the file and edit it with a text editor, it has "TABLE" on it.
Why beautifulsoup doesnt find the table??

Context
python 2.x
BeautifulSoup HTML parser
Problem
bsoup findall does not return all the expected tags, or it returns none at all, even though the user knows that the tag exists in the markup
Solution
Try specifying the exact parser when initializing the BeautifulSoup constructor
## BEFORE
soup = BeautifulSoup(webpage)
## AFTER
soup = BeautifulSoup(webpage, "html5lib")
Rationale
The target markup may include mal-formed HTML, and there are varying degrees of success with different parsers.
See also
related post by Martijn that addresses the same issue

Related

Remove img tags from xml with BeautifulSoup

It's my first time using Python and BeautifulSoup. The thing is I'm doing a migration of all articles within a blog from one website to another, and to perform this, I'm extracting certain information from a xml file; the last part of my code tells to extract only the text between the position 0 and 164 from the meta tag, so this way it can appear on google SERP as they want to appear.
The problem here is some articles from the blog has img tags on the first lines inside the tag and I want to remove them, including the src attributes so the code can grab just the text after those img tags.
I tried to solve it in many ways but I did not succeed.
Here is my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
import re
reload(sys)
sys.setdefaultencoding('utf8')
base_url = ("http://pimacleanpro.com/blog?rss=true")
soup = BeautifulSoup(urlopen(base_url).read(),"xml")
titles = soup("title")
slugs = soup("link")
bodies = soup("description")
with open("blog-data.csv", "w") as f:
fieldnames = ("title", "content", "slug", "seo_title", "seo_description","site_id", "page_path", "category")
output = csv.writer(f, delimiter=",")
output.writerow(fieldnames)
for i in xrange(len(titles)):
output.writerow([titles[i].encode_contents(),bodies[i].encode_contents(formatter=None),slugs[i].get_text(),titles[i].encode_contents(),bodies[i].encode_contents(formatter=None)[4:164]])
print "Done writing file"
any help will be appreciated.
Here's a Python 2.7 example that I think does what you want:
from bs4 import BeautifulSoup
from urllib2 import urlopen
from xml.sax.saxutils import unescape
base_url = ("http://pimacleanpro.com/blog?rss=true")
# Unescape to allow BS to parse the <img> tags
soup = BeautifulSoup(unescape(urlopen(base_url).read()))
titles = soup("title")
slugs = soup("link")
bodies = soup("description")
print bodies[2].encode_contents(formatter=None)[4:164]
# Remove all 'img' tags in all the 'description' tags in bodies
for body in bodies:
for img in body("img"):
img.decompose()
print bodies[2].encode_contents(formatter=None)[4:164]
# Proceed to writing to CSV, etc.
The first print statement outputs the following:
<img src='"http://ekblog.s3.amazonaws.com/contentp/wp-content/uploads/2018/09/03082910/decoration-design-detail-691710-300x221.jpg"'><br>
<em>Whether you are up
While the second one after removing the <img> tags is as follows:
<em>Whether you are upgrading just one room or giving your home a complete renovation, it’s likely that your first thought is to choose carpet for all of
Of course you could just remove all image tags in the soup object before creating titles, slugs, or bodies if they're not of interest to you:
for tag in soup("img"):
tag.decompose()

Methods in Python 2.7 that enable text extraction from multiple HTML pages with different element tags?

I primarily work in Python 2.7. I'm trying to extract the written content (body text) of hundreds of articles from their respective URLs. To simplify things, I've started by trying to extract the text from just one website in my list, and I've been able to do so successfully using BeautifulSoup4. My code looks like this:
import urllib2
from bs4 import BeautifulSoup
url = 'http://insertaddresshere'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
response = urllib2.urlopen(request)
soup = BeautifulSoup((response),"html.parser")
texts = soup.find_all("p")
for item in texts:
print item.text
This gets me the body text of a single article. I know how to iterate through a csv file and write to a new one, but the list of sites I need to iterate through are all from different domains, so the HTML code varies a lot. Is there any way to find body text from multiple articles that have different element labels (here, "p") for said body text? Is it possible to use BeautifulSoup to do this?

want to get all the java script file from html using bs4

from bs4 import BeautifulSoup
import re
import HTMLParser
import urllib
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
scripts=soup.find_all('script')
for tag in scripts:
try:
Script = tag["src"]
print Script
except:
print "No source"
using this code I m not getting all the java script from html document.
I have checked your code and it seems that you are getting all the javascript. At least you check for all the tags. Of course some of the javascript may be directly embedded into the html and thereby won't have a src attribute. Merely the actual javascript between the <script>...</script> tags. You can get the javscript between these embedded tags using tag.contents in your loop.
Furthermore, I would advise to specify a parser. By default bs4 uses html.parser. Other parsers may perform better/differently. Check out: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
from bs4 import BeautifulSoup
import urllib2
r = urllib2.urlopen('<your url>').read()
soup = BeautifulSoup(r, 'html.parser')
for s in soup.findAll('script'):
print s.get('src')

Using BeautifulSoup to print specific information that has a <div> tag

I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.
You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})

How to use lxml to get a message from a website?

At exam.com is not about the weather:
Tokyo: 25°C
I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only.
HTML exam.com structure as follows:
<p id="resultWeather">
<b>Weather</b>
Tokyo:
<b>25</b>°C
</p>
I'm a student. I'm doing a small project with my friends. Please explain to me easily understand. Thank you very much!
BeautifulSoup is more suitable for html parsing than lxml.
something like this can be helpful:
def get_weather():
import urllib
from BeautifulSoup import BeautifulSoup
data = urllib.urlopen('http://exam.com/').read()
soup = BeautifulSoup(data)
return soup.find('p', {'id': 'resultWeather'}).findAll('b')[-1].string
get page contents with urllib, parse it with BeautifulSoup, find P with id=resultWeather, find last B in our P and get it's content