Read multilanguage strings from html via Python 2.7 - python-2.7

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.
import urllib2
import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
print data[0]['content'].encode("utf-8")
the result I am taking is
BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text
The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?
PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.
Thank you in advance!

You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
with open("test.txt", "w") as myfile:
myfile.write(data[0]['content'].encode("utf-8"))
test.txt:
BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text
Which OS and terminal you are using?

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

Soup.find and findAll unable to find table elements on hockey-reference.com

I'm just a beginner at webscraping and python in general so I'm sorry if the answer is obvious, but I can't figure out I'm unable to find any of the table elements on https://www.hockey-reference.com/leagues/NHL_2018.html.
My initial thought was that this was a result of the whole div being commented out, so following some advice I found on here in another similar post, I replaced the comment characters and confirmed that they were removed when I saved the soup.text to a text file and searched. I was still unable to find any tables however.
In trying to search a little further I took the ID out of my .find and did a findAll and still table was coming up empty.
Here's the code I was trying to use, any advice is much appreciated!
import csv
import requests
from BeautifulSoup import BeautifulSoup
import re
comm = re.compile("<!--|-->")
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(comm.sub("", html))
table = soup.find('table', id="stats")
When searching for all of the table elements I was using
table = soup.findAll('table')
I'm also aware that there is a csv version on the site, I was just eager to practice.
Give a parser along with your markup, for example BeautifulSoup(html,'lxml') . Try the below code
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
table = soup.findAll('table')

read text file content with python at zapier

I have problems getting the content of a txt-file into a Zapier
object using https://zapier.com/help/code-python/. Here is the code I am
using:
with open('file', 'r') as content_file:
content = content_file.read()
I'd be glad if you could help me with this. Thanks for that!
David here, from the Zapier Platform team.
Your code as written doesn't work because the first argument for the open function is the filepath. There's no file at the path 'file', so you'll get an error. You access the input via the input_data dictionary.
That being said, the input is a url, not a file. You need to use urllib to read that url. I found the answer here.
I've got a working copy of the code like so:
import urllib2 # the lib that handles the url stuff
result = []
data = urllib2.urlopen(input_data['file'])
for line in data: # file lines are iterable
result.append(line) # keep each line, or parse, etc.
return {'lines': result}
The key takeaway is that you need to return a dictionary from the function, so make sure you somehow squish your file into one.
​Let me know if you've got any other questions!
#xavid, did you test this in Zapier?
It fails miserably beacuse urllib2 doesn't exist in the zapier python environment.

Scraping urls from html, save in csv using BeautifulSoup

I'm trying to save all hyperlinked urls in an online forum in a CSV file, for a research project.
When I 'print' the html scraping result it seems to be working fine, in the sense that it prints all the urls I want, but I'm unable to write these to separate rows in the CSV.
I'm clearly doing something wrong, but I don't know what! So any help will be greatly appreciated.
Here's the code I've written:
import urllib2
from bs4 import BeautifulSoup
import csv
import re
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php? fid=28&page=5').read())
urls = []
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
csvfile = open('Ss141.csv', 'wb')
writer = csv.writer(csvfile)
for url in zip(urls):
writer.writerow([url])
csvfile.close()
You do need to add your matches to the urls list:
for url in soup.find_all('a', href=re.compile('viewthread.php')):
print url['href']
urls.append(url)
and you don't need to use zip() here.
Best just write your urls as you find them, instead of collecting them in a list first:
soup = BeautifulSoup(urllib2.urlopen('http://forum.sex141.com/eforum/forumdisplay.php?fid=28&page=5').read())
with open('Ss141.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for url in soup.find_all('a', href=re.compile('viewthread.php')):
writer.writerow([url['href']])
The with statement will close the file object for you when the block is done.

Scraping messy source page with Beautiful Soup

I try to do some web scraping using Python and Beautiful Soup, but the source page of the webpage is not the prettiest. The code below is a minor part of the source page:
...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...
I want to get the parameter '2' after the string 'birthdayFriends', but I have no idea how to get it. So far i have written the code below, but it only prints a empty list.
import urllib2
from bs4 import BeautifulSoup
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='myWebpage',
user='myUsername',
passwd='myPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
page = urllib2.urlopen('myWebpage')
soup = BeautifulSoup(page.read())
bf = soup.findAll('birthdayFriends')
print bf
>> []
suppose somewhere in the html there is a script tag like the following:
<script>
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}}
</script>
then your code might look something like:
script = soup.findAll('script')[0] # or the number it appears in the file
# take the json part
j = bf.text.split('=')[1]
import json
# load json string to a dictionary
d = json.loads(j, strict=False)
print d["birthdayFriends"]
in case the content of the script tag is more complicated, consider loop over the script lines or see How can I parse Javascript variables using python?
also, for parsing JavaScript in python also see pynoceros