Using diff with beautiful soup objects - python-2.7

I am trying to compare the text all instances of a particular tag in two XML files. The OCR engine I am using outputs an xml files with all the ocr chraracters in a tag <OCRCharacters>...</OCRCharacters>.
I am using python 2.7.11 and beautiful soup 4 (bs4). From the terminal, I am calling my python program with two xml file names as arguments.
I want to extract all the strings in the <OCRCharacters> tag for each file, compare them line by line with difflib, and write a new file with the differences.
I use $ python parse_xml_file.py file1.xml file2.xml to call the program from the terminal.
The code below opens each file and prints each string in the tag <OCRCharacters>. How should I convert the objects made with bs4 to strings that I can use with difflib. I am open to better ways (using python) to do this.
import sys
with open(sys.argv[1], "r") as f1:
xml_doc_1 = f1.read()
with open(sys.argv[2], "r") as f2:
xml_doc_2 = f2.read()
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(xml_doc_1, 'xml')
soup2 = BeautifulSoup(xml_doc_2, 'xml')
print("#####################",sys.argv[1],"#####################")
for tag in soup1.find_all('OCRCharacters'):
print(repr(tag.string))
temp1 = repr(tag.string)
print(temp1)
print("#####################",sys.argv[2],"#####################")
for tag in soup2.find_all('OCRCharacters'):
print(repr(tag.string))
temp2 = repr(tag.string)

You can try this :
import sys
import difflib
from bs4 import BeautifulSoup
text = [[],[]]
files = []
soups = []
for i, arg in enumerate(sys.argv[1:]):
files.append(open(arg, "r").read())
soups.append(BeautifulSoup(files[i], 'xml'))
for tag_text in soups[i].find_all('OCRCharacters'):
text[i].append(''.join(tag_text))
for first_string, second_string in zip(text[0], text[1]):
d = difflib.Differ()
diff = d.compare(first_string.splitlines(), second_string.splitlines())
print '\n'.join(diff)
With xml1.xml :
<node>
<OCRCharacters>text1_1</OCRCharacters>
<OCRCharacters>text1_2</OCRCharacters>
<OCRCharacters>Same Value</OCRCharacters>
</node>
and xml2.xml :
<node>
<OCRCharacters>text2_1</OCRCharacters>
<OCRCharacters>text2_2</OCRCharacters>
<OCRCharacters>Same Value</OCRCharacters>
</node>
The output will be :
- text1_1
? ^
+ text2_1
? ^
- text1_2
? ^
+ text2_2
? ^
Same Value

Related

Django get filename without url

I get a file with this commannd:
src = request.POST.get('src', '')
But the output is: https://url.com/path/filename.jpg
How can I just get path/filename.jpg?
regards
Christopher
Yes I wrote a 3 lines command, one line would be better:
from urllib.parse import urlparse
import os
url = request.POST.get('src', '')
filepath = urlparse(url).path
src = path + (os.path.basename(filepath))

Format string to XML file

I want to reformat a string to the XML structure, but my string is not on an XML format (using Python 2.7).
I believe the correct way is to first create an XML format of the input in one line and then use XML Pretty Print for making it an XML file with multi rows and indentation (
Pretty printing XML in Python).
Below there is an example of an input after a History Server REST API's call to Hadoop server 1.
Input:
'{"jobAttempts":{"jobAttempt":[{"nodeHttpAddress":"slave2:8042","nodeId":"slave2:39637","id":1,"startTime":1544691730439,"containerId":"container_1544631848492_0013_01_000001","logsLink":"http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2"}]}}'
Output:
'<jobAttempts><jobAttempt><nodeHttpAddress>slave2:8042</nodeHttpAddress><nodeId>slave2:39637</nodeId><id>1</id><startTime>1544691730439</startTime><containerId>container_1544631848492_0013_01_000001</containerId><logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink></jobAttempt></jobAttempts>'
Final Output
<jobAttempts>
<jobAttempt>
<nodeHttpAddress>slave2:8042</nodeHttpAddress>
<nodeId>slave2:39637</nodeId>
<id>1</id>
<startTime>1544691730439</startTime>
<containerId>container_1544631848492_0013_01_000001</containerId>
<logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink>
</jobAttempts>
</jobAttempt>
*This string is actually an XML file which does not appear to have any style information associated with it.
I have found out that the source view of the History Server REST API's is indeed an XML file in one line. Thus, I had to read the source view and not the old problematic view with python.
Before I used
import urllib2
contents = urllib2.urlopen("http://http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013//jobattempts").read()
Now, I am downloading the source view of the html page with selenium and BeautifulSoup and I save it locally.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import xml.dom.minidom
driver = webdriver.Firefox()
driver.get("http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013/jobattempts")
page_source = driver.page_source
driver.close()
soup = BeautifulSoup(page_source, "html.parser")
print(soup)
xml = xml.dom.minidom.parseString(str(soup))
pretty_xml_as_string = xml.toprettyxml()
file = open("./content_new_2.xml", 'w')
file.write(pretty_xml_as_string)
file.close()

Python Web scraper using Beautifulsoup 4

I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?
import requests
import re
from bs4 import BeautifulSoup
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
# Creating the CSV file
commonFile = open('common_words.csv', 'wb')
# Grabbing the lines you want
for node in soup.findAll("tr"):
# Getting just the text and removing the html
words = ''.join(node.findAll(text=True))
# Removing the extra lines
ID = re.sub(r'[\t\r\n]', '', words)
# Needed to add a break in the line to make the rows
update = ''.join(ID)+'\n'
# Now we add this to the file
commonFile.write(update)
commonFile.close()
How about this?
import requests
import csv
from bs4 import BeautifulSoup
f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
words = soup.select('div[class=file] tr')
for i in range(len(words)):
word = words[i].text
f.writerow([word.replace('\n', '')])

removing double quotes and brackets from csv in python

I am trying to remove quotes and brackets from csv in python,I tryed for the folloing code but it can't give proper csv the code is:
import json
import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup
import csv
u = urllib2.urlopen("http://timesofindia.indiatimes.com/")
content = u.read()
u.close()
soup2 = BeautifulSoup(content)
blog_posts = []
for e in soup2.findAll("a", attrs={'pg': re.compile('^Head')}):
for b in soup2.findAll("div", attrs={'style': re.compile('^color:#ffffff;font-size:12px;font-family:arial;padding-top:3px;text-align:center;')}):
blog_posts.append(("The Times Of India",e.text,b.text))
print blog_posts
out_file = os.path.join('resources', 'ch05-webpages','newspapers','time1.csv')
f = open(out_file, 'wb')
wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
#f.write(json.dumps(blog_posts, indent=1))
wr.writerow(blog_posts)
f.close()
print 'Wrote output file to %s' % (f.name, )
the csv looks like:
"('The Times Of India', u'Missing jet: Air search expands to remote south Indian Ocean', u'Fri, Mar 21, 2014 | Updated 11.53AM IST')",
but i want csv like this:
The Times Of India,u'Missing jet: Air search expands to remote south Indian Ocean, u'Fri, Mar 21, 2014 | Updated 11.53AM IST
So what can i do for getting this type of csv?
Writer.writerow() expects a sequence containing strings or numbers. You are passing a sequence of tuples. Use Writer.writerows() instead.

How do you convert the multi-line content scraped into a list?

I was trying to convert the content scraped into a list for data manipulation, but got the following error: TypeError: 'NoneType' object is not callable
#! /usr/bin/python
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import os
import re
# Copy all of the content from the provided web page
webpage = urlopen("http://www.optionstrategist.com/calculators/free-volatility- data").read()
# Grab everything that lies between the title tags using a REGEX
preBegin = webpage.find('<pre>') # Locate the pre provided
preEnd = webpage.find('</pre>') # Locate the /pre provided
# Copy the content between the pre tags
voltable = webpage[preBegin:preEnd]
# Pass the content to the Beautiful Soup Module
raw_data = BeautifulSoup(voltable).splitline()
The code is very simple. This is the code for BeautifulSoup4:
# Find all <pre> tag in the HTML page
preTags = webpage.find_all('pre')
for tag in preTags:
# Get the text inside the tag
print(tag.get_text())
Reference:
find_all()
Kinds of filters to put into name field of find()/findall()
get_text()
To get the text from the first pre element:
#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = "http://www.optionstrategist.com/calculators/free-volatility-data"
soup = BeautifulSoup(urlopen(url))
print soup.pre.string
To extract lines with data:
from itertools import dropwhile
lines = soup.pre.string.splitlines()
# drop lines before the data table header
lines = dropwhile(lambda line: not line.startswith("Symbol"), lines)
# extract lines with data
lines = (line for line in lines if '%ile' in line)
Now each line contains data in a fixed-column format. You could use slicing and/or regex to parse/validate individual fields in each row.