Format string to XML file - regex

I want to reformat a string to the XML structure, but my string is not on an XML format (using Python 2.7).
I believe the correct way is to first create an XML format of the input in one line and then use XML Pretty Print for making it an XML file with multi rows and indentation (
Pretty printing XML in Python).
Below there is an example of an input after a History Server REST API's call to Hadoop server 1.
Input:
'{"jobAttempts":{"jobAttempt":[{"nodeHttpAddress":"slave2:8042","nodeId":"slave2:39637","id":1,"startTime":1544691730439,"containerId":"container_1544631848492_0013_01_000001","logsLink":"http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2"}]}}'
Output:
'<jobAttempts><jobAttempt><nodeHttpAddress>slave2:8042</nodeHttpAddress><nodeId>slave2:39637</nodeId><id>1</id><startTime>1544691730439</startTime><containerId>container_1544631848492_0013_01_000001</containerId><logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink></jobAttempt></jobAttempts>'
Final Output
<jobAttempts>
<jobAttempt>
<nodeHttpAddress>slave2:8042</nodeHttpAddress>
<nodeId>slave2:39637</nodeId>
<id>1</id>
<startTime>1544691730439</startTime>
<containerId>container_1544631848492_0013_01_000001</containerId>
<logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink>
</jobAttempts>
</jobAttempt>
*This string is actually an XML file which does not appear to have any style information associated with it.

I have found out that the source view of the History Server REST API's is indeed an XML file in one line. Thus, I had to read the source view and not the old problematic view with python.
Before I used
import urllib2
contents = urllib2.urlopen("http://http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013//jobattempts").read()
Now, I am downloading the source view of the html page with selenium and BeautifulSoup and I save it locally.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import xml.dom.minidom
driver = webdriver.Firefox()
driver.get("http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013/jobattempts")
page_source = driver.page_source
driver.close()
soup = BeautifulSoup(page_source, "html.parser")
print(soup)
xml = xml.dom.minidom.parseString(str(soup))
pretty_xml_as_string = xml.toprettyxml()
file = open("./content_new_2.xml", 'w')
file.write(pretty_xml_as_string)
file.close()

Related

How can I add my web scrape process using bs4 to selenium automation in Python to make it one single process which just asks for a zipcode?

I am using selenium to go to a website and then go to the search button type a zipcode which I am entering beforehand and then for that zip code I want the link that the webpage has to feed my web scraper created using beautiful soup and once the link comes up I can scrape required data to get my csv.
What I want:
I am having trouble getting that link to the beautiful soup URL. I basically want to automate it so that I just have to enter a zip code and it gives me my CSV.
What I am able to get:
I am able to enter the zip code and search using selenium and then add that url to my scraper to give csv.
Code I am using for selenium :
driver = webdriver.Chrome('/Users/akashgupta/Desktop/Courses and Learning/Automating Python and scraping/chromedriver')
driver.get('https://www.weather.gov/')
messageField = driver.find_element_by_xpath('//*[#id="inputstring"]')
messageField.click()
messageField.send_keys('75252')
time.sleep(3)
showMessageButton = driver.find_element_by_xpath('//*[#id="btnSearch"]')
showMessageButton.click()
#web scraping Part:
url="https://forecast.weather.gov/MapClick.php?lat=32.99802500000004&lon=-96.79775499999994#.Xo5LnFNKgWo"
res= requests.get(url)
soup=BeautifulSoup(res.content,'html.parser')
tag=soup.find_all('div',id='seven-day-forecast-body')
weekly=soup.find_all(class_='tombstone-container')
main=soup.find_all(class_='period-name')
description=soup.find_all(class_='short-desc')
temp=soup.find_all(class_='temp')
Period_Name=[]
Desc=[]
Temp=[]
for a in range(0,len(main)):
Period_Name.append(main[a].get_text())
Desc.append(description[a].get_text())
Temp.append(temp[a].get_text())
df = pd.DataFrame(list(zip(Period_Name, Desc,Temp)),columns =['Period_Name', 'Short_Desc','Temperature'])
from selenium import webdriver
import time
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.weather.gov/')
messageField = driver.find_element_by_xpath('//*[#id="inputstring"]')
messageField.click()
messageField.send_keys('75252')
time.sleep(3)
showMessageButton = driver.find_element_by_xpath('//*[#id="btnSearch"]')
showMessageButton.click()
WebDriverWait(driver, 10).until(EC.url_contains("https://forecast.weather.gov/MapClick.php")) # here you are waiting until url will match your output pattern
currentURL = driver.current_url
print(currentURL)
time.sleep(3)
driver.quit()
#web scraping Part:
res= requests.get(currentURL)
....

Python request download a file and save to a specific directory

Hello sorry if this question has been asked before.
But I have tried a lot of methods that provided.
Basically, I want to download the file from a website, which is I will show my coding below. The code works perfectly, but the problem is the file was auto download in our download folder path directory.
My concern is to download the file and save it to a specific folder.
I'm aware we can change our browser setting since this was a server that will remote by different users. So, it will automatically download to their temporarily /users/adam_01/download/ folder.
I want it to save in server disk which is, C://ExcelFile/
Below are my script and some of the data have been changing because it is confidential.
import pandas as pd
import html5lib
import time from bs4
import BeautifulSoup
import requests
import csv
from datetime
import datetime
import urllib.request
import os
with requests.Session() as c:
proxies = {"http": "http://:911"}
url = 'https://......./login.jsp'
USERNAME = 'mwirzonw'
PASSWORD = 'Fiqr123'
c.get(url,verify= False)
csrftoken = ''
login_data = dict(proxies,atl_token = csrftoken, os_username=USERNAME, os_password=PASSWORD, next='/')
c.post(url, data=login_data, headers={"referer" : "https://.....com"})
page = c.get('https://........s...../SearchRequest-96010.csv')
location = 'C:/Users/..../Downloads/'
with open('asdsad906010.csv', 'wb') as output:
output.write(page.content )
print("Done!")
Thank you, be pleased to ask if any confusing information was given.
Regards,
Fiqri
It seems that from your script you are writing the file to asdsad906010.csv. You should be able to change the output directory as follows.
# Set the output directory to your desired location
output_directory = 'C:/ExcelFile/'
# Create a file path by joining the directory name with the desired file name
file_path = os.path.join(output_directory, 'asdsad906010.csv')
# Write the file
with open(file_path, 'wb') as output:
output.write(page.content)

Python Web scraper using Beautifulsoup 4

I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?
import requests
import re
from bs4 import BeautifulSoup
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
# Creating the CSV file
commonFile = open('common_words.csv', 'wb')
# Grabbing the lines you want
for node in soup.findAll("tr"):
# Getting just the text and removing the html
words = ''.join(node.findAll(text=True))
# Removing the extra lines
ID = re.sub(r'[\t\r\n]', '', words)
# Needed to add a break in the line to make the rows
update = ''.join(ID)+'\n'
# Now we add this to the file
commonFile.write(update)
commonFile.close()
How about this?
import requests
import csv
from bs4 import BeautifulSoup
f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
words = soup.select('div[class=file] tr')
for i in range(len(words)):
word = words[i].text
f.writerow([word.replace('\n', '')])

How do I embed a csv file into a Word Document with Python docx

I'm opening a docx file and I want to embed into it a csv file.
The csv should be displayed as an icon.
How would I set it's ecaxt position in my Word File?
My code so far is:
from Tkinter import Tk
from tkFileDialog import askopenfilename
import csv
from docx import Document
import datetime
today = datetime.date.today()
Tk().withdraw() # we don't want a full GUI, so keep the root window from appearing
#filename = askopenfilename(title='Specify data csv file',filetypes=[('text files', '.csv')]) # show an "Open" dialog box and return the path to the selected file
filename='C:/Documents and Settings/K/My Documents/LiClipse Workspace/WO_templates/WO_templates/data.csv'
with open(filename, 'r') as csvfile:
data_csv = csv.reader(csvfile, delimiter=',')
for row in data_csv:
if row[3]<>'Name':
document = Document('2G_Template.docx')
for table in document.tables:
for _cell in table._cells:
for paragraph in _cell.paragraphs:
if '%DATE%' in paragraph.text:
paragraph.text=str(today.day)+'/'+str(today.month)+'/'+str(today.year)
if 'R%RR%' in paragraph.text:
paragraph.text='R'+row[0]
if '%DESC%' in paragraph.text:
paragraph.text=row[1]
for paragraph in document.paragraphs:
if '%DEL_PLAN%' in paragraph.text:
paragraph.text='Deletion PLan: '+row[2]
document.save(row[3]+'.docx')
The next step in each cycle should be to select a csv file and embed it in the word file. It would be the equivalent of a paste special action in Word, with the option Display as Icon selected.
I was able to embed the csv using pywin32 library.
import win32com.client as win32
.
.
.
word = win32.gencache.EnsureDispatch('Word.Application')
doc=word.Documents.Open(doc_path)
word.Visible = False
.
.
.
doc.InlineShapes.AddOLEObject(FileName=doc_path3,DisplayAsIcon=1,Range=range1,IconLabel="CSV",IconFileName=doc_path4)
word.ActiveDocument.SaveAs(doc_path2)
doc.Close()

How do you convert the multi-line content scraped into a list?

I was trying to convert the content scraped into a list for data manipulation, but got the following error: TypeError: 'NoneType' object is not callable
#! /usr/bin/python
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import os
import re
# Copy all of the content from the provided web page
webpage = urlopen("http://www.optionstrategist.com/calculators/free-volatility- data").read()
# Grab everything that lies between the title tags using a REGEX
preBegin = webpage.find('<pre>') # Locate the pre provided
preEnd = webpage.find('</pre>') # Locate the /pre provided
# Copy the content between the pre tags
voltable = webpage[preBegin:preEnd]
# Pass the content to the Beautiful Soup Module
raw_data = BeautifulSoup(voltable).splitline()
The code is very simple. This is the code for BeautifulSoup4:
# Find all <pre> tag in the HTML page
preTags = webpage.find_all('pre')
for tag in preTags:
# Get the text inside the tag
print(tag.get_text())
Reference:
find_all()
Kinds of filters to put into name field of find()/findall()
get_text()
To get the text from the first pre element:
#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = "http://www.optionstrategist.com/calculators/free-volatility-data"
soup = BeautifulSoup(urlopen(url))
print soup.pre.string
To extract lines with data:
from itertools import dropwhile
lines = soup.pre.string.splitlines()
# drop lines before the data table header
lines = dropwhile(lambda line: not line.startswith("Symbol"), lines)
# extract lines with data
lines = (line for line in lines if '%ile' in line)
Now each line contains data in a fixed-column format. You could use slicing and/or regex to parse/validate individual fields in each row.