Saving data from bs4 and request in a usable manner

Saving data from bs4 and request in a usable manner - python-2.7

I am too new to python, so please forgive me for stupid questions. Thanks in advance.
I have the following data(float) printed out with bs4 and requests, with the code (print link.find_all("id"), link.text)
X a
X b
X c
Y a
Y b
Y c
Z a
Z b
Z c
Instead, I would like to save it like:
X a b c
Y a b c
Z a b c
and then save it into a text file so that I can use it afterwards. (I don't even know how to save some data into a file with python)

Welcome to Python, here's a quick example of creating a dict of lists and writing it to a text file.
from bs4 import BeautifulSoup
# import collections
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="story">Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
Tillie2;
"""
soup = BeautifulSoup(html_doc, 'html.parser')
anchors = soup.find_all('a')
data = {} # collections.OrderedDict() if order matters
for item in anchors:
key = item.get('id')
if key not in data.keys():
data.update({key: [item.text]})
else:
values = data[key]
values.append(item.text)
data.update({key: values})
with open('example.txt', 'w') as f:
for key, value in data.items():
line = key + ' ' + ' '.join(value) + '\n'
f.write(line)
# example.txt
# link1 Elsie
# link3 Tillie Tillie2
# link2 Lacie

Related

python: excel: print in columns then start in next row

I have data from beautifulsoup in the form of:
a
b
c
d
e
f
I want to get them in excel in the following format:
a b c d e f g
h i j k l m n
o p q r s t u
etc...
when i print them in excel.
This is the code I have currently:
import openpyxl
from openpyxl import Workbook
import requests
from bs4 import BeautifulSoup
for i in range (1,2):
url ="https:...."
response=requests.get(url,verify=False)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class"})
results=[]
for item in g_data:
data=(item.text)
results.append(data)
wb=Workbook()
ws=wb.active
for row, i in enumerate(results):
columns_cell='A'
ws[column_cell+str(row+2)]=str(i)
wb.save("test.xlsx")
Thanks in advance for your help.
UPDATED code:
for i in range (1,3):
url="https:.... .format(pagenum=i)
response=requests.get(url)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td","class")
row=[]
wb=Workbook()
ws=wb.active
for idx, item in enumerate(g_data):
row.append(item.text)
if not idx % 7:
ws.append(row)
row=[]
wb.save("test2.xlsx")
UPDATED RESULTS PICTURE:
Finally this works:
for i in range (1,2)
url="https:... "
response=requests.get(url)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class"})
results=[]
for item in g_data:
results.append(item.text)
df=pd.DataFrame(np.array(results).reshape(20,7),columns("abcdefg"))
writer=pd.ExcelWriter('test4.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
The problem with this one is that its overwriting the previous results. Still a bit more work to do. but progress :)

Sounds like you want something a bit like this:
row = []
for idx, item in enumerate(g_data):
row.append(item.text)
if not idx % 6: # 7th element:
ws.append(row)
row = []

Beautifulsoup extraction using for loop into table in Python 2

Platform: Python 2.7.13 on Win 7 with spyder IDE
Please I'm totally new to both beautifulsoup and python. I am stuck at the last two lines.
Q. I want to import the details on the url below and put them in a table. That is the information with dd tags:
The first part of the code works well to get the link and get all the school details. However, i'm having trouble running the for command to get the remaining elements.
full code is below
# coding: utf-8
import urllib2
url = "http://tools.canlearn.ca/cslgs-scpse/cln-cln/rep-fit/p/af.p.clres.do?institution_id=default&searchType=ALL&searchString=&progLang=A&instType=B&prov_1=prov_1&progNameOnly=N&start=0&finish=999&section=1"
#try:
page = urllib2.urlopen(url)
#except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
# missing.put(tmpurl)
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
rooturl = "http://tools.canlearn.ca/cslgs-scpse/cln-cln/rep-fit/p/"
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
info = soup.find_all("div", class_="wb-frm")
names = [x.ol.find_all("li") for x in info][0]
def f(string):
return str(string[0] + ', ' + string[-1])
names2 = [names[i:i+3] for i in range(0, len(names), 3)]
diploma = [ [x[0].findAll("a")[0].find(text=True).strip() ,x[1].string ,f(x[2].find(text=True).strip().split()) ] for x in names2]
links = [x.ol.find_all("a") for x in info][0]
links2 = [y.get('href') for y in links]
links3 = [rooturl + z for z in links2]
for i in xrange(len(links3)) :
url_link = urllib2.urlopen(links3[i])
link_html = BeautifulSoup(url_link)
#Changed the code here based on good answer given by heyiamt ..
#it was
# link_html2 = link_html.find_all("div", class_="wb-frm")
# website = link_html2[0].a.get('href')
# dd[y]=link2[y].get('dd')
# diploma[i].append(dd) diploma[i].append(link_html2[0].a.get('href'))
# diploma[i].append(website)
#Get the whole box for the general info
# general_info_html = link_html.find_all("div", class_="panel-body")
# general_info_html2 = [y.findAll('dd') for y in general_info_html[2:]]
# general_info = {}
# for x in general_info_html2 :
# general_info.update({x[0].find(text='dt') : x[1].find(text='dd')})
# general_info.update({x[0].get('dd')})
# diploma[i].append(general_info)
for d in link_html.find_all('dd'):
if d.a is not None:
diploma[i].append(d.a.string)
continue
if d.string is not None:
diploma[i].append(d.string)
continue
diploma[i].append(d.contents[0])
import pandas as pd
col1 = [x[1] for x in diploma]
col2 = [x[0] for x in diploma]
col3 = [x[2] for x in diploma]
col4 = [x[3] for x in diploma]
col5 = [x[4] for x in diploma]
col55 = {'Program Level' : [x.get('Program Level:') for x in col5], 'Credential Type' : [x.get('Credential Type:') for x in col5],
'Joint Program Level' : [x.get('Joint Program Level:') for x in col5],
'Joint Credential Type' : [x.get('Joint Credential Type:') for x in col5],
'Address' : [x.get('Address:') for x in col5],
'Telephone' : [x.get('Telephone:') for x in col5],
'Email' : [x.get('Email:') for x in col5],
'Fax' : [x.get('Fax:') for x in col5],
'Toll Free' : [x.get('Toll Free:') for x in col5]
}
df = pd.DataFrame(col1, columns = ['University'])
df2 = pd.DataFrame(col55)
df['Type'] = col2
df['City'] = col3
df['Website'] = col4
df['Address'] = df2['Address']
df['Credential Type'] = df2['Credential Type']
df['Email'] = df2['Email']
df['Fax'] = df2['Fax']
df['Joint Credential Type'] = df2['Joint Credential Type']
df['Joint Program Level'] = df2['Joint Program Level']
df['Program Level'] = df2['Program Level']
df['Telephone'] = df2['Telephone']
df['Toll Free'] = df2['Toll Free']
df.to_csv('data1.csv', encoding='utf-8')
Expected result: (i.e with "dd" tags)
http://www.rosewoodcollege.ca/program-information/
Apprenticeship Program Certificate
Not entered
Not entered
Calgary, Alberta T3J 5H3
(403) 798-7447
mail#rosewoodcollege.ca

For this site, you can just use BeautifulSoup to find the tags within the divs without actually scrolling through the divs themselves. These particular dd tags have a bit of fishiness to them, though. Here's a shot at managing the different possibilities.
# Using link_html from your code above.
dd_strs = []
for d in link_html.find_all('dd'):
if d.a is not None:
dd_strs.append(d.a.string)
continue
if d.string is not None:
dd_strs.append(d.string)
continue
dd_strs.append(d.contents[0])
for dd_str in dd_strs:
print dd_str
Output is
http://www.rosewoodcollege.ca/program-informatio n/
Apprenticeship Program
Certificate
Not entered
Not entered
Rosewood College
(403) 798-7447
mail#rosewoodcollege.ca
2015-12-30
If you can rely on the dt tags to always be mated, in order, to the dd tags, you can just repeat the above but for dt instead of dd and merge the lists accordingly.

How to use beautifulsoup to save html of a link in a file and do the same with all the links in the html file

I'm trying to write a parser which will take a url and download it's html in a .html file. Then it'll go through the html file to find all links and save them as well. I want to repeat it multiple time. Can some one please help a little?
This is the code I have written:
import requests
import urllib2
from bs4 import BeautifulSoup
link_set = set()
count = 1
give_url = raw_input("Enter url:\t")
def magic(give_url):
page = urllib2.urlopen(give_url)
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)
response = requests.get(give_url)
html_data = response.text
soup = BeautifulSoup(html_data)
list_items = soup.find_all('a')
for each_item in list_items:
html_link = each_item.get('href')
link_set.add(give_url + str(html_link))
magic(give_url)
for each_item in link_set:
print each_item
print "\n"
Although it's working fine but When I try to call the magic function in for loop, i get RuntimeError: Set changed size during iteration.

I got it working.
The code for recursive URL parsing using beautiful soup:
import requests
import urllib2
from bs4 import BeautifulSoup
link_set = set()
give_url = raw_input("Enter url:\t")
def magic(give_url, link_set, count):
# print "______________________________________________________"
#
# print "Count is: " + str(count)
# count += 1
# print "THE URL IT IS SCRAPPING IS:" + give_url
page = urllib2.urlopen(give_url)
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)
response = requests.get(give_url)
html_data = response.text
soup = BeautifulSoup(html_data)
list_items = soup.find_all('a')
for each_item in list_items:
html_link = each_item.get('href')
if(html_link is None):
pass
else:
if(not (html_link.startswith('http') or html_link.startswith('https'))):
link_set.add(give_url + html_link)
else:
link_set.add(html_link)
# print "Total links in the given url are: " + str(len(link_set))
magic(give_url,link_set,0)
link_set2 = set()
link_set3 = set()
for element in link_set:
link_set2.add(element)
count = 1
for element in link_set:
magic(element,link_set3,count)
count += 1
for each_item in link_set3:
link_set2.add(each_item)
link_set3.clear()
count = 1
print "Total links scraped are: " + str(len(link_set2))
for element in link_set2:
count +=1
print "Element number " + str(count) + "processing"
print element
print "\n"
There are many mistakes so I request you all to please tell me where I can improve the code.

Using Interval tree to find overlapping regions

I have two files
File 1
chr1:4847593-4847993
TGCCGGAGGGGTTTCGATGGAACTCGTAGCA
File 2
Pbsn|X|75083240|75098962|
TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT
CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA
TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC
File 1 has approximately 8000 more lines with different header and sequence below it.
I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let say by +- 100 if yes then match the sequence in file 2 and then print out the header info for file 2 and the matched sequence.
My approach use interval tree(in python i am still trying to get a hang of it), store the co ordinates ?
I tried using re.match but its not giving me accurate results.
Any tips would be highly appreciated.
Thanks.
My first try,
How ever now i have hit another road block so for my second second file if my start and end is 5000 and 8000 respectively I want to change this by subtracting 2000 so my new start and stop is 3000 and 5000 here is my code
from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
for row in rows:
#print row
if row.startswith('>'):
row = row.strip().split('|')
chrom_name = row[5]
start = int[row[3]
end = int(row[3])
# one interval tree per chromosome
if chrom_name not in genome:
genome[chrom_name] = IntervalTree()
# first time we've encountered this chromosome, createtree
# index the feature
genome[chrom_name].addi(start,end,row[2])
#for key,value in genome.iteritems():
#print key, ":", value
mast = defaultdict(list)
with open(file1', 'r') as f:
for row in f:
row = row.strip().split()
row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
row[0] = 'MT' if row[0] == 'M' else row[0]
#print row[0]
mast[row[0]].append({
'start':int(row[1]),
'end':int(row[2])
})
#for k,v in mast.iteritems():
#print k, ":", v
with open(binding_factor, 'w') as f :
for k,v in mast.iteritems():
for i in v:
g = genome[k].search(i['start'],i['end'])
if g:
print g
l = gene
f.write(str(l)`enter code here` + '\n')

replacing specific lines in a text file using python

First of all I am pretty new at python, so bear with me. I am attempting to read from one file, retrieve specific values and overwrite old values in another file with a similar format. The format is 'text value=xxx' in both files. I have the first half of the program working, I can extract the values I want and have placed them into a dict named 'params{}'. The part I haven't figured out is how to just write the specific value into the target file without it showing up at the end of the file or just writing garbage or only half of the file. Here is my source code so far:
import os, os.path, re, fileinput, sys
#set the path to the resource files
#res_files_path = r'C:\Users\n518013\Documents\203-104 WA My MRT Files\CIA Data\pelzer_settings'
tst_res_files_path = r'C:\resource'
# Set path to target files.
#tar_files_path = r'C:\Users\n518013\Documents\203-104 WA My MRT Files\CIA Data\CIA3 Settings-G4'
tst_tar_files_path = r'C:\target'
#test dir.
test_files_path = r'C:\Users\n518013\Documents\MRT Equipment - BY 740-104 WA\CIA - AS\Setting Files\305_70R_22_5 setting files\CIA 1 Standard'
# function1 to find word index and point to value
def f_index(lst, item):
ind = lst.index(item)
val = lst[ind + 3]
print val
return val
# function 2 for values only 1 away from search term
def f_index_1(lst, item):
ind = lst.index(item)
val = lst[ind + 1]
return val
# Create file list.
file_lst = os.listdir(tst_res_files_path)
# Traverse the file list and read in dim settings files.
# Set up dict.
params = {}
#print params
for fname in file_lst:
file_loc = os.path.join(tst_res_files_path, fname)
with open(file_loc, 'r') as f:
if re.search('ms\.', fname):
print fname
break
line = f.read()
word = re.split('\W+', line)
print word
for w in word:
if w == 'Number':
print w
params['sectors'] = f_index(word, w)
elif w == 'Lid':
params['lid_dly'] = f_index(word, w)
elif w == 'Head':
params['rotation_dly'] = f_index(word, w)
elif w == 'Horizontal':
tmp = f_index_1(word, w)
param = int(tmp) + 72
params['horizontal'] = str(param)
elif w == 'Vertical':
tmp = f_index_1(word, w)
param = int(tmp) - 65
params['vertical'] = str(param)
elif w == 'Tilt':
params['tilt'] = f_index_1(word, w)
else:
print 'next...'
print params #this is just for debugging
file_tar = os.path.join(tst_tar_files_path, fname)
for lines in fileinput.input(file_tar, inplace=True):
print lines.rstrip()
if lines.startswith('Number'):
if lines[-2:-1] != params['sectors']:
repl = params['sectors']
lines = lines.replace(lines[-2:-1], repl)
sys.stdout.write(lines)
else:
continue
Sample text files:
[ADMINISTRATIVE SETTINGS]
SettingsType=SingleScan
DimensionCode=
Operator=
Description=rev.1 4sept03
TireDimClass=Crown
TireWidth=400mm
[TEST PARAMETERS]
Number Of Sectors=9
Vacuum=50
[DELAY SETTINGS]
Lid Close Delay=3
Head Rotation Delay=3
[HEAD POSITION]
Horizontal=140
Vertical=460
Tilt=0
[CALIBRATION]
UseConvFactors=0
LengthUnit=0
ConvMMX=1
ConvPixelX=1
CorrFactorX=1
ConvMMY=1
ConvPixelY=1
CorrFactorY=1
end sample txt.
The code I have only writes about half of the file back, and I don't understand why? I am trying to replace the line 'Number of Sectors=9' with 'Number of Sectors=8' if I could get this to work, the rest of the replacements can be done using if statements.
Please help! I've spent hours on google looking for answers and info and everything I find gets me close but no cigar!
Thank you all in advance!

your file has the '.ini' format. python supports reading and writing those with the ConfigParser module. you could do this:
# py3: from pathlib import Path
import os.path
import configparser
# py3: IN_PATH = Path(__file__).parent / '../data/sample.ini'
# py3: OUT_PATH = Path(__file__).parent / '../data/sample_out.ini'
HERE = os.path.dirname(__file__)
IN_PATH = os.path.join(HERE, '../data/sample.ini')
OUT_PATH = os.path.join(HERE, '../data/sample_out.ini')
config = configparser.ConfigParser()
# py3: config.read(str(IN_PATH))
config.read(IN_PATH)
print(config['CALIBRATION']['LengthUnit'])
config['CALIBRATION']['LengthUnit'] = '27'
# py3: with OUT_PATH.open('w') as fle:
with open(OUT_PATH, 'w') as fle:
config.write(fle)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Saving data from bs4 and request in a usable manner - python-2.7

Related

python: excel: print in columns then start in next row

Beautifulsoup extraction using for loop into table in Python 2

How to use beautifulsoup to save html of a link in a file and do the same with all the links in the html file

Using Interval tree to find overlapping regions

replacing specific lines in a text file using python

Categories

Resources