python: excel: print in columns then start in next row - python-2.7

I have data from beautifulsoup in the form of:
a
b
c
d
e
f
I want to get them in excel in the following format:
a b c d e f g
h i j k l m n
o p q r s t u
etc...
when i print them in excel.
This is the code I have currently:
import openpyxl
from openpyxl import Workbook
import requests
from bs4 import BeautifulSoup
for i in range (1,2):
url ="https:...."
response=requests.get(url,verify=False)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class"})
results=[]
for item in g_data:
data=(item.text)
results.append(data)
wb=Workbook()
ws=wb.active
for row, i in enumerate(results):
columns_cell='A'
ws[column_cell+str(row+2)]=str(i)
wb.save("test.xlsx")
Thanks in advance for your help.
UPDATED code:
for i in range (1,3):
url="https:.... .format(pagenum=i)
response=requests.get(url)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td","class")
row=[]
wb=Workbook()
ws=wb.active
for idx, item in enumerate(g_data):
row.append(item.text)
if not idx % 7:
ws.append(row)
row=[]
wb.save("test2.xlsx")
UPDATED RESULTS PICTURE:
Finally this works:
for i in range (1,2)
url="https:... "
response=requests.get(url)
soup=BeautifulSoup(response.text)
g_data=soup.find_all("td",{"class"})
results=[]
for item in g_data:
results.append(item.text)
df=pd.DataFrame(np.array(results).reshape(20,7),columns("abcdefg"))
writer=pd.ExcelWriter('test4.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
The problem with this one is that its overwriting the previous results. Still a bit more work to do. but progress :)

Sounds like you want something a bit like this:
row = []
for idx, item in enumerate(g_data):
row.append(item.text)
if not idx % 6: # 7th element:
ws.append(row)
row = []

Related

Curdoc() keeps adding plots, want to replace

I have written a program that creates a graph based on input from a dropdown list. I am using curdoc().add_root() from bokeh to show my graphs on a server as show() does not work. However, whenever I choose a new option, instead of replacing the current graph, it creates one below it. I have tried curdoc().clear() its not working. How do I make this work where it replaces the graph but doesnt delete the dropdown list, because that is what curdoc().clear() is doing? Here's my code:
import csv
import bokeh.plotting
from bokeh.plotting import figure, curdoc
from bokeh.io import output_file, show
from bokeh.layouts import widgetbox
from bokeh.models.widgets import MultiSelect
from bokeh.io import output_file, show, vform
from bokeh.layouts import row
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
columns1 = defaultdict(list)
with open('my_data.csv') as f:
for row in f:
row = row.strip()# read a row as {column1: value1, column2: value2,...}
row = row.split(',')
columns[row[0]].append(row[1])
columns[row[0]].append(row[2])
columns[row[0]].append(row[3])
columns[row[0]].append(row[4])
columns[row[0]].append(row[5])
with open('my_data1.csv') as f:
for row in f:
row = row.strip()# read a row as {column1: value1, column2: value2,...}
row = row.split(',')
columns1[row[0]].append(row[1])
columns1[row[0]].append(row[2])
columns1[row[0]].append(row[3])
columns1[row[0]].append(row[4])
columns1[row[0]].append(row[5])
from bokeh.layouts import widgetbox
from bokeh.models.widgets import Dropdown
from bokeh.plotting import curdoc
menu = [("NY", "New York"), ("California", "California"), ("Ohio", "Ohio")]
dropdown = Dropdown(label="Dropdown button", button_type="warning", menu=menu)
count = 0
#def function_to_call(attr, old, new):
#print dropdown.value
def myfunc(attr, old, new):
aaa = dropdown.value
xy = (columns[aaa])
xy = [float(i) for i in xy]
myInt = 10000
xy = [x / myInt for x in xy]
print xy
omega = (columns1[aaa])
omega = [float(i) for i in omega]
print omega
import numpy
corr123 = numpy.corrcoef(omega,xy)
print corr123
a = [2004, 2005, 2006, 2007, 2008]
p = figure(tools="pan,box_zoom,reset,save", title="Diabetes and Stats",
x_axis_label='Years', y_axis_label='percents')
# add some renderers
per = "Diabetes% " + aaa
p.line(a, omega, legend=per)
p.circle(a, omega, legend=per, fill_color="white",line_color="green", size=8)
p.line(a, xy, legend="Per Capita Income/10000")
p.circle(a, xy, legend="Per Capita Income/10000", fill_color="red", line_color="red", size=8)
p.legend.location="top_left"
#bokeh.plotting.reset_output
#curdoc().clear()
curdoc().add_root(p)
curdoc().add_root(dropdown)
#bokeh.plotting.reset_output
dropdown.on_change('value', myfunc)
curdoc().add_root(dropdown)

Python 2.7 Interactive Visualisation

I'm a new programmer who has for a few days trying to create a dropdown list whose input then creates a graph.
For my graph, I'm using Bokeh to create a html file graph, plotting per-capita income of a few places as well as it's percentage of Diabetes. However I have been trying to get it to work for 2 weeks now with a dropdown list and I simply cannot make it work.
I can create the file, but only when the user enters the input by typing. How Can I make this work with a person selecting a place from a dropdown list and the file showing that places graph as output. Here's my code.
Edit:
I want the selected value from the dropdown list to be sent as the value aaa to the program. I know I should turn my graph creating part of the program into a function. But how do I get the value of a dropdown list as the variable aaa?
import csv
from bokeh.plotting import figure, curdoc
from bokeh.io import output_file, show
from bokeh.layouts import widgetbox
from bokeh.models.widgets import Dropdown
aaa = raw_input("Write State, not Puerto Rico, Hawaii, or DC: ")
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
columns1 = defaultdict(list)
with open('my_data.csv') as f:
for row in f:
row = row.strip()# read a row as {column1: value1, column2: value2,...}
row = row.split(',')
columns[row[0]].append(row[1])
columns[row[0]].append(row[2])
columns[row[0]].append(row[3])
columns[row[0]].append(row[4])
columns[row[0]].append(row[5])
xy = (columns[aaa])
xy = [float(i) for i in xy]
myInt = 10000
xy = [x / myInt for x in xy]
print xy
with open('my_data1.csv') as f:
for row in f:
row = row.strip()# read a row as {column1: value1, column2: value2,...}
row = row.split(',')
columns1[row[0]].append(row[1])
columns1[row[0]].append(row[2])
columns1[row[0]].append(row[3])
columns1[row[0]].append(row[4])
columns1[row[0]].append(row[5])
omega = (columns1[aaa])
omega = [float(i) for i in omega]
print omega
import numpy
corr123 = numpy.corrcoef(omega,xy)
print corr123
a = [2004, 2005, 2006, 2007, 2008]
output_file("lines.html")
p = figure(tools="pan,box_zoom,reset,save", title="Diabetes and Stats",
x_axis_label='Years', y_axis_label='percents')
# add some renderers
per = "Diabetes% " + aaa
p.line(a, omega, legend=per)
p.circle(a, omega, legend=per, fill_color="white",line_color="green", size=8)
p.line(a, xy, legend="Per Capita Income/10000")
p.circle(a, xy, legend="Per Capita Income/10000", fill_color="red", line_color="red", size=8)
p.legend.location="top_left"
show(p)

Saving data from bs4 and request in a usable manner

I am too new to python, so please forgive me for stupid questions. Thanks in advance.
I have the following data(float) printed out with bs4 and requests, with the code (print link.find_all("id"), link.text)
X a
X b
X c
Y a
Y b
Y c
Z a
Z b
Z c
Instead, I would like to save it like:
X a b c
Y a b c
Z a b c
and then save it into a text file so that I can use it afterwards. (I don't even know how to save some data into a file with python)
Welcome to Python, here's a quick example of creating a dict of lists and writing it to a text file.
from bs4 import BeautifulSoup
# import collections
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="story">Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
Tillie2;
"""
soup = BeautifulSoup(html_doc, 'html.parser')
anchors = soup.find_all('a')
data = {} # collections.OrderedDict() if order matters
for item in anchors:
key = item.get('id')
if key not in data.keys():
data.update({key: [item.text]})
else:
values = data[key]
values.append(item.text)
data.update({key: values})
with open('example.txt', 'w') as f:
for key, value in data.items():
line = key + ' ' + ' '.join(value) + '\n'
f.write(line)
# example.txt
# link1 Elsie
# link3 Tillie Tillie2
# link2 Lacie

Using Interval tree to find overlapping regions

I have two files
File 1
chr1:4847593-4847993
TGCCGGAGGGGTTTCGATGGAACTCGTAGCA
File 2
Pbsn|X|75083240|75098962|
TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT
CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA
TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC
File 1 has approximately 8000 more lines with different header and sequence below it.
I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let say by +- 100 if yes then match the sequence in file 2 and then print out the header info for file 2 and the matched sequence.
My approach use interval tree(in python i am still trying to get a hang of it), store the co ordinates ?
I tried using re.match but its not giving me accurate results.
Any tips would be highly appreciated.
Thanks.
My first try,
How ever now i have hit another road block so for my second second file if my start and end is 5000 and 8000 respectively I want to change this by subtracting 2000 so my new start and stop is 3000 and 5000 here is my code
from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
for row in rows:
#print row
if row.startswith('>'):
row = row.strip().split('|')
chrom_name = row[5]
start = int[row[3]
end = int(row[3])
# one interval tree per chromosome
if chrom_name not in genome:
genome[chrom_name] = IntervalTree()
# first time we've encountered this chromosome, createtree
# index the feature
genome[chrom_name].addi(start,end,row[2])
#for key,value in genome.iteritems():
#print key, ":", value
mast = defaultdict(list)
with open(file1', 'r') as f:
for row in f:
row = row.strip().split()
row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
row[0] = 'MT' if row[0] == 'M' else row[0]
#print row[0]
mast[row[0]].append({
'start':int(row[1]),
'end':int(row[2])
})
#for k,v in mast.iteritems():
#print k, ":", v
with open(binding_factor, 'w') as f :
for k,v in mast.iteritems():
for i in v:
g = genome[k].search(i['start'],i['end'])
if g:
print g
l = gene
f.write(str(l)`enter code here` + '\n')

How to convert recurrent vertical column into rows than stack them together in Python/Pandas?

I am generating some data vertically at first, but would like to transpose them into row data, then stack them into an array like a Pandas data frame. How do I get a final product of a pandas data frame with 4 columns ('fr', 'en', 'ir', 'ab') and three rows?
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Get csv file into data frame
data = pd.read_csv("FamilySearchData_All_OCT2015_newEthnicity_filledEthnicity_processedName_trimmedCol.csv", header=0, encoding="utf-8")
df = DataFrame(data)
columns = ['fr', 'en', 'ir', 'ab']
classes = ['ethnicity2', 'Ab_group', 'Ab_tribe']
df_count = DataFrame(columns=columns)
for j in classes:
for i in columns:
ethnicity_tar = str(i)
count = 0
try:
count = df[str(j)].value_counts()[ethnicity_tar]
except Exception as e:
count = ''
print ethnicity_tar, count
Output:
fr 1554455
en 1196932
ir 941852
ab 95131
fr 1554444
en 16000
ir 940850
ab 9371
fr 1554600
en 2196931
ir 940957
ab 9399
What I would like at the end:
fr en ir ab
1554455 1196932 941852 95131
1554444 16000 940850 9371
1554600 2196931 940957 9399
To implement this I would create a dictionary (hash) of the column names each containing an array. Then as I loop through the rows in your file, I'd use the first value to index into the dictionary to get the array and then append the numerical value to that array.
Once this interim data structure is built, you could loop through the arrays pulling the same index value for each row and printing them:
for i in range(0, n):
print str(hash['fr'][i]) + " " +
str(hash['en'][i]) + " " +
str(hash['ir'][i]) + " "
str(hash['ab'][i])