Python - reading text file delimited by semicolon, ploting chart using openpyxl - python-2.7

I have copied the text file to excel sheet separating cells by ; delimiter.
I need to plot a chart using the same file which I achieved. Since all the values copied are type=str my chart gives me wrong points.
Please suggest to overcome this. Plot is should be made of int values
from datetime import date
from openpyxl import Workbook,load_workbook
from openpyxl.chart import (
LineChart,
Reference,
Series,
)
from openpyxl.chart.axis import DateAxis
excelfile = "C:\Users\lenovo\Desktop\how\openpychart.xlsx"
wb = Workbook()
ws = wb.active
f = open("C:\Users\lenovo\Desktop\sample.txt")
data = []
num = f.readlines()
for line in num:
line = line.split(";")
ws.append(line)
f.close()
wb.save(excelfile)
wb.close()
wb = load_workbook(excelfile, data_only=True)
ws = wb.active
c1 = LineChart()
c1.title = "Line Chart"
##c1.style = 13
c1.y_axis.title = 'Size'
c1.x_axis.title = 'Test Number'
data = Reference(ws, min_col=6, min_row=2, max_col=6, max_row=31)
series = Series(data, title='4th average')
c1.append(series)
data = Reference(ws, min_col=7, min_row=2, max_col=7, max_row=31)
series = Series(data, title='Defined Capacity')
c1.append(series)
##c1.add_data(data, titles_from_data=True)
# Style the lines
s1 = c1.series[0]
s1.marker.symbol = "triangle"
s1.marker.graphicalProperties.solidFill = "FF0000" # Marker filling
s1.marker.graphicalProperties.line.solidFill = "FF0000" # Marker outline
s1.graphicalProperties.line.noFill = True
s2 = c1.series[1]
s2.graphicalProperties.line.solidFill = "00AAAA"
s2.graphicalProperties.line.dashStyle = "sysDot"
s2.graphicalProperties.line.width = 100050 # width in EMUs
##s2 = c1.series[2]
##s2.smooth = True # Make the line smooth
ws.add_chart(c1, "A10")
##
##from copy import deepcopy
##stacked = deepcopy(c1)
##stacked.grouping = "stacked"
##stacked.title = "Stacked Line Chart"
##ws.add_chart(stacked, "A27")
##
##percent_stacked = deepcopy(c1)
##percent_stacked.grouping = "percentStacked"
##percent_stacked.title = "Percent Stacked Line Chart"
##ws.add_chart(percent_stacked, "A44")
##
### Chart with date axis
##c2 = LineChart()
##c2.title = "Date Axis"
##c2.style = 12
##c2.y_axis.title = "Size"
##c2.y_axis.crossAx = 500
##c2.x_axis = DateAxis(crossAx=100)
##c2.x_axis.number_format = 'd-mmm'
##c2.x_axis.majorTimeUnit = "days"
##c2.x_axis.title = "Date"
##
##c2.add_data(data, titles_from_data=True)
##dates = Reference(ws, min_col=1, min_row=2, max_row=7)
##c2.set_categories(dates)
##
##ws.add_chart(c2, "A61")
### setup and append the first series
##values = Reference(ws, (1, 1), (9, 1))
##series = Series(values, title="First series of values")
##chart.append(series)
##
### setup and append the second series
##values = Reference(ws, (1, 2), (9, 2))
##series = Series(values, title="Second series of values")
##chart.append(series)
##
##ws.add_chart(chart)
wb.save(excelfile)
wb.close()

I have modified below code in for loop and it worked.
f = open("C:\Users\lenovo\Desktop\sample.txt")
data = []
num = f.readlines()
for line in num:
line = line.split(";")
new_line=[]
for x in line:
if x.isdigit():
x=int(x)
new_line.append(x)
else:
new_line.append(x)
ws.append(new_line)
f.close()
wb.save(excelfile)
wb.close()
For each list,for each value check if its a digit, if yes converts to integer and store in another list.
Using x=map(int,x) didnt work since I have character values too.
I felt above is much more easy than using x=map(int,x) with try and Except
Thanks
Basha

Related

How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)?

I have another question in the word2vec universe.
I am using the 'sparklyr'-package. Within this package I call the ft_word2vec() function. I have some trouble understanding the output:
For each number of sentences/paragraphs I am providing to the ft_word2vec() function, I always get the same amount of vectors. Even, if I have more sentences/paragraphs than words. For me, that looks like I get the paragraph-vectors. Maybe a Code-example helps to understand my problem?
# add your spark_connection here as 'spark_connection = '
# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))
# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)
# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")
# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test
# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
FK_train,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))
# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()
Has somebody had similar experiences? Can someone explain what exactly the ft_word2vec() function returns? Do you have an example on how to get the word embedding vectors with this function? Or does the returned column indeed contain the paragraph vectors?
my colleague found a solution! If you know how to do it, the instructions really begin to make sense!
# add your spark_connection here as 'spark_connection = '
# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))
# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)
# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")
# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test
# CHANGES FOLLOW HERE:
# We have to use the spark connection instead of the data. For me this was the confusing part, since i thought no data -> no model.
# maybe we can think of this step as an initialization
mymodel = ft_word2vec(
spark_connection,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))
# now that we have our model initialized, we add the word-embeddings to the model
w2v_model = ml_fit(w2v_model, sc_FK_EMB)
# now we can collect the embedding vectors
emb = word2vecmodel$vectors %>% collect()

Input query for python code

So I have created this code for my research, but I want to use it for plenty of data files, I do not want to do it manually, which means retyping some lines in my code to use desired file. How to use input command in python (I work with python 2.7 on Windows OS) to use it faster, just by typing name of desired datafile. My code so far:
import iodata as io
import matplotlib.pyplot as plt
import numpy as np
import time
from scipy.signal import welch
from scipy import signal
testInstance = io.InputConverter()
start = time.time()
conversionError = io.ConversionError()
#data = testInstance.convert(r"S:\Doktorat\Python\", 1", conversionError)
data = testInstance.convert(r"/Users/PycharmProjects/Hugo/20160401", "201604010000", conversionError)
end = time.time()
print("time elapsed " + str(end - start))
if(conversionError.conversionSucces):
print("Conversion succesful")
if(conversionError.conversionSucces == False):
print("Conversion failed: " + conversionError.conversionErrorLog)
print "Done!"
# Create a new subplot for two cannals 1 & 3
a = np.amin(data.data)
Bx = data.data[0,]
By = data.data[1,]
dt = float(300)/266350
Fs = 1/dt
t = np.arange(0,300,dt*1e3)
N = len(Bx)
M = len(By)
time = np.linspace(0,300,N)
time2 = np.linspace(0,300,M)
filename = 'C:/Users/PycharmProjects/Hugo/20160401/201604010000.dat'
d = open(filename,'rb')
degree = u"\u00b0"
headersize = 64
header = d.read(headersize)
ax1 = plt.subplot(211)
ax1.set_title(header[:16] + ', ' + # station name
'Canals: '+header[32:33]+' and '+header[34:35]+ ', ' # canals
+'Temp'+header[38:43]+degree+'C' # temperature
+', '+'Time:'+header[26:32]+', '+'Date'+' '+header[16:26]) # date
plt.ylabel('Pico Tesle [pT]')
plt.xlabel('Time [ms]')
plt.grid()
plt.plot(time[51:-14], Bx[51:-14], label='Canal 1', color='r', linewidth=0.1, linestyle="-")
plt.plot(time2[1:-14], By[1:-14], label='Canal 3', color='b', linewidth=0.1, linestyle="-")
plt.legend(loc='upper right', frameon=False, )
# Create a new subplot for FFT
plt.subplot(212)
plt.title('Fast Fourier Transform')
plt.ylabel('Power [a.u.]')
plt.xlabel('Frequency Hz')
xaxis2 = np.arange(0,470,10)
plt.xticks(xaxis2)
fft1 = (Bx[51:-14])
fft2 = (By[1:-14])
plt.grid()
# Loop for FFT data
for dataset in [fft1]:
dataset = np.asarray(dataset)
freqs, psd = welch(dataset, fs=266336/300, window='hamming', nperseg=8192)
plt.semilogy(freqs, psd/dataset.size**0, color='r')
for dataset2 in [fft2]:
dataset2 = np.asarray(dataset2)
freqs2, psd2 = welch(dataset2, fs=266336/300, window='hamming', nperseg=8192)
plt.semilogy(freqs2, psd2/dataset2.size**0, color='b')
plt.show()
As you can see there are some places where it would be better to put input and when I run the code I can write names of filenames etc. to python instead of creating every single pythonfile, with specified info in the code.
Btw. I use Pycharm to my python.
If all you are trying to do is get rid of the hardcoded pathname, you should be able to format your name string with input variables
name = raw_input("Name: ")
measurement = raw_input("Measurement: ")
filename = "C:/Users/PycharmProjects/{0}/{1}".format(name, measurement)
see raw_input and string formatting

ValueError: Tensor Tensor("Const:0", shape=(), dtype=float32) may not be fed with tf.placeholder

I'm trying to make speech recognition system with tensorflow.
Input data is an numpy array of size 50000 X 1.
Output data (mapping data) is an numpy array of size 400 X 1.
Input and mapping data is passed in batches of 2 in a list.
I've used this tutorial to design the neural network. Following is the code snippet:
For RNN:
input_data = tf.placeholder(tf.float32, [batch_size, sound_constants.MAX_ROW_SIZE_IN_DATA, sound_constants.MAX_COLUMN_SIZE_IN_DATA], name="train_input")
target = tf.placeholder(tf.float32, [batch_size, sound_constants.MAX_ROW_SIZE_IN_TXT, sound_constants.MAX_COLUMN_SIZE_IN_TXT], name="train_output")
fwd_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True, forget_bias=1.0)
# creating one backward cell
bkwd_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True, forget_bias=1.0)
# creating bidirectional RNN
val, _, _ = tf.nn.static_bidirectional_rnn(fwd_cell, bkwd_cell, tf.unstack(input_data), dtype=tf.float32)
For feeding data:
feed = {g['input_data'] : trb[0], g['target'] : trb[1], g['dropout'] : 0.6}
accuracy_, _ = sess.run([g['accuracy'], g['ts']], feed_dict=feed)
accuracy += accuracy_
When I ran the code, I got this error:
Traceback (most recent call last):
File "/home/wolborg/PycharmProjects/speech-to-text-rnn/src/rnn_train_1.py", line 205, in <module>
tr_losses, te_losses = train_network(g)
File "/home/wolborg/PycharmProjects/speech-to-text-rnn/src/rnn_train_1.py", line 177, in train_network
accuracy_, _ = sess.run([g['accuracy'], g['ts']], feed_dict=feed)
File "/home/wolborg/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/wolborg/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1102, in _run
raise ValueError('Tensor %s may not be fed.' % subfeed_t)
ValueError: Tensor Tensor("Const:0", shape=(), dtype=float32) may not be fed.
Process finished with exit code 1
Earlier, I was facing this issue with tf.sparse_placeholder, then after some browsing, I changed input type to tf.placeholder and made related changes. Now I'm clueless on where I'm making the error.
Please suggest something as how should I feed data.
Entire code:
import tensorflow as tf
# for taking MFCC and label input
import numpy as np
import rnn_input_data_1
import sound_constants
# input constants
# Training Parameters
num_input = 10 # mfcc data input
training_data_size = 8 # determines number of files in training and testing module
testing_data_size = num_input - training_data_size
# Network Parameters
learning_rate = 0.0001 # for large training set, it can be set 0.001
num_hidden = 200 # number of hidden layers
num_classes = 28 # total alphabet classes (a-z) + extra symbols (', ' ')
epoch = 1 # number of iterations
batch_size = 2 # number of batches
mfcc_coeffs, text_data = rnn_input_data_1.mfcc_and_text_encoding()
class DataGenerator:
def __init__(self, data_size):
self.ptr = 0
self.epochs = 0
self.data_size = data_size
def next_batch(self):
self.ptr += batch_size
if self.ptr > self.data_size:
self.epochs += 1
self.ptr = 0
return mfcc_coeffs[self.ptr-batch_size : self.ptr], text_data[self.ptr-batch_size : self.ptr]
def reset_graph():
if 'sess' in globals() and sess:
sess.close()
tf.reset_default_graph()
def struct_network():
print ('Inside struct network !!')
reset_graph()
input_data = tf.placeholder(tf.float32, [batch_size, sound_constants.MAX_ROW_SIZE_IN_DATA, sound_constants.MAX_COLUMN_SIZE_IN_DATA], name="train_input")
target = tf.placeholder(tf.float32, [batch_size, sound_constants.MAX_ROW_SIZE_IN_TXT, sound_constants.MAX_COLUMN_SIZE_IN_TXT], name="train_output")
keep_prob = tf.constant(1.0)
fwd_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True, forget_bias=1.0)
# creating one backward cell
bkwd_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True, forget_bias=1.0)
# creating bidirectional RNN
val, _, _ = tf.nn.static_bidirectional_rnn(fwd_cell, bkwd_cell, tf.unstack(input_data), dtype=tf.float32)
# adding dropouts
val = tf.nn.dropout(val, keep_prob)
val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0]) - 1)
# creating bidirectional RNN
print ('BiRNN created !!')
print ('Last Size: ', last.get_shape())
weight = tf.Variable(tf.truncated_normal([num_hidden * 2, sound_constants.MAX_ROW_SIZE_IN_TXT]))
bias = tf.Variable(tf.constant(0.1, shape=[sound_constants.MAX_ROW_SIZE_IN_TXT]))
# mapping to 28 output classes
logits = tf.matmul(last, weight) + bias
prediction = tf.nn.softmax(logits)
prediction = tf.reshape(prediction, shape = [batch_size, sound_constants.MAX_ROW_SIZE_IN_TXT, sound_constants.MAX_COLUMN_SIZE_IN_TXT])
# getting probability distribution
mat1 = tf.cast(tf.argmax(prediction,1),tf.float32)
correct = tf.equal(prediction, target)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
logits = tf.reshape(logits, shape=[batch_size, sound_constants.MAX_ROW_SIZE_IN_TXT, sound_constants.MAX_COLUMN_SIZE_IN_TXT])
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=target))
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
# returning components as dictionary elements
return {'input_data' : input_data,
'target' : target,
'dropout': keep_prob,
'loss': loss,
'ts': train_step,
'preds': prediction,
'accuracy': accuracy
}
def train_network(graph):
# initialize tensorflow session and all variables
# tf_gpu_config = tf.ConfigProto(allow_soft_placement = True, log_device_placement = True)
# tf_gpu_config.gpu_options.allow_growth = True
# with tf.Session(config = tf_gpu_config) as sess:
with tf.Session() as sess:
train_instance = DataGenerator(training_data_size)
test_instance = DataGenerator(testing_data_size)
print ('Training data size: ', train_instance.data_size)
print ('Testing data size: ', test_instance.data_size)
sess.run(tf.global_variables_initializer())
print ('Starting session...')
step, accuracy = 0, 0
tr_losses, te_losses = [], []
current_epoch = 0
while current_epoch < epoch:
step += 1
trb = train_instance.next_batch()
feed = {g['input_data'] : trb[0], g['target'] : trb[1], g['dropout'] : 0.6}
accuracy_, _ = sess.run([g['accuracy'], g['ts']], feed_dict=feed)
accuracy += accuracy_
if train_instance.epochs > current_epoch:
current_epoch += 1
tr_losses.append(accuracy / step)
step, accuracy = 0, 0
#eval test set
te_epoch = test_instance.epochs
while test_instance.epochs == te_epoch:
step += 1
print ('Testing round ', step)
trc = test_instance.next_batch()
feed = {g['input_data']: trc[0], g['target']: trc[1]}
accuracy_ = sess.run([g['accuracy']], feed_dict=feed)[0]
accuracy += accuracy_
te_losses.append(accuracy / step)
step, accuracy = 0,0
print("Accuracy after epoch", current_epoch, " - tr:", tr_losses[-1], "- te:", te_losses[-1])
return tr_losses, te_losses
g = struct_network()
tr_losses, te_losses = train_network(g)
You defined keep_prob as a tf.constant, but then trying to feed the value into it. Replace keep_prob = tf.constant(1.0) with keep_prob = tf.placeholder(tf.float32,[]) or keep_prob = tf.placeholder_with_default(1.0,[])

Average Multiple Runs (5 Runs) of The Sample & Graph it

I have five cvs files containing two columns corresponding to Diameter and Intensity of a sample to each run. Each file corresponds to a single run of the sample, and the values of those columns might be slightly different from the runs, but will be close to each other.
Sample of run from the csv file:
Diameter,Intensity
3.00e+1,0.00
3.19e+1,0.00
3.39e+1,0.00
3.60e+1,0.00
3.83e+1,0.00
4.08e+1,7.01
My goal is to draw a scatter plot diagram using Python to read in those five cvs files and generate a single scatter curve line averaging the five runs into a single averaged curve one? How can that be done?
Here is my attempt in coding using Plotly. The problem with the following code is that the curve is not drawing correctly when averaging the intensities of the five runs:
files = defaultdict(list)
file_start = raw_input("File Starts: ")
def read_data(file):
with open(file,'Ur') as f:
reader=csv.reader(f,delimiter=',')
for row in reader:
if "d (nm)" in row:
continue
else:
files[file].append(row)
os.chdir(".")
for file in glob.glob("*.csv"):
if file_start in file:
read_data(file)
for key, val in files.iteritems():
# R1 is run 1 from the file being read
if "R1" in key:
for v in val:
l1.append(float(v[0]))
l2.append(float(v[1]))
elif "R2" in key:
for v in val:
l3.append(float(v[0]))
l4.append(float(v[1]))
elif "R3" in key:
for v in val:
l5.append(float(v[0]))
l6.append(float(v[1]))
elif "R4" in key:
for v in val:
l7.append(float(v[0]))
l8.append(float(v[1]))
elif "R5" in key:
for v in val:
l9.append(float(v[0]))
l10.append(float(v[1]))
sum_val_0 = 0; avg_l_0 =[]; sum_val_1 = 0; avg_l_1 =[]
for val in zip(l1, l3, l5, l7, l9, l2, l4, l6, l8, l10):
sum_val_0 = sum_val_0 + float(val[0])+float(val[1]) + float(val[2]) +float(val[3])+float(val[4])
avg = sum_val_0/ len(val)
avg_l_0.append(avg)
sum_val_1 = sum_val_1 + float(val[5])+float(val[6]) + float(val[7]) +float(val[8])+float(val[9])
avg_1 = sum_val_1/ 10
avg_l_1.append(avg_1)
trace0 = go.Scatter(
x = avg_l_0,
y= avg_l_1,
line = dict(
color = ('rgb(205, 12, 24)'),
width = 2)
)
data = [trace0]
layout = go.Layout(
dict(showlegend=True,
xaxis = dict(title = 'Diameter', range = [35,1500]),
yaxis = dict(title = 'Intensity', range = [-20,120], showline=True,),
)
)
fig = dict(data=data, layout=layout)

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here