Patsy's dmatrices cannot read my formula - python-2.7

I have a function LogReg, which is as follows: (using justmarkham's code as inspiration)
def LogReg(self):
formulA = "class ~"
print self.frame #dataframe used
print self.columnNames[:-1]
for a in self.columnNames[:-1]:
formulA += " {0} +".format(a)
formula = formulA[:-2] #there is always a \n behind, we don't want that
print "formula = " + formula
Y,X = dmatrices(formula, self.frame, return_type="dataframe")
Y = np.ravel(Y) #flatten Y to a 1D list
model = LogisticRegression() #from sklearn.linear_model
model = model.fit(X, Y)
print model.score(X, Y)
with the following outcome:
a0 a1 a2 a3 class
picture1 1 2 3 67 1
picture2 6 7 45 61 3
picture3 8 7 6 5 2
picture4 1 2 4 3 0
['a0', 'a1', 'a2', 'a3']
formula = class ~ a0 + a1 + a2 + a3
Traceback (most recent call last):
File "classification.py", line 80, in <module>
c.LogReg()
File "classification.py", line 61, in LogReg
Y,X = dmatrices(formula, self.frame, return_type="dataframe")
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 152, in _do_highlevel_design
NA_action)
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 57, in _try_incr_builders
NA_action)
File "/<path>/python2.7/site-packages/patsy/build.py", line 660, in design_matrix_builders
NA_action)
File "/<path>/python2.7/site-packages/patsy/build.py", line 424, in _examine_factor_types
value = factor.eval(factor_states[factor], data)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 485, in eval
return self._eval(memorize_state["eval_code"], memorize_state, data)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 468, in _eval
code, inner_namespace=inner_namespace)
File "/<path>/python2.7/site-packages/patsy/compat.py", line 117, in call_and_wrap_exc
return f(*args, **kwargs)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 125, in eval
code = compile(expr, source_name, "eval", self.flags, False)
File "<string>", line 1
class
^
SyntaxError: unexpected EOF while parsing
I do not see what goes wrong here, as the string does by my knowledge not contain the EOF character, nor does the Python code seem erroneous. Therefore, the question: Where does it go wrong (and preferably: , and how to fix it)?
P.S.: The software used are all the most recent stable packages as available on 04/09/2015.

Well, that was quick. By asking the question, I suddenly had color marking in the code, notifying me that 'class' is a protected name, and should not be used as a variable. Nano doesn't give those colors, leaving me blind.
Lesson learnt: Kids, don't do class as variable.

Related

Using regex to extract two elements from txt file and rename (python)

I'm trying to rename a bunch of payslip txt files i python using regex. The elements that I want to use for this are personnummer (social security number) and datum (date). Personnummer is formatted like this \d\d\d\d\d\d-\d\d\d\d and works fine by itself using the code below.
But when i try to add datum as well as personnummer, which is formatted like this GFROM:\d\d\d\d\d\d\d\d (i only want the numbers, not the GFROM part) I run into a syntax error.
Do you have any suggestions? I've looked through the previous posts but haven't really found anything there.
Many thanks in advance.
/Andrew
import os
import re
mydir = 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov'
personnummer = "(\d\d\d\d\d\d\-\d\d\d\d)"
datum = "(GFROM:(\d\d\d\d\d\d\d\d))"
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s = re.search(personnummer, txt)
t = re.search(datum, txt)
name = '19' + s.group() + ' ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)```
**The input files look like this;**
DATUM: 010122 KUND:20290
XXX KOMMUN SIDA: 23 70677
PERSONS NAME UTB-KOD ANS.DAT: 010206-3008
BOK/ G T ARBETS- ARB ARB L L P B BRUT L FAST
GÄLLER GÄLLER AVG LÖP AV CAK/ BEFATTNINGS R Y ANST TIDS TID TID P G L L AVDR K BLPP BELOPP LÖNE UPP DEL
FR O M T O M KOD FÖR DB NR TAL BSK -BENÄMNING P P FORM VILLKOR % HEL L R G G FROM L FROM FIP*A lÖN TIML OMF PEN
----------------------------------------------------------------------------------------------------------------------------------------
760701 790630 110 83 20 5070LOK HEMSAMARIT 5 1 4 10004000 Ö 7607 000000 800 000000
790701 800108 970 76 21 5017ANA-T HEMSAMARIT 5T1 4 00004000 K 077907 000000000000 000000
KUNDNR:20290 SIDA: 023 70677 GFROM:19760701 GTOM:19800108 PERSONS NAME 010206-3008
000001L 2 000001010122 33399CMT011MATRIKELKORT Matrikelkort 000001CMZ029050330-7118 01-01-22 CMZ02901
120290
**The errors i got**
runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
Traceback (most recent call last):
File "<ipython-input-21-f7cd01adb9a3>", line 1, in <module>
runfile('C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py', wdir='C:/Users/atutt-wi/Desktop/USB')
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 827, in runfile
execfile(filename, namespace)
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py",
line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 24, in <module>
os.rename(archpath, newpath)
OSError: [WinError 123] Incorrect syntax for file name,
directory name or volume label: 'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\File17.txt' ->
'C:/Users/atutt-wi/Desktop/USB/Matrikelkort/matrikelkort prov\\010206-3008 20GFROM:19760701 Matrikelkort.txt'
**Update: When i removed the ':' from GFROM i get the following error**
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\atutt-wi\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/atutt-wi/Desktop/USB/regex personnummer och datum matrikelkort tool.py", line 22, in <module>
name = '19' + s.group() + ' ' + '20' + t.group() + ' Matrikelkort'+ '.txt'
AttributeError: 'NoneType' object has no attribute 'group'
Here is a snippet you could try:
import os
import re
rx_num = re.compile(r"\s(\d{6}-\d{4})\s", re.M)
rx_dat = re.compile("GFROM:(\d\d\d\d\d\d\d\d)\s", re.M)
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:
txt = f.read()
s_match = rx_num.search(txt)
s = s_match.group() if s_match is not None else "[Missing]"
t_match = rx_dat.search(txt)
t = t_match.group() if t_match is not None else "[Missing]"
name = '19' + s + ' ' + '20' + t + ' Matrikelkort'+ '.txt'
newpath = os.path.join(mydir, name)
os.rename(archpath, newpath)
The use of compile is optional, but I find it clearer. I also added the re.M which is the flag for 'Multiline'. Lastly, I added those \s before and after the groups to ensure a string like 'abd123456-7890def' would not match. Also, keep in mind that you will onsly get the first match with this code. If you want every match, try using findall instead.

Invalid literal for float in k nearest neighbor

I am having the hardest time figuring out why i am getting this error. I have searched a lot but unable to fine any solution
import numpy as np
import warnings
from collections import Counter
import pandas as pd
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-
np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv("data.txt")
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
print(full_data)
After running. it gives error
Traceback (most recent call last):
File "E:\Jazab\Machine Learning\Lec18(Testing K Neatest Nerighbors
Classifier)\Lec18(Testing K Neatest Nerighbors
Classifier)\Lec18_Testing_K_Neatest_Nerighbors_Classifier_.py", line 25, in
<module>
full_data = df.astype(float).values.tolist()
File "C:\Python27\lib\site-packages\pandas\util\_decorators.py", line 91, in
wrapper
return func(*args, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 3299, in
astype
**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3224, in
astype
return self.apply('astype', dtype=dtype, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3091, in
apply
applied = getattr(b, f)(**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 471, in
astype
**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 521, in
_astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "C:\Python27\lib\site-packages\pandas\core\dtypes\cast.py", line 636,
in astype_nansafe
return arr.astype(dtype)
ValueError: invalid literal for float(): 3) <-----Reappears in Group 8 as:
Press any key to continue . . .
if i remove astype(float) program run fine
What should i need to do ?
There are bad data (3)), so need to_numeric with apply because need processes all columns.
Non numeric are converted to NaNs, which are replaced by fillna to some scalar, e.g. 0:
full_data = df.apply(pd.to_numeric, errors='coerce').fillna(0).values.tolist()
Sample:
df = pd.DataFrame({'A':[1,2,7], 'B':['3)',4,5]})
print (df)
A B
0 1 3)
1 2 4
2 7 5
full_data = df.apply(pd.to_numeric, errors='coerce').fillna(0).values.tolist()
print (full_data)
[[1.0, 0.0], [2.0, 4.0], [7.0, 5.0]]
It looks like you have 3) as an entry in your CSV file, and Pandas is complaining because it can't cast it to a float because of the ).

Import file and convert integers into int

I'm running into the following problem.
I import a file that looks like [[58, 59, 60]].
Printing it gives ['[[58, 59, 60]]'].
Now I only want to add 58 59 60 to a new list. The problem is:
output gives ['[[58, 59, 60]]']
output[0] gives '[[58, 59, 60]]'
output[0][2] gives '5'
output[0][3] gives '8'.
Is there a way of importing the file in a way that it only loads full integers?
with open('file', 'r') as fobj:
content = int(fobj.read())
You could use json to do the converting from string to actual list when reading from the file.
For example:
import json
def get_output(filename): # filename is name of text file
with open(filename) as fn:
output = json.loads(fn.read())
return output
If the file 't.txt' contains this: [[58, 59, 60]], then get_output('t.txt') returns the list [[58, 59, 60]]:
output = get_output('t.txt')
type(output) # ===> list
output[0] # ===> [58, 59, 60]
output[0][2] # ===> 60

python - Error with Mariana/Theano neural network

I am facing a problem when I start my trainer and I can't figure out the cause.
My input data is of dimension 42 and my output should be one value out of 4.
This is the shape of my training and test set:
Training set:
input = (1152, 42) target = (1152,)
Training set: input = (1152, 42) target = (1152,)
Test set: input = (384, 42) target = (384,)
This is the construction of my network:
ls = MS.GradientDescent(lr=0.01)
cost = MC.CrossEntropy()
i = ML.Input(42, name='inp')
h = ML.Hidden(23, activation=MA.Sigmoid(), initializations=[MI.GlorotTanhInit()], name="hid")
o = ML.SoftmaxClassifier(4, learningScenario=ls, costObject=cost, name="out")
mlp = i > h > o
And this is the construction of the datasets, trainers and recorders:
trainData = MDM.RandomSeries(distances = train_set[0], next_state = train_set[1])
trainMaps = MDM.DatasetMapper()
trainMaps.mapInput(i, trainData.distances)
trainMaps.mapOutput(o, trainData.next_state)
testData = MDM.RandomSeries(distances = test_set[0], next_state = test_set[1])
testMaps = MDM.DatasetMapper()
testMaps.mapInput(i, testData.distances)
testMaps.mapOutput(o, testData.next_state)
earlyStop = MSTOP.GeometricEarlyStopping(testMaps, patience=100, patienceIncreaseFactor=1.1, significantImprovement=0.00001, outputFunction="score", outputLayer=o)
epochWall = MSTOP.EpochWall(1000)
trainer = MT.DefaultTrainer(
trainMaps=trainMaps,
testMaps=testMaps,
validationMaps=None,
stopCriteria=[earlyStop, epochWall],
testFunctionName="testAndAccuracy",
trainMiniBatchSize=MT.DefaultTrainer.ALL_SET,
saveIfMurdered=False
)
recorder = MREC.GGPlot2("MLP", whenToSave = [MREC.SaveMin("test", o.name, "score")], printRate=1, writeRate=1)
trainer.start("MLP", mlp, recorder = recorder)
But the following error is being produced:
Traceback (most recent call last):
File "nn-mariana.py", line 82, in <module>
trainer.start("MLP", mlp, recorder = recorder)
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 226, in start
Trainer_ABC.start( self, runName, model, recorder, trainingOrder, moreHyperParameters )
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 110, in start
return self.run(runName, model, recorder, *args, **kwargs)
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 410, in run
outputLayers
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 269, in _trainTest
res = modelFct(output, **kwargs)
File "SUPRESSED/Mariana/Mariana/network.py", line 47, in __call__
return self.callTheanoFct(outputLayer, **kwargs)
File "SUPRESSED/Mariana/Mariana/network.py", line 44, in callTheanoFct
return self.outputFcts[ol](**kwargs)
File "SUPRESSED/Mariana/Mariana/wrappers.py", line 110, in __call__
return self.run(**kwargs)
File "SUPRESSED/Mariana/Mariana/wrappers.py", line 102, in run
fres = iter(self.theano_fct(*self.fctInputs.values()))
File "SUPRESSED/Theano/theano/compile/function_module.py", line 871, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "SUPRESSED/Theano/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "SUPRESSED/Theano/theano/compile/function_module.py", line 859, in __call__
outputs = self.fn()
ValueError: Input dimension mis-match. (input[0].shape[1] = 1152, input[1].shape[1] = 4)
Apply node that caused the error: Elemwise{Composite{((i0 * i1) + (i2 * log(i3)))}}[(0, 1)](InplaceDimShuffle{x,0}.0, LogSoftmax.0, Elemwise{sub,no_inplace}.0, Elemwise{sub,no_inplace}.0)
Toposort index: 18
Inputs types: [TensorType(int32, row), TensorType(float64, matrix), TensorType(int32, row), TensorType(float64, matrix)]
Inputs shapes: [(1, 1152), (1152, 4), (1, 1152), (1152, 4)]
Inputs strides: [(4608, 4), (32, 8), (4608, 4), (32, 8)]
Inputs values: ['not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{axis=[1], acc_dtype=float64}(Elemwise{Composite{((i0 * i1) + (i2 * log(i3)))}}[(0, 1)].0)]]
Versions:
Mariana (1.0.1rc1, /media/guilhermevrs/Data/Documentos/Academico/TCC-code/Mariana)
Theano (0.8.0.dev0, SUPRESSED/Theano)
This code was produced having as base the tutorial code from the mnist example.
Could you please help me to figure out what's going on?
Thank you in advance
I talked directly to the authors of Mariana and the cause and solution is explained in this issue

Python Error : TypeError: can only concatenate list (not "str") to list

I am currently working on a backup program, I have run into errors while trying to gernate a unique file name with a given destination. I call this function in my code as: getFileUnique(f,pathtofile(backup+"/"+"../trash/")). f is the file path, the rest of the variables are pretty straight forward.
def getFileUnique(path,destination):
path = path.replace("\\","/")
p = path.split("/")[-1]
if not os.path.exists(join(destination,p)):
return destination+p
j = p.split(".")
counter = 0
print(j)
while os.path.exists(join(destination,j[:-1]+str(counter)+"."+j[-1])):
print(counter)
print("asdfsdf")
counter += 1
return destination+j[:-1]+str(counter)+"."+j[-1]
Error:
Traceback (most recent call last):
File "C:\Users\Owner\Google Drive\Programs\Dev Enviroment\python\backup\backup.py", line 76, in <module>
main("files","backup")
File "C:\Users\Owner\Google Drive\Programs\Dev Enviroment\python\backup\backup.py", line 73, in main
updateBackup(oldf,newf,reg,backup)
File "C:\Users\Owner\Google Drive\Programs\Dev Enviroment\python\backup\backup.py", line 65, in updateBackup
k = getFileUnique(f,pathtofile(backup+"/"+"../trash/"))
File "C:\Users\Owner\Google Drive\Programs\Dev Enviroment\python\backup\backup.py", line 41, in getFileUnique
while os.path.exists(join(destination,j[:-1]+str(counter)+"."+j[-1])):
TypeError: can only concatenate list (not "str") to list
return destination + '.'.join(j[:-1]) + str(counter) + "." + j[-1]