I have a code that opens a file, calculates the median value and writes that value to a separate file. Some of the files maybe empty so I wrote the following loop to check it the file is empty and if so skip it, increment the count and go back to the loop. It does what is expected for the first empty file it finds ,but not the second. The loop is below
t = 15.2
while t>=11.4:
if os.stat(r'C:\Users\Khary\Documents\bin%.2f.txt'%t ).st_size > 0:
print("All good")
F= r'C:\Users\Documents\bin%.2f.txt'%t
print(t)
F= np.loadtxt(F,skiprows=0)
LogMass = F[:,0]
LogRed = F[:,1]
value = np.median(LogMass)
filesave(*find_nearest(LogMass,LogRed))
t -=0.2
else:
t -=0.2
print("empty file")
The output is as follows
All good
15.2
All good
15.0
All good
14.8
All good
14.600000000000001
All good
14.400000000000002
All good
14.200000000000003
All good
14.000000000000004
All good
13.800000000000004
All good
13.600000000000005
All good
13.400000000000006
empty file
All good
13.000000000000007
Traceback (most recent call last):
File "C:\Users\Documents\Codes\Calculate Bin Median.py", line 35, in <module>
LogMass = F[:,0]
IndexError: too many indices
A second issue is that t somehow goes from one decimal place to 15 and the last place seems to incrementing whats with that?
Thanks for any and all help
EDIT
The error IndexError: too many indices only seems to apply to files with only one line example...
12.9982324 0.004321374
If I add a second line I no longer get the error can someone explain why this is? Thanks
EDIT
I tried a little experiment and it seems numpy does not like extracting a column if the array only has one row.
In [8]: x = np.array([1,3])
In [9]: y=x[:,0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-50e27cf81d21> in <module>()
----> 1 y=x[:,0]
IndexError: too many indices
In [10]: y=x[:,0].shape
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-10-e8108cf30e9a> in <module>()
----> 1 y=x[:,0].shape
IndexError: too many indices
In [11]:
You should be using try/except blocks. Something like:
t = 15.2
while t >= 11.4:
F= r'C:\Users\Documents\bin%.2f.txt'%t
try:
F = np.loadtxt(F,skiprows=0)
LogMass = F[:,0]
LogRed = F[:,1]
value = np.median(LogMass)
filesave(*find_nearest(LogMass,LogRed))
except IndexError:
print("bad file: {}".format(F))
else:
print("file worked!")
finally:
t -=0.2
Please refer to the official tutorial for more details about exception handling.
The issue with the last digit is due to how floats work they can not represent base10 numbers exactly. This can lead to fun things like:
In [13]: .3 * 3 - .9
Out[13]: -1.1102230246251565e-16
To deal with the one line file case, add the ndmin parameter to np.loadtxt (review its doc):
np.loadtxt('test.npy',ndmin=2)
# array([[ 1., 2.]])
With the help of a user named ajcr, found the problem was that ndim=2 should have been used in numpy.loadtxt() to insure that the array always 2 has dimensions.
Python uses indentation to define if while and for blocks.
It doesn't look like your if else statement is fully indented from the while.
I usually use a full 'tab' keyboard key to indent instead of 'spaces'
Related
I'm working with Python 2.7 and I stumbled on this issue and I cannot even understand the logic of why the following doesn't work.
I need to fit different dataset and I want to do it dynamically in a for loop.
Here is a MWE that duplicates the problem:
import numpy as np
from scipy.optimize import curve_fit
def func_to_fit(x, a):
return x ** a
x = np.arange(0, 10)
dummy_dataset = {1: x ** 2,
2: x ** 3}
param = {}
fitted_func = {}
print "First loop"
for i in [1, 2]:
if i == 1:
idx = "Ex"
else:
idx = "Ez"
# store the results in different items in a dictionary to avoid the usual copying issues in Python
param[idx] = curve_fit(func_to_fit, x, dummy_dataset[i])[0]
fitted_func[idx] = lambda x: func_to_fit(x, *param[idx])
print idx, fitted_func[idx](x)
print param
print fitted_func
print "Ex", fitted_func["Ex"](x)
print "Ez", fitted_func["Ez"](x)
And the output is
First loop
Ex [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
Ez [ 0. 1. 8. 27. 64. 125. 216. 343. 512. 729.]
{'Ex': array([ 2.]), 'Ez': array([ 3.])}
{'Ex': <function <lambda> at 0x0000000015989EB8>, 'Ez': <function <lambda> at 0x0000000015989F98>}
Ex [ 0. 1. 8. 27. 64. 125. 216. 343. 512. 729.]
Ez [ 0. 1. 8. 27. 64. 125. 216. 343. 512. 729.]
The behaviour before the print "Second loop" is as I expect: outside the loop where I do the fitting, the parameters stored in param are the correct ones ("Ex"=2 and "Ez"=3), the functions in fitted_func are stored in different memory locations, and when I call them in the first loop they give the correct results, namely fitted_func["Ex"](x) = x^2 and fitted_func["Ez"](x) = x^3.
However, if I call each fitted function individually, as I do with the last two lines of code, and I try to calculate the values of the functions, I have the wrong result, i.e. both functions return x^3.
I suppose this behaviour is somewhat connected to the way Python handles memory and with some kind of race condition, i.e., Python calculates every time the lambda function in fitted_func[idx] and remembers the last valid result (in this case, the fitting for x^3).
Anyway, I don't really understand what's the exact problem and how to overcome it.
Hoping I managed to illustrate the issue properly, does anybody have any idea of
1. What causes this issue?
2. How to overcome it?
Thanks a lot!
My code below is supposed to print a graph/network using Networkx, Pandas and data from a CSV file. The code is (networkx3.py) -
import csv
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
g = nx.Graph()
csv_dict = pd.read_csv('Book1.csv', index_col=[0])
csv_1 = csv_dict.values.tolist()
ini = 0
for row in csv_1:
for i in row:
if type(row[i]) is str:
g.add_edge(ini, int(i), conn_prob=(float(row[i])))
max_wg_ngs = sorted(g[ini].items(), key=lambda e: e[1]["conn_prob"], reverse=True)[:2]
sarr = [str(a) for a in max_wg_ngs]
print "Neighbours of Node %d are:" % ini
#print(max_wg_ngs)
for item in sarr:
print ''.join(str(item))[1:-1]
ini += 1
pos = nx.spring_layout(g, scale=100.)
nx.draw_networkx_nodes(g, pos)
nx.draw_networkx_edges(g, pos)
nx.draw_networkx_labels(g, pos)
#plt.axis('off')
plt.show()
The data in the CSV file is (Book1.csv) -
,1,2,3,4,5,6,7,8,9,10
1,0,0.257905291,0.775104118,0.239086843,0.002313744,0.416936603,0.194817214,0.163350301,0.252043807,0.251272559
2,0.346100279,0,0.438892758,0.598885794,0.002263231,0.406685237,0.523850975,0.257660167,0.206302228,0.161385794
3,0.753358102,0.222349243,0,0.407830809,0.001714776,0.507573592,0.169905687,0.139611318,0.187910832,0.326950557
4,0.185342928,0.571302688,0.51784403,0,0.003231018,0.295197533,0.216184462,0.153032751,0.216331326,0.317961522
5,0,0,0,0,0,0,0,0,0,0
6,0.478164621,0.418192795,0.646810223,0.410746629,0.002414973,0,0.609176897,0.203461461,0.157576977,0.636747837
7,0.24894327,0.522914349,0.33948832,0.316240267,0.002335929,0.639377086,0,0.410011123,0.540266963,0.587764182
8,0.234017887,0.320967208,0.285193773,0.258198079,0.003146737,0.224412057,0.411725737,0,0.487081815,0.469526333
9,0.302955306,0.080506624,0.261610132,0.22856311,0.001746979,0.014994905,0.63386228,0.486096957,0,0.664434415
10,0.232675407,0.121596312,0.457715027,0.310618067,0.001872929,0.57556548,0.473562887,0.32185564,0.482351246,0
The code however doesn't work. I don't understand where I'm going wrong. The error is -
Traceback (most recent call last):
File "networkx3.py", line 13, in <module>
if type(row[i]) is str:
TypeError: list indices must be integers, not float
I do not want to modify the CSV file or its data. The index column and header are supposed to be ignored.
I have previously asked this question but I did not get satisfactory answers. Can anybody help?
Thanks a lot in advance :) (Using Ubuntu 14.04 32-bit VM. Credits to #Adonis for helping in creating the original code)
A little late in answering my own question, but with some valuable help from #Joel and #Adonis, I finally figured out where I was going wrong.
The problem was in the 2nd for loop where I tried to pass a float value as a string into the Graph which gave me an error. Other minor changes would result in an output but without any edges, just nodes.
Finally, after using an enumerate function to define a connecting node (using its index giving power), I got the required output. The only changes to be made are in the 2nd for loop and the if condition:
for row in csv_1:
for idx, i in enumerate(row):
if type(row[idx]) is float:
g.add_edge(ini, idx, conn_prob=(float(row[idx])))
Thanks to all the selfless guys at SOF for the help, couldn't have done it without you :)
I'm working on a home automation project that uses python-twitter to post update information. When I try to post a shorter-than-140 string with no links, I end up getting a TwitterError: Text must be less than or equal to 140 characters.
Here's some shell output describing the problem:
>>> import twitter
>>> api = twitter.Api(
... consumer_key=os.environ['TW_API_KEY'],
... consumer_secret=os.environ['TW_API_SECRET'],
... access_token_key=os.environ['TW_ACCESS_KEY'],
... access_token_secret=os.environ['TW_ACCESS_SECRET']
... )
>>> m = 'The house is now inside the comfort range. Conditions: 28.4 C (83.1 F) at 50.0% humidity. It feels like 84 F.'
>>> len(m)
109
>>> api.PostUpdate(m)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/pi/.virtualenvs/site/local/lib/python2.7/site-packages/twitter/api.py", line 952, in PostUpdate
raise TwitterError("Text must be less than or equal to 140 characters.")
TwitterError: Text must be less than or equal to 140 characters.
I tried converting the string to unicode using api.PostUpdate(u'{0}'.format(m)) but unsurprisingly no help there either.
I assume what's happening is that python-twitter is interpreting the decimal numbers as links, thereby decreasing the amount of characters I have left to use, but I don't know enough about the api to know if that's true or not.
Given that I am not planning on including links in these tweets, is there a way to force python-twitter to ignore the decimals without actually getting rid of them? I would prefer not to settle for that workaround, as I kind of like being able to report in decimal notation. I would also prefer not to have to go with tweepy but I suppose I'd swallow that pill if I needed to.
Update: I've found a way to turn off the api's length check using the verify_status_length=False flag in PostUpdate(), which allows me to tweet the above string m with no errors. It doesn't feel that elegant but I will stick with it for now until I find a better solution.
I am modelling data for a logit model with 34 dependent variables,and it keep throwing in the singular matrix error , as below -:
Traceback (most recent call last):
File "<pyshell#1116>", line 1, in <module>
test_scores = smf.Logit(m['event'], train_cols,missing='drop').fit()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
hess=hess)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix
Which was when I stumpled on this method to reduce the matrix to its independent columns
def independent_columns(A, tol = 0):#1e-05):
"""
Return an array composed of independent columns of A.
Note the answer may not be unique; this function returns one of many
possible answers.
https://stackoverflow.com/q/13312498/190597 (user1812712)
http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
(Anne Archibald)
>>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
2 4 1 3
-1 -2 1 0
0 0 2 2
3 6 2 5
# try with checking the rank of matrixs
>>> independent_columns(A)
np.array([[1, 4],
[2, 5],
[3, 6]])
"""
Q, R = linalg.qr(A)
independent = np.where(np.abs(R.diagonal()) > tol)[0]
#print independent
return A[:, independent], independent
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
#train_cols will not be converted back from a df to a matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
I still get the LinAlgError , though I was hoping I will have the reduced matrix rank now.
Also, I see np.linalg.matrix_rank(train_cols) returns 33 (ie. before calling on the independent_columns function total "x" columns was 34(ie, len(train_cols.ix[0])=34 ), meaning I don't have a full rank matrix), while np.linalg.matrix_rank(A2) returns 33 (meaning I have dropped a columns, and yet I still see the LinAlgError , when I run test_scores = smf.Logit(m['event'],A2,missing='drop').fit() , what am I missing ?
reference to the code above -
How to find degenerate rows/columns in a covariance matrix
I tried to start building the model forward,by introducing each variable at a time, which doesn't give me the singular matrix error, but I would rather have a method that is deterministic, and lets me know, what am I doing wrong & how to eliminate these columns.
Edit (updated post the suggestions by #
user333700 below)
1. You are right, "A2" doesn't have the reduced rank of 33 . ie. len(A2.ix[0]) =34 -> meaning the possibly collinear columns are not dropped - should I increase the "tol", tolerance to get rank of A2 (and the numbers of columns thereof) , as 33. If I change the tol to "1e-05" above, then I do get len(A2.ix[0]) =33, which suggests to me that tol >0 (strictly) is one indicator.
After this I just did the same, test_scores = smf.Logit(m['event'],A2,missing='drop').fit(), without nm to get the convergence.
2. Errors post trying 'nm' method. Strange thing though is that if I take just 20,000 rows, I do get the results. Since it is not showing up Memory error, but "Inverting hessian failed, no bse or cov_params available" - I am assuming, there are multiple nearly-similar records - what would you say ?
m = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded
Warning (from warnings module):
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available
test_scores.summary()
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
test_scores.summary()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
yname_list)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
use_t=False)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
use_t=use_t)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
std_err = results.bse
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
return np.sqrt(np.diag(self.cov_params()))
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances
Edit 2: (updated post the suggestions by #user333700 below)
Reiterating what I am trying to model - less than about 1% of total
users "convert" (success outcomes) - so I took a balanced sample of
35(+ve) /65 (-ve)
I suspect the model is not robust, though it converges. So, will use "start_params" as the params from earlier iteration, from a different dataset.
This edit is about confirming is the "start_params" can feed into the results as below -:
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m = smf.Logit(data['event_custom'], A2,missing='drop')
#m = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2
test_scores.params
bidfloor_Quartile_modified_binned_0 0.305765
connectiontype_binned_0 -0.436798
day_custom_binned_Fri -0.040269
day_custom_binned_Mon 0.138599
day_custom_binned_Sat -0.319997
day_custom_binned_Sun -0.236507
day_custom_binned_Thu -0.058922
user_agent_device_family_binned_iPad -10.793270
user_agent_device_family_binned_iPhone -8.483099
user_agent_masterclass_binned_apple 9.038889
user_agent_masterclass_binned_generic -0.760297
user_agent_masterclass_binned_samsung -0.063522
log_height_width 0.593199
log_height_width_ScreenResolution -0.520836
productivity -1.495373
games 0.706340
entertainment -1.806886
IAB24 2.531467
IAB17 0.650327
IAB14 0.414031
utilities 9.968253
IAB1 1.850786
social_networking -2.814148
IAB3 -9.230780
music 0.019584
IAB9 -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS] -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG] 0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG] 0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN] 0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN] -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND] 0.033370
... more params here
Now on a different dataset(pass2, for indexing), I model the same as below -:
ie. I read a new dataframe, do all the variable transformation and then model via Logit as earlier .
m_pass2 = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
and, possibly keep iterating by picking up "start_params" from earlier passes.
Several points to this:
You need tol > 0 to detect near perfect collinearity, which might also cause numerical problems in later calculations.
Check the number of columns of A2 to see whether a column has really be dropped.
Logit needs to do some non-linear calculations with the exog, so even if the design matrix is not very close to perfect collinearity, the transformed variables for the log-likelihood, derivative or Hessian calculations might still end up being with numerical problems, like singular Hessian.
(All these are floating point problems when we work near floating point precision, 1e-15, 1e-16. There are sometimes differences in the default thresholds for matrix_rank and similar linalg functions which can imply that in some edge cases one function identifies it as singular and another one doesn't.)
The default optimization method for the discrete models including Logit is a simple Newton method, which is fast in reasonably nice cases, but can fail in cases that are badly conditioned. You could try one of the other optimizers which will be one of those in scipy.optimize, method='nm' is usually very robust but slow, method='bfgs' works well in many cases but also can run into convergence problems.
Nevertheless, even when one of the other optimization methods succeeds, it is still necessary to inspect the results. More often than not, a failure with one method means that the model or estimation problem might not be well defined.
A good way to check whether it is just a problem with bad starting values or a specification problem is to run method='nm' first and then run one of the more accurate methods like newton or bfgs using the nm estimate as starting value, and see whether it succeeds from good starting values.
I am writing a program to print the sum of multiple of 3 or 5 less than 1000. I am using an arithmetic progression to do it. My code is:
def multiple(x,y):
a=(1000-(1000%x) - x)/x +1
b=(995-y)/y +1
c=(1000-(1000%x*y)-x*y)/x*y +1
Sa=int(a/2(2*x+(a-1)*a))
Sb=b/2(2*y+(b-1)*b)
Sc=c/2(2*x*y+(c-1)*x*y)
Sd=Sa+Sb-Sc
print Sd
When I call the function I get the error:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\swampy-2.1.7\MULTIPLE.py", line 23, in multiple
Sa=int(a/2(2*x+(a-1)*a))
TypeError: 'int' object is not callable
Please point out the mistake in my code. Thanks.
P.S. Please forgive my "art" of question asking. I am new to Python and StackOverflow, so please bear with me. Thanks!
In Sa=int(a/2(2*x+(a-1)*a)) you forgot a * to multiply between a/2 and (2*x+(a-1)*a) You should have Sa=int(a/2*(2*x+(a-1)*a)).
Also, the same on Sb and Sc.