line graph from loop in Python - python-2.7

Helo everyone
I need some help. I wrote this scrip:
import matplotlib.pyplot as plt
import scipy
import pyfits
import numpy as np
import re
import os
import glob
import time
global numbers
numbers=re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
image_list=sorted(glob.glob('*.fit'), key=numericalSort)
for i in range(len(image_list)):
hdulist=pyfits.open(image_list[i])
data=hdulist[0].data
dimension=hdulist[0].header['NAXIS1']
time=hdulist[0].header['TIME']
hours=float(time[:2])*3600
minutes=float(time[3:5])*60
sec=float(time[6:])
cas=hours+minutes+sec
y=[]
for n in range(0,dimension):
y.append(data.flat[n])
maxy= max(y)
print image_list[i],cas,maxy
plt.plot([cas],[maxy],'bo')
plt.ion()
plt.draw()
This scrip read fit data file. From each file find max value which is y value and from header TIME which is x value axis.
And now my problem...When I run this scrip I get graph but only with points. How I get graph with line (line point to point)?
Thank for answer and help

Your problem may well be here:
plt.plot([cas],[maxy],'bo')
at the point that this statement is encountered, cas is a single value and maxy is also a single value -- you have only one point to plot and therefore nothing to join. Next time round the loop you plot another single point, unconnected to the previous one, and so on.
I can't be sure, but perhaps you mean to do something like:
x = []
for i in range(len(image_list)):
hdulist=pyfits.open(image_list[i])
data=hdulist[0].data
dimension=hdulist[0].header['NAXIS1']
time=hdulist[0].header['TIME']
hours=float(time[:2])*3600
minutes=float(time[3:5])*60
sec=float(time[6:])
cas=hours+minutes+sec
x.append(cas)
y=[]
for n in range(0,dimension):
y.append(data.flat[n])
maxy= max(y)
print image_list[i],cas,maxy
plt.plot(x, y ,'bo-')
plt.ion()
plt.draw()
ie plot a single line once you've collected all the x and y values. The linestyle format, bo- which provides the connecting line.

OK here is solution
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import scipy
import pyfits
import numpy as np
import re
import os
import glob
import time
global numbers
numbers=re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
fig=plt.figure()
ax1=fig.add_subplot(1,1,1)
def animate(i):
image_list=sorted(glob.glob('*.fit'), key=numericalSort)
cas,maxy=[],[]
files=open("data.dat","wr")
for n in range(len(image_list)):
hdulist=pyfits.open(image_list[n])
data=hdulist[0].data
maxy=data.max()
time=hdulist[0].header['TIME']
hours=int(float(time[:2])*3600)
minutes=int(float(time[3:5])*60)
sec=int(float(time[6:]))
cas=hours+minutes+sec
files.write("\n{},{}".format(cas,maxy))
files.close()
pool=open('data.dat','r')
data=pool.read()
dataA=data.split('\n')
xar=[]
yar=[]
pool.close()
for line in dataA:
if len(line)>1:
x,y=line.split(',')
xar.append(int(x))
yar.append(int(y))
print xar,yar
ax1.clear()
ax1.plot(xar,yar,'b-')
ax1.plot(xar,yar,'ro')
plt.title('Light curve')
plt.xlabel('TIME')
plt.ylabel('Max intensity')
plt.grid()
This script read some values from files and plot it.

Related

Modules similar to matplotlib's `Gridspec` and `GridSpecFromSubplotSpec` in pyqtgraph

I have identified that the most time-consuming step in the execution of code is saving plots to memory. I'm using matplotlib for plotting and saving the plots. The issue is that I'm running several simulations and saving the plots resulting from these simulations; this effort is consuming an insane number of compute hours. I have verified that it is indeed the plotting that is doing the damage.
It seems that pyqtgraph renders and saves images comparatively faster than matplotlib. I want to know if something similar to the following lines of code could be implemented in pyqtgraph?
import matplotlib
matplotlib.use('Agg')
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(6, 2.5))
rows = 1
columns = 2
gs = gridspec.GridSpec(rows, columns, hspace=0.0,wspace=0.0)
aj=0
for specie in lines:
for transition in species[specie].items():
gss = gridspec.GridSpecFromSubplotSpec(2, 1, subplot_spec=gs[aj],hspace=0.0,height_ratios=[1, 3])
ax0 = fig.add_subplot(gss[0])
ax1 = fig.add_subplot(gss[1], sharex=ax0)
ax0.plot(fitregs[specie+transition[0]+'_Vel'],fitregs['N_residue'],color='black',linewidth=0.85)
ax1.plot(fitregs[specie+transition[0]+'_Vel'],fitregs['N_flux'],color='black',linewidth=0.85)
ax1.plot(fitregs[specie+transition[0]+'_Vel'],fitregs['Best_profile'],color='red',linewidth=0.85)
ax1.xaxis.set_minor_locator(AutoMinorLocator())
ax1.tick_params(which='both', width=1)
ax1.tick_params(which='major', length=5)
ax1.tick_params(which='minor', length=2.5)
ax1.text(0.70,0.70,r'$\chi^{2}_{\nu}$'+'= {}'.format(round(red_chisqr[aj],2)),transform=ax1.transAxes)
ax1.text(0.10,0.10,'{}'.format(specie+' '+transition[0]),transform=ax1.transAxes)
ak=ak+1
aj=aj+1
canvas = FigureCanvas(fig)
canvas.print_figure('fits.png')
Example output from the above code is

Combining line plots (with data from DataFrames)

Cant figure out how to combine these two plots.
Here is the relevant code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline #Jupyter Notebook
df_state.head()
lm_orginal_plot.head()
outputs for .head()
df_state.loc['Alabama'][['month','total purchases',
'permit','permit_recheck']].plot(x='month',figsize=(6,3),linestyle='-', marker='o')
lm_original_plot.plot(x='month',figsize=(6,3),linestyle=':');
outputs for plots
This is how I would do this (not saying it is the best method or anything):
1) merge two dfs on month
all = df_state(lm_original_plot, on = 'month', how='left')
2) create figure (total is now column just like the other variables in
the first chart, so you can just add ‘total’ to your first chart code)
Not my work, just what a peer showed me.

Column order reversed in step histogram plot

Passing a 2D array to Matplotlib's histogram function with histtype='step' seems to plot the columns in reverse order (at least from my biased, Western perspective of left-to-right).
Here's an illustration:
import matplotlib.pyplot as plt
import numpy as np
X = np.array([
np.random.normal(size=5000),
np.random.uniform(size=5000)*2.0 - 1.0,
np.random.beta(2.0,1.0,size=5000)*3.0,
]).T
trash = plt.hist(X,bins=50,histtype='step')
plt.legend(['Normal','2*Uniform-1','3*Beta(2,1)'],loc='upper left')
Produces this:
Running matplotlib version 2.0.2, python 2.7
From the documentation for legend:
in order to keep the "label" and the legend element instance together,
it is preferable to specify the label either at artist creation, or by
calling the set_label method on the
artist
I recommend to use the label keyword argument to hist:
String, or sequence of strings to match multiple datasets
The result is:
import matplotlib.pyplot as plt
import numpy as np
X = np.array([
np.random.normal(size=5000),
np.random.uniform(size=5000)*2.0 - 1.0,
np.random.beta(2.0,1.0,size=5000)*3.0,
]).T
trash = plt.hist(X,bins=50,histtype='step',
label=['Normal','2*Uniform-1','3*Beta(2,1)'])
plt.legend(loc='upper left')
plt.show()

Scikit-Learn One-hot-encode before or after train/test split

I am looking at two scenarios building a model using scikit-learn and I can not figure out why one of them is returning a result that is so fundamentally different than the other. The only thing different between the two cases (that I know of) is that in one case I am one-hot-encoding the categorical variables all at once (on the whole data) and then splitting between training and test. In the second case I am splitting between training and test and then one-hot-encoding both sets based off of the training data.
The latter case is technically better for judging the generalization error of the process but this case is returning a normalized gini that is dramatically different (and bad - essentially no model) compared to the first case. I know the first case gini (~0.33) is in line with a model built on this data.
Why is the second case returning such a different gini? FYI The data set contains a mix of numeric and categorical variables.
Method 1 (one-hot encode entire data and then split) This returns: Validation Sample Score: 0.3454355044 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test
#First one hot and make a pandas df
dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
dat= vectorizer.transform( dat_dict )
dat=pd.DataFrame(dat)
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
Method 2 (first split and then one-hot encode) This returns: Validation Sample Score: 0.0055124452 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test
#first split
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
#One hot encode the training X and transform the test X
dat_dict=train_X.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
train_X= vectorizer.transform( dat_dict )
train_X=pd.DataFrame(train_X)
dat_dict=test_X.T.to_dict().values()
test_X= vectorizer.transform( dat_dict )
test_X=pd.DataFrame(test_X)
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
While the previous comments correctly suggest it is best to map over your entire feature space first, in your case both the Train and Test contain all of the feature values in all of the columns.
If you compare the vectorizer.vocabulary_ between the two versions, they are exactly the same, so there is no difference in mapping. Hence, it cannot be causing the problem.
The reason Method 2 fails is because your dat_dict gets re-sorted by the original index when you execute this command.
dat_dict=train_X.T.to_dict().values()
In other words, train_X has a shuffled index going into this line of code. When you turn it into a dict, the dict order re-sorts into the numerical order of the original index. This causes your Train and Test data become completely de-correlated with y.
Method 1 doesn't suffer from this problem, because you shuffle the data after the mapping.
You can fix the issue by adding a .reset_index() both times you assign the dat_dict in Method 2, e.g.,
dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
This ensures the data order is preserved when converting to a dict.
When I add that bit of code, I get the following results:
- Method 1: Validation Sample Score: 0.3454355044 (normalized gini)
- Method 2: Validation Sample Score: 0.3438430991 (normalized gini)
I can't get your code to run, but my guess is that in the test dataset either
you're not seeing all the levels of some of the categorical variables, and hence if you calculate your dummy variables just on this data, you'll actually have different columns.
Otherwise, maybe you have the same columns but they're in a different order?

Use a loop to plot n charts Python

I have a set of data that I load into python using a pandas dataframe. What I would like to do is create a loop that will print a plot for all the elements in their own frame, not all on one. My data is in an excel file structured in this fashion:
Index | DATE | AMB CO 1 | AMB CO 2 |...|AMB CO_n | TOTAL
1 | 1/1/12| 14 | 33 |...| 236 | 1600
. | ... | ... | ... |...| ... | ...
. | ... | ... | ... |...| ... | ...
. | ... | ... | ... |...| ... | ...
n
This is what I have for code so far:
import pandas as pd
import matplotlib.pyplot as plt
ambdf = pd.read_excel('Ambulance.xlsx',
sheetname='Sheet2', index_col=0, na_values=['NA'])
print type(ambdf)
print ambdf
print ambdf['EAS']
amb_plot = plt.plot(ambdf['EAS'], linewidth=2)
plt.title('EAS Ambulance Numbers')
plt.xlabel('Month')
plt.ylabel('Count of Deliveries')
print amb_plot
for i in ambdf:
print plt.plot(ambdf[i], linewidth = 2)
I am thinking of doing something like this:
for i in ambdf:
ambdf_plot = plt.plot(ambdf, linewidth = 2)
The above was not remotely what i wanted and it stems from my unfamiliarity with Pandas, MatplotLib etc, looking at some documentation though to me it looks like matplotlib is not even needed (question 2)
So A) How can I produce a plot of data for every column in my df
and B) do I need to use matplotlib or should I just use pandas to do it all?
Thank you,
Ok, so the easiest method to create several plots is this:
import matplotlib.pyplot as plt
x=[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]
y=[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]
for i in range(len(x)):
plt.figure()
plt.plot(x[i],y[i])
# Show/save figure as desired.
plt.show()
# Can show all four figures at once by calling plt.show() here, outside the loop.
#plt.show()
Note that you need to create a figure every time or pyplot will plot in the first one created.
If you want to create several data series all you need to do is:
import matplotlib.pyplot as plt
plt.figure()
x=[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]
y=[[1,2,3,4],[2,3,4,5],[3,4,5,6],[7,8,9,10]]
plt.plot(x[0],y[0],'r',x[1],y[1],'g',x[2],y[2],'b',x[3],y[3],'k')
You could automate it by having a list of colours like ['r','g','b','k'] and then just calling both entries in this list and corresponding data to be plotted in a loop if you wanted to. If you just want to programmatically add data series to one plot something like this will do it (no new figure is created each time so everything is plotted in the same figure):
import matplotlib.pyplot as plt
x=[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]
y=[[1,2,3,4],[2,3,4,5],[3,4,5,6],[7,8,9,10]]
colours=['r','g','b','k']
plt.figure() # In this example, all the plots will be in one figure.
for i in range(len(x)):
plt.plot(x[i],y[i],colours[i])
plt.show()
If anything matplotlib has a very good documentation page with plenty of examples.
17 Dec 2019: added plt.show() and plt.figure() calls to clarify this part of the story.
Use a dictionary!!
You can also use dictionaries that allows you to have more control over the plots:
import matplotlib.pyplot as plt
# plot 0 plot 1 plot 2 plot 3
x=[[1,2,3,4],[1,4,3,4],[1,2,3,4],[9,8,7,4]]
y=[[3,2,3,4],[3,6,3,4],[6,7,8,9],[3,2,2,4]]
plots = zip(x,y)
def loop_plot(plots):
figs={}
axs={}
for idx,plot in enumerate(plots):
figs[idx]=plt.figure()
axs[idx]=figs[idx].add_subplot(111)
axs[idx].plot(plot[0],plot[1])
return figs, axs
figs, axs = loop_plot(plots)
Now you can select the plot that you want to modify easily:
axs[0].set_title("Now I can control it!")
Of course, is up to you to decide what to do with the plots. You can either save them to disk figs[idx].savefig("plot_%s.png" %idx) or show them plt.show(). Use the argument block=False only if you want to pop up all the plots together (this could be quite messy if you have a lot of plots). You can do this inside the loop_plot function or in a separate loop using the dictionaries that the function provided.
Just to add returning figs and axs is not mandatory to execute plt.show().
Here are two examples of how to generate graphs in separate windows (frames), and, an example of how to generate graphs and save them into separate graphics files.
Okay, first the on-screen example. Notice that we use a separate instance of plt.figure(), for each graph, with plt.plot(). At the end, we have to call plt.show() to put it all on the screen.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace( 0,10 )
for n in range(3):
y = np.sin( x+n )
plt.figure()
plt.plot( x, y )
plt.show()
Another way to do this, is to use plt.show(block=False) inside the loop:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace( 0,10 )
for n in range(3):
y = np.sin( x+n )
plt.figure()
plt.plot( x, y )
plt.show( block=False )
Now, let's generate the graphs and instead, write them each to a file. Here we replace plt.show(), with plt.savefig( filename ). The difference from the previous example is that we don't have to account for ''blocking'' at each graph. Note also, that we number the file names. Here we use %03d so that we can conveniently have them in number order afterwards.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace( 0,10 )
for n in range(3):
y = np.sin( x+n )
plt.figure()
plt.plot( x, y )
plt.savefig('myfilename%03d.png'%(n))
If your requirement is to plot against one column, then feel free to use this (First import data into a pandas DF) (plots a matrix of plots with 5 columns and as many rows required):
import math
i,j=0,0
PLOTS_PER_ROW = 5
fig, axs = plt.subplots(math.ceil(len(df.columns)/PLOTS_PER_ROW),PLOTS_PER_ROW, figsize=(20, 60))
for col in df.columns:
axs[i][j].scatter(df['target_col'], df[col], s=3)
axs[i][j].set_ylabel(col)
j+=1
if j%PLOTS_PER_ROW==0:
i+=1
j=0
plt.show()
A simple way of plotting on different frames would be like:
import matplotlib.pyplot as plt
for grp in list_groups:
plt.figure()
plt.plot(grp)
plt.show()
Then python will plot multiple frames for each iteration.
We can create a for loop and pass all the numeric columns into it.
The loop will plot the graphs one by one in separate pane as we are including
plt.figure() into it.
import pandas as pd
import seaborn as sns
import numpy as np
numeric_features=[x for x in data.columns if data[x].dtype!="object"]
#taking only the numeric columns from the dataframe.
for i in data[numeric_features].columns:
plt.figure(figsize=(12,5))
plt.title(i)
sns.boxplot(data=data[i])