Python matplotlib insert index in plot - python-2.7

So I am trying to save multiple plots which are generated after every iteration of a for loop and I want to insert a name tag on those plots like a header with the number of iterations done. code looks like this. I tried suptitle but it does not work.
for i in range(steps):
nor_m = matplotlib.colors.Normalize(vmin = 0, vmax = 1)
plt.hexbin(xxx,yyy,C, gridsize=13, cmap=matplotlib.cm.rainbow, norm=nor_m, edgecolors= 'k', extent=[-1,12,-1,12])
plt.draw()
plt.suptitle('frame'%i, fontsize=12)
savefig("flie%d.png"%i)

What about plt.title?
for i in range(steps):
nor_m = matplotlib.colors.Normalize(vmin=0, vmax=1)
plt.hexbin(xxx, yyy, C, gridsize=13, cmap=matplotlib.cm.rainbow, norm=nor_m, edgecolors= 'k', extent=[-1,12,-1,12])
plt.title('frame %d'%i, fontsize=12)
plt.savefig("flie%d.png"%i)
You also had an error in the string formatting of the title call. Actually 'frame'%i should have failed with an TypeError: not all arguments converted during string formatting-error.
Note also, that you don't need the plt.draw, since this will be called by plt.savefig.

Related

Why does pandas Round method explodes my data frame?

I try to round all values in this dataframe. However, the pandas round() method explodes my dataframe by a factor 5! From 150 rows to 7518 rows.
It may be that there is something odd with the data in the dataframe, but then again, one would not expect a simple rounding function to do this.
Below, I replicate the error using 1) simulated data and 2) the data that leads to the said error.
This results in 150 rows, which is the correct number:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([150, 4]), columns=['A', 'B', 'C', 'D'])
df["cat"] = "MID"
df.loc[:399,["cat"]] = "LOW"
df.iloc[-400:,-1] = "HI"
df.cat.value_counts()
df.set_index("cat", inplace=True)
df.round(3)
Using the data from my dropbox folder, the round-function produces a whopping 7518 rows:
dfb = pd.read_pickle('dfna.pkl')
dfb.round(3)
This is strange. I solved it for now using this rather ugly line:
dfb = dfb.reset_index().round({'A': 3, 'B': 3, 'C': 3, 'D': 3}).set_index('tricile')
However, this is not ideal, given that pandas' round method acts in mysterious ways and may affect future programs.
I think it is bug - round with duplicated CategoricalIndex, so created pandas issue 21809 and pandas issue 21810.
Similar solution like your is:
print (dfb.reset_index().round(3).set_index('tricile'))
Or remove CategoricalIndex:
dfb.index = dfb.index.astype(str)
print (dfb.round(3))

Viz LDA model with Bokeh and T-sne

I have tried to follow this tutorial (https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html) of visualizing LDA with t-sne and bokeh.
But i run into a bit of problem.
When i tried to run the following code:
plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
color=colormap[_lda_keys][:num_example],
source=bp.ColumnDataSource({
"content": text[:num_example],
"topic_key": _lda_keys[:num_example]
}))
NB: In the tutorial the content is called news, in mine it is called text
i get this error:
Supplying a user-defined data source AND iterable values to glyph methods is
not possibe. Either:
Pass all data directly as literals:
p.circe(x=a_list, y=an_array, ...)
Or, put all data in a ColumnDataSource and pass column names:
source = ColumnDataSource(data=dict(x=a_list, y=an_array))
p.circe(x='x', y='x', source=source, ...)
To me this do not make so much sense and i have not succeded in finding any annswer to it ethier here, github or else where. Hope that some on can help. best Niels
I've been also battling with that piece of code and I've found two problems with it.
First, when you pass a source to the scatter function, like the error states, you must include all data in the dictionary, i.e., x and y axes, colors, labels, and any other information that you want to include in the tooltip.
Second, the x and y axes have a different shape than the information passed to the tooltip, so you also have to slice both arrays in the axes with the num_example variable.
The following code got me running:
# create the dictionary with all the information
plot_dict = {
'x': tsne_lda[:num_example, 0],
'y': tsne_lda[:num_example, 1],
'colors': colormap[_lda_keys][:num_example],
'content': text[:num_example],
'topic_key': _lda_keys[:num_example]
}
# create the dataframe from the dictionary
plot_df = pd.DataFrame.from_dict(plot_dict)
# declare the source
source = bp.ColumnDataSource(data=plot_df)
title = 'LDA viz'
# initialize bokeh plot
plot_lda = bp.figure(plot_width=1400, plot_height=1100,
title=title,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
# build scatter function from the columns of the dataframe
plot_lda.scatter('x', 'y', color='colors', source=source)

use python to write to a specific column is a .csv file

I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3

deleting semicolons in a column of csv in python

I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times.
I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk.
Here is the csv I'm working with.
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75
02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54
02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96
02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27
02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44
02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61
02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18
Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function.
line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131'
line = line.replace(':',',')
print(line)
Output
02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505
Then split on commas to separate the data.
line.split(',')
If you only want the numerical values you could also do the following (using a regular expression):
import re
line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505'
values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')]
print values
Which gives you a list of numerical values that you can process.
[2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :)
>>>import csv
>>> with open('myFile.csv', newline='') as csvfile:
... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
... for row in myCsvreader:
... for item in row:
... item.spit(':') # Returns hours without semicolons
Once you extracted different time stamps, you can use the datetime module, such as:
from datetime import datetime, date, time
x = time(hour=9, minute=30, second=30)
y = time(hour=9, minute=30, second=42)
diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x)
print diff.total_seconds()

add_edges_from three tuples networkx

I am trying to use networkx to create a DiGraph. I want to use add_edges_from(), and I want the edges and their data to be generated from three tuples.
I am importing the data from a CSV file. I have three columns: one for ids (first set of nodes), one for a set of names (second set of nodes), and another for capacities (no headers in the file). So, I created a dictionary for the ids and capacities.
dictionary = dict(zip(id, capacity))
then I zipped the tuples containing the edges data:
List = zip(id, name, capacity)
but when I execute the next line, it gives me an assertion error.
G.add_edges_from(List, 'weight': 1)
Can someone help me with this problem? I have been trying for a week with no luck.
P.S. I'm a newbie in programming.
EDIT:
so, i found the following solution. I am honestly not sure how it works, but it did the job!
Here is the code:
import networkx as nx
import csv
G = nx.DiGraph()
capacity_dict = dict(zip(zip(id, name),capacity))
List = zip(id, name, capacity)
G.add_edges_from(capacity_dict, weight=1)
for u,v,d in List:
G[u][v]['capacity']=d
Now when I run:
G.edges(data=True)
The result will be:
[(2.0, 'First', {'capacity': 1.0, 'weight': 1}), (3.0, 'Second', {'capacity': 2.0, 'weight': 1})]
I am using the network simplex. Now, I am trying to find a way to make the output of the flowDict more understandable, because it is only showing the ids of the flow. (Maybe i'll try to input them in a database and return the whole row of data instead of using the ids only).
A few improvements on your version. (1) NetworkX algorithms assume that weight is 1 unless you specifically set it differently. Hence there is no need to set it explicitly in your case. (2) Using the generator allows the capacity attribute to be set explicitly and other attributes to also be set once per record. (3) The use of a generator to process each record as it comes through saves you having to iterate through the whole list twice. The performance improvement is probably negligible on small datasets but still it feels more elegant. Having said that -- your method clearly works!
import networkx as nx
import csv
# simulate a csv file.
# This makes a multi-line string behave as a file.
from StringIO import StringIO
filehandle = StringIO('''a,b,30
b,c,40
d,a,20
''')
# process each row in the file
# and generate an edge from each
def edge_generator(fh):
reader = csv.reader(fh)
for row in reader:
row[-1] = float(row[-1]) # convert capacity to float
# add other attributes to the dict() below as needed...
# e.g. you might add weights here as well.
yield (row[0],
row[1],
dict(capacity=row[2]))
# create the graph
G = nx.DiGraph()
G.add_edges_from(edge_generator(filehandle))
print G.edges(data=True)
Returns this:
[('a', 'b', {'capacity': 30.0}),
('b', 'c', {'capacity': 40.0}),
('d', 'a', {'capacity': 20.0})]