Viz LDA model with Bokeh and T-sne

Viz LDA model with Bokeh and T-sne - python-2.7

I have tried to follow this tutorial (https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html) of visualizing LDA with t-sne and bokeh.
But i run into a bit of problem.
When i tried to run the following code:
plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
color=colormap[_lda_keys][:num_example],
source=bp.ColumnDataSource({
"content": text[:num_example],
"topic_key": _lda_keys[:num_example]
}))
NB: In the tutorial the content is called news, in mine it is called text
i get this error:
Supplying a user-defined data source AND iterable values to glyph methods is
not possibe. Either:
Pass all data directly as literals:
p.circe(x=a_list, y=an_array, ...)
Or, put all data in a ColumnDataSource and pass column names:
source = ColumnDataSource(data=dict(x=a_list, y=an_array))
p.circe(x='x', y='x', source=source, ...)
To me this do not make so much sense and i have not succeded in finding any annswer to it ethier here, github or else where. Hope that some on can help. best Niels

I've been also battling with that piece of code and I've found two problems with it.
First, when you pass a source to the scatter function, like the error states, you must include all data in the dictionary, i.e., x and y axes, colors, labels, and any other information that you want to include in the tooltip.
Second, the x and y axes have a different shape than the information passed to the tooltip, so you also have to slice both arrays in the axes with the num_example variable.
The following code got me running:
# create the dictionary with all the information
plot_dict = {
'x': tsne_lda[:num_example, 0],
'y': tsne_lda[:num_example, 1],
'colors': colormap[_lda_keys][:num_example],
'content': text[:num_example],
'topic_key': _lda_keys[:num_example]
}
# create the dataframe from the dictionary
plot_df = pd.DataFrame.from_dict(plot_dict)
# declare the source
source = bp.ColumnDataSource(data=plot_df)
title = 'LDA viz'
# initialize bokeh plot
plot_lda = bp.figure(plot_width=1400, plot_height=1100,
title=title,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
# build scatter function from the columns of the dataframe
plot_lda.scatter('x', 'y', color='colors', source=source)

Related

Combine multiple (>2) survival curves (null models) in same plot

I am trying to combine multiple survfit objects on the same plot, using function ggsurvplot_combine from package survminer. When I made a list of 2 survfit objects, it perfectly works. But when I combine 3 survfit objects in different ways, I receive the error:
error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated
I've read similar posts on combining survivl plots (https://cran.r-project.org/web/packages/survminer/survminer.pdf, https://github.com/kassambara/survminer/issues/195, R plotting multiple survival curves in the same plot, https://rpkgs.datanovia.com/survminer/reference/ggsurvplot_combine.html) and on this specific error, for which solutions are been provided with using 'unique'. However, I do not even understand for which factor variable this error accounts. I do not have the right to share my data or figures, so I'll try to replicate it:
Data:
time: follow-up between untill event or end of follow-up
endpoints: 1= event, 0=no event or censor
Null models:
KM1 <- survfit(Surv(data$time1,data$endpoint1)~1,
type="kaplan-meier", conf.type="log", data=data)
KM2 <- survfit(Surv(data$time2,data$endpoint2)~1, type="kaplan-meier",
conf.type="log", data=data)
KM3 <- survfit(Surv(data$time3,data$endpoint3)~1, type="kaplan-meier",
conf.type="log", data=data)
List null models:
list_that_works <- list(KM1,KM3)
list_that_fails <- list(KM1,KM2,KM3)
It seems as if the list contains of just two arguments: list(PFS=, OS=)
Combine >2 null models in one plot:
ggsurvplot_combine(list_that_works, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives the plot I'm looking for, but with 2 cumulative incidence curves.
ggsurvplot_combine(list_that_fails, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives error 'error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated'.
When I try combining 3 plots with using
ggsurvplot(c(KM1,KM2,KM3), data=data, conf.int=TRUE, fun="event", combine=TRUE), it gives the error:
Error: Problem with mutate() 'column 'survsummary'
survsummary = purrr::map2(grouped.d$fit, grouped.d$name, .surv_summary, data=data'. x $ operator is invlid for atomic vectors.
Any help is highly appreciated!
Also another way to combine surv fits is very welcome!
My best bet is that it has something to do with the 'list' function that only contains of two arguments: list(PFS=, OS=)

I fixed it! Instead of removing the post, I'll share my solution, it may be of help for others:
I made a list of the formulas instead of the null models, so:
formulas <- list(
KM1 = Surv(time1, endpoint1)~1,
KM2 = Surv(time2, endpoint2)~1,
KM3 = Surv(time3, endpoint3)~1)
I made a null model of the 3 formulas at once:
fit <- surv_fit(formulas, data=data)
Then I made a plot with this survival fit:
ggsurvplot_combine(fit, data=data)

Decision tree classifier,multilabel output

Decision Tree supports multi label classification right? my y labels are of type [['brufen','amoxil'],['brufen'],['xanex']]. Now y labels can be of the type list of list of labels as mentioned in the sklearn documentation, so why does it gives me error of unknown label type?
This error is resolved in a way that the length of list should be consistent, but how else should I handle this problem apart from one hot encoding it?

You need to convert the labels to label-indicator format first. Then you can use them with decision trees.
For converting, you can use MultiLabelBinarizer.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_converted = mlb.fit_transform([['brufen','amoxil'], ['brufen'], ['xanex']])
# Output: array([[1, 1, 0],
# [0, 1, 0],
# [0, 0, 1]])
mlb.classes_
# OutPut: array(['amoxil', 'brufen', 'xanex'], dtype=object)
Now use this y_converted instead of original y in decision tree.

Based on the information here: https://scikit-learn.org/stable/modules/multiclass.html#multioutputclassifier
You can use sklearn.multioutput.MultiOutputClassifier with a decision tree to get multi-label behavior. If I understand correctly, it works by internally creating a separate tree for each label.

Graphing multiple data sets using function to extract data from dictionary (matplotlib)

I would like to plot select data from a dictionary of the following format:
dictdata = {key_A: [(1,2),(1,3)]; key_B: [(3,2),(2,3)]; key_C: [(4,2),(1,4)]}
I am using the following function to extract data corresponding to a specific key and then separate the x and y values into two lists which can be plotted.
def plot_dictdata(ax1, key):
data = list()
data.append(dictdata[key])
for list_of_points in data:
for point in list_of_points:
x = point[0]
y = point[1]
ax1.scatter(x,y)
I'd like to be able to call this function multiple times (see code below) and have all relevant sets of data appear on the same graph. However, the final plot only shows the last set of data. How can I graph all sets of data on the same graph without clearing the previous set of data?
fig, ax1 = plt.subplots()
plot_dictdata(ax1, "key_A")
plot_dictdata(ax1, "key_B")
plot_dictdata(ax1, "key_C")
plt.show()
I have only just started using matplotlib, and wasn't able to figure out a solution using the following examples discussing related problems. Thank you in advance.
how to add a plot on top of another plot in matplotlib?
How to draw multiple line graph by using matplotlib in Python
Plotting a continuous stream of data with MatPlotLib

It could be that the problem is at a different point than you think it to be. The reason you only get the last point plotted is that in each loop step x and y are getting reassigned, such that at the end of the loop, each of them contain a single value.
As a solution you might want to use a list to append the values to, like
import matplotlib.pyplot as plt
dictdata = {"key_A": [(1,2),(1,3)], "key_B": [(3,2),(2,3)], "key_C": [(4,2),(1,4)]}
def plot_dictdata(ax1, key):
data = list()
data.append(dictdata[key])
x=[];y=[]
for list_of_points in data:
for point in list_of_points:
x.append(point[0])
y.append(point[1])
ax1.scatter(x,y)
fig, ax1 = plt.subplots()
plot_dictdata(ax1, "key_A")
plot_dictdata(ax1, "key_B")
plot_dictdata(ax1, "key_C")
plt.show()
resulting in
It would be worth noting that the plot_dictdata function could be simplified a lot, giving the same result as the above:
def plot_dictdata(ax1, key):
x,y = zip(*dictdata[key])
ax1.scatter(x,y)

Python matplotlib insert index in plot

So I am trying to save multiple plots which are generated after every iteration of a for loop and I want to insert a name tag on those plots like a header with the number of iterations done. code looks like this. I tried suptitle but it does not work.
for i in range(steps):
nor_m = matplotlib.colors.Normalize(vmin = 0, vmax = 1)
plt.hexbin(xxx,yyy,C, gridsize=13, cmap=matplotlib.cm.rainbow, norm=nor_m, edgecolors= 'k', extent=[-1,12,-1,12])
plt.draw()
plt.suptitle('frame'%i, fontsize=12)
savefig("flie%d.png"%i)

What about plt.title?
for i in range(steps):
nor_m = matplotlib.colors.Normalize(vmin=0, vmax=1)
plt.hexbin(xxx, yyy, C, gridsize=13, cmap=matplotlib.cm.rainbow, norm=nor_m, edgecolors= 'k', extent=[-1,12,-1,12])
plt.title('frame %d'%i, fontsize=12)
plt.savefig("flie%d.png"%i)
You also had an error in the string formatting of the title call. Actually 'frame'%i should have failed with an TypeError: not all arguments converted during string formatting-error.
Note also, that you don't need the plt.draw, since this will be called by plt.savefig.

add_edges_from three tuples networkx

I am trying to use networkx to create a DiGraph. I want to use add_edges_from(), and I want the edges and their data to be generated from three tuples.
I am importing the data from a CSV file. I have three columns: one for ids (first set of nodes), one for a set of names (second set of nodes), and another for capacities (no headers in the file). So, I created a dictionary for the ids and capacities.
dictionary = dict(zip(id, capacity))
then I zipped the tuples containing the edges data:
List = zip(id, name, capacity)
but when I execute the next line, it gives me an assertion error.
G.add_edges_from(List, 'weight': 1)
Can someone help me with this problem? I have been trying for a week with no luck.
P.S. I'm a newbie in programming.
EDIT:
so, i found the following solution. I am honestly not sure how it works, but it did the job!
Here is the code:
import networkx as nx
import csv
G = nx.DiGraph()
capacity_dict = dict(zip(zip(id, name),capacity))
List = zip(id, name, capacity)
G.add_edges_from(capacity_dict, weight=1)
for u,v,d in List:
G[u][v]['capacity']=d
Now when I run:
G.edges(data=True)
The result will be:
[(2.0, 'First', {'capacity': 1.0, 'weight': 1}), (3.0, 'Second', {'capacity': 2.0, 'weight': 1})]
I am using the network simplex. Now, I am trying to find a way to make the output of the flowDict more understandable, because it is only showing the ids of the flow. (Maybe i'll try to input them in a database and return the whole row of data instead of using the ids only).

A few improvements on your version. (1) NetworkX algorithms assume that weight is 1 unless you specifically set it differently. Hence there is no need to set it explicitly in your case. (2) Using the generator allows the capacity attribute to be set explicitly and other attributes to also be set once per record. (3) The use of a generator to process each record as it comes through saves you having to iterate through the whole list twice. The performance improvement is probably negligible on small datasets but still it feels more elegant. Having said that -- your method clearly works!
import networkx as nx
import csv
# simulate a csv file.
# This makes a multi-line string behave as a file.
from StringIO import StringIO
filehandle = StringIO('''a,b,30
b,c,40
d,a,20
''')
# process each row in the file
# and generate an edge from each
def edge_generator(fh):
reader = csv.reader(fh)
for row in reader:
row[-1] = float(row[-1]) # convert capacity to float
# add other attributes to the dict() below as needed...
# e.g. you might add weights here as well.
yield (row[0],
row[1],
dict(capacity=row[2]))
# create the graph
G = nx.DiGraph()
G.add_edges_from(edge_generator(filehandle))
print G.edges(data=True)
Returns this:
[('a', 'b', {'capacity': 30.0}),
('b', 'c', {'capacity': 40.0}),
('d', 'a', {'capacity': 20.0})]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Viz LDA model with Bokeh and T-sne - python-2.7

Related

Combine multiple (>2) survival curves (null models) in same plot

Decision tree classifier,multilabel output

Graphing multiple data sets using function to extract data from dictionary (matplotlib)

Python matplotlib insert index in plot

add_edges_from three tuples networkx

Categories

Resources