Word Labels for Document Matrix in Gensim - python-2.7

My ultimate goal is to produce a *.csv file containing labeled binary term vectors for each document. In essence, a term document matrix.
Using gensim, I can produce a file with an unlabeled term matrix.
I do this by essentially copying and pasting code from here: http://radimrehurek.com/gensim/tut1.html
Given a list of documents called "texts".
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]
To convert the above vectors into a numpy matrix, I use:
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
I then convert the sparse numpy matrix to a full array:
full_matrix = csc_matrix(scipy_csc_matrix).toarray()
Finally, I output this to a file:
with open('file.csv','wb') as f:
writer = csv.writer(f)
writer.writerows(full_matrix)
This produces a matrix of binomial vectors, but I do not know which vector represents which word. Is there an accurate way of matching words to vectors?
I've tried parsing the dictionary to creative a list of words which I would glue to the above full_matrix.
#Retrive dictionary
tokenIDs = dictionary.token2id
#Retrieve keys from dictionary and concotanate those to full_matrix
for key, value in tokenIDs.iteritems():
temp1 = unicodedata.normalize('NFKD', key).encode('ascii','ignore')
temp = [temp1]
dictlist.append(temp)
Keys = np.asarray(dictlist)
#Combine Keys and Matrix
labeled_full_matrix = np.concatenate((Keys, full_matrix), axis=1)
However, this does not work. The word ids (Keys) are not matched to the appropriate vectors.
I am under the assumption a much simpler and more elegant approach is possible. But after some time, I haven't been able to find it. Maybe someone here can help, or point me to something fundamental I've missed.

Is this what you want?
%time lda1 = models.LdaModel(corpus1, num_topics=20, id2word=dictionary1, update_every=5, chunksize=10000, passes=100)
import pandas
mixture = [dict(lda1[x]) for x in corpus1]
pandas.DataFrame(mixture).to_csv("output.csv")

Related

how to sort list in python which has two numbers per index value?

My code
b=[((1,1)),((1,2)),((2,1)),((2,2)),((1,3))]
for i in range(len(b)):
print b[i]
Obtained output:
(1, 1)
(1, 2)
(2, 1)
(2, 2)
(1, 3)
how do i sort this list by the first element or/and second element in each index value to get the output as:
(1, 1)
(1, 2)
(1, 3)
(2, 1)
(2, 2)
OR
(1, 1)
(2, 1)
(1, 2)
(2, 2)
(1, 3)
It would be nice if both columns are sorted as shown in the desired output, how ever if either of the output columns is sorted it will suffice.
Try this: b = sorted(b, key = lambda i: (i[0], i[1]))
The sorted builtin does this.
>>> sorted (b)
[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2)]
This only sorts by the first element, to sort on the second
>>> sorted(b, key=lambda i: i[1])
[(1, 1), (2, 1), (1, 2), (2, 2), (1, 3)]
Also notice that Python doesn't allow this nested tuple; the paren inside a paren is reduced to just one.
>>> b=[((1,1)),((1,2)),((2,1)),((2,2)),((1,3))]
>>> b
[(1, 1), (1, 2), (2, 1), (2, 2), (1, 3)]

Replacing the values of `edgelist` with those of a `labels` dictionary

I am new to both Python and NetworkX. I have a square, regular graph G with NxN nodes (a lattice). Such nodes are labelled by means of a dict (see code below). Now I want the edgelist to return the start and endpoint of each edge not by referring to the node coordinates but to the label the node has been given.
Example:
N = 3
G=nx.grid_2d_graph(N,N)
labels = dict( ((i, j), i + (N-1-j) * N ) for i, j in G.nodes() )
#This gives nodes an attribute ID that is identical to their labels
for (i,j) in labels:
G.node[(i,j)] ['ID']= labels[(i,j)]
edgelist=G.edges() #This gives the list of all edges in the format (Start XY, End XY)
If I run it with N=3 I get:
In [14]: labels
Out[14]: {(0, 0): 6, (0, 1): 3, (0, 2): 0, (1, 0): 7, (1, 1): 4, (1, 2): 1, (2, 0): 8, (2, 1): 5, (2, 2): 2}
This scheme labels the upper left node as 0, with node (N-1)th being placed in the lower right corner. And this is what I want. Now the problem with edgelist:
In [15]: edgelist
Out [15]: [((0, 1), (0, 0)), ((0, 1), (1, 1)), ((0, 1), (0, 2)), ((1, 2), (1, 1)), ((1, 2), (0, 2)), ((1, 2), (2, 2)), ((0, 0), (1, 0)), ((2, 1), (2, 0)), ((2, 1), (1, 1)), ((2, 1), (2, 2)), ((1, 1), (1, 0)), ((2, 0), (1, 0))]
I tried to solve the problem with these lines (inspiration from here: Replace items in a list using a dictionary):
allKeys = {}
for subdict in (labels):
allKeys.update(subdict)
new_edgelist = [allKeys[edge] for edge in edgelist]
but I get this wonderful thing which enlightens my monday:
TypeError: cannot convert dictionary update sequence element #0 to a sequence
To sum up, I want to be able to replace the elements of the edgelist list with the values of the labels dictionary so that, say, the edge from ((2,0),(1,0)) (which correspond to nodes 8 and 7) is returned (8,7). Endless thanks!
I believe what you are looking for is simply nx.relabel_nodes(G,labels,False) here is the documentation
Here is the output when I printed the nodes of G before and after calling the relabel nodes function.
# Before relabel_nodes
[(0, 1), (1, 0), (0, 0), (1, 1)]
# After relabel_nodes
[0, 1, 2, 3]
After doing this, the edge labels automatically becomes what you expect.
# Edges before relabelling nodes
[((0, 1), (0, 0)), ((0, 1), (1, 1)), ((1, 0), (0, 0)), ((1, 0), (1, 1))]
# Edges after relabelling nodes
[(0, 1), (0, 2), (1, 3), (2, 3)]
Also, I have replied to this question in the chat that you created but it seems you were not notified.

convert a list of x and y coordinates into multistring

I have a set of x and y coordinates as follows:
x = (1,1,2,2,3,4)
y= (0,1,2,3,4,5)
What is the best way of going about transforming this list into a multiline string format, e.g:
x_y = [((1,0)(1,1)),((1,1)(2,2)),((2,2)(2,3)),((2,3)(3,4)),((3,4)(4,5))]
You can pair up the elements of x and y with zip():
>>> x = (1,1,2,2,3,4)
>>> y = (0,1,2,3,4,5)
>>> xy = zip(x, y)
>>> xy
[(1, 0), (1, 1), (2, 2), (2, 3), (3, 4), (4, 5)]
Then you can rearrange this into the kind of list in your example with a list comprehension:
>>> x_y = [(xy[i], xy[i+1]) for i in xrange(len(xy)-1)]
>>> x_y
[((1, 0), (1, 1)), ((1, 1), (2, 2)), ((2, 2), (2, 3)), ((2, 3), (3, 4)), ((3, 4), (4, 5))]
If you don't care about efficiency, the second part could also be written as:
>>> x_y = zip(xy, xy[1:])

how to iterate through lists vertically?

I have multiple lists to work with. What I'm trying to do is to take a certain index for every list(in this case index 1,2,and 3), in a vertical column. And add those vertical numbers to an empty list.
line1=[1,2,3,4,5,5,6]
line2=[3,5,7,8,9,6,4]
line3=[5,6,3,7,8,3,7]
vlist1=[]
vlist2=[]
vlist3=[]
expected output
Vlist1=[1,3,5]
Vlist2=[2,5,6]
Vlist3=[3,7,3]
Having variables with numbers in them is often a design mistake. Instead, you should probably have a nested data structure. If you do that with your line1, line2 and line3 lists, you'd get a nested list:
lines = [[1,2,3,4,5,5,6],
[3,5,7,8,9,6,4],
[5,6,3,7,8,3,7]]
You can then "transpose" this list of lists with zip:
vlist = list(zip(*lines)) # note the list call is not needed in Python 2
Now you can access the inner lists (which in are actually tuples this now) by indexing or slicing into the transposed list.
first_three_vlists = vlist[:3]
in python 3 zip returns a generator object, you need to treat it like one:
from itertools import islice
vlist1,vlist2,vlist3 = islice(zip(line1,line2,line3),3)
But really you should keep your data out of your variable names. Use a list-of-lists data structure, and if you need to transpose it just do:
list(zip(*nested_list))
Out[13]: [(1, 3, 5), (2, 5, 6), (3, 7, 3), (4, 8, 7), (5, 9, 8), (5, 6, 3), (6, 4, 7)]
Use pythons zip() function, index accordingly.
>>> line1=[1,2,3,4,5,5,6]
>>> line2=[3,5,7,8,9,6,4]
>>> line3=[5,6,3,7,8,3,7]
>>> zip(line1,line2,line3)
[(1, 3, 5), (2, 5, 6), (3, 7, 3), (4, 8, 7), (5, 9, 8), (5, 6, 3), (6, 4, 7)]
Put your input lists into a list. Then to create the ith vlist, do something like this:
vlist[i] = [];
for l in list_of_lists:
vlist[i].append(l[i])

What's the most Pythonic way to identify consecutive duplicates in a list?

I've got a list of integers and I want to be able to identify contiguous blocks of duplicates: that is, I want to produce an order-preserving list of duples where each duples contains (int_in_question, number of occurrences).
For example, if I have a list like:
[0, 0, 0, 3, 3, 2, 5, 2, 6, 6]
I want the result to be:
[(0, 3), (3, 2), (2, 1), (5, 1), (2, 1), (6, 2)]
I have a fairly simple way of doing this with a for-loop, a temp, and a counter:
result_list = []
current = source_list[0]
count = 0
for value in source_list:
if value == current:
count += 1
else:
result_list.append((current, count))
current = value
count = 1
result_list.append((current, count))
But I really like python's functional programming idioms, and I'd like to be able to do this with a simple generator expression. However I find it difficult to keep sub-counts when working with generators. I have a feeling a two-step process might get me there, but for now I'm stumped.
Is there a particularly elegant/pythonic way to do this, especially with generators?
>>> from itertools import groupby
>>> L = [0, 0, 0, 3, 3, 2, 5, 2, 6, 6]
>>> grouped_L = [(k, sum(1 for i in g)) for k,g in groupby(L)]
>>> # Or (k, len(list(g))), but that creates an intermediate list
>>> grouped_L
[(0, 3), (3, 2), (2, 1), (5, 1), (2, 1), (6, 2)]
Batteries included, as they say.
Suggestion for using sum and generator expression from JBernardo; see comment.