What's the most Pythonic way to identify consecutive duplicates in a list? - list

I've got a list of integers and I want to be able to identify contiguous blocks of duplicates: that is, I want to produce an order-preserving list of duples where each duples contains (int_in_question, number of occurrences).
For example, if I have a list like:
[0, 0, 0, 3, 3, 2, 5, 2, 6, 6]
I want the result to be:
[(0, 3), (3, 2), (2, 1), (5, 1), (2, 1), (6, 2)]
I have a fairly simple way of doing this with a for-loop, a temp, and a counter:
result_list = []
current = source_list[0]
count = 0
for value in source_list:
if value == current:
count += 1
else:
result_list.append((current, count))
current = value
count = 1
result_list.append((current, count))
But I really like python's functional programming idioms, and I'd like to be able to do this with a simple generator expression. However I find it difficult to keep sub-counts when working with generators. I have a feeling a two-step process might get me there, but for now I'm stumped.
Is there a particularly elegant/pythonic way to do this, especially with generators?

>>> from itertools import groupby
>>> L = [0, 0, 0, 3, 3, 2, 5, 2, 6, 6]
>>> grouped_L = [(k, sum(1 for i in g)) for k,g in groupby(L)]
>>> # Or (k, len(list(g))), but that creates an intermediate list
>>> grouped_L
[(0, 3), (3, 2), (2, 1), (5, 1), (2, 1), (6, 2)]
Batteries included, as they say.
Suggestion for using sum and generator expression from JBernardo; see comment.

Related

extract "columns" from a deque of lists

I have a long deque of lists of 4 elements.
How do I efficiently extract columns from it?
I am using a comprehension list now as follows:
S=[s[0] for s in sample_D]
R=[s[2] for s in sample_D]
I am not sure if this is the most efficient way to do it.
Let's take an example:
>>> sample_D = [(i, i+1, i+2, i+3) for i in range(0, 1000, 4)]
>>> sample_D
[(0, 1, 2, 3), (4, 5, 6, 7), ..., (996, 997, 998, 999)]
The zip function is useful to transpose a matrix:
Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
>>> list(zip(*sample_D))
[(0, 4, 8, ..., 988, 992, 996), (1, 5, ..., 993, 997), (2, 6, ..., 994, 998), (3, 7, ..., 995, 999)]
The list comprehension returns lists, while the zip method returns tuple, but the content is the same:
>>> def using_list_comp(sample, indices):
... return tuple([t[i] for t in sample] for i in indices)
>>> def using_zip(sample, indices):
... z = list(zip(*sample))
... return tuple(z[i] for i in indices)
>>> assert using_list_comp(sample_D, [0, 1, 2, 3]) == tuple(list(t) for t in using_zip(sample_D, [0, 1, 2, 3]))
If you need only one column, then the list comprehension is faster:
>>> import timeit
>>> timeit.timeit(lambda: using_list_comp(sample_D,[0]))
6.561095703000319
>>> timeit.timeit(lambda: using_zip(sample_D,[0]))
10.13769362000312
But if you need multiple columns, the zip method is faster:
>>> timeit.timeit(lambda: using_list_comp(sample_D,[0, 1, 2, 3]))
25.433326307000243
>>> timeit.timeit(lambda: using_zip(sample_D,[0, 1, 2, 3]))
10.10265000200161

Word Labels for Document Matrix in Gensim

My ultimate goal is to produce a *.csv file containing labeled binary term vectors for each document. In essence, a term document matrix.
Using gensim, I can produce a file with an unlabeled term matrix.
I do this by essentially copying and pasting code from here: http://radimrehurek.com/gensim/tut1.html
Given a list of documents called "texts".
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]
To convert the above vectors into a numpy matrix, I use:
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
I then convert the sparse numpy matrix to a full array:
full_matrix = csc_matrix(scipy_csc_matrix).toarray()
Finally, I output this to a file:
with open('file.csv','wb') as f:
writer = csv.writer(f)
writer.writerows(full_matrix)
This produces a matrix of binomial vectors, but I do not know which vector represents which word. Is there an accurate way of matching words to vectors?
I've tried parsing the dictionary to creative a list of words which I would glue to the above full_matrix.
#Retrive dictionary
tokenIDs = dictionary.token2id
#Retrieve keys from dictionary and concotanate those to full_matrix
for key, value in tokenIDs.iteritems():
temp1 = unicodedata.normalize('NFKD', key).encode('ascii','ignore')
temp = [temp1]
dictlist.append(temp)
Keys = np.asarray(dictlist)
#Combine Keys and Matrix
labeled_full_matrix = np.concatenate((Keys, full_matrix), axis=1)
However, this does not work. The word ids (Keys) are not matched to the appropriate vectors.
I am under the assumption a much simpler and more elegant approach is possible. But after some time, I haven't been able to find it. Maybe someone here can help, or point me to something fundamental I've missed.
Is this what you want?
%time lda1 = models.LdaModel(corpus1, num_topics=20, id2word=dictionary1, update_every=5, chunksize=10000, passes=100)
import pandas
mixture = [dict(lda1[x]) for x in corpus1]
pandas.DataFrame(mixture).to_csv("output.csv")

Can somebody give me an example for the zip() function in Python?

In Python's document, it says the following things for the zip function:
"The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n)."
I have a difficulty in understanding the zip(*[iter(s)]*n) idiom. Can any body give me an example on when we should use that idiom?
Thank you very much!
I don't know what documentation you're using, but this version of zip() documentation, has this example:
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)
>>> x == list(x2) and y == list(y2)
True
It interpolates two lists together, in respective order, and it also has an "unzip" feature
And since you asked, here's a slightly more understandable example:
>>> friends = ["Amy", "Bob", "Cathy"]
>>> orders = ["Burger", "Pizza", "Hot dog"]
>>> friend_order_pairs = zip(x, y)
>>> friend_order_pairs
[("Amy", "Burger"), ("Bob", "Pizza"), ("Cathy", "Hot dog")]
It's 2020, but let me leave this here for reference.
The zip(*[iter(s)]*n) idiom is used to split a flat list into chunks.
For example:
>>> mylist = [1, 2, 3, 'a', 'b', 'c', 'first', 'second', 'third']
>>> list(zip(*[iter(mylist)]*3))
[(1, 2, 3), ('a', 'b', 'c'), ('first', 'second', 'third')]
The idiom is analyzed here.
zip() is for sticking two or more lists together.
names=['bob','tim','larry']
ages=[15,36,50]
zip(names,ages)
Out: [('bob', 15), ('tim', 36), ('larry', 50)]
I use it to create dictionaries when I have a separate lists of keys and values:
>>> keys = ('pi', 'c', 'e')
>>> values = (3.14, 3*10**8, 1.6*10**-19)
>>> dict(zip(keys, values))
{'c': 300000000, 'pi': 3.14, 'e': 1.6000000000000002e-19}
Here is how to iterate over two lists and their indices using enumerate() together with zip():
alist = ['a1', 'a2', 'a3']
blist = ['b1', 'b2', 'b3']
for i, (a, b) in enumerate(zip(alist, blist)):
print i, a, b
zip() basically combines two or more items to form another list of equal length:
>>> alist = ['a1', 'a2', 'a3']
>>> blist = ['b1', 'b2', 'b3']
>>>
>>> zip(alist, blist)
[('a1', 'b1'), ('a2', 'b2'), ('a3', 'b3')]
>>>
Use izip instead.
When working with very large data sets, you can use izip which uses a generator and only evaluates results when requested - therefore great for memory management and much better performance. I usually use generator based variants of python modules when possible.
imagine an example like this:
from itertools import islice,izip
w = xrange(9000000000000000000)
x = xrange(2000000000000000000)
y = xrange(9000000000000000000)
z = xrange(9000000000000000000)
# The following only returns a generator that holds an iterator for the first 100 items
# without loading that large mess of numbers into memory
first_100_items_generator = islice(izip(w,x,y,z), 100)
# Iterate through the generator and return only what you need - first 100 items
first_100_items = list(first_100_items_generator)
print(first_100_items)
Output:
[ (0, 0, 0, 0),
(1, 1, 1, 1),
(2, 2, 2, 2),
(3, 3, 3, 3),
(4, 4, 4, 4),
(5, 5, 5, 5),
(6, 6, 6, 6),
(7, 7, 7, 7),
(8, 8, 8, 8),
(9, 9, 9, 9),
(10, 10, 10, 10),
(11, 11, 11, 11)
...
...
]
So here I have four large arrays of numbers, I used izip to zip the values then used islice to pick out the first 100 items.
The nice thing about using xrange, izip and islice is that are use generators, therefore they are not executed until the final "list()" method is called on it.
It's a bit of a digression into generators but good to know when you start doing large data processing in python.
Info on generators:
youtube
Generator intro

how to iterate through lists vertically?

I have multiple lists to work with. What I'm trying to do is to take a certain index for every list(in this case index 1,2,and 3), in a vertical column. And add those vertical numbers to an empty list.
line1=[1,2,3,4,5,5,6]
line2=[3,5,7,8,9,6,4]
line3=[5,6,3,7,8,3,7]
vlist1=[]
vlist2=[]
vlist3=[]
expected output
Vlist1=[1,3,5]
Vlist2=[2,5,6]
Vlist3=[3,7,3]
Having variables with numbers in them is often a design mistake. Instead, you should probably have a nested data structure. If you do that with your line1, line2 and line3 lists, you'd get a nested list:
lines = [[1,2,3,4,5,5,6],
[3,5,7,8,9,6,4],
[5,6,3,7,8,3,7]]
You can then "transpose" this list of lists with zip:
vlist = list(zip(*lines)) # note the list call is not needed in Python 2
Now you can access the inner lists (which in are actually tuples this now) by indexing or slicing into the transposed list.
first_three_vlists = vlist[:3]
in python 3 zip returns a generator object, you need to treat it like one:
from itertools import islice
vlist1,vlist2,vlist3 = islice(zip(line1,line2,line3),3)
But really you should keep your data out of your variable names. Use a list-of-lists data structure, and if you need to transpose it just do:
list(zip(*nested_list))
Out[13]: [(1, 3, 5), (2, 5, 6), (3, 7, 3), (4, 8, 7), (5, 9, 8), (5, 6, 3), (6, 4, 7)]
Use pythons zip() function, index accordingly.
>>> line1=[1,2,3,4,5,5,6]
>>> line2=[3,5,7,8,9,6,4]
>>> line3=[5,6,3,7,8,3,7]
>>> zip(line1,line2,line3)
[(1, 3, 5), (2, 5, 6), (3, 7, 3), (4, 8, 7), (5, 9, 8), (5, 6, 3), (6, 4, 7)]
Put your input lists into a list. Then to create the ith vlist, do something like this:
vlist[i] = [];
for l in list_of_lists:
vlist[i].append(l[i])

Slicing a list into a list of sub-lists [duplicate]

This question already has answers here:
How do I split a list into equally-sized chunks?
(66 answers)
Closed 6 years ago.
What is the simplest and reasonably efficient way to slice a list into a list of the sliced sub-list sections for arbitrary length sub lists?
For example, if our source list is:
input = [1, 2, 3, 4, 5, 6, 7, 8, 9, ... ]
And our sub list length is 3 then we seek:
output = [ [1, 2, 3], [4, 5, 6], [7, 8, 9], ... ]
Likewise if our sub list length is 4 then we seek:
output = [ [1, 2, 3, 4], [5, 6, 7, 8], ... ]
[input[i:i+n] for i in range(0, len(input), n)] # Use xrange in py2k
where n is the length of a chunk.
Since you don't define what might happen to the final element of the new list when the number of elements in input is not divisible by n, I assumed that it's of no importance: with this you'll get last element equal 2 if n equal 7, for example.
The documentation of the itertools module contains the following recipe:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
This function returns an iterator of tuples of the desired length:
>>> list(grouper(2, [1,2,3,4,5,6,7]))
[(1, 2), (3, 4), (5, 6), (7, None)]
A really pythonic variant (python 3):
list(zip(*(iter([1,2,3,4,5,6,7,8,9]),)*3))
A list iterator is created and turned into a tuple with 3x the same iterator, then unpacked to zip and casted to list again. One value is pulled from each iterator by zip, but as there is just a single iterator object, the internal counter is increased globally for all three.
I like SilentGhost's solution.
My solution uses functional programming in python:
group = lambda t, n: zip(*[t[i::n] for i in range(n)])
group([1, 2, 3, 4], 2)
gives:
[(1, 2), (3, 4)]
This assumes that the input list size is divisible by the group size. If not, unpaired elements will not be included.