I am looking for an algorithm to efficiently to generate all three value combinations of a dataset by picking 6 values at a time.
I am looking for an algorithm to efficiently generate a small set of 6-tuples that cumulatively express all possible 3-tuple combinations of a dataset.
For instance, computing playing-card hands of 6 cards that express all possible 3 card combinations.
For example, given a dataset:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The first "pick" of 6 values might be:
['a','b','c','d','e','f']
And this covers the three-value combinations:
('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')
It is obviously possible by:
computing all triplet combinations
picking 6 values
computing all triplet combinations for those 6 values
removing these combinations from the first computation
repeating until all have been accounted for
In this example there are 2600 possible triplet combinations (26*25*24)/(3*2*1) == 2600 and using the "brute-force" method above, all triplet combinations can be represented in around 301 6-value groups.
However, it feels like there ought to be a more efficient way of achieving this.
My preferred language is python, but I'm planning on implementing this in C++.
Update
Here's my python code to "brute-force" it:
from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')
def calculate(data_set):
all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
data = set(all_triplets)
sextuples = []
while data:
sxt = set()
for item in data:
nxt = sxt | item
if len(nxt) > 6:
continue
sxt = nxt
if len(nxt) == 6:
break
sextuples.append(list(sxt))
covers = set(frozenset(x) for x in combinations(list(sxt),3))
data = data - covers
print "%r\t%s" % (list(sxt),len(data))
print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)
calculate(data_set)
Completed 2600 triplets in 301 sextuples
I'm looking for something more computationally efficient than this.
Update
Senderle has provided an interesting solution: to divide the dataset into pairs then generate all possible triplets of the pairs. This is definitely better than anything I'd come up with.
Here's a quick function to check whether all triplets are covered and assess the redundancy of triplet coverage:
from itertools import combinations
def check_coverage(data_set,sextuplets):
all_triplets = dict.fromkeys(combinations(data_set,3),0)
sxt_count = 0
for sxt in sextuplets:
sxt_count += 1
for triplet in combinations(sxt,3):
all_triplets[triplet] += 1
total = len(all_triplets)
biggest_overlap = overlap = nohits = onehits = morehits = 0
for k,v in all_triplets.iteritems():
if v == 0:
nohits += 1
elif v == 1:
onehits += 1
else:
morehits += 1
overlap += v - 1
if v > biggest_overlap:
biggest_overlap = v
print "All Triplets in dataset: %6d" % (total,)
print "Total triplets from sxt: %6d" % (total + overlap,)
print "Number of sextuples: %6d\n" % (sxt_count,)
print "Missed %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
print "Biggest Overlap: %3d" % (biggest_overlap,)
Using Senderle's sextuplets generator I'm fascinated to observe that the repeated triplets are localised and as the datasets increase in size, the repeats become proportionally more localised and the peak repeat larger.
>>> check_coverage(range(26),sextuplets(range(26)))
All Triplets in dataset: 2600
Total triplets from sxt: 5720
Number of sextuples: 286
Missed 0 of 2600: 0.0%
HitOnce 2288 of 2600: 88.0%
HitMore 312 of 2600: 12.0%
Overlap 3120 of 2600: 120.0%
Biggest Overlap: 11
>>> check_coverage(range(40),sextuplets(range(40)))
All Triplets in dataset: 9880
Total triplets from sxt: 22800
Number of sextuples: 1140
Missed 0 of 9880: 0.0%
HitOnce 9120 of 9880: 92.3%
HitMore 760 of 9880: 7.7%
Overlap 12920 of 9880: 130.8%
Biggest Overlap: 18
>>> check_coverage(range(80),sextuplets(range(80)))
All Triplets in dataset: 82160
Total triplets from sxt: 197600
Number of sextuples: 9880
Missed 0 of 82160: 0.0%
HitOnce 79040 of 82160: 96.2%
HitMore 3120 of 82160: 3.8%
Overlap 115440 of 82160: 140.5%
Biggest Overlap: 38
I believe the following produces correct results. It relies on the intuition that to generate all necessary sextuplets, all that is necessary is to generate all possible combinations of arbitrary pairs of items. This "mixes" values together well enough that all possible triplets are represented.
There's a slight wrinkle. For an odd number of items, one pair isn't a pair at all, so you can't generate a sextuplet from it, but the value still needs to be represented. This does some gymnastics to sidestep that problem; there might be a better way, but I'm not sure what it is.
from itertools import izip_longest, islice, combinations
def sextuplets(seq, _fillvalue=object()):
if len(seq) < 6:
yield [tuple(seq)]
return
it = iter(seq)
pairs = izip_longest(it, it, fillvalue=_fillvalue)
sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
for st in sextuplets:
if st[-1] == _fillvalue:
# replace fill value with valid item not in sextuplet
# while maintaining original order
for i, (x, y) in enumerate(zip(st, seq)):
if x != y:
st = st[0:i] + (y,) + st[i:-1]
break
yield st
I tested it on sequences of items of length 10 to 80, and it generates correct results in all cases. I don't have a proof that this will give correct results for all sequences though. I also don't have a proof that this is a minimal set of sextuplets. But I'd love to hear a proof of either, if anyone can come up with one.
>>> def gen_triplets_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return set(t for trip in triplets for t in trip)
...
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i))
for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True
Although I already said so, I'll point out again that this is an inefficient way to actually generate triplets, as it produces duplicates.
>>> def gen_triplet_list_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return list(t for trip in triplets for t in trip)
...
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160
Indeed, although theoretically you should get a speedup...
>>> len(list(sextuplets(range(80))))
9880
... itertools.combinations still outperforms sextuplets for small sequences:
>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop
And it's still competitive with sextuplets for medium-sized sequences:
>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop
Unless you're working with very large sequences, I'm not sure this is worth the trouble. (Still, it was an interesting problem.)
Related
So I have this (nicely sorted) array.
And sometimes I need all of the elements from the array. But other times I need all of the even-indexed members together and all of the odd-indexed members together. And then again, sometimes I need it split into three groups with indices 0,3,6 etc. in one group, then 1,4,7 in the next and finally 2,5,8 in the last.
This can be done with group_by and taking the modulus of the index. See for yourself:
https://play.crystal-lang.org/#/r/4kzj
arr = ['a', 'b', 'c', 'd', 'e']
puts arr.group_by { |x| arr.index(x).not_nil! % 1 } # {0 => ['a', 'b', 'c', 'd', 'e']}
puts arr.group_by { |x| arr.index(x).not_nil! % 2 } # {0 => ['a', 'c', 'e'], 1 => ['b', 'd']}
puts arr.group_by { |x| arr.index(x).not_nil! % 3 } # {0 => ['a', 'd'], 1 => ['b', 'e'], 2 => ['c']}
But that not_nil! in there feels like a code-smell / warning that there's a better way.
Can I get the index of the elements without needing to look it up and handle the Nil type?
You can also just do:
arr = ['a', 'b', 'c', 'd', 'e']
i = 0
puts arr.group_by { |x| i += 1; i % 1 }
i = 0
puts arr.group_by { |x| i += 1; i % 2 }
i = 0
puts arr.group_by { |x| i += 1; i % 3 }
Besides the nilable return type, it's also very inefficient to call Array#index for each element. This means a runtime of O(N²).
#group_by is used for grouping by value, but you don't need the value for grouping as you just want to group by index. That can be done a lot easier than wrapping around #group_by and #index
A more efficient solution is to loop over the indices and group the values based on the index:
groups = [[] of Char, [] of Char]
arr.each_index do |i|
groups[i % 2] << arr[i]
end
There is no special method for this, but it's fairly simple to implement yourself.
If you don't need all groups, but only one of them, you can also use Int32#step to iterate every other index:
group = [] of Char
2.step(to: arr.size - 1, by: 3) do |i|
group << arr[i]
end
Given:
obj = {}
obj['a'] = ['x', 'y', 'z']
obj['b'] = ['x', 'y', 'z', 'u', 't']
obj['c'] = ['x']
obj['d'] = ['y', 'u']
How do you select (e.g. print) the top 2 entries in this dictionary, sorted by the length of each list?
the top 2 entries in this dictionary, sorted by the length of each
list
print(sorted(obj.values(), key=len)[:2])
The output:
[['x'], ['y', 'u']]
I have two lists of unequal length such as
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
I want to compare these two lists at each index only against the corresponding positions i.e list2[0] against list1[0] and list2[1] against list1[1] and so on upto the length of list1.
And get two new lists one having the mismatches and the second having the position of mismatches for example in the language of coding it can be stated as :
if 'G' == 'GGG' or 'G' # where 'G' is from list1[1] and 'GGG' is from list2[2]
elif 'G' == 'AAA'
{
outlist1 == list1[index] # postion of mismatch
outlist2 == 'G/A'
}
ok this works. There are definitely ways to do it in less code, but I think this is pretty clear:
#Function to process the lists
def get_mismatches(list1,list2):
#Prepare the output lists
mismatch_list = []
mismatch_pos = []
#Figure out which list is smaller
smaller_list_len = min(len(list1),len(list2))
#Loop through the lists checking element by element
for ind in range(smaller_list_len):
elem1 = list1[ind][0] #First char of string 1, such as 'G'
elem2 = list2[ind][0] #First char of string 2, such as 'A'
#If they match just continue
if elem1 == elem2:
continue
#If they don't match update the output lists
else:
mismatch_pos.append(ind)
mismatch_list.append(elem1+'/'+elem2)
#Return the output lists
return mismatch_list,mismatch_pos
#Make input lists
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
#Call the function to get the output lists
outlist1,outlist2 = get_mismatches(list1,list2)
#Print the output lists:
print outlist1
print outlist2
Output:
['G/A', 'C/G', 'A/C']
[0, 2, 3]
And just to see how short I could get the code I made this function which I think is equivalent:
def short_get_mismatches(l1,l2):
o1,o2 = zip(*[(i,x[0]+'/'+y[0]) for i,(x,y) in enumerate(zip(l1,l2)) if x[0] != y[0]])
return list(o1),list(o2)
#Make input lists
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
#Call the function to get the output lists
outlist1,outlist2 = short_get_mismatches(list1,list2)
EDIT:
I'm not sure if I'm cleaning the sequence as you want w/ the N's and -'s. Is this the answer to the example in your comment?
Unclean list1 ['A', 'T', 'G', 'C', 'A', 'C', 'G', 'T', 'C', 'G']
Clean list1 ['A', 'T', 'G', 'C', 'A', 'C', 'G', 'T', 'C', 'G']
Unclean list2 ['GGG', 'TTTN', '-', 'NNN', 'AAA', 'CCC', 'GCCC', 'TTT', 'CCCTN']
Clean list2 ['GGG', 'TTT', 'AAA', 'CCC', 'GCCC', 'TTT', 'CCCT']
0 A GGG
1 T TTT
2 G AAA
3 C CCC
4 A GCCC
5 C TTT
6 G CCCT
['A/G', 'G/A', 'A/G', 'C/T', 'G/C']
[0, 2, 4, 5, 6]
this works fine for my question:
#!/usr/bin/env python
list1=['A', 'T', 'G', 'C', 'A' ,'C', 'G' , 'T' , 'C', 'G']
list2=[ 'GGG' , 'TTTN' , ' - ' , 'NNN' , 'AAA' , 'CCC' , 'GCCC' , 'TTT' ,'CCCATN' ]
notifications = []
indexes = []
for i in range(min(len(list1), len(list2))):
item1 = list1[i]
item2 = list2[i]
# Skip ' - '
if item2 == ' - ':
continue
# Remove N since it's a wildcard
item2 = item2.replace('N', '')
# Remove item1
item2 = item2.replace(item1, '')
chars = set(item2)
# All matched
if len(chars) == 0:
continue
notifications.append('{}/{}'.format(item1, '/'.join(set(item2))))
indexes.append(i)
print(notifications)
print(indexes)
It gives the output as
['A/G', 'G/C', 'C/A/T']
[0, 6, 8]
New to Python, trying to understand how this iterative loop that is intended to remove all items form the list is handling the indexes in the list and why it stops where it does...
Why does this loop:
foo = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
for i in foo:
foo.remove(i)
print foo
Stop here?
['b', 'd', 'f', 'h']
Instead of here?
['H']
Also, what's happening "under the hood" with the indexes here?
With every iteration, is Python keeping track of which index is next while at the same time, once an item is removed the item to its right moves one index to the left (and that's why it's skipping every other item)?
It starts at index zero, removing the "A" there. It then moves to index one, removing the "D" there. (not "C", because that's at index zero at this point.) Then there are only two items left in the list, so it can't move on to index two, and the loop ends.
Perhaps instead of a for loop, you could use a while loop that continues until the list is empty.
foo = ['A', 'C', 'D', 'E']
while foo:
foo.pop(0)
print foo
... Or you could iterate over a copy of the list, which won't change from underneath you as you modify foo. Of course, this uses a little extra memory.
foo = ['A', 'C', 'D', 'E']
for i in foo[:]:
foo.remove(i)
print foo
To understand why this is happening, let us look step-by-step what is happening internally.
Step 1:
>>> foo = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Here, a new list object is created and is assigned to foo.
Step 2:
>>> for i in foo:
Now, the iteration starts. i loop variable is assigned the value of item at index 0 which is 'a'.
Step 3:
>>> foo.remove(i)
>>> print foo
['b', 'c', 'd', 'e', 'f', 'g', 'h']
Now, .remove(i) performs .remove(foo[0]) and not .remove('a') apparently. The new list now has 'b' at index 0, 'c' at index 1 and so on.
Step 4:
>>> for i in foo:
For the next iteration, i loop variable is assigned the value of item at index 1 which is currently 'c'.
Step 5:
>>> foo.remove(i)
>>> print foo
['b', 'd', 'e', 'f', 'g', 'h']
Now this time, .remove(i) performs .remove(foo[1]) which removes 'c' from the list. The current list now has 'b' at index 0, 'd' at index 1 and so on.
Step 6:
>>> for i in foo:
For the next iteration, i loop variable is assigned the value of item at index 2 which is currently 'e'.
Step 7:
>>> foo.remove(i)
>>> print foo
['b', 'd', 'f', 'g', 'h']
Now this time, .remove(i) performs .remove(foo[2]) which removes 'e' from the list. Similarly, the indices of the items gets changed as in step 5 above.
Step 8:
>>> for i in foo:
For the next iteration, i loop variable is assigned the value of item at index 3 which is currently 'g'.
Step 9:
>>> foo.remove(i)
>>> print foo
['b', 'd', 'f', 'h']
Now this time, .remove(i) performs .remove(foo[3]) which removes 'g' from the list.
Step 10:
>>> for i in foo:
Now, i should point to item at index 4 but since the original list has been reduced to 4 elements, the execution will stop here.
>>> foo
['b', 'd', 'f', 'h']
Above is the final list after execution.
SOME CONCLUSIONS:
NEVER CHANGE THE LENGTH OF LISTS WHILE ITERATING ON THEM. In simple words, don't modify the original list while performing iteration on it.
When performing .remove() in a list iteratively, the loop variable will refer to the list item using indexes and not by the actual items in the original list.
I am a Python newbie and I've been trying to find the way to generate each possible combination of members from two lists:
left = ['a', 'b', 'c', 'd', 'e']
right = ['f', 'g', 'h', 'i', 'j']
The resulting list should be something like:
af ag ah ai aj bf bg bh bi bj cf cg ch ci cj etc...
I made several experiments with loops but I can't get it right:
The zip function but it wasn't useful since it just pairs 1 to 1 members:
for x in zip(left,right):
print x
and looping one list for the other just returns the members of one list repeated as many times as the number of members of the second list :(
Any help will be appreciated. Thanks in advance.
You can use for example list comprehension:
left = ['a', 'b', 'c', 'd', 'e']
right = ['f', 'g', 'h', 'i', 'j']
result = [lc + rc for lc in left for rc in right]
print result
The result will look like:
['af', 'ag', 'ah', 'ai', 'aj', 'bf', 'bg', 'bh', 'bi', 'bj', 'cf', 'cg', 'ch', 'ci', 'cj', 'df', 'dg', 'dh', 'di', 'dj', 'ef', 'eg', 'eh', 'ei', 'ej']