Is there a way to group_by with_index in Crystal? - crystal-lang

So I have this (nicely sorted) array.
And sometimes I need all of the elements from the array. But other times I need all of the even-indexed members together and all of the odd-indexed members together. And then again, sometimes I need it split into three groups with indices 0,3,6 etc. in one group, then 1,4,7 in the next and finally 2,5,8 in the last.
This can be done with group_by and taking the modulus of the index. See for yourself:
https://play.crystal-lang.org/#/r/4kzj
arr = ['a', 'b', 'c', 'd', 'e']
puts arr.group_by { |x| arr.index(x).not_nil! % 1 } # {0 => ['a', 'b', 'c', 'd', 'e']}
puts arr.group_by { |x| arr.index(x).not_nil! % 2 } # {0 => ['a', 'c', 'e'], 1 => ['b', 'd']}
puts arr.group_by { |x| arr.index(x).not_nil! % 3 } # {0 => ['a', 'd'], 1 => ['b', 'e'], 2 => ['c']}
But that not_nil! in there feels like a code-smell / warning that there's a better way.
Can I get the index of the elements without needing to look it up and handle the Nil type?

You can also just do:
arr = ['a', 'b', 'c', 'd', 'e']
i = 0
puts arr.group_by { |x| i += 1; i % 1 }
i = 0
puts arr.group_by { |x| i += 1; i % 2 }
i = 0
puts arr.group_by { |x| i += 1; i % 3 }

Besides the nilable return type, it's also very inefficient to call Array#index for each element. This means a runtime of O(N²).
#group_by is used for grouping by value, but you don't need the value for grouping as you just want to group by index. That can be done a lot easier than wrapping around #group_by and #index
A more efficient solution is to loop over the indices and group the values based on the index:
groups = [[] of Char, [] of Char]
arr.each_index do |i|
groups[i % 2] << arr[i]
end
There is no special method for this, but it's fairly simple to implement yourself.
If you don't need all groups, but only one of them, you can also use Int32#step to iterate every other index:
group = [] of Char
2.step(to: arr.size - 1, by: 3) do |i|
group << arr[i]
end

Related

Function to sort a list in a specific order, while also counting the amount of times each value appears

So I need to define a function that returns a list that is arranged in a specific order, and also gives the amount of times each value appears.
For example, let's say i have this input:
["s", "w", "h", "s", "h"]
I'll need my function to return this:
[2, 2, 1]
The 2 is the amount of times s appears, the following 2 is the amount of times h appears, and the 1 is the amount of times w appears.
I have been stuck on this for quite a while now, this is how far I came:
def item_order(list):
sort_order = {"s": 0, "h": 1, "w": 2}
list.sort(key=lambda val: sort_order[val[1]])
But I'm not sure if this is the right way to go.
Any help would be greatly appreciated!
You can use collections.Counter to count number of items. For example:
from collections import Counter
def item_order(lst):
weights = {"s": 0, "h": 1, "w": 2}
rv = sorted(lst, key=weights.get)
return rv, Counter(rv)
lst = ["s", "w", "h", "s", "h"]
sorted_list, cnt = item_order(lst)
print(sorted_list)
print(cnt) # or list(cnt.values())
Prints:
['s', 's', 'h', 'h', 'w']
Counter({'s': 2, 'h': 2, 'w': 1})
# A set for values already seen
seen_characters = set()
# Function sorts a list of characters, calculates count of unique characters
# Parameter: list -> of characters
# Returns: list -> of numbers
def item_order(character_list):
result = []
character_list.sort()
for character in character_list:
if character not in seen_characters:
result.append(character_list.count(character))
seen_characters.add(character)
return result
given_list = ['s', 'w', 'h', 's', 'h']
print(item_order(given_list))

Creating a dictionary from list of lists

I have a list of lists in the following format:
[['a'],['1'],['2'],['3'], ['b'],['4'],['5'],['6']]
My desired output is:
[['a', '1'], ['a', '2'], ['a','3'],['b', '4'],['b', '5'],['b', '6']]
or even better would be:
{'a':['1','2','3'], 'b':['4','5','6']}
Essentially, the "number values" are never the same size (think that a could include 1 2 and 3, and b could include 4 5 6 7 and 8, etc)
What would be the easiest way of doing this? Using regex?
Thanks
You can use a for loop and check if the element is a digit or not:
d = {}
for i in lst:
if not i[0].isdigit(): # Check if element is a digit. If not, add a key with the current value of i[0]
d[i[0]] = []
current = i[0]
else:
d[current].append(i[0])
Output:
>>> d
{'a': ['1', '2', '3'], 'b': ['4', '5', '6']}
This is assuming everything in the list is a string

comparing two lists of unequal length at each index

I have two lists of unequal length such as
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
I want to compare these two lists at each index only against the corresponding positions i.e list2[0] against list1[0] and list2[1] against list1[1] and so on upto the length of list1.
And get two new lists one having the mismatches and the second having the position of mismatches for example in the language of coding it can be stated as :
if 'G' == 'GGG' or 'G' # where 'G' is from list1[1] and 'GGG' is from list2[2]
elif 'G' == 'AAA'
{
outlist1 == list1[index] # postion of mismatch
outlist2 == 'G/A'
}
ok this works. There are definitely ways to do it in less code, but I think this is pretty clear:
#Function to process the lists
def get_mismatches(list1,list2):
#Prepare the output lists
mismatch_list = []
mismatch_pos = []
#Figure out which list is smaller
smaller_list_len = min(len(list1),len(list2))
#Loop through the lists checking element by element
for ind in range(smaller_list_len):
elem1 = list1[ind][0] #First char of string 1, such as 'G'
elem2 = list2[ind][0] #First char of string 2, such as 'A'
#If they match just continue
if elem1 == elem2:
continue
#If they don't match update the output lists
else:
mismatch_pos.append(ind)
mismatch_list.append(elem1+'/'+elem2)
#Return the output lists
return mismatch_list,mismatch_pos
#Make input lists
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
#Call the function to get the output lists
outlist1,outlist2 = get_mismatches(list1,list2)
#Print the output lists:
print outlist1
print outlist2
Output:
['G/A', 'C/G', 'A/C']
[0, 2, 3]
And just to see how short I could get the code I made this function which I think is equivalent:
def short_get_mismatches(l1,l2):
o1,o2 = zip(*[(i,x[0]+'/'+y[0]) for i,(x,y) in enumerate(zip(l1,l2)) if x[0] != y[0]])
return list(o1),list(o2)
#Make input lists
list1 = ['G','T','C','A','G']
list2 = ['AAAAA','TTTT','GGGG','CCCCCCCC']
#Call the function to get the output lists
outlist1,outlist2 = short_get_mismatches(list1,list2)
EDIT:
I'm not sure if I'm cleaning the sequence as you want w/ the N's and -'s. Is this the answer to the example in your comment?
Unclean list1 ['A', 'T', 'G', 'C', 'A', 'C', 'G', 'T', 'C', 'G']
Clean list1 ['A', 'T', 'G', 'C', 'A', 'C', 'G', 'T', 'C', 'G']
Unclean list2 ['GGG', 'TTTN', '-', 'NNN', 'AAA', 'CCC', 'GCCC', 'TTT', 'CCCTN']
Clean list2 ['GGG', 'TTT', 'AAA', 'CCC', 'GCCC', 'TTT', 'CCCT']
0 A GGG
1 T TTT
2 G AAA
3 C CCC
4 A GCCC
5 C TTT
6 G CCCT
['A/G', 'G/A', 'A/G', 'C/T', 'G/C']
[0, 2, 4, 5, 6]
this works fine for my question:
#!/usr/bin/env python
list1=['A', 'T', 'G', 'C', 'A' ,'C', 'G' , 'T' , 'C', 'G']
list2=[ 'GGG' , 'TTTN' , ' - ' , 'NNN' , 'AAA' , 'CCC' , 'GCCC' , 'TTT' ,'CCCATN' ]
notifications = []
indexes = []
for i in range(min(len(list1), len(list2))):
item1 = list1[i]
item2 = list2[i]
# Skip ' - '
if item2 == ' - ':
continue
# Remove N since it's a wildcard
item2 = item2.replace('N', '')
# Remove item1
item2 = item2.replace(item1, '')
chars = set(item2)
# All matched
if len(chars) == 0:
continue
notifications.append('{}/{}'.format(item1, '/'.join(set(item2))))
indexes.append(i)
print(notifications)
print(indexes)
It gives the output as
['A/G', 'G/C', 'C/A/T']
[0, 6, 8]

Edit and append items to nested list - Python 2.7

I'm struggling folks. I have searched this forum and Google but can't find a simple answer that I can understand.
I have a nested list "plot" it would have hundreds of sublists all in the format of this sample:
plot = [['A', 21.09], ['A', 10.00], ['A', 20.99], ['B', 58.50], ['B', 17.69]]
I need to change the items in the sublists and store them in a new list "plotlists". These are the changes I need to make:
[?][0] (all first sublist items) if they are 'A' change to 0 if they are 'B' change to 1
[?][1] (2nd items) no change
[?][2] (3rd - new items) if [?][0] is 'A' then this item = -1 else it is [?][1] * 1.2
I have tried many ways to achieve this but the best I can get is a right mess of code that produces 3 new lists i.e.
Here is a minimal sample:
plot = [['A', 21.09], ['A', 10.00], ['A', 20.99], ['B', 58.50], ['B', 17.69]]
plot0 = []
plot1 = []
plot2 = []
for i in plot:
plot0.append(i[0])
plot1.append(i[1])
for i in plot0:
if i == 'A':
plot0.append(0)
elif 1 == 'B':
plot0.append(1)
for i in plot0:
if i == 'A':
plot2.append(-1)
elif i == 'B':
plot2.append(1.2)
Result:
plot0 = [0, 0, 0, 1, 1]
plot1 = [21.09, 10.00, 29.99, 58.50, 17.69]
plot2 = [-1, -1, -1, 1.2, 1.2]
Please can anyone show me ow to write this as a list comprehension that produces a result like this:
plotlists = [[0, 21.09, -1][0, 10.00, -1][0, 29.99, -1][1, 58.50, 70.56][1, 17.69, 21.23]]
This is a rather long list comprehension but it'll work:
new_list = [[0 if sublist[0] is 'A' else 1, sublist[1], -1 if sublist[0] is 'A' else 1.2*sublist[1]] for sublist in plot]
Update: Auto increment counter
new_list = [[i, 0 if sublist[0] is 'A' else 1, sublist[1], -1 if sublist[0] is 'A' else 1.2*sublist[1]] for i, sublist in zip(xrange(0, len(plot)), plot)]

All triplet combinations, 6 values at a time

I am looking for an algorithm to efficiently to generate all three value combinations of a dataset by picking 6 values at a time.
I am looking for an algorithm to efficiently generate a small set of 6-tuples that cumulatively express all possible 3-tuple combinations of a dataset.
For instance, computing playing-card hands of 6 cards that express all possible 3 card combinations.
For example, given a dataset:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The first "pick" of 6 values might be:
['a','b','c','d','e','f']
And this covers the three-value combinations:
('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')
It is obviously possible by:
computing all triplet combinations
picking 6 values
computing all triplet combinations for those 6 values
removing these combinations from the first computation
repeating until all have been accounted for
In this example there are 2600 possible triplet combinations (26*25*24)/(3*2*1) == 2600 and using the "brute-force" method above, all triplet combinations can be represented in around 301 6-value groups.
However, it feels like there ought to be a more efficient way of achieving this.
My preferred language is python, but I'm planning on implementing this in C++.
Update
Here's my python code to "brute-force" it:
from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')
def calculate(data_set):
all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
data = set(all_triplets)
sextuples = []
while data:
sxt = set()
for item in data:
nxt = sxt | item
if len(nxt) > 6:
continue
sxt = nxt
if len(nxt) == 6:
break
sextuples.append(list(sxt))
covers = set(frozenset(x) for x in combinations(list(sxt),3))
data = data - covers
print "%r\t%s" % (list(sxt),len(data))
print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)
calculate(data_set)
Completed 2600 triplets in 301 sextuples
I'm looking for something more computationally efficient than this.
Update
Senderle has provided an interesting solution: to divide the dataset into pairs then generate all possible triplets of the pairs. This is definitely better than anything I'd come up with.
Here's a quick function to check whether all triplets are covered and assess the redundancy of triplet coverage:
from itertools import combinations
def check_coverage(data_set,sextuplets):
all_triplets = dict.fromkeys(combinations(data_set,3),0)
sxt_count = 0
for sxt in sextuplets:
sxt_count += 1
for triplet in combinations(sxt,3):
all_triplets[triplet] += 1
total = len(all_triplets)
biggest_overlap = overlap = nohits = onehits = morehits = 0
for k,v in all_triplets.iteritems():
if v == 0:
nohits += 1
elif v == 1:
onehits += 1
else:
morehits += 1
overlap += v - 1
if v > biggest_overlap:
biggest_overlap = v
print "All Triplets in dataset: %6d" % (total,)
print "Total triplets from sxt: %6d" % (total + overlap,)
print "Number of sextuples: %6d\n" % (sxt_count,)
print "Missed %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
print "Biggest Overlap: %3d" % (biggest_overlap,)
Using Senderle's sextuplets generator I'm fascinated to observe that the repeated triplets are localised and as the datasets increase in size, the repeats become proportionally more localised and the peak repeat larger.
>>> check_coverage(range(26),sextuplets(range(26)))
All Triplets in dataset: 2600
Total triplets from sxt: 5720
Number of sextuples: 286
Missed 0 of 2600: 0.0%
HitOnce 2288 of 2600: 88.0%
HitMore 312 of 2600: 12.0%
Overlap 3120 of 2600: 120.0%
Biggest Overlap: 11
>>> check_coverage(range(40),sextuplets(range(40)))
All Triplets in dataset: 9880
Total triplets from sxt: 22800
Number of sextuples: 1140
Missed 0 of 9880: 0.0%
HitOnce 9120 of 9880: 92.3%
HitMore 760 of 9880: 7.7%
Overlap 12920 of 9880: 130.8%
Biggest Overlap: 18
>>> check_coverage(range(80),sextuplets(range(80)))
All Triplets in dataset: 82160
Total triplets from sxt: 197600
Number of sextuples: 9880
Missed 0 of 82160: 0.0%
HitOnce 79040 of 82160: 96.2%
HitMore 3120 of 82160: 3.8%
Overlap 115440 of 82160: 140.5%
Biggest Overlap: 38
I believe the following produces correct results. It relies on the intuition that to generate all necessary sextuplets, all that is necessary is to generate all possible combinations of arbitrary pairs of items. This "mixes" values together well enough that all possible triplets are represented.
There's a slight wrinkle. For an odd number of items, one pair isn't a pair at all, so you can't generate a sextuplet from it, but the value still needs to be represented. This does some gymnastics to sidestep that problem; there might be a better way, but I'm not sure what it is.
from itertools import izip_longest, islice, combinations
def sextuplets(seq, _fillvalue=object()):
if len(seq) < 6:
yield [tuple(seq)]
return
it = iter(seq)
pairs = izip_longest(it, it, fillvalue=_fillvalue)
sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
for st in sextuplets:
if st[-1] == _fillvalue:
# replace fill value with valid item not in sextuplet
# while maintaining original order
for i, (x, y) in enumerate(zip(st, seq)):
if x != y:
st = st[0:i] + (y,) + st[i:-1]
break
yield st
I tested it on sequences of items of length 10 to 80, and it generates correct results in all cases. I don't have a proof that this will give correct results for all sequences though. I also don't have a proof that this is a minimal set of sextuplets. But I'd love to hear a proof of either, if anyone can come up with one.
>>> def gen_triplets_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return set(t for trip in triplets for t in trip)
...
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i))
for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True
Although I already said so, I'll point out again that this is an inefficient way to actually generate triplets, as it produces duplicates.
>>> def gen_triplet_list_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return list(t for trip in triplets for t in trip)
...
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160
Indeed, although theoretically you should get a speedup...
>>> len(list(sextuplets(range(80))))
9880
... itertools.combinations still outperforms sextuplets for small sequences:
>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop
And it's still competitive with sextuplets for medium-sized sequences:
>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop
Unless you're working with very large sequences, I'm not sure this is worth the trouble. (Still, it was an interesting problem.)