Organize the sequence of tuple in a Haskell List comprehension - list

Hello dear Community,
I'm trying organize the sequence of tuple in a Haskell List comprehension.
e.g. I got the following List Comprehension:
[ (a,b,c,d) | a <- [0, 50, 100, 150, 200]
, b <- ['a', 'b', 'c']
, c <- [True, False]
, d <- ['A', 'B']
]
and get:
[ (0, 'a', True, 'A'), (0, 'a', True, 'B'), (0, 'a', False, 'A')
, (0, 'a', False, 'B'), (0, 'b', True, 'A'), (0, 'b', True, 'B')
, (0, 'b', False, 'A'), (0, 'b', False, 'B'), (0, 'c', True, 'A')
,(0, 'c', True, 'B'), (0, 'c', False, 'A')..
Now I want the sequence like following:
[ (0, 'a', True, 'A'), (0, 'a', True, 'B'), (0, 'b', True, 'A')
, (0, 'b', True, 'B'), (0, 'c' ,True, 'A'), (0, 'c', True, 'B')
, (0, 'a', False, 'A'), (0, 'a', False, 'B')..
That means:
First a trade off between the capital letters 'A' & 'B', then a trade off between the small letters 'a','b','c', as a penultimate the trade off between the boolean values True, False and finaly the numbers.
Unfortunately I have absolutely no idea how to realize these and i want to know how you can manipulate the sequence of a list with tuple [(a,b,c)].

The order of the x <- list statements in list comprehension is important. If you write:
[expr | x <- list1, y <- list2]
this is equivalent to a nested for loop with y being the inner loop. So a Python equivalent with loops would be:
for x in list1:
for y in list2:
expr
and thus the inner loop is entirely exhausted before the outer loop picks the next value.
So we need to reorder the statements such that we first pick d, then b, then c and finally a. So that means that we turn:
[(a,b,c,d)| a <- [0,50..200], b <- "abc", c <- [True,False], d <-"AB"]
(I made the lists shorter in notation)
into:
-- [(a,b,c,d)| a <- [0,50..200], b <- "abc", c <- [True,False], d <-"AB"]
-- | \_________/_____ |
-- | ________/ \ |
-- | / \ |
[(a,b,c,d)| a <- [0,50..200], c <- [True,False], b <- "abc", d <- "AB"]
(the comment is only to visualize the difference)
which generates:
Prelude> [(a,b,c,d)| a <- [0,50..200], c <- [True,False], b <- "abc", d <- "AB"]
[(0,'a',True,'A'),
(0,'a',True,'B'),
(0,'b',True,'A'),
(0,'b',True,'B'),
(0,'c',True,'A'),
(0,'c',True,'B'),
(0,'a',False,'A'),
(0,'a',False,'B'),
(0,'b',False,'A'),
(0,'b',False,'B'),
(0,'c',False,'A'),
(0,'c',False,'B'),
(50,'a',True,'A'),
(50,'a',True,'B'),
(50,'b',True,'A'),
(50,'b',True,'B'),
(50,'c',True,'A'),
(50,'c',True,'B'),
(50,'a',False,'A'),
(50,'a',False,'B'),
(50,'b',False,'A'),
(50,'b',False,'B'),
(50,'c',False,'A'),
(50,'c',False,'B'),
(100,'a',True,'A'),
(100,'a',True,'B'),
(100,'b',True,'A'),
(100,'b',True,'B'),
(100,'c',True,'A'),
(100,'c',True,'B'),
(100,'a',False,'A'),
(100,'a',False,'B'),
(100,'b',False,'A'),
(100,'b',False,'B'),
(100,'c',False,'A'),
(100,'c',False,'B'),
(150,'a',True,'A'),
(150,'a',True,'B'),
(150,'b',True,'A'),
(150,'b',True,'B'),
(150,'c',True,'A'),
(150,'c',True,'B'),
(150,'a',False,'A'),
(150,'a',False,'B'),
(150,'b',False,'A'),
(150,'b',False,'B'),
(150,'c',False,'A'),
(150,'c',False,'B'),
(200,'a',True,'A'),
(200,'a',True,'B'),
(200,'b',True,'A'),
(200,'b',True,'B'),
(200,'c',True,'A'),
(200,'c',True,'B'),
(200,'a',False,'A'),
(200,'a',False,'B'),
(200,'b',False,'A'),
(200,'b',False,'B'),
(200,'c',False,'A'),
(200,'c',False,'B')]
(new lines added to make it easier to verify)

Related

PySpark - Truncate lists in a column according to first and last occurrences of specific elements

I have a pyspark-dataframe where one column is a list with a given order:
df_in = spark.createDataFrame(
[
(1,['A', 'B', 'A', 'F', 'C', 'D']),
(2,['F', 'C', 'B', 'X', 'A', 'D']),
(3,['L', 'A', 'B', 'M', 'C'])])
I want to specify two elements e_1 and e_2 (e.g. 'A' and 'C'), such that the resulting column contains of a list with all the elements that are after the first occurence of either e_1 or e_2 and before the last occurence of either e_1 or e_2. Hence, the resulting df schould look like this:
df_out = spark.createDataFrame(
[
(1,['A', 'B', 'A', 'F', 'C']),
(2,['C', 'B', 'X', 'A']),
(3,['A', 'B', 'M', 'C'])])
How do I achieve this? Thanks in advance!
Best regards
Define a function that does what you want with a single element of the dataframe.
Extend it to a "user defined function" (UDF) that is able to operate on a column of the dataframe.
In this example, the function might look like
def trim(start_stop, l):
first = None
for i,e in enumerate(l):
if e in start_stop:
if first is None:
first=i
last = i
return l[first:last+1] if first is not None else []
trim({'B','C'}, ['A', 'B', 'A', 'F', 'C', 'D'])
# ['B', 'A', 'F', 'C']
For the UDF, we have to decide whether we want start_stop to be a fixed value or flexible.
from pyspark.sql import functions, types
# {'A','C'} as hard-coded value for start_stop
trimUDF = functions.udf(
lambda x: trim({'A','C'}, x),
types.ArrayType(types.StringType()))
# use as trimUDF(COLNAME)
# flexible start_stop
def trimUDF(start_stop):
return functions.udf(
lambda l: trim(start_stop,l),
types.ArrayType(types.StringType()))
# use as trimUDF(start_stop)(COLNAME)
For your example and the second version of trimUDF, we obtain:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1,['A', 'B', 'A', 'F', 'C', 'D']),
(2,['F', 'C', 'B', 'X', 'A', 'D']),
(3,['L', 'A', 'B', 'M', 'C'])])
df.show()
df = df.withColumn("_2",trimUDF({'A','C'})("_2"))
df.show()
This code results in the following output.
+---+------------------+
| _1| _2|
+---+------------------+
| 1|[A, B, A, F, C, D]|
| 2|[F, C, B, X, A, D]|
| 3| [L, A, B, M, C]|
+---+------------------+
+---+---------------+
| _1| _2|
+---+---------------+
| 1|[A, B, A, F, C]|
| 2| [C, B, X, A]|
| 3| [A, B, M, C]|
+---+---------------+
For the record, here is the complete code.
# python3 -m venv venv
# . venv/bin/activate
# pip install wheel pyspark[sql]
from pyspark.sql import SparkSession, functions, types
def trim(start_stop, l):
first = None
for i,e in enumerate(l):
if e in start_stop:
if first is None:
first=i
last = i
return l[first:last+1] if first is not None else []
#trimUDF = functions.udf(
# lambda x: trim({'A','C'}, x),
# types.ArrayType(types.StringType()))
def trimUDF(start_stop):
return functions.udf(
lambda l: trim(start_stop,l),
types.ArrayType(types.StringType()))
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1,['A', 'B', 'A', 'F', 'C', 'D']),
(2,['F', 'C', 'B', 'X', 'A', 'D']),
(3,['L', 'A', 'B', 'M', 'C'])])
df.show()
df = df.withColumn("_2",trimUDF({'A','C'})("_2"))
df.show()
This is an array.
Find the position of the elements A and c using array_position
That will give you an array. Please sort the array using sort_array
Now slice resultant column starting from the lowest position from the array_position above. subtract the lowest position element from maximum and add on 1 to get the length to pass in the slice.
Code below. Left in intermediate column x for you to follow through.
df_in.withColumn('x', sort_array(array(*[array_position(col('_2'),x) for x in ['A','C']]))).withColumn('y', slice(col('_2'),col('x')[0],col('x')[1]-col('x')[0]+1)).show()
+---+------------------+------+---------------+
| _1| _2| x| y|
+---+------------------+------+---------------+
| 1|[A, B, A, F, C, D]|[1, 5]|[A, B, A, F, C]|
| 2|[F, C, B, X, A, D]|[2, 5]| [C, B, X, A]|
| 3| [L, A, B, M, C]|[2, 5]| [A, B, M, C]|
+---+------------------+------+---------------+
One way to do is:
Join array into a string with elements separated by ,.
Use regex to extract required sub-string.
Split string by , back into original array.
Cast, if required.
df = spark.createDataFrame(data=[(1,["X A", "X B", "X A", "X F", "X C", "X D"]),(2,["X F", "X C", "X B", "X X", "X A", "X D"]),(3,["X L", "X A", "X B", "X M", "X C"])], schema=["id","arr"])
match_list = ["X A", "X C", "X D"]
match_any = "|".join([w for w in match_list])
regex = rf"((?:{match_any}).*(?:{match_any}))"
df = df.withColumn("arr", F.concat_ws(",", "arr")) \
.withColumn("arr", F.regexp_extract("arr", regex, 1)) \
.withColumn("arr", F.split("arr", ","))
Output:
+---+------------------------------+
|id |arr |
+---+------------------------------+
|1 |[X A, X B, X A, X F, X C, X D]|
|2 |[X C, X B, X X, X A, X D] |
|3 |[X A, X B, X M, X C] |
+---+------------------------------+

Generating K-multisets when order matters and repetition is allowed

Sorry if my question is basic but I have not coded in the past 15 years although I am trying to learn to code again for a research project. I have a set of 12 objects [A B C D E F G H I J K L] and I want to create a list of any possible K-multiset for any K between 1 and 6. (I do have a list of selection probabilities for them but at this stage of the project I can assume an equal probability of selection for all). Order matters and repetition is allowed (so they are just k-tuples). For example: [A], [A A] [B A A] [A B A], [B A A A A A] etc. I tried to use weighted-n-of-with-repeats in the RND extension of NetLogo but it seems that order does not matter in it so [B A] and [A B] are the same thing and are reported as [A B]. Can you please help with the NetLogo code?
This is what I have tried so far but unfortunately it does not recognize order:
to k-multiset
let n 0
let pairs [[ "A" 0.1 ] [ "B" 0.1 ] ["C" 0.1] [“D” 0.1] [“E” 0.1] [“F” 0.1] [“G” 0.1] [“H” 0.1] [“I” 0.1] [“J” 0.1] [“K” 0.1]]
while [n < 7] [print map first rnd:weighted-n-of-list-with-repeats n pairs [[p] -> last p ]]
end
Note that every multiset containing K items can be represented as integer value in 12-ary numeral system, where A corresponds to 0, B corresponds to 1 and so on until L=11.
So you can just walk through all integers in range 0..12^K-1 (about 3 millions for K=6) and get 12-ary digits as indexes of items.
python code for smaller item list and output range:
List = list('ABC')
L = len(List)
for K in range(1,4):
for i in range(L**K):
j = i
output = []
for n in range(K): #K digits
d = j % L #integer modulo for n-th digit
j = j // L #integer division
output.insert(0,List[d])
print(output)
part of output:
['A']
['B']
['C']
['A', 'A']
['A', 'B']
['A', 'C']
...
['C', 'C']
['A', 'A', 'A']
['A', 'A', 'B']
['A', 'A', 'C']
...
['C', 'B', 'C']
['C', 'C', 'A']
['C', 'C', 'B']
['C', 'C', 'C']

How to match elements from arrays and only print matches?

I have two lists, i am trying to match one item from the first list to another from the second list under a certain condition (for example if they share the same number in the same location). i wrote my code to match the first set ['A','B','C',4,'D'] and only print the set from list2 that has 4 in the same location. so basically my output would be:
['A','B','C',4,'D']
[1, 2, 3, 4, 5]
well i can't figure out how to print only the match
here is my code:
list1 = [['A','B','C',4,'D'],['A','B','C',9,'D'],['A','B','C',5,'D'],['A','B','C',6,'D'],['A','B','C',7,'D']]
list2 = [[1,2,3,2,5],[1,2,3,5,5],[1,2,3,3,5],[1,2,3,4,5],[1,2,3,1,5],[1,2,3,2,5]]
for var in list1:
print var
for i in range(0,len(list2)):
for var1 in list2:
if list1[0][3] == list2[i][3]:
print var1
Your program would become easier, if you used izip of itertools. Assuming you just need to print the elements
from itertools import izip
list1 = [['A','B','C',4,'D'],['A','B','C',9,'D'],['A','B','C',5,'D'],['A','B','C',6,'D'],['A','B','C',7,'D']]
list2 = [[1,2,3,2,5],[1,2,3,5,5],[1,2,3,3,5],[1,2,3,4,5],[1,2,3,1,5],[1,2,3,2,5]]
for item1 in list1:
for item2 in list2:
for i,j in izip(item1, item2):
if i==j:
print i
By using izip two times, it would be much easier
from itertools import izip
list1 = [['A','B','C',4,'D'],['A','B','C',9,'D'],['A','B','C',5,'D'],['A','B','C',6,'D'],['A','B','C',7,'D']]
list2 = [[1,2,3,2,5],[1,2,3,5,5],[1,2,3,3,5],[1,2,3,4,5],[1,2,3,1,5],[1,2,3,2,5]]
for i in izip(list1,list2):
for item1, item2 in izip(i[0],i[1]):
if item1 == item2:
print item1
Almost. I am not sure if that is what you wanted but the following code prints all pairs which have the same number in the 4th location of the array:
list1 = [['A','B','C',4,'D'],['A','B','C',9,'D'],['A','B','C',5,'D'],
['A','B','C',6,'D'],['A','B','C',7,'D']]
list2 = [[1,2,3,2,5],[1,2,3,5,5],[1,2,3,3,5],[1,2,3,4,5],[1,2,3,1,5],
[1,2,3,2,5]]
for t in list1:
print t
for b in list2:
if t[3] == b[3]:
print b
Output is:
['A', 'B', 'C', 4, 'D']
[1, 2, 3, 4, 5]
['A', 'B', 'C', 9, 'D']
['A', 'B', 'C', 5, 'D']
[1, 2, 3, 5, 5]
['A', 'B', 'C', 6, 'D']
['A', 'B', 'C', 7, 'D']
Is that what you were looking for?

function that takes variable length as argument and returns tuple

I have written code as below:
def convTup(*args):
t = set([])
for i in args:
t.add(i)
return tuple(t)
print convTup('a','b','c','d')
print convTup(1,2,3,4,5,6)
print convTup('a','b')
Expected output :
('a', 'b', 'c', 'd')
(1, 2, 3, 4, 5, 6)
('a', 'b')
But I got output as below:
('a', 'c', 'b', 'd')
(1, 2, 3, 4, 5, 6)
('a', 'b')
Why has the order of the elements changed only for ('a','b','c','d')? How can I print the tuple in the same order as the given input?
You can use this and you'll have a tuple sequence as your input
>>> def f(*args):
p = []
[p.append(x) for x in args if x not in p]
return tuple(p)
>>> f(1, 1, 2, 3)
(1, 2, 3)
>>> f('a', 'b', 'c', 'd')
('a', 'b', 'c', 'd')
This function will create a list of unique elements and track their order then return it as a tuple.
You can see the same functionality using a set instead of list. A set doesn't keep track of the order the elements were entered.
>>> def tup1(*args):
l = {x for x in args}
return tuple(l)
>>>
>>> tup1('a', 'b', 'c', 'd')
('a', 'c', 'b', 'd')
You can implement your own SortedSet collection if you use it in multiple places.
This should do what you need:
def convTup(*args):
return sorted(tuple(set(args)), key=lambda x: args.index(x))
Sou you convert args into set that is ordered by default, then turn it into tuple and finally sort that tuple by the original order.
So, to be precise, this function is ordering by order of appearance taking into account only the first appearance of element in args.

List to a different output

I am doing a minor structure manipulation using python, and have a few issues.
Currently my output is the data below.
[['a', ['b', 'c'], ['d', 'e']], ['h', ['i'], ['j']]]
I want to get into this structure below, but my data structure comes out a bit wrong. There could be multiple lists with different entry per list.
(a, b, a, d), (a, c, a, e), (h, i, h, j)
What would be the best approach?
Here's a quick one:
from itertools import product, izip
data = [['a', ['b', 'c'], ['d', 'e']], ['h', ['i'], ['j']]]
result = []
for d in data:
first = d[0]
for v in izip(*d[1:]):
tmp = []
for p in product(*[first, v]):
tmp.extend(p)
result.append(tuple(tmp))
print result
Output:
[('a', 'b', 'a', 'd'), ('a', 'c', 'a', 'e'), ('h', 'i', 'h', 'j')]