scala spark reduce list in groupby - list

I have spark DataFrame with two columns
colA colB
1 3
1 2
2 4
2 5
2 1
I want to groupBy colA and iterate over colB list for each group such that:
res = 0
for i in collect_list(col("colB")):
res += i * (3+res)
returned value shall be res
so I get:
colA colB
1 24
2 78
how can i do this in scala?

You can achieve the result you want with the following:
val df = Seq((1,3), (1,2), (2,4), (2,5), (2,1)).toDF("colA", "colB")
val retDf = df
.groupBy("colA")
.agg(
aggregate(
collect_list("colB"), lit(0), (acc, nxt) => nxt * (acc + 3)
) as "colB")
You need to be very careful with this however, as data on Spark is distributed. If the data has been shuffled since being read into Spark there is no guarantee that it will be collected in the same order. In the toy example collect_list("colB") will return Seq(3,2) where colA is 1. If there had been any shuffles at an earlier phase however, collect_list can just as well return Seq(2,3) which would give you 27 instead of the desired 24. You need to provide some metadata to your data which you can use to ensure you're processing this in the order you expect such as with the monotonicallyIncreasingId method.

RDD approach with no loss of ordering.
%scala
val rdd1 = spark.sparkContext.parallelize(Seq((1,3), (1,3), (2,4), (2,5), (2,1))).zipWithIndex().map(x => ((x._1._1), (x._1._2, x._2)) )
val rdd2 = rdd1.groupByKey
// Convert to Array.
val rdd3 = rdd2.map(x => (x._1, x._2.toArray))
val rdd4 = rdd3.map(x => (x._1, x._2.sortBy(_._2)))
val rdd5 = rdd4.mapValues(v => v.map(_._1))
rdd5.collect()
val res = rdd5.map(x => (x._1, x._2.fold(0)((acc, nxt) => nxt * (acc + 3) )))
res.collect()
returns:
res201: Array[(Int, Int)] = Array((1,24), (2,78))
Covert from and to DF as required.

Related

How do I find the most frequent element in a list in pyspark?

I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df

Calculating the value of a field based on the difference between the values of another field in two adjacent positions using haskell

I have a list of custom data objects which track an increasing total value on a daily basis using one field total. Another field in the custom data type is the value new. Using a csv file I have read in the values for date and total and am trying to calculate and set the values for new from these values.
data item = Item{
date :: Day,
total :: Int,
new :: Int
}
Before
date
total
new
01/01/2021
0
0
02/01/2021
2
0
03/01/2021
6
0
04/01/2021
15
0
After
date
total
new
01/01/2021
0
0
02/01/2021
2
2
03/01/2021
6
4
04/01/2021
15
9
My understanding is that in haskell I should be trying to avoid the use of for loops which iterate over a list until the final row is reached, for example using a loop control which terminates upon reaching a value equal to the length of the list.
Instead I have tried to create a function which assigns the value of new which can used with map to update each item in the list. My problem is that such a function requires access to both the item being updated, as well as the previous item's value for total and I'm unsure how to implement this in haskell.
--Set daily values by mapping single update function across list
calcNew:: [Item] -> Int -> [Item]
calcNew items = map updateOneItem items
-- takes an item and a value to fill the new field
updateOneItem :: Item -> Int -> Item
updateOneItem item x = Item date item total item x
Is it possible to populate that value while using map? If not, is a recursive solution required?
We can do this by zipping the input list with itself, shifted by one step.
Assuming you have a list of items already populated with total values, which you want to update to contain the correct new values (building an updated copy of course),
type Day = Int
data Item = Item{ -- data Item, NB
date :: Day,
total :: Int,
new :: Int
} deriving Show
calcNews :: [Item] -> [Item]
calcNews [] = []
calcNews totalsOK#(t:ts) = t : zipWith f ts totalsOK
where
f this prev = this{ new = total this - total prev }
This gives us
> calcNews [Item 1 0 0, Item 2 2 0, Item 3 5 0, Item 4 10 0]
[Item {date = 1, total = 0, new = 0},Item {date = 2, total = 2, new = 2},
Item {date = 3, total = 5,new = 3},Item {date = 4, total = 10, new = 5}]
Of course zipWith f x y == map (\(a,b) -> f a b) $ zip x y, as we saw in your previous question, so zipWith is like a binary map.
Sometimes (though not here) we might need access to the previously calculated value as well, to calculate the next value. To arrange for that we can create the result by zipping the input with the shifted version of the result itself:
calcNews2 :: [Item] -> [Item]
calcNews2 [] = []
calcNews2 (t:totalsOK) = newsOK
where
newsOK = t : zipWith f totalsOK newsOK
f tot nw = tot{ new = total tot - total nw }

How to extract where clause as array in spark sql?

I am trying to extract where clause from SQL query.
Multiple conditions in where clause should be in form array. Please help me.
Sample Input String:
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
Output Expected:
Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")
Thanks in advance.
EDIT:
My question here is like... I would like to split all the conditions as separate items... let's say my query is like
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
The output I'm expecting is like
List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")
The thing is the query may have multiple levels of conditions like
select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'
in output each condition should be a separate item
List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")
I wouldn't use Regex for this. Here's an alternative way to extract your conditions based on Catalyst's Logical Plan :
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case And(l,r) => Seq(l,r)
case o:Predicate => Seq(o)
}
}.toList.flatten
println(predicates)
Output :
List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))
Here the predicates are still Expressions and hold information (tree representation).
EDIT :
As asked in comment, here's a String (user friendly I hope) representation of the predicates :)
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case o:Predicate => Seq(o)
}
}.toList.flatten
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
val stringRepresentation = predicates.flatMap{stringifyExpressions}
println(stringRepresentation)
New Output :
List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)
You can keep playing with the recursive stringifyExpressions method if you want to customize the output.
EDIT 2 : In response to your own edit :
You can change the Or / EqualTo cases to the following
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
case EqualTo(l,r) =>
val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
Seq(s"$prettyLeft=$prettyRight")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
This gives the 4 elements List :
List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)
For the second example :
select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'
You'd get this output (List[String] with 5 elements) :
List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')
Additional note: If you want to print the attribute names without the starting quote, you can handle it by printing this instead of toString :
node.asInstanceOf[UnresolvedAttribute].name

how to make list of lists from pandas dataframe, skipping nan values

I have a pandas dataframe that looks roughly like
foo foo2 foo3 foo4
a NY WA AZ NaN
b DC NaN NaN NaN
c MA CA NaN NaN
I'd like to make a nested list of the observations of this dataframe, but omit the NaN values, so I have something like [['NY','WA','AZ'],['DC'],['MA',CA'].
There is a pattern in this dataframe, if that makes a difference, such that if fooX is empty, the subsequent column fooY will also be empty.
I originally had something like this code below. I'm sure there's a nicer way to do this
A = [[i] for i in subset_label['label'].tolist()]
B = [i for i in subset_label['label2'].tolist()]
C = [i for i in subset_label['label3'].tolist()]
D = [i for i in subset_label['label4'].tolist()]
out_list = []
for index, row in subset_label.iterrows():
out_list.append([row.label, row.label2, row.label3, row.label4])
out_list
Option 1
pd.DataFrame.stack drops na by default.
df.stack().groupby(level=0).apply(list).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
​___
Option 2
Fun alternative, because I think summing lists within pandas objects is fun.
df.applymap(lambda x: [x] if pd.notnull(x) else []).sum(1).tolist()
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Option 3
numpy experiment
nn = df.notnull().values
sliced = df.values.ravel()[nn.ravel()]
splits = nn.sum(1)[:-1].cumsum()
[s.tolist() for s in np.split(sliced, splits)]
[['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Try this:
In [77]: df.T.apply(lambda x: x.dropna().tolist()).tolist()
Out[77]: [['NY', 'WA', 'AZ'], ['DC'], ['MA', 'CA']]
Here's a vectorized version!
original = pd.DataFrame(data={
'foo': ['NY', 'DC', 'MA'],
'foo2': ['WA', np.nan, 'CA'],
'foo3': ['AZ', np.nan, np.nan],
'foo4': [np.nan] * 3,
})
out = original.copy().fillna('NAN')
# Build up mapping such that each non-nan entry is mapped to [entry]
# and nan entries are mapped to []
unique_entries = np.unique(out.values)
mapping = {e: [e] for e in unique_entries}
mapping['NAN'] = []
# Apply mapping
for c in original.columns:
out[c] = out[c].map(mapping)
# Concatenate the lists along axis 1
out.sum(axis=1)
You should get something like
0 [NY, WA, AZ]
1 [DC]
2 [MA, CA]
dtype: object

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.