Fidning max/min value of a list in pyspark - amazon-web-services

I know this is a very trivial question, and I am quite surprised I could not find an answer on the internet, but can one find the max or min value o a list in pyspark?
In Python it is easily done by
max(list)
However, when I try the same in pyspark I get the following error:
An error was encountered:
An error occurred while calling z:org.apache.spark.sql.functions.max. Trace:
py4j.Py4JException: Method max([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:276)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Any ideas as to what I am doing wrong?
UPDATE: Adding what exactly I did:
This is my list:
cur_datelist
Output:
['2020-06-10', '2020-06-11', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-15', '2020-06-16', '2020-06-17', '2020-06-18', '2020-06-19', '2020-06-20', '2020-06-21', '2020-06-22', '2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26', '2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-12', '2020-07-13', '2020-07-14', '2020-07-15', '2020-07-16', '2020-07-17', '2020-07-18', '2020-07-19', '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26', '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30', '2020-07-31', '2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04', '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-12', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28', '2020-08-29', '2020-08-30', '2020-08-31']
The class is 'list':
type(cur_datelist)
<class 'list'>
I assumed that to be a normal pythonic list.
So when I tried max(cur_datelist), I get the above mentioned error.

It is not different between pyspark and python for the list but the column is difference. This is the result of my pyspark.
# just a list
l = [1, 2, 3]
print(max(l))
# 3
# dataframe with the array column
df = spark.createDataFrame([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF('id', 'list')
import pyspark.sql.functions as f
df.withColumn('max', f.array_max(f.col('list'))).show()
#+---+---------+---+
#| id| list|max|
#+---+---------+---+
#| 1|[1, 2, 3]| 3|
#| 2|[4, 5, 6]| 6|
#+---+---------+---+
Your error comes from the max function overlap between the python native one and the spark column function! To avoid this, specify your pyspark function. Then max denotes the python original.
import pyspark.sql.functions as f
l = ['2020-06-10', '2020-06-11', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-15', '2020-06-16', '2020-06-17', '2020-06-18', '2020-06-19', '2020-06-20', '2020-06-21', '2020-06-22', '2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26', '2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11', '2020-07-12', '2020-07-13', '2020-07-14', '2020-07-15', '2020-07-16', '2020-07-17', '2020-07-18', '2020-07-19', '2020-07-20', '2020-07-21', '2020-07-22', '2020-07-23', '2020-07-24', '2020-07-25', '2020-07-26', '2020-07-27', '2020-07-28', '2020-07-29', '2020-07-30', '2020-07-31', '2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04', '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08', '2020-08-09', '2020-08-10', '2020-08-11', '2020-08-12', '2020-08-13', '2020-08-14', '2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18', '2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22', '2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26', '2020-08-27', '2020-08-28', '2020-08-29', '2020-08-30', '2020-08-31']
print(max(l))
# 2020-08-31
Or,
import builtins as p
print(p.max(l))
# 2020-08-31

Related

How do I find the most frequent element in a list in pyspark?

I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df

evaluate mixed type with eval()

I have two date/time variables that contains list of the date/time values and another variable containing the list of operator to operate over the date/time variables. The format can be expressed as follows:
column1 = np.array([date1, date2,.......,dateN])
column2 = np.array([date1, date2,.......,dateN])
Both of the above variables of type Date/Time. Then I have the following variable operator that has the same length of column1 and column2:
operator = np.array(['>=','<=','==','=!',......])
I am getting "Invalid Token" with the following operation:
np.array([eval('{}{}{}'.format(v1,op,v2)) for v1,op,v2 in zip(column1,operator,column2)])
Any hint to get around this issue ?
-------------------EDIT----------------------
With some sample data and without eval I get the following output:
np.array(['{} {} {}'.format(v1,op,v2) for v1,op,v2 in zip(datelist1,operator,datelist2)])
array(['2017-03-30 10:30:22.928000 <= 2012-05-23 00:00:00',
'2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000'],
dtype='|S49')
Once I bring in eval(), I get the following error:
eval('2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000')
File "<string>", line 1
2011-01-07 00:00:00 == 2017-03-30 10:31:14.477000
^
SyntaxError: invalid syntax
----------------------EDIT & CORRECTIONS ----------------------------
Date/Time variables that I mentioned before are basically of type numpy datetime64 type and I am now getting the following issue while trying two date comparions with eval:
np.array([(repr(d1)+op+repr(d2)) for d1,op,d2 in zip(${Column Name1},${Operator},${Column Name2})])
The above snippet is tried over a table with three columns where ${Column Name1} and ${Column Name2} is of numpy.datetime64 type and ${Operator} is of string type. The result is as follows for one of the rows:
numpy.datetime64('2014-08-13T02:00:00.000000+0200')>=numpy.datetime64('2014-08-13T02:00:00.000000+0200')
Now I want to evaluate the above expression with function eval as follows:
np.array([eval(repr(d1)+op+repr(d2)) for d1,op,d2 in zip(${Column Name1},${Operator},${Column Name2})])
Eventually I get the following error:
NameError:name 'numpy' is not defined
I can assume the problem. The Open Source Tool that I am using is importing numpy as np whereas repr() returning numpy that it does not recognize. If this is the problem , how to fix this issue ?
datetime objects can be compared:
In [506]: datetime.datetime.today()
Out[506]: datetime.datetime(2017, 3, 30, 10, 43, 18, 363747)
In [507]: t1=datetime.datetime.today()
In [508]: t2=datetime.datetime.today()
In [509]: t1 < t2
Out[509]: True
In [510]: t1 == t2
Out[510]: False
Numpy's own version of datetime objects can also be compared
In [516]: nt1 = np.datetime64('2017-03-30 10:30:22.928000')
In [517]: nt2 = np.datetime64('2017-03-30 10:31:14.477000')
In [518]: nt1 < nt2
Out[518]: True
In [519]: nt3 = np.datetime64('2012-05-23 00:00:00')
In [520]: [nt1 <= nt2, nt2==nt3]
Out[520]: [True, False]
Using the repr string version of a datetime object works:
In [524]: repr(t1)+'<'+repr(t2)
Out[524]: 'datetime.datetime(2017, 3, 30, 10, 47, 29, 69324)<datetime.datetime(2017, 3, 30, 10, 47, 33, 669494)'
In [525]: eval(repr(t1)+'<'+repr(t2))
Out[525]: True
Not that I recommend that sort of construction. I like the dictionary mapping to an operator better.
You might want to use python operator for this:
# import operators used
import operator
from operator import ge, eq, le, ne
# build a look up table from string to operators
ops = {">=": ge, "==": eq, "<=": le, "!=": ne}
import numpy as np
# used some numbers to simplify the case, should work on datetime as well
a = np.array([1, 3, 5, 3])
b = np.array([2, 3, 2, 1])
operator = np.array(['>=','<=','==','!='])
# evaluate the operation
[ops[op](x, y) for op, x, y in zip(operator, a, b)]
# [False, True, False, True]

PySpark add a column to a DataFrame from a list of strings

I'm looking for a way to add a new column in a Spark DataFrames from a list of strings, in just one simple line.
Given :
rdd = sc.parallelize([((u'2016-10-19', u'2016-293'), 40020),
((u'2016-10-19', u'2016-293'), 143938),
((u'2016-10-19', u'2016-293'), 135891225.0)
])
This is my code to structure my rdd and get a Dataframe :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
((OrderDate, OrderPeriod), metrics) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics
(rdd
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue'])
.show())
Result :
+----------+-----------+-----------+
| OrderDate|OrderPeriod|MetricValue|
+----------+-----------+-----------+
|2016-10-19| 2016-293| 40020|
|2016-10-19| 2016-293| 143938|
|2016-10-19| 2016-293|135891225.0|
+----------+-----------+-----------+
I want to add a new column to precise the metric's name. This is what I've done :
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
(((OrderDate, OrderPeriod), metrics), index) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics, index
df1 = (rdd
.zipWithIndex()
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue', 'index']))
Then
from pyspark.sql.types import StructType, StructField, StringType
df2 = sqlContext.createDataFrame(sc.parallelize([('0', 'UsersNb'),
('1', 'VideosNb'),
('2', 'VideosDuration')]),
StructType([StructField('index', StringType()),
StructField('MetricName', StringType())]))
df2.show()
+-----+--------------+
|index| MetricName|
+-----+--------------+
| 0| UsersNb|
| 1| VideosNb|
| 2|VideosDuration|
+-----+--------------+
And finally :
(df1
.join(df2, df1.index == df2.index)
.drop(df2.index)
.select('index', 'OrderDate', 'OrderPeriod', 'MetricName', 'MetricValue')
.show())
+-----+----------+-----------+--------------+-----------+
|index| OrderDate|OrderPeriod| MetricName|MetricValue|
+-----+----------+-----------+--------------+-----------+
| 0|2016-10-19| 2016-293| VideosNb| 143938|
| 1|2016-10-19| 2016-293| UsersNb| 40020|
| 2|2016-10-19| 2016-293|VideosDuration|135891225.0|
+-----+----------+-----------+--------------+-----------+
This is my expected output, but this method takes considerably longer. I want to do this in just one or two lines.. Something like for exemple the lit method :
from pyspark.sql.functions import lit
df1.withColumn('MetricName', lit('my_string'))
But I need of course to put 3 different strings : 'VideosNb', 'UsersNb' and 'VideosDuration'.
Ideas ? Thank you very much !

Slicing a pandas column based on the position of a matching substring

I am trying to slice a pandas column called PATH from a DataFrame called dframe such that I would get the ad1 container's filename with the extension in a new column called AD1position.
PATH
0 \
1 \abc.ad1\xaxaxa
2 \defghij.ad1\wbcbcb
3 \tuvwxyz.ad1\ydeded
In other words, here's what I want to see:
PATH AD1position
0 \
1 \abc.ad1\xaxaxa abc.ad1
2 \defghij.ad1\wbcbcb defghij.ad1
3 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
If I was to do this in Excel, I would write:
=if(iserror(search(".ad1",[PATH])),"",mid([PATH],2,search(".ad1",[PATH]) + 3))
In Python, I seem to be stuck. Here's what I wrote thus far:
dframe['AD1position'] = dframe['PATH'].apply(lambda x: x['PATH'].str[1:(x['PATH'].str.find('.ad1')) \
+ 3] if x['PATH'].str.find('.ad1') != -1 else "")
Doing this returns the following error:
TypeError: string indices must be integers
I suspect that the problem is caused by the function in the slicer, but I'd appreciate any help with figuring out how to resolve this.
use .str.extract() function:
In [17]: df['AD1position'] = df.PATH.str.extract(r'.*?([^\\]*\.ad1)', expand=True)
In [18]: df
Out[18]:
PATH AD1position
0 \ NaN
1 \aaa\bbb NaN
2 \byz.ad1 byz.ad1
3 \abc.ad1\xaxaxa abc.ad1
4 \defghij.ad1\wbcbcb defghij.ad1
5 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
This will get you the first element of the split.
df['AD1position'] = df.PATH.str.split('\\').str.get(1)
Thank you Root.

Pandas Series - print columns and rows

For now I am not so worried about the most performant way to get at my data in a series, lets say that my series is as follows :
A 1
B 2
C 3
D 4
If I am using a for loop to iterate this, for example :
for row in seriesObj:
print row
The code above will print the values down the right hand side, but lets say, I want to get at the left column (indexes) how might I do that?
All help greatly appreciated, I am very new to pandas and am having some teething problems.
Thanks.
Try Series.iteritems.
import pandas as pd
s = pd.Series([1, 2, 3, 4], index=iter('ABCD'))
for ind, val in s.iteritems():
print ind, val
Prints:
A 1
B 2
C 3
D 4