Pivoting with missing values - python-2.7

I have a DataFrame with the following simple schema:
root
|-- amount: double (nullable = true)
|-- Date: timestamp (nullable = true)
I was trying to see the sum of amounts per day and per hour, some like:
+---+--------+--------+ ... +--------+
|day| 0| 1| | 23|
+---+--------+--------+ ... +--------+
|148| 306.0| 106.0| | 0.0|
|243| 1906.0| 50.0| | 1.0|
| 31| 866.0| 100.0| | 0.0|
+---+--------+--------+ ... +--------+
Well, first I added a column hour and then I grouped by day, and pivoted by hour. However, I got an exception, which perhaps is related to missing sales for some hours. This is what I'm trying to fix but I haven't realized how.
(df.withColumn("hour", hour("date"))
.groupBy(dayofyear("date").alias("day"))
.pivot("hour")
.sum("amount").show())
An excerpt of the exception.
AnalysisException: u'resolved attribute(s) date#3972 missing from
day#5367,hour#5354,sum(amount)#5437 in operator !Aggregate
[dayofyear(cast(date#3972 as date))], [dayofyear(cast(date#3972 as
date)) AS day#5367, pivotfirst(hour#5354, sum(amount)#5437, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 0, 0) AS __pivot_sum(amount) AS sum(amount)#5487];'

The problem is unresolved day column. You can create it outside groupBy clause to address that:
df = (sc
.parallelize([
(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp"))
.withColumn("hour", hour("date")))
with_day = df.withColumn("day", dayofyear("Date"))
with_day.groupBy("day").pivot("hour", range(0, 24)).sum("amount")
values argument for pivot is optional but advisable.

Related

Is it possible to compose ranges with minmax and join using |

Given the following example using ranges-v3:
using namespace ranges::views;
std::vector<std::vector<int>> v1 = {
{1, 2, 3, 17, 5 },
{11, 12, -1, 14, 15}
};
auto [min_val, max_val] = ranges::minmax(join(v1));
I can get the correct minmax value. But i am wondering if it is possible to use compose with values like join and minmax?
For eg. why can i not do this:
auto [min_val, max_val] = v1 | join() | ranges::minmax();
The compiler error is long and unwieldy, because i am compiling with c++17 and do not have concepts. But i was hoping to understand from the documentation here when i can and cannot use the | (compose) style.

Nested list within a dataframe colum, extracting the values of list within a dataframe column Pyspark Spark

please I would like to transform the tags column in the match_event dataframe below
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
|eventId| eventName| eventSec| id|matchId|matchPeriod|playerId| positions|subEventId| subEventName| tags|teamId|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
| 8| Pass| 1.255989999999997|88178642|1694390| 1H| 26010|[[50, 48], [47, 50]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass|2.3519079999999803|88178643|1694390| 1H| 3682|[[47, 50], [41, 48]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass|3.2410280000000284|88178644|1694390| 1H| 31528|[[41, 48], [32, 35]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass| 6.033681000000001|88178645|1694390| 1H| 7855| [[32, 35], [89, 6]]| 83| High pass| [[1802]]| 4418|
| 1| Duel|13.143591000000015|88178646|1694390| 1H| 25437| [[89, 6], [85, 0]]| 12|Ground defending ...| [[702], [1801]]| 4418|
| 1| Duel|14.138041000000044|88178663|1694390| 1H| 83575|[[11, 94], [15, 1...| 11|Ground attacking ...| [[702], [1801]]| 11944|
| 3| Free Kick|27.053005999999982|88178648|1694390| 1H| 7915| [[85, 0], [93, 16]]| 36| Throw in| [[1802]]| 4418|
| 8| Pass| 28.97515999999996|88178667|1694390| 1H| 70090| [[7, 84], [9, 71]]| 82| Head pass| [[1401], [1802]]| 11944|
| 10| Shot| 31.22621700000002|88178649|1694390| 1H| 25437| [[91, 29], [0, 0]]| 100| Shot|[[402], [1401], [...| 4418|
| 9|Save attempt| 32.66416000000004|88178674|1694390| 1H| 83574|[[100, 100], [15,...| 91| Save attempt| [[1203], [1801]]| 11944|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
to something like this, that is extracting the last item in the list to a column as seen below
+----+
|tags|
+----+
|1801|
|1801|
|1801|
|1802|
|1801|
|1801|
+----+
the column would be re-attached to the match_event dataframe, maybe using withColumn
I tried the below code
u = match_event[['tags']].rdd
t=u.map(lambda xs: [n for x in xs[-1:] for n in x[-1:]])
tag = spark.createDataFrame(t, ['tag'])
I got this. Was difficult to further implement using withColumn
+------+
| tag|
+------+
|[1801]|
|[1801]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1802]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
|[1302]|
|[1802]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
+------+
Please help. Thanks in advance
For spark2.4+ use element_at.
df.withColumn("lastItem", F.element_at("tags",-1)[0]).show()
#+---------------+--------+
#| tags|lastItem|
#+---------------+--------+
#|[[1], [2], [3]]| 3|
#|[[1], [2], [3]]| 3|
#+---------------+--------+
Try this :
from pyspark.sql.functions import udf
columns = ['eventId', 'eventName','eventSec', 'id','matchId','matchPeriod','playerId', 'positions','subEventId','subEventName', tags','teamId']
vals = [ ( 8, "Pass", 1.255989999999997,88178642,1694390,"1H", 26010,[[50, 48], [47, 50]],85,"Simple pass",[[1801]], 4418),
( 1,"Duel",13.143591000000015,88178646,1694390,"1H",25437, [[89, 6], [85, 0]],12,"Ground defending",[[702], [1801]], 4418)
]
udf1 =spark.udf.register("Lastcol", lambda xs: [n for x in xs[-1:] for n in x[-1:]])
df = spark.createDataFrame(vals, columns)
df2 = df.withColumn( 'created_col',udf1('tags')).show()

PySpark Combine Rows to Columns StackOverFlow Error

What I want (very simplified):
Input Dataset to Output dataset
Some of the code I tried:
def add_columns(cur_typ, target, value):
if cur_typ == target:
return value
return None
schema = T.StructType([T.StructField("name", T.StringType(), True),
T.StructField("typeT", T.StringType(), True),
T.StructField("value", T.IntegerType(), True)])
data = [("x", "a", 3), ("x", "b", 5), ("x", "c", 7), ("y", "a", 1), ("y", "b", 2),
("y", "c", 4), ("z", "a", 6), ("z", "b", 2), ("z", "c", 3)]
df = ctx.spark_session.createDataFrame(ctx.spark_session.sparkContext.parallelize(data), schema)
targets = [i.typeT for i in df.select("typeT").distinct().collect()]
add_columns = F.udf(add_columns)
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w))
df = df.drop("typeT", "value").dropDuplicates()
another version:
targets = df.select(F.collect_set("typeT").alias("typeT")).first()["typeT"]
w = Window.partitionBy('name')
for target in targets:
df = df.withColumn(target, F.max(F.lit(F.when(veh["typeT"] == F.lit(target), veh["value"])
.otherwise(None)).over(w)))
df = df.drop("typeT", "value").dropDuplicates()
For small datasets both work, but I have a dataframe with 1 million rows and 5000 different typeTs.
So the result should be a table of about 500 x 5000 (some names do not have certain typeTs.
Now I get stackoverflow errors (py4j.protocol.Py4JJavaError: An error occurred while calling o7624.withColumn.
: java.lang.StackOverflowError) trying to create this dataframe. Besides increasing stacksize, what can I do? Is there a way better way to get the same result?
using withColumn in loop is not good, if no cols to be added are more.
create an array of cols, and select them, which will result in better performance
cols = [F.col("name")]
for target in targets:
cols.append(F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w).alias(target))
df = df.select(cols)
which results the same output
+----+---+---+---+
|name| c| b| a|
+----+---+---+---+
| x| 7| 5| 3|
| z| 3| 2| 6|
| y| 4| 2| 1|
+----+---+---+---+

AWS EMR Hive: Not yet supported place for UDAF 'COUNT'

I have a pretty complicated query I am trying to convert over to use with Hive.
Specifically, I am running it as a Hive "step" in an AWS EMR cluster.
I have tried to clean up the query a bit for the post and just leave essence of the thing.
The full error message is:
FAILED: SemanticException [Error 10128]: Line XX:XX Not yet supported place for UDAF 'COUNT'
The line number is pointing to the COUNT at the bottom of the select statement:
INSERT INTO db.new_table (
new_column1,
new_column2,
new_column3,
... ,
new_column20
)
SELECT MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
TBL1.col2,
TBL1.col3,
TBL1.col3 AS new_column3,
TBL1.col4,
CASE
WHEN TBL1.col5 = …
ELSE “some value”
END AS new_column5,
TBL1.col6,
TBL1.col7,
TBL1.col8,
CASE
WHEN TBL1.col9 = …
ELSE "some value"
END AS new_column9,
CASE
WHEN TBL1.col10 = …
ELSE "value"
END AS new_column10,
TBL1.col11,
"value" AS new_column12,
TBL2.col1,
TBL2.col2,
from_unixtime(…) AS new_column13,
CAST(…) AS new_column14,
CAST(…) AS new_column15,
CAST(…) AS new_column16,
COUNT(DISTINCT TBL1.col17) AS new_column17
FROM db.table1 TBL1
LEFT JOIN
db.table2 TBL2
ON TBL1.col311 = TBL2.col311
WHERE TBL1.col14 BETWEEN "low" AND "high"
AND TBL1.col44 = "Y"
AND TBL1.col55 = "N"
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20;
If I have left out too much, please let me know.
Thanks for your help!
Updates
It turns out, I did in fact leave out way too much info. Sorry for those who have already tried to help...
I made the updates above.
Removing the 20th group by column, eg:
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced: Expression not in GROUP BY key '' ''
LATEST
Removing the 20th group by column and adding the first one, eg:
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced:
Line XX:XX Invalid table alias or column reference 'new_column5':(possible column
names are: TBL1.col1, TBL1.col2, (looks like all columns of TBL1),
TBL2.col1, TBL2.col2, TBL2.col311)
Line # is referring the line with the SELECT statement. Just those three columns from TBL2 are listed in the error output.
The error seems to be pointing to COALESCE(new_column5). Note that I have a CASE statement within the TBL 1 select which I am running with AS new_column5.
You are addressing calculated column name new_column5 at the same subquery level where it is being calculated. This is not possible in Hive. Replace it with calculation itself or use upper level subquery.
This:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(CASE WHEN TBL1.col5 = … ELSE “some value” END," ")||"_"||...) AS new_col1,
Instead of this:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,

Python printing lists with column headers

So I have a nested list containing these values
#[[Mark, 10, Orange],
#[Fred, 15, Red],
#[Gary, 12, Blue],
#[Ned, 21, Yellow]]
You can see that the file is laid out so you have (name, age, favcolour)
I want to make it so I can display each column with its corresponding header
E.G
Name|Age|Favourite colour
Mark|10 |Orange
Fred|15 |Red
Gary|12 |Blue
Ned |21 |Yellow
Thank You!
Simple solution using str.format() function:
l = [['Mark', 10, 'Orange'],['Fred', 15, 'Red'],['Gary', 12, 'Blue'],['Ned', 21, 'Yellow']]
f = '{:<10}|{:<3}|{:<15}' # format
# header(`Name` column has some gap as there could be long names, like "Cristopher")
print('Name |Age|Favourite colour')
for i in l:
print(f.format(*i))
The output:
Name |Age|Favourite colour
Mark |10 |Orange
Fred |15 |Red
Gary |12 |Blue
Ned |21 |Yellow