Collect range of dates as list in Spark

Collect range of dates as list in Spark - list

I have the following DFs:
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-08| 1| 0|
| 2022-01-10| 1| 0|
| 2022-01-11| 1| 0|
| 2022-01-12| 1| 0|
| 2022-01-13| 1| 0|
| 2022-01-15| 1| 0|
| 2022-01-18| 1| 0|
| 2022-01-19| 1| 0|
| 2022-01-08| 2| 0|
| 2022-01-11| 2| 0|
| 2022-01-12| 2| 0|
| 2022-01-15| 2| 0|
| 2022-01-16| 2| 0|
| 2022-01-17| 2| 0|
| 2022-01-19| 2| 0|
| 2022-01-20| 2| 0|
+--------------+---+----+
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-09| 1| 1|
| 2022-01-14| 1| 1|
| 2022-01-16| 1| 1|
| 2022-01-17| 1| 1|
| 2022-01-20| 1| 1|
| 2022-01-09| 2| 1|
| 2022-01-10| 2| 1|
| 2022-01-13| 2| 1|
| 2022-01-14| 2| 1|
| 2022-01-18| 2| 1|
+--------------+---+----+
I want to get the first 2 dates of DF1 that has as sequence in DF2.
Example:
For date "2022-01-15" and Id = 1 in DF1 I need to collect dates "2022-01-14" and "2022-01-09" from DF2.
My expected output:
+--------------+---+------------------------------+
|Date |Id |List |
+--------------+---+------------------------------+
| 2022-01-08| 1| [] |
| 2022-01-10| 1| ['2022-01-09'] |
| 2022-01-11| 1| ['2022-01-09'] |
| 2022-01-12| 1| ['2022-01-09'] |
| 2022-01-13| 1| ['2022-01-09'] |
| 2022-01-15| 1| ['2022-01-14', '2022-01-09']|
| 2022-01-18| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-19| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-08| 2| [] |
| 2022-01-11| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-12| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-15| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-16| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-17| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-19| 2| ['2022-01-18', '2022-01-14']|
| 2022-01-20| 2| ['2022-01-18', '2022-01-14']|
+--------------+---+------------------------------+
I know that I can use collect_list to get the dates as a list, but how can I collect by range?
MVCE:
data_1 = [
("2022-01-08", 1, 0),
("2022-01-10", 1, 0),
("2022-01-11", 1, 0),
("2022-01-12", 1, 0),
("2022-01-13", 1, 0),
("2022-01-15", 1, 0),
("2022-01-18", 1, 0),
("2022-01-19", 1, 0),
("2022-01-08", 2, 0),
("2022-01-11", 2, 0),
("2022-01-12", 2, 0),
("2022-01-15", 2, 0),
("2022-01-16", 2, 0),
("2022-01-17", 2, 0),
("2022-01-19", 2, 0),
("2022-01-20", 2, 0)
]
schema_1 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_1 = spark.createDataFrame(data=data_1, schema=schema_1)
data_2 = [
("2022-01-09", 1, 1),
("2022-01-14", 1, 1),
("2022-01-16", 1, 1),
("2022-01-17", 1, 1),
("2022-01-20", 1, 1),
("2022-01-09", 2, 1),
("2022-01-10", 2, 1),
("2022-01-13", 2, 1),
("2022-01-14", 2, 1),
("2022-01-18", 2, 1)
]
schema_2 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_2 = spark.createDataFrame(data=data_2, schema=schema_2)

You can accomplish this by:
joining the two tables on Id;
conditionally collecting dates from df_2 when they are earlier than the target date from df_1 (collect_list ignores null values by default); and
using a combination of slice and sort_array to keep only the two most recent dates.
import pyspark.sql.functions as F
df_out = df_1 \
.join(df_2.select(F.col("Date").alias("Date_RHS"), "Id"), on="Id", how="inner") \
.groupBy("Date", "Id") \
.agg(F.collect_list(F.when(F.col("Date_RHS") < F.col("Date"), F.col("Date_RHS")).otherwise(F.lit(None))).alias("List")) \
.select("Date", "Id", F.slice(F.sort_array(F.col("List"), asc=False), start=1, length=2).alias("List"))
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+

The following approach will first aggregate df_2, then do a left join. Then, use the higher-order function filter to filter out dates which are bigger than column "Date" and slice to select just 2 max values from the array.
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.filter('d2', lambda x: x < F.col('Date')), False), 1, 2).alias('List')
)
df.show(truncate=0)
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
For lower Spark versions, use this:
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.expr("filter(d2, x -> x < Date)"), False), 1, 2).alias('List')
)

Related

Parsing string using regexp_extract using pyspark

I am trying to split the string to different columns using regular expression
Below is my data
decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]
df = sc.parallelize(decodeData).toDF(['Decode',''])
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()
Result
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | 2 | | M| |
| M|Exx05 | 4 | M| | |
| M|C705 P|Exx05 | 6 | P| M| |
|M|C705 P|8960 L|Exx05| 7 | M| M| P|
| M|C705 P|78|8960 | 9 | | M| |
+--------------------+---+-----+----+-----+
Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes
eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack
Expected results
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | | M| M| |
| M|Exx05 | | M| | |
| M|C705 P|Exx05 | | P| M| |
|M|C705 P|8960 L|Exx05| | L| M| P|
| M|C705 P|78|8960 | | | M| P|
+--------------------+---+-----+----+-----+
When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.
eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc
One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.

You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:
dfNew = df.withColumn(
'Exx05',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
'C705',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
'8960',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)
dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode | |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05 |2 |M |M | |
|M|Exx05 |4 |M | | |
|M|C705 P|Exx05 |6 |P |M | |
|M|C705 P|8960 L|Exx05|7 |L |M |P |
|M|C705$P|78|8960 |9 | |M |P |
+---------------------+---+-----+----+----+

What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y
from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()
+--------------------+---+-----+----+----+
| Decode| t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
| M|C705|Exx05| 2| | M| |
| M|Exx05| 4| M| | |
| M|C705 P|Exx05| 6| P| M| |
|M|C705 P|8960 L|E...| 7| L| M| P|
| M|C705 P|78|8960| 9| | M| P|
+--------------------+---+-----+----+----+

Change selected rows into columns

I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc

scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+

you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+

PySpark MLPC Multi Target classificaiton

PySpark 2.4.0
How to train a model which has multiple target columns?
Here is a sample dataset,
+---+----+-------+--------+--------+--------+
| id|days|product|target_1|target_2|target_3|
+---+----+-------+--------+--------+--------+
| 1| 6| 55| 1| 0| 1|
| 2| 3| 52| 0| 1| 0|
| 3| 4| 53| 1| 1| 1|
| 1| 5| 53| 1| 0| 0|
| 2| 2| 53| 1| 0| 0|
| 3| 1| 54| 0| 1| 0|
+---+----+-------+--------+--------+--------+
id, days and product are the feature columns. In order to train using PySpark ML - MLPC, i've converted the features into feature vectors.
Here is the code,
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['id', 'days', 'product'],
outputCol="features")
output = assembler.transform(data)
and i've feature column as below,
+---+----+-------+--------+--------+--------+--------------+
| id|days|product|target_1|target_2|target_3| features|
+---+----+-------+--------+--------+--------+--------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|
+---+----+-------+--------+--------+--------+--------------+
Now if i take each target columns as single label, i'll end up creating 3 models. But is there a way to convert all 3 targets(they are binary - 0 or 1) into labels.
For example if i take each target column separately then my MLPC layer will be like,
target_1 >> layers = [3, 5, 4, 2]
target_2 >> layers = [3, 5, 4, 2]
target_3 >> layers = [3, 5, 4, 2]
Since the target column contains only 0 or 1. Can i create a layer like below,
layers = [3, 5, 4, 3]
3 output for each target columns, they should give an output of 0 or 1 from every output neuron.
from pyspark.ml.classification import MultilayerPerceptronClassifier
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers,blockSize=128, seed=1234)
I tried to combine all targets into single label,
assembler_label = VectorAssembler(
inputCols=['target_1', 'target_2', 'target_3'],
outputCol="label")
output_with_label = assembler_label.transform(output)
And the resulting data looks like,
+---+----+-------+--------+--------+--------+--------------+-------------+
| id|days|product|target_1|target_2|target_3| features| label|
+---+----+-------+--------+--------+--------+--------------+-------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|[1.0,0.0,1.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|[0.0,1.0,0.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|[1.0,1.0,1.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|[1.0,0.0,0.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|[1.0,0.0,0.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|[0.0,1.0,0.0]|
+---+----+-------+--------+--------+--------+--------------+-------------+
When i tried to fit the data,
model = trainer.fit(output_with_label)
i got an error,
IllegalArgumentException: u'requirement failed: Column label must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
So, is there a way to handle data like this?

django Queryset exclude() multiple data

i have database scheme like this.
# periode
+------+--------------+--------------+
| id | from | to |
+------+--------------+--------------+
| 1 | 2018-04-12 | 2018-05-11 |
| 2 | 2018-05-12 | 2018-06-11 |
+------+--------------+--------------+
# foo
+------+---------+
| id | name |
+------+---------+
| 1 | John |
| 2 | Doe |
| 3 | Trodi |
| 4 | son |
| 5 | Alex |
+------+---------+
#bar
+------+---------------+--------------+
| id | employee_id | periode_id |
+------+---------------+--------------+
| 1 | 1 |1 |
| 2 | 2 |1 |
| 3 | 1 |2 |
| 4 | 3 |1 |
+------+---------------+--------------+
I need to show employee that not in salary.
for now I do like this
queryset=Bar.objects.all().filter(periode_id=1)
result=Foo.objects.exclude(id=queryset)
but its fail, how do filter employee list not in salary?...

Well here you basically want the foos such that there is no period_id=1 in the Bar table.
We can let this work with:
ex = Bar.objects.all().filter(periode_id=1).values_list('employee_id', flat=True)
result=Foo.objects.exclude(id__in=ex)

if else in pyspark for collapsing column values

I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data

The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))

Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Collect range of dates as list in Spark - list

Related

Parsing string using regexp_extract using pyspark

Change selected rows into columns

PySpark MLPC Multi Target classificaiton

django Queryset exclude() multiple data

if else in pyspark for collapsing column values

Categories

Resources