Related
I have the following DFs:
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-08| 1| 0|
| 2022-01-10| 1| 0|
| 2022-01-11| 1| 0|
| 2022-01-12| 1| 0|
| 2022-01-13| 1| 0|
| 2022-01-15| 1| 0|
| 2022-01-18| 1| 0|
| 2022-01-19| 1| 0|
| 2022-01-08| 2| 0|
| 2022-01-11| 2| 0|
| 2022-01-12| 2| 0|
| 2022-01-15| 2| 0|
| 2022-01-16| 2| 0|
| 2022-01-17| 2| 0|
| 2022-01-19| 2| 0|
| 2022-01-20| 2| 0|
+--------------+---+----+
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-09| 1| 1|
| 2022-01-14| 1| 1|
| 2022-01-16| 1| 1|
| 2022-01-17| 1| 1|
| 2022-01-20| 1| 1|
| 2022-01-09| 2| 1|
| 2022-01-10| 2| 1|
| 2022-01-13| 2| 1|
| 2022-01-14| 2| 1|
| 2022-01-18| 2| 1|
+--------------+---+----+
I want to get the first 2 dates of DF1 that has as sequence in DF2.
Example:
For date "2022-01-15" and Id = 1 in DF1 I need to collect dates "2022-01-14" and "2022-01-09" from DF2.
My expected output:
+--------------+---+------------------------------+
|Date |Id |List |
+--------------+---+------------------------------+
| 2022-01-08| 1| [] |
| 2022-01-10| 1| ['2022-01-09'] |
| 2022-01-11| 1| ['2022-01-09'] |
| 2022-01-12| 1| ['2022-01-09'] |
| 2022-01-13| 1| ['2022-01-09'] |
| 2022-01-15| 1| ['2022-01-14', '2022-01-09']|
| 2022-01-18| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-19| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-08| 2| [] |
| 2022-01-11| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-12| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-15| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-16| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-17| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-19| 2| ['2022-01-18', '2022-01-14']|
| 2022-01-20| 2| ['2022-01-18', '2022-01-14']|
+--------------+---+------------------------------+
I know that I can use collect_list to get the dates as a list, but how can I collect by range?
MVCE:
data_1 = [
("2022-01-08", 1, 0),
("2022-01-10", 1, 0),
("2022-01-11", 1, 0),
("2022-01-12", 1, 0),
("2022-01-13", 1, 0),
("2022-01-15", 1, 0),
("2022-01-18", 1, 0),
("2022-01-19", 1, 0),
("2022-01-08", 2, 0),
("2022-01-11", 2, 0),
("2022-01-12", 2, 0),
("2022-01-15", 2, 0),
("2022-01-16", 2, 0),
("2022-01-17", 2, 0),
("2022-01-19", 2, 0),
("2022-01-20", 2, 0)
]
schema_1 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_1 = spark.createDataFrame(data=data_1, schema=schema_1)
data_2 = [
("2022-01-09", 1, 1),
("2022-01-14", 1, 1),
("2022-01-16", 1, 1),
("2022-01-17", 1, 1),
("2022-01-20", 1, 1),
("2022-01-09", 2, 1),
("2022-01-10", 2, 1),
("2022-01-13", 2, 1),
("2022-01-14", 2, 1),
("2022-01-18", 2, 1)
]
schema_2 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_2 = spark.createDataFrame(data=data_2, schema=schema_2)
You can accomplish this by:
joining the two tables on Id;
conditionally collecting dates from df_2 when they are earlier than the target date from df_1 (collect_list ignores null values by default); and
using a combination of slice and sort_array to keep only the two most recent dates.
import pyspark.sql.functions as F
df_out = df_1 \
.join(df_2.select(F.col("Date").alias("Date_RHS"), "Id"), on="Id", how="inner") \
.groupBy("Date", "Id") \
.agg(F.collect_list(F.when(F.col("Date_RHS") < F.col("Date"), F.col("Date_RHS")).otherwise(F.lit(None))).alias("List")) \
.select("Date", "Id", F.slice(F.sort_array(F.col("List"), asc=False), start=1, length=2).alias("List"))
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
The following approach will first aggregate df_2, then do a left join. Then, use the higher-order function filter to filter out dates which are bigger than column "Date" and slice to select just 2 max values from the array.
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.filter('d2', lambda x: x < F.col('Date')), False), 1, 2).alias('List')
)
df.show(truncate=0)
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
For lower Spark versions, use this:
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.expr("filter(d2, x -> x < Date)"), False), 1, 2).alias('List')
)
I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc
scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+
you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+
This question already has answers here:
Regular Expressions: Is there an AND operator?
(14 answers)
Regex AND operator
(4 answers)
Closed 3 years ago.
How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?
The following is reproducible, but does not give the desired results.
import pyspark
from pyspark.sql import (
SparkSession,
functions as F)
spark = (SparkSession.builder
.master('yarn')
.appName("regex")
.getOrCreate()
)
sc = spark.sparkContext
sc.version # u'2.2.0'
testdf = spark.createDataFrame([
(1, "Julie", "CEO"),
(2, "Janice", "CFO"),
(3, "Jake", "CTO")],
["ID", "Name", "Title"])
ptrn = '(?=Ja)(?=ke)'
testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| false|
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| |
| 2|Janice| CFO| |
| 3| Jake| CTO| |
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| Julie|
| 2|Janice| CFO| Janice|
| 3| Jake| CTO| Jake|
+---+------+-----+-----------+
The desired results would be:
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| true|
+---+------+-----+-----------+
As the third row in the Name column contains 'Ja' and 'ke'.
If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.
Found a quick solution, hopefully this can help someone else.
change ptrn from '(?=Ja)(?=ke)' to '(?=.*Ja)(?=.*ke)' and rlike works.
This answer got me close, but led to my problem.
https://stackoverflow.com/a/469951/5060792
These answers solved my problem.
https://stackoverflow.com/a/3041326
https://stackoverflow.com/a/470602/5060792
By the way, with nothing but the change to ptrn, regexp_extract throws a java.lang.IndexOutOfBoundsException: No group 1 exception. After wrapping the entire pattern in parenthesis, ptrn = '((?=.*Ja)(?=.*ke))', it returns nulls.
Again, regexp_replace replaces nothing and the original values are returned.
PySpark 2.4.0
How to train a model which has multiple target columns?
Here is a sample dataset,
+---+----+-------+--------+--------+--------+
| id|days|product|target_1|target_2|target_3|
+---+----+-------+--------+--------+--------+
| 1| 6| 55| 1| 0| 1|
| 2| 3| 52| 0| 1| 0|
| 3| 4| 53| 1| 1| 1|
| 1| 5| 53| 1| 0| 0|
| 2| 2| 53| 1| 0| 0|
| 3| 1| 54| 0| 1| 0|
+---+----+-------+--------+--------+--------+
id, days and product are the feature columns. In order to train using PySpark ML - MLPC, i've converted the features into feature vectors.
Here is the code,
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['id', 'days', 'product'],
outputCol="features")
output = assembler.transform(data)
and i've feature column as below,
+---+----+-------+--------+--------+--------+--------------+
| id|days|product|target_1|target_2|target_3| features|
+---+----+-------+--------+--------+--------+--------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|
+---+----+-------+--------+--------+--------+--------------+
Now if i take each target columns as single label, i'll end up creating 3 models. But is there a way to convert all 3 targets(they are binary - 0 or 1) into labels.
For example if i take each target column separately then my MLPC layer will be like,
target_1 >> layers = [3, 5, 4, 2]
target_2 >> layers = [3, 5, 4, 2]
target_3 >> layers = [3, 5, 4, 2]
Since the target column contains only 0 or 1. Can i create a layer like below,
layers = [3, 5, 4, 3]
3 output for each target columns, they should give an output of 0 or 1 from every output neuron.
from pyspark.ml.classification import MultilayerPerceptronClassifier
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers,blockSize=128, seed=1234)
I tried to combine all targets into single label,
assembler_label = VectorAssembler(
inputCols=['target_1', 'target_2', 'target_3'],
outputCol="label")
output_with_label = assembler_label.transform(output)
And the resulting data looks like,
+---+----+-------+--------+--------+--------+--------------+-------------+
| id|days|product|target_1|target_2|target_3| features| label|
+---+----+-------+--------+--------+--------+--------------+-------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|[1.0,0.0,1.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|[0.0,1.0,0.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|[1.0,1.0,1.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|[1.0,0.0,0.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|[1.0,0.0,0.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|[0.0,1.0,0.0]|
+---+----+-------+--------+--------+--------+--------------+-------------+
When i tried to fit the data,
model = trainer.fit(output_with_label)
i got an error,
IllegalArgumentException: u'requirement failed: Column label must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
So, is there a way to handle data like this?
I have a csv file with one of the columns named id and another one named genre, that can contain any number of them.
1,Action|Horror|Adventure
2,Action|Adventure
Is it possible to do something like select a row, and for each genre insert into another dataframe current id and genre.
1,Action
1,Horror
1,Adventure
2,Action
2,Adventure
You can use a udf to split the genre data and use explode function.
from pyspark.sql.functions import explode
from pyspark.sql.types import ArrayType,StringType
s = [('1','Action|Adventure'),('2','Comdey|Action')]
rdd = sc.parallelize(s)
df = sqlContext.createDataFrame(rdd,['id','Col'])
df.show()
+---+----------------+
| id| Col|
+---+----------------+
| 1|Action|Adventure|
| 2| Comdey|Action|
+---+----------------+
newcol = f.udf(lambda x : x.split('|'),ArrayType(StringType()))
df1 = df.withColumn('Genre',explode(newcol('col'))).drop('col')
df1.show()
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1|Adventure|
| 2| Comdey|
| 2| Action|
+---+---------+
In addition to Suresh solution, you can also use flatMap after splitting your string to achieve the same:
#Read csv from file (works in Spark 2.x and onwards
df_csv = sqlContext.read.csv("genre.csv")
#Split the Genre (y) on the character |, but leave the id (x) as is
rdd_split= df_csv.rdd.map(lambda (x,y):(x,y.split('|')))
#Use a list comprehension to add the id column to each Genre(y)
rdd_explode = rdd_split.flatMap(lambda (x,y):[(x,k) for k in y])
#Convert the resulting RDD back to a dataframe
df_final = rdd_explode.toDF(['id','Genre'])
df_final.show() returns this as output:
+---+---------+
| id| Genre|
+---+---------+
| 1| Action|
| 1| Horror|
| 1|Adventure|
| 2| Action|
| 2|Adventure|
+---+---------+