I have the following DFs:
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-08| 1| 0|
| 2022-01-10| 1| 0|
| 2022-01-11| 1| 0|
| 2022-01-12| 1| 0|
| 2022-01-13| 1| 0|
| 2022-01-15| 1| 0|
| 2022-01-18| 1| 0|
| 2022-01-19| 1| 0|
| 2022-01-08| 2| 0|
| 2022-01-11| 2| 0|
| 2022-01-12| 2| 0|
| 2022-01-15| 2| 0|
| 2022-01-16| 2| 0|
| 2022-01-17| 2| 0|
| 2022-01-19| 2| 0|
| 2022-01-20| 2| 0|
+--------------+---+----+
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-09| 1| 1|
| 2022-01-14| 1| 1|
| 2022-01-16| 1| 1|
| 2022-01-17| 1| 1|
| 2022-01-20| 1| 1|
| 2022-01-09| 2| 1|
| 2022-01-10| 2| 1|
| 2022-01-13| 2| 1|
| 2022-01-14| 2| 1|
| 2022-01-18| 2| 1|
+--------------+---+----+
I want to get the first 2 dates of DF1 that has as sequence in DF2.
Example:
For date "2022-01-15" and Id = 1 in DF1 I need to collect dates "2022-01-14" and "2022-01-09" from DF2.
My expected output:
+--------------+---+------------------------------+
|Date |Id |List |
+--------------+---+------------------------------+
| 2022-01-08| 1| [] |
| 2022-01-10| 1| ['2022-01-09'] |
| 2022-01-11| 1| ['2022-01-09'] |
| 2022-01-12| 1| ['2022-01-09'] |
| 2022-01-13| 1| ['2022-01-09'] |
| 2022-01-15| 1| ['2022-01-14', '2022-01-09']|
| 2022-01-18| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-19| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-08| 2| [] |
| 2022-01-11| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-12| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-15| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-16| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-17| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-19| 2| ['2022-01-18', '2022-01-14']|
| 2022-01-20| 2| ['2022-01-18', '2022-01-14']|
+--------------+---+------------------------------+
I know that I can use collect_list to get the dates as a list, but how can I collect by range?
MVCE:
data_1 = [
("2022-01-08", 1, 0),
("2022-01-10", 1, 0),
("2022-01-11", 1, 0),
("2022-01-12", 1, 0),
("2022-01-13", 1, 0),
("2022-01-15", 1, 0),
("2022-01-18", 1, 0),
("2022-01-19", 1, 0),
("2022-01-08", 2, 0),
("2022-01-11", 2, 0),
("2022-01-12", 2, 0),
("2022-01-15", 2, 0),
("2022-01-16", 2, 0),
("2022-01-17", 2, 0),
("2022-01-19", 2, 0),
("2022-01-20", 2, 0)
]
schema_1 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_1 = spark.createDataFrame(data=data_1, schema=schema_1)
data_2 = [
("2022-01-09", 1, 1),
("2022-01-14", 1, 1),
("2022-01-16", 1, 1),
("2022-01-17", 1, 1),
("2022-01-20", 1, 1),
("2022-01-09", 2, 1),
("2022-01-10", 2, 1),
("2022-01-13", 2, 1),
("2022-01-14", 2, 1),
("2022-01-18", 2, 1)
]
schema_2 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_2 = spark.createDataFrame(data=data_2, schema=schema_2)
You can accomplish this by:
joining the two tables on Id;
conditionally collecting dates from df_2 when they are earlier than the target date from df_1 (collect_list ignores null values by default); and
using a combination of slice and sort_array to keep only the two most recent dates.
import pyspark.sql.functions as F
df_out = df_1 \
.join(df_2.select(F.col("Date").alias("Date_RHS"), "Id"), on="Id", how="inner") \
.groupBy("Date", "Id") \
.agg(F.collect_list(F.when(F.col("Date_RHS") < F.col("Date"), F.col("Date_RHS")).otherwise(F.lit(None))).alias("List")) \
.select("Date", "Id", F.slice(F.sort_array(F.col("List"), asc=False), start=1, length=2).alias("List"))
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
The following approach will first aggregate df_2, then do a left join. Then, use the higher-order function filter to filter out dates which are bigger than column "Date" and slice to select just 2 max values from the array.
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.filter('d2', lambda x: x < F.col('Date')), False), 1, 2).alias('List')
)
df.show(truncate=0)
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
For lower Spark versions, use this:
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.expr("filter(d2, x -> x < Date)"), False), 1, 2).alias('List')
)
I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc
scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+
you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+
I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data
The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))
Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe
Okay so i'm trying to make a code that will read in a positive odd integer and output an inverse pyramid starting with that number and descending to 1 and cutting off the first and last digit in the next line and so on. So if i entered 7 it would display:
7654321
65432
543
4
The i 'th row contains n-(2i-2) but I'm not sure how to use that.
Thanks for your help.
This is what I have so far:
#include <iostream>
using namespace std;
int main()
{
int n,i,j;
cout << "Enter a positive odd number: " << endl;
cin >> n ;
i=n;
while(n%2 ==0)
{
cout<< "Invalid number." << endl;
cout<< "Enter a positive odd number: " << endl;
cin >> n ;
}
for(i=n; i<=n && i>0 ; i--)
{
for(j=i; j<=i; j--)
{
cout<< i%10 ;
}
cout<<endl;
}
return(0);
}
Number the character positions on screen like this:
+----+----+----+----+----+----+----+
| 0 0| 0 1| 0 2| 0 3| 0 4| 0 5| 0 6|
+----+----+----+----+----+----+----+
| 1 0| 1 1| 1 2| 1 3| 1 4| 1 5| 1 6|
+----+----+----+----+----+----+----+
| 2 0| 2 1| 2 2| 2 3| 2 4| 2 5| 2 6|
+----+----+----+----+----+----+----+
| 3 0| 3 1| 3 2| 3 3| 3 4| 3 5| 3 6|
+----+----+----+----+----+----+----+
and check what goes in there
+----+----+----+----+----+----+----+
| 7 | 6 | 5 | 4 | 3 | 2 | 1 |
+----+----+----+----+----+----+----+
| | 6 | 5 | 4 | 3 | 2 | |
+----+----+----+----+----+----+----+
| | | 5 | 4 | 3 | | |
+----+----+----+----+----+----+----+
| | | | 4 | | | |
+----+----+----+----+----+----+----+
Now find the relation between x, y, the value to print, and the initial number.
I have this source file:
// ConstPointer.cpp
const short * const const_short_p_const = 0;
const short * const_short_p = 0;
and compiled it with and without debug infos (SUN C++ Compiler 5.10):
# CC ConstPointer.cpp -c -o ConstPointer.o
# CC -g ConstPointer.cpp -c -o ConstPointer-debug.o
Here are the symbol names of the object file without debug information:
# nm -C ConstPointer.o
ConstPointer.o:
[Index] Value Size Type Bind Other Shndx Name
[2] | 0| 0|SECT |LOCL |0 |10 |
[3] | 0| 0|SECT |LOCL |0 |9 |
[4] | 0| 0|OBJT |LOCL |0 |6 |Bbss.bss
[1] | 0| 0|FILE |LOCL |0 |ABS |ConstPointer.cpp
[5] | 0| 0|OBJT |LOCL |0 |3 |Ddata.data
[6] | 0| 0|OBJT |LOCL |0 |5 |Dpicdata.picdata
[7] | 0| 0|OBJT |LOCL |0 |4 |Drodata.rodata
[9] | 4| 4|OBJT |GLOB |0 |3 |const_short_p
[8] | 0| 4|OBJT |LOCL |0 |3 |const_short_p_const
Here are the symbol names of the object file with debug information:
# nm -C ConstPointer-debug.o
ConstPointer-debug.o:
[Index] Value Size Type Bind Other Shndx Name
[4] | 0| 0|SECT |LOCL |0 |9 |
[2] | 0| 0|SECT |LOCL |0 |8 |
[3] | 0| 0|SECT |LOCL |0 |10 |
[10] | 0| 4|OBJT |GLOB |0 |3 |$XAHMCqApZlqO37H.const_short_p_const
[5] | 0| 0|NOTY |LOCL |0 |6 |Bbss.bss
[1] | 0| 0|FILE |LOCL |0 |ABS |ConstPointer.cpp
[6] | 0| 0|NOTY |LOCL |0 |3 |Ddata.data
[7] | 0| 0|NOTY |LOCL |0 |5 |Dpicdata.picdata
[8] | 0| 0|NOTY |LOCL |0 |4 |Drodata.rodata
[9] | 4| 4|OBJT |GLOB |0 |3 |const_short_p
Why has the variable const_short_p_const another symbol name? g++ does not change it, when compiling with debug information. It looks like a compiler bug to me. What do you think? The second const (constant pointer) leads to this.
EDIT for Drew Hall's comment:
For example you have two files:
// ConstPointer.cpp
const short * const const_short_p_const = 0;
void foo();
int main(int argc, const char *argv[]) {
foo();
return 0;
}
and
// ConstPointer2.cpp
extern const short * const const_short_p_const;
void foo() {
short x = *const_short_p_const;
}
Compiling is fine:
# CC ConstPointer2.cpp -g -c -o ConstPointer2.o
# CC ConstPointer.cpp -g -c -o ConstPointer.o
but linking does not work because the symbols differ! The symbol name in ConstPointer2.o is const_short_p_const, but the symbol name in ConstPointer.o is $XAHMCqApZlqO37H.const_short_p_const.
# CC ConstPointer.o ConstPointer2.o -o ConstPointer
Undefined first referenced
symbol in file
const_short_p_const ConstPointer2.o
Maybe this is linked to the fact that a global const variable is implicitely static in C++?