PySpark MLPC Multi Target classificaiton

PySpark MLPC Multi Target classificaiton - python-2.7

PySpark 2.4.0
How to train a model which has multiple target columns?
Here is a sample dataset,
+---+----+-------+--------+--------+--------+
| id|days|product|target_1|target_2|target_3|
+---+----+-------+--------+--------+--------+
| 1| 6| 55| 1| 0| 1|
| 2| 3| 52| 0| 1| 0|
| 3| 4| 53| 1| 1| 1|
| 1| 5| 53| 1| 0| 0|
| 2| 2| 53| 1| 0| 0|
| 3| 1| 54| 0| 1| 0|
+---+----+-------+--------+--------+--------+
id, days and product are the feature columns. In order to train using PySpark ML - MLPC, i've converted the features into feature vectors.
Here is the code,
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['id', 'days', 'product'],
outputCol="features")
output = assembler.transform(data)
and i've feature column as below,
+---+----+-------+--------+--------+--------+--------------+
| id|days|product|target_1|target_2|target_3| features|
+---+----+-------+--------+--------+--------+--------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|
+---+----+-------+--------+--------+--------+--------------+
Now if i take each target columns as single label, i'll end up creating 3 models. But is there a way to convert all 3 targets(they are binary - 0 or 1) into labels.
For example if i take each target column separately then my MLPC layer will be like,
target_1 >> layers = [3, 5, 4, 2]
target_2 >> layers = [3, 5, 4, 2]
target_3 >> layers = [3, 5, 4, 2]
Since the target column contains only 0 or 1. Can i create a layer like below,
layers = [3, 5, 4, 3]
3 output for each target columns, they should give an output of 0 or 1 from every output neuron.
from pyspark.ml.classification import MultilayerPerceptronClassifier
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers,blockSize=128, seed=1234)
I tried to combine all targets into single label,
assembler_label = VectorAssembler(
inputCols=['target_1', 'target_2', 'target_3'],
outputCol="label")
output_with_label = assembler_label.transform(output)
And the resulting data looks like,
+---+----+-------+--------+--------+--------+--------------+-------------+
| id|days|product|target_1|target_2|target_3| features| label|
+---+----+-------+--------+--------+--------+--------------+-------------+
| 1| 6| 55| 1| 0| 1|[1.0,6.0,55.0]|[1.0,0.0,1.0]|
| 2| 3| 52| 0| 1| 0|[2.0,3.0,52.0]|[0.0,1.0,0.0]|
| 3| 4| 53| 1| 1| 1|[3.0,4.0,53.0]|[1.0,1.0,1.0]|
| 1| 5| 53| 1| 0| 0|[1.0,5.0,53.0]|[1.0,0.0,0.0]|
| 2| 2| 53| 1| 0| 0|[2.0,2.0,53.0]|[1.0,0.0,0.0]|
| 3| 1| 54| 0| 1| 0|[3.0,1.0,54.0]|[0.0,1.0,0.0]|
+---+----+-------+--------+--------+--------+--------------+-------------+
When i tried to fit the data,
model = trainer.fit(output_with_label)
i got an error,
IllegalArgumentException: u'requirement failed: Column label must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
So, is there a way to handle data like this?

Related

Collect range of dates as list in Spark

I have the following DFs:
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-08| 1| 0|
| 2022-01-10| 1| 0|
| 2022-01-11| 1| 0|
| 2022-01-12| 1| 0|
| 2022-01-13| 1| 0|
| 2022-01-15| 1| 0|
| 2022-01-18| 1| 0|
| 2022-01-19| 1| 0|
| 2022-01-08| 2| 0|
| 2022-01-11| 2| 0|
| 2022-01-12| 2| 0|
| 2022-01-15| 2| 0|
| 2022-01-16| 2| 0|
| 2022-01-17| 2| 0|
| 2022-01-19| 2| 0|
| 2022-01-20| 2| 0|
+--------------+---+----+
+--------------+---+----+
|Date |Id |Cond|
+--------------+---+----+
| 2022-01-09| 1| 1|
| 2022-01-14| 1| 1|
| 2022-01-16| 1| 1|
| 2022-01-17| 1| 1|
| 2022-01-20| 1| 1|
| 2022-01-09| 2| 1|
| 2022-01-10| 2| 1|
| 2022-01-13| 2| 1|
| 2022-01-14| 2| 1|
| 2022-01-18| 2| 1|
+--------------+---+----+
I want to get the first 2 dates of DF1 that has as sequence in DF2.
Example:
For date "2022-01-15" and Id = 1 in DF1 I need to collect dates "2022-01-14" and "2022-01-09" from DF2.
My expected output:
+--------------+---+------------------------------+
|Date |Id |List |
+--------------+---+------------------------------+
| 2022-01-08| 1| [] |
| 2022-01-10| 1| ['2022-01-09'] |
| 2022-01-11| 1| ['2022-01-09'] |
| 2022-01-12| 1| ['2022-01-09'] |
| 2022-01-13| 1| ['2022-01-09'] |
| 2022-01-15| 1| ['2022-01-14', '2022-01-09']|
| 2022-01-18| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-19| 1| ['2022-01-17', '2022-01-16']|
| 2022-01-08| 2| [] |
| 2022-01-11| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-12| 2| ['2022-01-10', '2022-01-09']|
| 2022-01-15| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-16| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-17| 2| ['2022-01-14', '2022-01-13']|
| 2022-01-19| 2| ['2022-01-18', '2022-01-14']|
| 2022-01-20| 2| ['2022-01-18', '2022-01-14']|
+--------------+---+------------------------------+
I know that I can use collect_list to get the dates as a list, but how can I collect by range?
MVCE:
data_1 = [
("2022-01-08", 1, 0),
("2022-01-10", 1, 0),
("2022-01-11", 1, 0),
("2022-01-12", 1, 0),
("2022-01-13", 1, 0),
("2022-01-15", 1, 0),
("2022-01-18", 1, 0),
("2022-01-19", 1, 0),
("2022-01-08", 2, 0),
("2022-01-11", 2, 0),
("2022-01-12", 2, 0),
("2022-01-15", 2, 0),
("2022-01-16", 2, 0),
("2022-01-17", 2, 0),
("2022-01-19", 2, 0),
("2022-01-20", 2, 0)
]
schema_1 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_1 = spark.createDataFrame(data=data_1, schema=schema_1)
data_2 = [
("2022-01-09", 1, 1),
("2022-01-14", 1, 1),
("2022-01-16", 1, 1),
("2022-01-17", 1, 1),
("2022-01-20", 1, 1),
("2022-01-09", 2, 1),
("2022-01-10", 2, 1),
("2022-01-13", 2, 1),
("2022-01-14", 2, 1),
("2022-01-18", 2, 1)
]
schema_2 = StructType([
StructField("Date", StringType(), True),
StructField("Id", IntegerType(), True),
StructField("Cond", IntegerType(), True)
])
df_2 = spark.createDataFrame(data=data_2, schema=schema_2)

You can accomplish this by:
joining the two tables on Id;
conditionally collecting dates from df_2 when they are earlier than the target date from df_1 (collect_list ignores null values by default); and
using a combination of slice and sort_array to keep only the two most recent dates.
import pyspark.sql.functions as F
df_out = df_1 \
.join(df_2.select(F.col("Date").alias("Date_RHS"), "Id"), on="Id", how="inner") \
.groupBy("Date", "Id") \
.agg(F.collect_list(F.when(F.col("Date_RHS") < F.col("Date"), F.col("Date_RHS")).otherwise(F.lit(None))).alias("List")) \
.select("Date", "Id", F.slice(F.sort_array(F.col("List"), asc=False), start=1, length=2).alias("List"))
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+

The following approach will first aggregate df_2, then do a left join. Then, use the higher-order function filter to filter out dates which are bigger than column "Date" and slice to select just 2 max values from the array.
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.filter('d2', lambda x: x < F.col('Date')), False), 1, 2).alias('List')
)
df.show(truncate=0)
# +----------+---+------------------------+
# |Date |Id |List |
# +----------+---+------------------------+
# |2022-01-08|1 |[] |
# |2022-01-10|1 |[2022-01-09] |
# |2022-01-11|1 |[2022-01-09] |
# |2022-01-12|1 |[2022-01-09] |
# |2022-01-13|1 |[2022-01-09] |
# |2022-01-15|1 |[2022-01-14, 2022-01-09]|
# |2022-01-18|1 |[2022-01-17, 2022-01-16]|
# |2022-01-19|1 |[2022-01-17, 2022-01-16]|
# |2022-01-08|2 |[] |
# |2022-01-11|2 |[2022-01-10, 2022-01-09]|
# |2022-01-12|2 |[2022-01-10, 2022-01-09]|
# |2022-01-15|2 |[2022-01-14, 2022-01-13]|
# |2022-01-16|2 |[2022-01-14, 2022-01-13]|
# |2022-01-17|2 |[2022-01-14, 2022-01-13]|
# |2022-01-19|2 |[2022-01-18, 2022-01-14]|
# |2022-01-20|2 |[2022-01-18, 2022-01-14]|
# +----------+---+------------------------+
For lower Spark versions, use this:
from pyspark.sql import functions as F
df = df_1.join(df_2.groupBy('Id').agg(F.collect_set('Date').alias('d2')), 'Id', 'left')
df = df.select(
'Date', 'Id',
F.slice(F.sort_array(F.expr("filter(d2, x -> x < Date)"), False), 1, 2).alias('List')
)

Change selected rows into columns

I have a dataframe with the below structure
+------+-------------+--------+
|region| key| val|
+--------------------+--------+
|Sample|row1 | 6|
|Sample|row1_category| Cat 1|
|Sample|row1_Unit | Kg|
|Sample|row2 | 4|
|Sample|row2_category| Cat 2|
|Sample|row2_Unit | ltr|
+------+-------------+--------+
I tried to add a column and push the values to from rows to columns, but the category and unit column
I want to convert it into the below structure
+------+-------------+--------+--------+--------+
|region| key| val|Category| Unit |
+--------------------+--------+--------+--------+
|Sample|row1 | 6| Cat 1| Kg|
|Sample|row2 | 4| Cat 2| ltr|
+------+-------------+--------+--------+--------+
This i need to do for multiple keys , i ll have row2,row 3 etc

scala> df.show
+------+-------------+----+
|region| key| val|
+------+-------------+----+
|Sample| row1| 6|
|Sample|row1_category|Cat1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat2|
|Sample| row2_Unit| ltr|
+------+-------------+----+
scala> val df1 = df.withColumn("_temp", split( $"key" , "_")).select(col("region"), $"_temp".getItem(0) as "key",$"_temp".getItem(1) as "colType",col("val"))
scala> df1.show(false)
+------+----+--------+----+
|region|key |colType |val |
+------+----+--------+----+
|Sample|row1|null |6 |
|Sample|row1|category|Cat1|
|Sample|row1|Unit |Kg |
|Sample|row2|null |4 |
|Sample|row2|category|Cat2|
|Sample|row2|Unit |ltr |
+------+----+--------+----+
scala> val df2 = df1.withColumn("Category", when(col("colType") === "category", col("val"))).withColumn("Unit", when(col("colType") === "Unit", col("val"))).withColumn("val", when(col("colType").isNull, col("val")))
scala> df2.show(false)
+------+----+--------+----+--------+----+
|region|key |colType |val |Category|Unit|
+------+----+--------+----+--------+----+
|Sample|row1|null |6 |null |null|
|Sample|row1|category|null|Cat1 |null|
|Sample|row1|Unit |null|null |Kg |
|Sample|row2|null |4 |null |null|
|Sample|row2|category|null|Cat2 |null|
|Sample|row2|Unit |null|null |ltr |
+------+----+--------+----+--------+----+
scala> val df3 = df2.groupBy("region", "key").agg(concat_ws("",collect_set(when($"val".isNotNull, $"val"))).as("val"),concat_ws("",collect_set(when($"Category".isNotNull, $"Category"))).as("Category"), concat_ws("",collect_set(when($"Unit".isNotNull, $"Unit"))).as("Unit"))
scala> df3.show()
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat1| Kg|
|Sample|row2| 4| Cat2| ltr|
+------+----+---+--------+----+

you can achieve it by grouping by your key and maybe region and aggregate with collect_list, using ragex ^[^_]+ you will get all characters until _ character.
UPDATE: You can use (\\d{1,}) regex to find all numbers from string(capturing groups), for example if you have row_123_456_unit and your function looks like regexp_extract('val,"(\\d{1,})",0) you will get 123, if you change last parameter to 1, then you will get 456. Hope it helps. test regex
df.printSchema()
df.show()
val regex1 = "^[^_]+" // until '_' character
val regex2 = "(\\d{1,})" // capture group of numbers
df.groupBy('region, regexp_extract('key, regex1, 0))
.agg('region, collect_list('key).as("key"), collect_list('val).as("val"))
.select('region,
'key.getItem(0).as("key"),
'val.getItem(0).as("val"),
'val.getItem(1).as("Category"),
'val.getItem(2).as("Unit")
).show()
output:
root
|-- region: string (nullable = true)
|-- key: string (nullable = true)
|-- val: string (nullable = true)
+------+-------------+-----+
|region| key| val|
+------+-------------+-----+
|Sample| row1| 6|
|Sample|row1_category|Cat 1|
|Sample| row1_Unit| Kg|
|Sample| row2| 4|
|Sample|row2_category|Cat 2|
|Sample| row2_Unit| ltr|
+------+-------------+-----+
+------+----+---+--------+----+
|region| key|val|Category|Unit|
+------+----+---+--------+----+
|Sample|row1| 6| Cat 1| Kg|
|Sample|row2| 4| Cat 2| ltr|
+------+----+---+--------+----+

if else in pyspark for collapsing column values

I am trying a simple code to collapse my categorical variables in dataframe to binary classes after indexing
currently my column has 3 classes- "A","B","C"
I am writing a simple if else statement to collapse classes like
def condition(r):
if (r.wo_flag=="SLM" or r.wo_flag=="NON-SLM"):
r.wo_flag="dispatch"
else:
r.wo_flag="non_dispatch"
return r.wo_flag
df_final=df_new.map(lambda x: condition(x))
Its not working it doesn't understand the else condition
|MData|Recode12|Status|DayOfWeekOfDispatch|MannerOfDispatch|Wo_flag|PlaceOfInjury|Race|
M| 11| M| 4| 7| C| 99| 1 |
M| 8| D| 3| 7| A| 99| 1 |
F| 10| W| 2| 7| C| 99| 1 |
M| 9| D| 1| 7| B| 99| 1 |
M| 8| D| 2| 7| C| 99| 1 |
This is the Sample Data

The accepted answer is not very efficient due to the use of a user defined function (UDF).
I think most people are looking for when.
from pyspark.sql.functions import when
matches = df["wo_flag"].isin("SLM", "NON-SLM")
new_df = df.withColumn("wo_flag", when(matches, "dispatch").otherwise("non-dispatch"))

Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
You are trying to modify Rows (Rows are immmutable)
When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe

C++ Inverse number pyramid

Okay so i'm trying to make a code that will read in a positive odd integer and output an inverse pyramid starting with that number and descending to 1 and cutting off the first and last digit in the next line and so on. So if i entered 7 it would display:
7654321
65432
543
4
The i 'th row contains n-(2i-2) but I'm not sure how to use that.
Thanks for your help.
This is what I have so far:
#include <iostream>
using namespace std;
int main()
{
int n,i,j;
cout << "Enter a positive odd number: " << endl;
cin >> n ;
i=n;
while(n%2 ==0)
{
cout<< "Invalid number." << endl;
cout<< "Enter a positive odd number: " << endl;
cin >> n ;
}
for(i=n; i<=n && i>0 ; i--)
{
for(j=i; j<=i; j--)
{
cout<< i%10 ;
}
cout<<endl;
}
return(0);
}

Number the character positions on screen like this:
+----+----+----+----+----+----+----+
| 0 0| 0 1| 0 2| 0 3| 0 4| 0 5| 0 6|
+----+----+----+----+----+----+----+
| 1 0| 1 1| 1 2| 1 3| 1 4| 1 5| 1 6|
+----+----+----+----+----+----+----+
| 2 0| 2 1| 2 2| 2 3| 2 4| 2 5| 2 6|
+----+----+----+----+----+----+----+
| 3 0| 3 1| 3 2| 3 3| 3 4| 3 5| 3 6|
+----+----+----+----+----+----+----+
and check what goes in there
+----+----+----+----+----+----+----+
| 7 | 6 | 5 | 4 | 3 | 2 | 1 |
+----+----+----+----+----+----+----+
| | 6 | 5 | 4 | 3 | 2 | |
+----+----+----+----+----+----+----+
| | | 5 | 4 | 3 | | |
+----+----+----+----+----+----+----+
| | | | 4 | | | |
+----+----+----+----+----+----+----+
Now find the relation between x, y, the value to print, and the initial number.

Why does sun C++ compiler change symbol names when compiling with debug infos?

I have this source file:
// ConstPointer.cpp
const short * const const_short_p_const = 0;
const short * const_short_p = 0;
and compiled it with and without debug infos (SUN C++ Compiler 5.10):
# CC ConstPointer.cpp -c -o ConstPointer.o
# CC -g ConstPointer.cpp -c -o ConstPointer-debug.o
Here are the symbol names of the object file without debug information:
# nm -C ConstPointer.o
ConstPointer.o:
[Index] Value Size Type Bind Other Shndx Name
[2] | 0| 0|SECT |LOCL |0 |10 |
[3] | 0| 0|SECT |LOCL |0 |9 |
[4] | 0| 0|OBJT |LOCL |0 |6 |Bbss.bss
[1] | 0| 0|FILE |LOCL |0 |ABS |ConstPointer.cpp
[5] | 0| 0|OBJT |LOCL |0 |3 |Ddata.data
[6] | 0| 0|OBJT |LOCL |0 |5 |Dpicdata.picdata
[7] | 0| 0|OBJT |LOCL |0 |4 |Drodata.rodata
[9] | 4| 4|OBJT |GLOB |0 |3 |const_short_p
[8] | 0| 4|OBJT |LOCL |0 |3 |const_short_p_const
Here are the symbol names of the object file with debug information:
# nm -C ConstPointer-debug.o
ConstPointer-debug.o:
[Index] Value Size Type Bind Other Shndx Name
[4] | 0| 0|SECT |LOCL |0 |9 |
[2] | 0| 0|SECT |LOCL |0 |8 |
[3] | 0| 0|SECT |LOCL |0 |10 |
[10] | 0| 4|OBJT |GLOB |0 |3 |$XAHMCqApZlqO37H.const_short_p_const
[5] | 0| 0|NOTY |LOCL |0 |6 |Bbss.bss
[1] | 0| 0|FILE |LOCL |0 |ABS |ConstPointer.cpp
[6] | 0| 0|NOTY |LOCL |0 |3 |Ddata.data
[7] | 0| 0|NOTY |LOCL |0 |5 |Dpicdata.picdata
[8] | 0| 0|NOTY |LOCL |0 |4 |Drodata.rodata
[9] | 4| 4|OBJT |GLOB |0 |3 |const_short_p
Why has the variable const_short_p_const another symbol name? g++ does not change it, when compiling with debug information. It looks like a compiler bug to me. What do you think? The second const (constant pointer) leads to this.
EDIT for Drew Hall's comment:
For example you have two files:
// ConstPointer.cpp
const short * const const_short_p_const = 0;
void foo();
int main(int argc, const char *argv[]) {
foo();
return 0;
}
and
// ConstPointer2.cpp
extern const short * const const_short_p_const;
void foo() {
short x = *const_short_p_const;
}
Compiling is fine:
# CC ConstPointer2.cpp -g -c -o ConstPointer2.o
# CC ConstPointer.cpp -g -c -o ConstPointer.o
but linking does not work because the symbols differ! The symbol name in ConstPointer2.o is const_short_p_const, but the symbol name in ConstPointer.o is $XAHMCqApZlqO37H.const_short_p_const.
# CC ConstPointer.o ConstPointer2.o -o ConstPointer
Undefined first referenced
symbol in file
const_short_p_const ConstPointer2.o

Maybe this is linked to the fact that a global const variable is implicitely static in C++?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PySpark MLPC Multi Target classificaiton - python-2.7

Related

Collect range of dates as list in Spark

Change selected rows into columns

if else in pyspark for collapsing column values

C++ Inverse number pyramid

Why does sun C++ compiler change symbol names when compiling with debug infos?

Categories

Resources