Apache Spark: Percentile of list of row values in dataframe

Apache Spark: Percentile of list of row values in dataframe - list

I have an Apache Spark dataframe with a set of computed columns. For each row in the dataframe (approx 2000), I wish to take the row values for 10 columns and locate the closest value of an 11th column relative to those other 10.
I would imagine I would take those row values and turn it into a list then use an abs value calculation to determine the closest.
But I am stuck on how to turn the row values into a list. I've taken a column and turned those values into a list using collect_list but not sure how to handle when the list comes from a single row and multiple columns.

You should explode your columns so you can linearize your computations.
Let's create a sample dataframe:
import numpy as np
np.random.seed(0)
df = sc.parallelize([np.random.randint(0, 10, 11).tolist() for _ in range(20)])\
.toDF(["col" + str(i) for i in range(1, 12)])
df.show()
+----+----+----+----+----+----+----+----+----+-----+-----+
|col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|
+----+----+----+----+----+----+----+----+----+-----+-----+
| 5| 0| 3| 3| 7| 9| 3| 5| 2| 4| 7|
| 6| 8| 8| 1| 6| 7| 7| 8| 1| 5| 9|
| 8| 9| 4| 3| 0| 3| 5| 0| 2| 3| 8|
| 1| 3| 3| 3| 7| 0| 1| 9| 9| 0| 4|
| 7| 3| 2| 7| 2| 0| 0| 4| 5| 5| 6|
| 8| 4| 1| 4| 9| 8| 1| 1| 7| 9| 9|
| 3| 6| 7| 2| 0| 3| 5| 9| 4| 4| 6|
| 4| 4| 3| 4| 4| 8| 4| 3| 7| 5| 5|
| 0| 1| 5| 9| 3| 0| 5| 0| 1| 2| 4|
| 2| 0| 3| 2| 0| 7| 5| 9| 0| 2| 7|
| 2| 9| 2| 3| 3| 2| 3| 4| 1| 2| 9|
| 1| 4| 6| 8| 2| 3| 0| 0| 6| 0| 6|
| 3| 3| 8| 8| 8| 2| 3| 2| 0| 8| 8|
| 3| 8| 2| 8| 4| 3| 0| 4| 3| 6| 9|
| 8| 0| 8| 5| 9| 0| 9| 6| 5| 3| 1|
| 8| 0| 4| 9| 6| 5| 7| 8| 8| 9| 2|
| 8| 6| 6| 9| 1| 6| 8| 8| 3| 2| 3|
| 6| 3| 6| 5| 7| 0| 8| 4| 6| 5| 8|
| 2| 3| 9| 7| 5| 3| 4| 5| 3| 3| 7|
| 9| 9| 9| 7| 3| 2| 3| 9| 7| 7| 5|
+----+----+----+----+----+----+----+----+----+-----+-----+
There are several ways to turn row values into a list:
creating a map with keys equal to column names and values equal to the corresponding row value.
import pyspark.sql.functions as psf
from itertools import chain
df = df\
.withColumn("id", psf.monotonically_increasing_id())\
.select(
"id",
psf.posexplode(
psf.create_map(list(chain(*[(psf.lit(c), psf.col(c)) for c in df.columns if c != "col11"])))
).alias("pos", "col_name", "value"), "col11")
df.show()
+---+---+--------+-----+-----+
| id|pos|col_name|value|col11|
+---+---+--------+-----+-----+
| 0| 0| col1| 5| 7|
| 0| 1| col2| 0| 7|
| 0| 2| col3| 3| 7|
| 0| 3| col4| 3| 7|
| 0| 4| col5| 7| 7|
| 0| 5| col6| 9| 7|
| 0| 6| col7| 3| 7|
| 0| 7| col8| 5| 7|
| 0| 8| col9| 2| 7|
| 0| 9| col10| 4| 7|
| 1| 0| col1| 6| 9|
| 1| 1| col2| 8| 9|
| 1| 2| col3| 8| 9|
| 1| 3| col4| 1| 9|
| 1| 4| col5| 6| 9|
| 1| 5| col6| 7| 9|
| 1| 6| col7| 7| 9|
| 1| 7| col8| 8| 9|
| 1| 8| col9| 1| 9|
| 1| 9| col10| 5| 9|
+---+---+--------+-----+-----+
Using a StructType inside an ArrayType
df = df\
.withColumn("id", psf.monotonically_increasing_id())\
.select(
"id",
psf.explode(
psf.array([psf.struct(psf.lit(c).alias("col_name"), psf.col(c).alias("value"))
for c in df.columns if c != "col11"])).alias("cols"),
"col11").select("cols.*", "col11", "id")
df.show()
+--------+-----+-----+---+
|col_name|value|col11| id|
+--------+-----+-----+---+
| col1| 5| 7| 0|
| col2| 0| 7| 0|
| col3| 3| 7| 0|
| col4| 3| 7| 0|
| col5| 7| 7| 0|
| col6| 9| 7| 0|
| col7| 3| 7| 0|
| col8| 5| 7| 0|
| col9| 2| 7| 0|
| col10| 4| 7| 0|
| col1| 6| 9| 1|
| col2| 8| 9| 1|
| col3| 8| 9| 1|
| col4| 1| 9| 1|
| col5| 6| 9| 1|
| col6| 7| 9| 1|
| col7| 7| 9| 1|
| col8| 8| 9| 1|
| col9| 1| 9| 1|
| col10| 5| 9| 1|
+--------+-----+-----+---+
using an ArrayType...
Once you have an exploded list, you can look for the minimum value of |col11 - value|:
from pyspark.sql import Window
w = Window.partitionBy("id").orderBy(psf.abs(psf.col("col11") - psf.col("value")))
res = df.withColumn("rn", psf.row_number().over(w)).filter("rn = 1")
res.sort("id").show()
+--------+-----+-----+----------+---+
|col_name|value|col11| id| rn|
+--------+-----+-----+----------+---+
| col5| 7| 7| 0| 1|
| col2| 8| 9| 1| 1|
| col1| 8| 8| 2| 1|
| col2| 3| 4| 3| 1|
| col1| 7| 6| 4| 1|
| col5| 9| 9| 5| 1|
| col2| 6| 6| 6| 1|
| col10| 5| 5| 7| 1|
| col3| 5| 4| 8| 1|
| col6| 7| 7| 9| 1|
| col2| 9| 9|8589934592| 1|
| col3| 6| 6|8589934593| 1|
| col3| 8| 8|8589934594| 1|
| col2| 8| 9|8589934595| 1|
| col2| 0| 1|8589934596| 1|
| col2| 0| 2|8589934597| 1|
| col9| 3| 3|8589934598| 1|
| col7| 8| 8|8589934599| 1|
| col4| 7| 7|8589934600| 1|
| col4| 7| 5|8589934601| 1|
+--------+-----+-----+----------+---+

Related

Why do some pins don't work with wiringPi?

I'm trying to write a simple program which uses the GPIO pins of the Raspberry Pi 3B.
When I run the following program, only the LED on pin 17 flashes.
Pin 27 doesn't blink and doesn't get set to OUT.
I don't get any error messages.
int main(){
if (wiringPiSetupSys() == -1){
std::cout << "wiringpsetup failed\n";
exit(1);
}
auto pin = 17;
auto pin1 = 27;
pinMode(pin, OUTPUT);
pinMode(pin1, OUTPUT);
for (auto i = 0; i < 8; ++i)
{
digitalWrite(pin, HIGH);
digitalWrite(pin1, HIGH);
delay(500);
std::cout << "high\n";
digitalWrite(pin, LOW);
digitalWrite(pin1, LOW);
delay(500);
std::cout << "low\n";
}
}
However, I can make the LED on pin 27 blink by executing the following commands in the terminal:
gpio -g mode 27 out
gpio -g write 27 0
gpio -g write 27 1
gpio -g write 27 0
Consequently, the LED is properly connected and not broken.
A little additional information:
pi#raspberrypi:~$ gpio -g readall
+-----+-----+---------+------+---+---Pi 3B--+---+------+---------+-----+-----+
| BCM | wPi | Name | Mode | V | Physical | V | Mode | Name | wPi | BCM |
+-----+-----+---------+------+---+----++----+---+------+---------+-----+-----+
| | | 3.3v | | | 1 || 2 | | | 5v | | |
| 2 | 8 | SDA.1 | IN | 1 | 3 || 4 | | | 5v | | |
| 3 | 9 | SCL.1 | IN | 1 | 5 || 6 | | | 0v | | |
| 4 | 7 | GPIO. 7 | IN | 1 | 7 || 8 | 0 | IN | TxD | 15 | 14 |
| | | 0v | | | 9 || 10 | 1 | IN | RxD | 16 | 15 |
| 17 | 0 | GPIO. 0 | OUT | 0 | 11 || 12 | 0 | IN | GPIO. 1 | 1 | 18 |
| 27 | 2 | GPIO. 2 | OUT | 1 | 13 || 14 | | | 0v | | |
| 22 | 3 | GPIO. 3 | IN | 0 | 15 || 16 | 0 | IN | GPIO. 4 | 4 | 23 |
| | | 3.3v | | | 17 || 18 | 0 | OUT | GPIO. 5 | 5 | 24 |
| 10 | 12 | MOSI | IN | 0 | 19 || 20 | | | 0v | | |
| 9 | 13 | MISO | IN | 0 | 21 || 22 | 1 | OUT | GPIO. 6 | 6 | 25 |
| 11 | 14 | SCLK | IN | 0 | 23 || 24 | 1 | IN | CE0 | 10 | 8 |
| | | 0v | | | 25 || 26 | 1 | IN | CE1 | 11 | 7 |
| 0 | 30 | SDA.0 | IN | 1 | 27 || 28 | 1 | IN | SCL.0 | 31 | 1 |
| 5 | 21 | GPIO.21 | IN | 1 | 29 || 30 | | | 0v | | |
| 6 | 22 | GPIO.22 | IN | 1 | 31 || 32 | 0 | IN | GPIO.26 | 26 | 12 |
| 13 | 23 | GPIO.23 | IN | 0 | 33 || 34 | | | 0v | | |
| 19 | 24 | GPIO.24 | IN | 0 | 35 || 36 | 0 | IN | GPIO.27 | 27 | 16 |
| 26 | 25 | GPIO.25 | IN | 0 | 37 || 38 | 0 | IN | GPIO.28 | 28 | 20 |
| | | 0v | | | 39 || 40 | 0 | IN | GPIO.29 | 29 | 21 |
+-----+-----+---------+------+---+----++----+---+------+---------+-----+-----+
| BCM | wPi | Name | Mode | V | Physical | V | Mode | Name | wPi | BCM |
+-----+-----+---------+------+---+---Pi 3B--+---+------+---------+-----+-----+

I still don't know why it doesn't work in the example, but I found another solution:
Instead of wiringPiSetupSys() I use now the function wiringPiSetupPhys() and accordingly the correct pins (13 and 15).
If someone knows a solution with wiringPiSetupSys(), I would still be glad about an answer...

After appending, I get null values in primary table headers

I have a table that I want to use as headers for another table that just has data. I used append as new in PBI, used the headers table as primary and data table as secondary. All the columns from the primary table have null values and the data table is appended next to headers column.
Eg:
Table 1 ( Headers)
+-----+-----+-----+-----+
| ABC | DEF | IGH | KLM |
+-----+-----+-----+-----+
Table 2 ( Data )
+----+----+----+----+
| 1 | 2 | 3 | 4 |
| 6 | 7 | 8 | 9 |
| 11 | 12 | 13 | 14 |
| 16 | 17 | 18 | 19 |
| 21 | 22 | 23 | 24 |
| 26 | 27 | 28 | 29 |
| 31 | 32 | 33 | 34 |
+----+----+----+----+
Table I am getting after append:
+------+------+------+------+------+------+------+------+
| ABC | DEF | IGH | KLM | null | null | null | null |
+------+------+------+------+------+------+------+------+
| null | null | null | null | 1 | 2 | 3 | 4 |
| null | null | null | null | 6 | 7 | 8 | 9 |
| null | null | null | null | 11 | 12 | 13 | 14 |
| null | null | null | null | 16 | 17 | 18 | 19 |
| null | null | null | null | 21 | 22 | 23 | 24 |
| null | null | null | null | 26 | 27 | 28 | 29 |
| null | null | null | null | 31 | 32 | 33 | 34 |
+------+------+------+------+------+------+------+------+
Table I need:
+-----+-----+-----+-----+
| ABC | DEF | IGH | KLM |
+-----+-----+-----+-----+
| 1 | 2 | 3 | 4 |
| 6 | 7 | 8 | 9 |
| 11 | 12 | 13 | 14 |
| 16 | 17 | 18 | 19 |
| 21 | 22 | 23 | 24 |
| 26 | 27 | 28 | 29 |
| 31 | 32 | 33 | 34 |
+-----+-----+-----+-----+
I used Append as new in PBI
Used the headers table ( Table 1) as primary and appended Table 2 to it.
This shows at the top function:
= Table.Combine({Table 1, Table 2})
This in the advanced editor:
let
Source = Table.Combine({Sheet1, InterviewQn})
in
Source
Expected result:
+-----+-----+-----+-----+
| ABC | DEF | IGH | KLM |
+-----+-----+-----+-----+
| 1 | 2 | 3 | 4 |
| 6 | 7 | 8 | 9 |
| 11 | 12 | 13 | 14 |
| 16 | 17 | 18 | 19 |
| 21 | 22 | 23 | 24 |
| 26 | 27 | 28 | 29 |
| 31 | 32 | 33 | 34 |
+-----+-----+-----+-----+
OR
+-----+-----+-----+-----+
| ABC | DEF | IGH | KLM |
| 1 | 2 | 3 | 4 |
| 6 | 7 | 8 | 9 |
| 11 | 12 | 13 | 14 |
| 16 | 17 | 18 | 19 |
| 21 | 22 | 23 | 24 |
| 26 | 27 | 28 | 29 |
| 31 | 32 | 33 | 34 |
+-----+-----+-----+-----+

If you're only trying to rename the columns of Table 2, using the column names of Table 1, then it's simply:
= Table.RenameColumns(#"Table 2", List.Zip({Table.ColumnNames(#"Table 2"), Table.ColumnNames(#"Table 1")}))
See https://pwrbi.com/so_55529969/ for worked example PBIX file

Power BI DAX to filter common items A & B share

Sample data:
| Vendor | Size Group | Model | Quantity | Cost | TAT | Posting Date |
|--------|------------|-------|----------|-------|-----|-------------------|
| A | S | A150 | 150 | 450 | 67 | July 7, 2018 |
| A | M | A200 | 250 | 1500 | 75 | June 22, 2018 |
| A | M | A150 | 25 | 8500 | 85 | July 9, 2018 |
| C | L | A200 | 350 | 1250 | 125 | March 5, 2018 |
| C | XL | A500 | 150 | 6500 | 45 | February 20, 2018 |
| A | M | A900 | 385 | 475 | 40 | January 29, 2018 |
| A | M | A150 | 650 | 45 | 45 | August 31, 2018 |
| D | M | A150 | 65 | 7500 | 15 | April 10, 2018 |
| D | M | A300 | 140 | 3420 | 10 | April 3, 2018 |
| E | S | A150 | 20 | 10525 | 85 | January 3, 2018 |
| B | S | A150 | 30 | 10500 | 40 | June 3, 2018 |
| B | S | A150 | 450 | 450 | 64 | April 3, 2018 |
| E | XS | A900 | 45 | 75 | 60 | January 3, 2018 |
| F | M | A900 | 95 | 655 | 175 | January 3, 2018 |
| D | XL | A300 | 15 | 21500 | 25 | January 3, 2018 |
| D | S | A500 | 450 | 65 | 25 | May 3, 2018 |
| A | M | A350 | 250 | 450 | 22 | January 3, 2018 |
| B | S | A150 | 45 | 8500 | 28 | January 3, 2018 |
| A | S | A300 | 550 | 650 | 128 | January 3, 2018 |
| C | M | A150 | 1500 | 855 | 190 | January 3, 2018 |
| B | M | A150 | 65 | 1750 | 41 | January 3, 2018 |
| A | L | A500 | 75 | 1700 | 24 | January 3, 2018 |
| B | S | A900 | 55 | 9800 | 37 | May 29, 2018 |
| B | M | A500 | 150 | 850 | 83 | April 18, 2018 |
In the provided sample, the common Size Groups A & B both share are S & M. So, I was hoping to display those Size Groups as the legend and Average Cost as the value in a clustered column chart.
Can anyone please advise how I can go about this?
Thank you!!!

Sort by two or more variables

I'm trying to sort by ID and then by Date.
What I have:
| ID | Date |
| ----------------------|
| 112 | 2013-01-01 |
| 112 | 2013-01-15 |
| 113 | 2012-01-01 |
| 112 | 2014-02-13 |
| 112 | 2013-01-02 |
| 113 | 2011-01-11 |
What I need:
| ID | Date |
| ----------------------|
| 112 | 2013-01-01 |
| 112 | 2013-01-02 |
| 112 | 2013-01-15 |
| 112 | 2014-02-13 |
| 113 | 2011-01-11 |
| 113 | 2012-01-01 |
My problem is that I only know how to sort by ID or Date.

More generally:
clear
input id foo
1 56
1 34
2 13
1 67
1 22
2 89
2 61
2 76
end
sort id (foo)
list, sepby(id)
+----------+
| id foo |
|----------|
1. | 1 22 |
2. | 1 34 |
3. | 1 56 |
4. | 1 67 |
|----------|
5. | 2 13 |
6. | 2 61 |
7. | 2 76 |
8. | 2 89 |
+----------+
In a more advanced programming context you can use the same syntax with bysort.

Train our own classifier

Now I am training my own classifier.So for that I am using traincascade.But when I am giving this command 'opencv_traincascade -data facedet -vec vecfile.vec -bg negative.txt -npos 2650 -nneg 581 -nstages 20 -w 20 -h 20' it shows error like this.
PARAMETERS:
cascadeDirName: facedet
vecFileName: vecfile.vec
bgFileName: negative.txt
numPos: 2000
numNeg: 1000
numStages: 20
precalcValBufSize[Mb] : 256
precalcIdxBufSize[Mb] : 256
stageType: BOOST
featureType: HAAR
sampleWidth: 20
sampleHeight: 20
boostType: GAB
minHitRate: 0.995
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: BASIC
===== TRAINING 0-stage =====
<BEGIN
POS count : consumed 2000 : 2000
NEG count : acceptanceRatio 1000 : 1
Precalculation time: 3
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 1| 1|
+----+---------+---------+
| 4| 1| 1|
+----+---------+---------+
| 5| 1| 1|
+----+---------+---------+
| 6| 0.9955| 0.391|
+----+---------+---------+
END>
Parameters can not be written, because file facedet/params.xml can not be opened.
What is this error.I don't understand.Any one help me to solve this.
Positive samples:
/home/arya/myown/Positive/images18413.jpeg 1 1 1 113 33
/home/arya/myown/Positive/images1392.jpeg 1 113 33 107 133
/home/arya/myown/Positive/face841.jpeg 1 185 93 35 73
/home/arya/myown/Positive/images866.jpeg 2 121 26 64 68 121 26 88 123
/home/arya/myown/Positive/images83.jpeg 1 102 13 107 136
/home/arya/myown/Positive/images355.jpeg 2 92 16 224 25 92 16 117 130
/home/arya/myown/Positive/images888.jpeg 1 108 29 116 71
/home/arya/myown/Positive/images2535.jpeg 1 108 29 111 129
/home/arya/myown/Positive/images18221.jpeg 1 110 34 109 124
/home/arya/myown/Positive/images1127.jpeg 1 110 34 92 104
/home/arya/myown/Positive/images18357.jpeg 1 103 27 142 133
/home/arya/myown/Positive/images889.jpeg 1 86 25 134 124
Negative samples:
./Negative/face150.jpeg
./Negative/face1051.jpeg
./Negative/Pictures174.jpeg
./Negative/Pictures160.jpeg
./Negative/Pictures34.jpeg
./Negative/face130.jpeg
./Negative/face1.jpeg
./Negative/Pictures319.jpeg
./Negative/face1120.jpeg
./Negative/Pictures317.jpeg
./Negative/face1077.jpeg
./Negative/Pictures93.jpeg
./Negative/Pictures145.jpeg
./Negative/face1094.jpeg
./Negative/Pictures7.jpeg

Please be sure that you have already created the folder "facedet" before training your classifier as it does not create it by itself.
It needs this folder to create "params.xml" file in inside it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Spark: Percentile of list of row values in dataframe - list

Related

Why do some pins don't work with wiringPi?

After appending, I get null values in primary table headers

Power BI DAX to filter common items A & B share

Sort by two or more variables

Train our own classifier

Categories

Resources