How to replace null values from left join table in pyspark

How to replace null values from left join table in pyspark - replace

I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):
t1.id var1 var2 table2.id table2.var3 table2.var4
1 1.3 4 1 ['a','b','d'] ['x','y','z']
2 3.0 5 2 ['a','c','m','n'] ['x','z']
3 2.3 5
I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.
it's a similar issue as discussed in PySpark replace Null with Array
But I have over 10 variables from table 2 and each has a different dimension.
Any suggestion as what I can do? Can I do countvectorizer before the join?

Dataframe have .na.fill() attribute.
replace_cols = {col:'' for col in df.columns}
df.na.fill(replace_cols)

Related

Use DAX to get data between 2 tables

I have table 'tblA' with only 1 column named 'Value'
Value
1
2
The second table 'tblB' with several columns
Col1 Col2
Test A
Dump B
How can I have a join between them so that I will have new table with result like this (each value in tblA will fill in to all rows in tblB):
Col1 Col2 Value
Test A 1
Dump B 1
Test A 2
Dump B 2
I also tried to use for loop to get one-by-one value in tblA. But it seems that DAX didn't support loop.
Please advise.

Use expression for a calculated table
tblC = CROSSJOIN ( tblA, tblB )

Merging Tables Correctly in SAS

Hi I am trying to merge two tables the FormA scores table that I made that is now CalculatingScores with the domain number found in DomainsFormA. I need to merge them by QuestionNum. Here is my code.
proc sql;
create table combined as
select *
from CalculatingScores inner join DomainsFormA
on CalculatingScores.Scores=DomainsFormA.QuestionNum;
quit;
proc print data=combined (obs=15);
run;
This table is what I am trying to get my merged tables to look like but for 15 observations.
Form
Student
QuestionNum
Scores
DomainNum
A
1
1
0
5
A
1
2
1
4
A
1
3
0
5
But My tables look more like this
Form
Student
QuestionNum
Scores
DomainNum
A
1
2
1
5
A
1
4
1
5
A
1
5
1
5
My entire Scores column for these 15 observations have a value of 1. Also my DomainNum column only has values of 5. My Student and Form columns are correct but I need to have varied scores and varied domain numbers. Any ideas for how to solve my problem? Maybe I need a order by statement?

You appear to be joining on the incorrect columns
You coded
on CalculatingScores.Scores=DomainsFormA.QuestionNum
which is joining a score to a question number
perhaps you should be coding
on CalculatingScores.QuestionNum=DomainsFormA.QuestionNum
^^^^^^^^^^^ ^^^^^^^^^^^

Concatenating row values in Athena Aws

I've 2 cols lets say id and values. I want to concatenate values grouped by id col.
for eg.
I've
ID Values
1 a
1 b
2 a
2 b
I need the output as
ID Values
1 a,b
2 a,b

You can use an array_agg followed by an array_join
select id, array_join(array_agg(values),',') from table group by 1
The array_agg will give you an array of all values with the same id, and the array_join will concatenate them into a string. See the docs.

pandas keep rows based on column values for repeated values

I have a pandas data frame and I have a list of values. I want to keep all the rows from my original DF that have a certain column value belonging to my list of values. However my list that I want to choose my rows from have repeated values. Each time I encounter the same values again I want to add the rows with that column values again to my new data frame.
lets say my frames name is: with_prot_choice_df and my list is: with_prot_choices
if I issue the following command:
with_prot_choice_df = with_df[with_df[0].isin(with_prot_choices)]
then this will only keep the rows once (as if for only unique values in the list).
I don't want to do this with for loops since I will repeat the process many times and it will be extremely time consuming.
Any advice will be appreciated. Thanks.
I'm adding an example here:
let's say my data frame is:
col1 col2
a 1
a 6
b 2
c 3
d 4
and my list is:
lst = [a,b,a,a]
I want my new data frame, new_df to be:
new_df
col1 col2
a 1
a 6
b 2
a 1
a 6
a 1
a 6

Seems like you need reindex
df.set_index('col1').reindex(lst).reset_index()
Out[224]:
col1 col2
0 a 1
1 b 2
2 a 1
3 a 1
Updated
df.merge(pd.DataFrame({'col1':lst}).reset_index()).sort_values('index').drop('index',1)
Out[236]:
col1 col2
0 a 1
3 a 6
6 b 2
1 a 1
4 a 6
2 a 1
5 a 6

Hierarchical index in data frame missing columns

Im trying to learn Pandas by doing different exercises. I created a dataframe that looks like the example below. I'm trying to create a unique id by concatenating the fields, however when i get the data frame columns i only have fpd as a column. Could someone explain me why i don't see all the columns?
monthID pollutantID processID roadTypeID avgSpeedBinID Fpd
1 1 1 4 1 1.749101
2 0.935300
3 0.529701
4 0.393052
5 0.306381
6 0.261649
7 0.235040
I get the data frame by executing this:
fpd = data['fpd'].groupby([data['monthID'],data['pollutantID'],
data['processID'],data['roadTypeID'],data['avgSpeedBinID']]).sum()
fp = pd.DataFrame(fpd)

You could reset the multiindex to columns by:
fp.reset_index(inplace=True)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to replace null values from left join table in pyspark - replace

Dataframe have .na.fill() attribute. replace_cols = {col:'' for col in df.columns} df.na.fill(replace_cols)

Related

Use DAX to get data between 2 tables

Merging Tables Correctly in SAS

Concatenating row values in Athena Aws

pandas keep rows based on column values for repeated values

Hierarchical index in data frame missing columns

Categories

Resources