How to replace null values from left join table in pyspark - replace

I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):
t1.id var1 var2 table2.id table2.var3 table2.var4
1 1.3 4 1 ['a','b','d'] ['x','y','z']
2 3.0 5 2 ['a','c','m','n'] ['x','z']
3 2.3 5
I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.
it's a similar issue as discussed in PySpark replace Null with Array
But I have over 10 variables from table 2 and each has a different dimension.
Any suggestion as what I can do? Can I do countvectorizer before the join?

Dataframe have .na.fill() attribute.
replace_cols = {col:'' for col in df.columns}
df.na.fill(replace_cols)

Related

Use DAX to get data between 2 tables

I have table 'tblA' with only 1 column named 'Value'
Value
1
2
The second table 'tblB' with several columns
Col1 Col2
Test A
Dump B
How can I have a join between them so that I will have new table with result like this (each value in tblA will fill in to all rows in tblB):
Col1 Col2 Value
Test A 1
Dump B 1
Test A 2
Dump B 2
I also tried to use for loop to get one-by-one value in tblA. But it seems that DAX didn't support loop.
Please advise.
Use expression for a calculated table
tblC = CROSSJOIN ( tblA, tblB )

Merging Tables Correctly in SAS

Hi I am trying to merge two tables the FormA scores table that I made that is now CalculatingScores with the domain number found in DomainsFormA. I need to merge them by QuestionNum. Here is my code.
proc sql;
create table combined as
select *
from CalculatingScores inner join DomainsFormA
on CalculatingScores.Scores=DomainsFormA.QuestionNum;
quit;
proc print data=combined (obs=15);
run;
This table is what I am trying to get my merged tables to look like but for 15 observations.
Form
Student
QuestionNum
Scores
DomainNum
A
1
1
0
5
A
1
2
1
4
A
1
3
0
5
But My tables look more like this
Form
Student
QuestionNum
Scores
DomainNum
A
1
2
1
5
A
1
4
1
5
A
1
5
1
5
My entire Scores column for these 15 observations have a value of 1. Also my DomainNum column only has values of 5. My Student and Form columns are correct but I need to have varied scores and varied domain numbers. Any ideas for how to solve my problem? Maybe I need a order by statement?
You appear to be joining on the incorrect columns
You coded
on CalculatingScores.Scores=DomainsFormA.QuestionNum
which is joining a score to a question number
perhaps you should be coding
on CalculatingScores.QuestionNum=DomainsFormA.QuestionNum
^^^^^^^^^^^ ^^^^^^^^^^^

Concatenating row values in Athena Aws

I've 2 cols lets say id and values. I want to concatenate values grouped by id col.
for eg.
I've
ID Values
1 a
1 b
2 a
2 b
I need the output as
ID Values
1 a,b
2 a,b
You can use an array_agg followed by an array_join
select id, array_join(array_agg(values),',') from table group by 1
The array_agg will give you an array of all values with the same id, and the array_join will concatenate them into a string. See the docs.

pandas keep rows based on column values for repeated values

I have a pandas data frame and I have a list of values. I want to keep all the rows from my original DF that have a certain column value belonging to my list of values. However my list that I want to choose my rows from have repeated values. Each time I encounter the same values again I want to add the rows with that column values again to my new data frame.
lets say my frames name is: with_prot_choice_df and my list is: with_prot_choices
if I issue the following command:
with_prot_choice_df = with_df[with_df[0].isin(with_prot_choices)]
then this will only keep the rows once (as if for only unique values in the list).
I don't want to do this with for loops since I will repeat the process many times and it will be extremely time consuming.
Any advice will be appreciated. Thanks.
I'm adding an example here:
let's say my data frame is:
col1 col2
a 1
a 6
b 2
c 3
d 4
and my list is:
lst = [a,b,a,a]
I want my new data frame, new_df to be:
new_df
col1 col2
a 1
a 6
b 2
a 1
a 6
a 1
a 6
Seems like you need reindex
df.set_index('col1').reindex(lst).reset_index()
Out[224]:
col1 col2
0 a 1
1 b 2
2 a 1
3 a 1
Updated
df.merge(pd.DataFrame({'col1':lst}).reset_index()).sort_values('index').drop('index',1)
Out[236]:
col1 col2
0 a 1
3 a 6
6 b 2
1 a 1
4 a 6
2 a 1
5 a 6

Hierarchical index in data frame missing columns

Im trying to learn Pandas by doing different exercises. I created a dataframe that looks like the example below. I'm trying to create a unique id by concatenating the fields, however when i get the data frame columns i only have fpd as a column. Could someone explain me why i don't see all the columns?
monthID pollutantID processID roadTypeID avgSpeedBinID Fpd
1 1 1 4 1 1.749101
2 0.935300
3 0.529701
4 0.393052
5 0.306381
6 0.261649
7 0.235040
I get the data frame by executing this:
fpd = data['fpd'].groupby([data['monthID'],data['pollutantID'],
data['processID'],data['roadTypeID'],data['avgSpeedBinID']]).sum()
fp = pd.DataFrame(fpd)
You could reset the multiindex to columns by:
fp.reset_index(inplace=True)