pandas keep rows based on column values for repeated values - python-2.7

I have a pandas data frame and I have a list of values. I want to keep all the rows from my original DF that have a certain column value belonging to my list of values. However my list that I want to choose my rows from have repeated values. Each time I encounter the same values again I want to add the rows with that column values again to my new data frame.
lets say my frames name is: with_prot_choice_df and my list is: with_prot_choices
if I issue the following command:
with_prot_choice_df = with_df[with_df[0].isin(with_prot_choices)]
then this will only keep the rows once (as if for only unique values in the list).
I don't want to do this with for loops since I will repeat the process many times and it will be extremely time consuming.
Any advice will be appreciated. Thanks.
I'm adding an example here:
let's say my data frame is:
col1 col2
a 1
a 6
b 2
c 3
d 4
and my list is:
lst = [a,b,a,a]
I want my new data frame, new_df to be:
new_df
col1 col2
a 1
a 6
b 2
a 1
a 6
a 1
a 6

Seems like you need reindex
df.set_index('col1').reindex(lst).reset_index()
Out[224]:
col1 col2
0 a 1
1 b 2
2 a 1
3 a 1
Updated
df.merge(pd.DataFrame({'col1':lst}).reset_index()).sort_values('index').drop('index',1)
Out[236]:
col1 col2
0 a 1
3 a 6
6 b 2
1 a 1
4 a 6
2 a 1
5 a 6

Related

Generate new variable in one dataset using observation from another dataset in SAS

I have two datasets, one with one observation and two variables. Other dataset with 10 observations, four variables.
Dataset 1
Final Result
X Fail
Dataset 2
A B C D Output
1 1 2 Pass
2 1 2 Pass
3 1 2 Pass
4 1 2 Fail
5 1 2 Pass
6 1 2 Fail
7 1 2 Pass
8 1 2 Fail
9 1 2 Pass
10 1 2 Pass
I would like to generate a fifth variable (output) in the second dataset depending on the value of the second variable in the first dataset.
If Result in first dataset equal to fail, generate a new variable output in the second dataset as fail. If Result in first dataset equal to pass, then generate a new variable output in the second dataset which will be equal to the value in column D of the second dataset.
Just use some simple IF/THEN logic. Since you know DATASET1 only has one observation then only read one observation from it.
data want;
if _n_=1 then set dataset1 ;
set dataset2 ;
length OUTPUT $4 ;
if RESULT='FAIL' then OUTPUT=RESULT;
else OUTPUT=D ;
run;

Merging Tables Correctly in SAS

Hi I am trying to merge two tables the FormA scores table that I made that is now CalculatingScores with the domain number found in DomainsFormA. I need to merge them by QuestionNum. Here is my code.
proc sql;
create table combined as
select *
from CalculatingScores inner join DomainsFormA
on CalculatingScores.Scores=DomainsFormA.QuestionNum;
quit;
proc print data=combined (obs=15);
run;
This table is what I am trying to get my merged tables to look like but for 15 observations.
Form
Student
QuestionNum
Scores
DomainNum
A
1
1
0
5
A
1
2
1
4
A
1
3
0
5
But My tables look more like this
Form
Student
QuestionNum
Scores
DomainNum
A
1
2
1
5
A
1
4
1
5
A
1
5
1
5
My entire Scores column for these 15 observations have a value of 1. Also my DomainNum column only has values of 5. My Student and Form columns are correct but I need to have varied scores and varied domain numbers. Any ideas for how to solve my problem? Maybe I need a order by statement?
You appear to be joining on the incorrect columns
You coded
on CalculatingScores.Scores=DomainsFormA.QuestionNum
which is joining a score to a question number
perhaps you should be coding
on CalculatingScores.QuestionNum=DomainsFormA.QuestionNum
^^^^^^^^^^^ ^^^^^^^^^^^

Sorting between groups based on a variable other than the one grouped on

I would like to use Pandas groupby to sort groups according to a value within each group. This value is not the one used for the grouping.
I am working with public transport data which tells me the stops and arrival times of different bus trips. Here is a sample of the dataframe (called stopTimes):
trip_id stop_sequence arrival_time
1 3 15:08:00
2 2 16:01:00
1 1 09:00:40
2 3 16:45:00
2 1 07:05:30
1 2 12:03:00
I would like to sort the trips according to the arrival time at the first stop. So the result of the sorting for the above dataframe would be:
trip_id stop_sequence arrival_time
2 1 07:05:30
2 2 16:01:00
2 3 16:45:00
1 1 09:00:40
1 2 12:03:00
1 3 15:08:00
I have been able to achieve this result already by:
timeSortedTrips = stopTimes.loc[stopTimes['stop_sequence']==1].sort_values('arrival_time')['trip_id']
stopTimes['trip_id'] = pd.Categorical(stopTimes['trip_id'],timeSortedTrips)
stopTimes = stopTimes.sort_values(['trip_id','arrival_time'])
However, I am curious: can I achieve this using groupby? If so, would it be more efficient? Additionally, I am new to Python, so if you have even better ideas to do this sorting please point me in that direction.
You can groupby trip_id and within each group, sort by arrival_time
stopTimes.arrival_time = pd.to_datetime(stopTimes.arrival_time)
stopTimes = stopTimes.groupby("trip_id", as_index=False).apply(lambda x: x.sort("arrival_time"))

How to replace null values from left join table in pyspark

I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):
t1.id var1 var2 table2.id table2.var3 table2.var4
1 1.3 4 1 ['a','b','d'] ['x','y','z']
2 3.0 5 2 ['a','c','m','n'] ['x','z']
3 2.3 5
I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.
it's a similar issue as discussed in PySpark replace Null with Array
But I have over 10 variables from table 2 and each has a different dimension.
Any suggestion as what I can do? Can I do countvectorizer before the join?
Dataframe have .na.fill() attribute.
replace_cols = {col:'' for col in df.columns}
df.na.fill(replace_cols)

Generating a List in M Query

I am trying to generate a list using M Query, but instead of generating a list starting from a specific, number, I want the number to start at 1 and then add 1 for every row on another table.
So for example, if I have:
Tbl1
Col1
A
B
C
D
D
I want to generate
Tbl2
Col1
1
2
3
4
5
I want the number to start at 1 and then add 1 for every row on another table.
if I understand your correctly, here it is:
={1..Table.RowCount(Tab1)}