I have a SAS data set, which I have sorted according to my needs. I want to split it into BY groups and, for each group, output each observation until the first occurrence of a particular value in a particular column.
ID No C1 Year2 C3 Date (DD/MM/YYYY)
---------------------------------------------------------
AB123 4 B4 2008E OC 09/04/2008
AB123 3 B4 2008E EL 09/04/2008
AB123 2 B4 2008E ZZ 09/04/2008
AB123 1 B4 2008E OC 09/04/2008
AB123 0 B4 2008E ZZ 09/04/2008
AB123 1 B4 2008E OC 06/02/2008
AB123 0 B4 2008E ZZ 06/02/2008
This is one BY group: the data set is grouped by ID, C1, Year2 and sorted by ID, C1, Year2, Date(desc), No(desc). Further instances of each of ID, C1 and Year2 could occur anywhere in the data set, but the 3 variables define each BY group.
I want to output all observations per BY group up to and including the first occurrence of ZZ in C3. So above I would want the first 3 observations output (or flagged) and then move on to the next BY group.
Any help would be greatly appreciated. Please let me know if you need any more details of the problem. Thanks.
Here's one way that should work.
data have;
input ID $ No C1 $ Year2 $ C3 $ Date :DDMMYY10.;
format date DDMMYY10.;
cards;
AB123 4 B4 2008E OC 09/04/2008
AB123 3 B4 2008E EL 09/04/2008
AB123 2 B4 2008E ZZ 09/04/2008
AB123 1 B4 2008E OC 09/04/2008
AB123 0 B4 2008E ZZ 09/04/2008
AB123 1 B4 2008E OC 06/02/2008
AB123 0 B4 2008E ZZ 06/02/2008
;
run;
data want (drop=stopflag);
set have;
by id c1 year2;
retain stopflag;
if max(first.id,first.c1,first.year2)=1 then stopflag=0;
if c3='ZZ' and stopflag=0 then do;
output;
stopflag=1;
end;
if stopflag=0 then output;
run;
Related
I am trying to join several string variables (c1, c2 etc.):
AKJ OFE ETH AKJ AKJ
345 952 319 123 345
I can join them with the following command:
generate c = c1 + c2 + c3 + c4 + c5
How can I join only their unique entries?
AKJ OFE ETH
345 952 319 123
An alternative solution is the following:
clear
input str3(c1 c2 c3 c4 c5)
AKJ OFE ETH AKJ AKJ
345 952 319 123 345
end
local vars c2 c3 c4 c5
local dvars c1
generate tempc1 = c1
foreach var of local vars {
generate temp`var' = `var'
foreach dvar of local dvars {
replace temp`var' = "" if `var' == `dvar'
}
local dvars `dvars' `var'
}
egen c = concat(temp*), punct(" ")
drop temp*
list
+-----------------------------------------------+
| c1 c2 c3 c4 c5 c |
|-----------------------------------------------|
1. | AKJ OFE ETH AKJ AKJ AKJ OFE ETH |
2. | 345 952 319 123 345 345 952 319 123 |
+-----------------------------------------------+
I have two group of arrays
a1 a2 a3 a4 a5 a6 a7 a8 <= name it as key1
b1 b2 b3 b4 b5 b6 b7 b8 <= val1
c1 c2 c3 c4 c5 c6 c7 c8
and
d1 d2 d3 d4 d5 d6 d7 d8 <= key2
e1 e2 e3 e4 e5 e6 e7 e8 <= val2
f1 f2 f3 f4 f5 f6 f7 f8
The arrays a1,...,an and d1,...,dn are sorted and might be repeated. i.e. their values might be something like 1 1 2 3 4 6 7 7 7 ... I want to check if for each Tuple di,ei check if it is equal to any of ai,bi. If it is (di==ai,bi==ei) then I have to combine fi and ci using some function e.g. add and store in fi.
Firstly, is it possible to do this using zip iterators and transformation in thurst library to solve this efficiently?
Secondly, the simplest method that I can imagine is to count occurance of number of each keys (ai) do prefix sum and use both to get start and end index of each keys and then for each di use above counting to iterate through those indices and check if ei==di. and perform the transformation.
i.e. If I have
1 1 2 3 5 6 7
2 3 4 5 2 4 6
2 4 5 6 7 8 5
as first array, I count the occurance of 1,2,3,4,5,6,7,...:
2 1 1 0 1 1 1 <=name it as count
and then do prefix sum to get:
2 3 4 4 5 6 7 <= name it as cumsum
and use this to do:
for each element di,
for i in (cumsum[di] -count[di]) to cumsum[di]:
if ei==val1[i] then performAddition;
What I fear is that since not all threads are equal, this will lead to warp divergence, and I may not have efficient performance.
You could treat your data as two key-value tables.Table1: (a,b) -> c and Table2: (d,e)->f, where pair (a,b) and (d,e) are keys, and c, f are values.
Then your problem simplifies to
foreach key in Table2
if key in Table1
Table2[key] += Table1[key]
Suppose a and b have limited ranges and are positive, such as unsigned char, a simple way to combine a and b into one key is
unsigned short key = (unsigned short)(a) * 256 + b;
If the range of key is still not too large as in the above example, you could create your Table1 as
int Table1[65536];
Checking if key in Table1 becomes
if (Table1[key] != INVALID_VALUE)
....
With all these restrictions, implementation with thrust should be very simple.
Similar combining method could still be used if a and b have larger range like int.
But if the range of key is too large, you have to go to the method suggested by Robert Crovella.
How can I write pandas or python code to obtain matrix from my data? I have the following table:
Item Route Order
R124 A1 1
R124 A2 2
R124 A3 3
R124 A4 4
R124 A4 4
R126 A5 1
R126 A6 2
R126 A7 3
R126 A7 3
My required output is:
A1 A2 A3 A4 A5 A6 A7
R124 1 1 1 2 0 0 0
R126 0 0 0 0 1 1 2
To obtain the matrix, a unique 'Item' value becomes a row name. For example R124 has 1 entry each in the 'Order' column mapping to A1, A2 and A3 in the 'Route' column, and 2 entries mapping to A4 in the 'Route' column. The number of mapped entries are recorded accordingly. Since R124 has no 'Order' entry mapping to 'Route' column for A5, A6 and A7, zeros are recorded as shown in the output matrix.
You just need pivot_table.
If your data frame is df:
df.pivot_table(index="Item",columns="Route",values="Order",aggfunc='count')
gives:
Route A1 A2 A3 A4 A5 A6 A7
Item
R124 1.0 1.0 1.0 2.0 NaN NaN NaN
R126 NaN NaN NaN NaN 1.0 1.0 2.0
and to completely mimic your desired answer just add fillna and astype:
df.pivot_table(index="Item",columns="Route",values="Order",aggfunc='count').fillna(0).astype(int)
gives
Route A1 A2 A3 A4 A5 A6 A7
Item
R124 1 1 1 2 0 0 0
R126 0 0 0 0 1 1 2
I have two data frames:
df1 =
Id ColA ColB ColC
1 aa bb cc
3 11 ww 55
5 11 bb cc
df2 =
Id ColD ColE ColF
1 ff ee rr
2 ww rr 55
3 hh 11 22
4 11 11 cc
5 cc bb aa
I need to merge these two data frames to get the following result:
result =
Id ColA ColB ColC ColD ColE ColF
1 aa bb cc ff ee rr
2 NaN NaN NaN ww rr 55
3 11 ww 55 hh 11 22
4 NaN NaN NaN 11 11 cc
5 11 bb cc cc bb aa
I do the merging this way:
import pandas as pd
result = pd.merge(df1,df2,on='Id')
However my result looks as follows instead of the expected above-shown result:
result =
Id ColA ColB ColC ColD ColE ColF
1 aa bb cc ff ee rr
3 11 ww 55 hh 11 22
5 11 bb cc cc bb aa
According to the documentation of merge, you need to specify the 'how' parameter as outer (the default is inner, which is consistent with what you're getting):
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
I often end up with the following situation. I have a dataframe with two IDs
A = pd.DataFrame([[1,'a', 'a1'], [2, None, 'a2'], [3,'c', 'a3'], [4,'None', 'a3'], [None, 'e', 'a3'], ['None', 'None', 'None']], columns = ['id1', 'id2', 'colA'])
id1 id2 colA
0 1 a a1
1 2 None a2
2 3 c a3
3 4 None a3
4 None e a3
5 None None None
and I have another dataframe with additional info I want to add to the first dataframe
B = pd.DataFrame([[1,'a', 'b1', 'c1'], [2, 'b', 'b2', 'c2'], [3,'c', 'b3', 'c3'], [4, 'd', 'b4', 'c4'], [5, 'e', 'b5', 'c5'], [6, 'e', 'b5', 'c5']], columns = ['id1', 'id2', 'colB', 'colC'])
Out[15]:
id1 id2 colB colC
0 1 a b1 c1
1 2 b b2 c2
2 3 c b3 c3
3 4 d b4 c4
4 5 e b5 c5
5 6 e b5 c5
I want to merge on id1, like this
A.merge(B, how='left', on='id1')
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN NaN NaN
5 None None None NaN NaN NaN
This is close to what I want. However for the failed lookups (that is when id1 is not available) I would like to merge on id2, so the result looks like
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN b5 c5
5 None None None NaN NaN NaN
What's the best way to achieve this? Note I don't really want 2 id2 columns in the result and id2 may have duplicates.
IIUC you use fillna. But it fill last row too.
print df
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN NaN NaN
5 None None None NaN NaN NaN
df = df.fillna(B)
print df
id1 id2_x colA id2_y colB colC
0 1 a a1 a b1 c1
1 2 None a2 b b2 c2
2 3 c a3 c b3 c3
3 4 None a3 d b4 c4
4 None e a3 NaN b5 c5
5 None None None NaN b5 c5
As EdChum mentioned in comments, next solution is use combine_first, but output is different:
print A.combine_first(B)
colA colB colC id1 id2
0 a1 b1 c1 1 a
1 a2 b2 c2 2 b
2 a3 b3 c3 3 c
3 a3 b4 c4 4 None
4 a3 b5 c5 5 e
5 None b5 c5 None None
Difference is:
In [142]: %timeit A.combine_first(B)
100 loops, best of 3: 3.44 ms per loop
In [143]: %timeit A.merge(B, how='left', on='id1').fillna(B)
100 loops, best of 3: 2.89 ms per loop