Is there a way I could replace a row value to its previous row by each group?
Below is the before and after data set. Product for each type - C needs to be changed as type - L for each customer when the ID is same it has the highest amount.
Before
ObsCust LINK_ID Type Product Amount
1 1 12432 L A 23
2 1 12432 C B 0
3 2 23213 L C 234
4 2 23145 L D 25
5 2 23145 C E 0
6 3 21311 L F 34
7 3 21324 L G 45
8 3 21324 L H 35
9 3 21324 C I 0
After
Cust LINK_ID Type Product Amount
1 12432 L A 234
1 12432 C A -
2 23213 L C 23,212
2 23145 L D 335
2 23145 C D -
3 21311 L F 323
3 21324 L G 2,344
3 21324 L H 34
3 21324 C G -
Thank you!
if i understand correctly, you want to have product value for C Type be the product associated with the highest amount in L Types. If this is correct one possible way is to use the following. First the product with the highest amount for L-Type within each group of customers and IDs are calculated as follows:
note that the original dataset is assumed to be named "example".
proc sql;
create table L_Type as
select cust, LINK_ID, product, amount
from example
where type = 'L' and amount = max(amount)
group by cust, LINK_ID
;
quit;
then product calculated above is coded for c type in the original example.
proc sql;
select
e.cust
, e.LINK_ID
, e.type
, case when e.type = 'C' then b.product end as product
, e.amount
from example e left join L_Type b
on e.cust = b.cust and e.LINK_ID = b.LINK_ID
;
quit;
So you have a couple processing tasks to do:
Have you considered all the edge cases ?
For a customer find the row(s) with the maximum amount.
Is one of them type L ?
No, do nothing
Yes, track the Product and LinkId as follows
Is there more than one 'maximal' row ?
No, track the Product & LinkId from the one row
Yes, Is there more than one Product in the rows ?
No, track the Product value
Is there more than one LinkId ?
No, track the LinkId
Yes, Which LinkIds?
Track all the different LinkIds
Track one of these: first, lowest, highest, last LinkId
Yes, now what ?
Log an error ?
Track one of the Product values because only one can be used, which one ?
first occurring ?
lowest value ?
highest value ?
last occurring ?
For the tracked LinkIds (there might not be any) apply the tracked Product to the rows that are type C (or perhaps type not L)
Related
I have a dataset with the structure that looks something like this:
Group ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
Within each Group, I want to obtain the sum of all possible combinations of two or more IDs. For instance, within group 1, I can have the following combinations: AB, AC, BC, ABC. So, in total I have four possible combinations for group 1, of which I'd like to get the sum of the variable value.
I am using the formula for combinations of N elements in groups of size R to identify how many observations I need to add to the dataset to have enough observations.
For Group 1, the number of observations I need are:
3!/((3-2)!*2!)*2 = 6 for the two-IDs combinations
3!/(3-3)!*3!)*3 = 3 for the three-IDs combination.
So a total of 9 observations. Since I already have three, I can use the command:expand 6 if Group==1. For Group 1 I would get something like
Group ID Value
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
Now, I am stuck here on how to proceed to tell Stata to identify the combinations and create the summation. Ideally, I want to create two new variables, to identify the tuples and get the summation, so something that looks like:
Group ID Value Touple Sum
1 A 10 AB 25
1 B 15 AB 25
1 A 10 AC 30
1 C 20 AC 30
1 B 15 BC 35
1 C 20 BC 35
1 A 10 ABC 45
1 B 15 ABC 45
1 C 20 ABC 45
In this way, I could then just drop the duplicates in terms of Group and Tuples. Once I have the Tuples variable, getting the sum is straightforward, but getting the Tuples, I can't get my head around it.
Any advice on how to do this?
I tried doing this with nested loops and the tuples command.
First I create and save a tempfile to store results:
clear
tempfile group_results
save `group_results', replace emptyok
Then I input and save data, along with a local for the number of groups:
clear
input Group str1 ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
2 F 13 // added to test
2 G 2 // added to test
end
sum Group
local num_groups = r(max)
tempfile base
save `base', replace
Here's the core of the code. The outer loop here iterates over Groups. Then it makes a list of the IDs in that group, and uses the tuples command to make a list of the unique combinations of those IDs, with a minimum size of 2. The k loop iterates through the number of tuples and the m loop makes an indicator for tuple membership.
forvalues i = 1/`num_groups' {
display "Starting Group `i'"
use `base' if Group==`i', clear
* Make list of IDs to get unique combos of
forvalues j = 1/`=_N' {
local tuple_list`i' = "`tuple_list`i'' " + ID[`j']
}
* Get all unique combos in list using tuples command
tuples `tuple_list`i'', display min(2)
forvalues k = 1/`ntuples' {
display "Tuple `k': `tuple`k''"
local length = wordcount("`tuple`k''")
gen intuple=0
gen tuple`k'="`tuple`k''"
forvalues m = 1/`length' {
replace intuple=1 if ID==word("`tuple`k''",`m')
}
* Calculate sum of values in that tuple
gegen group_sum`k' = sum(Value) if intuple==1
drop intuple
list
}
* Reshape into desired format
reshape long tuple group_sum, i(Group ID Value) j(tuple_num)
drop if missing(group_sum)
sort tuple_num
list
append using `group_results'
save `group_results', replace
}
* Full results
use `group_results', clear
sort Group tuple_num
list
I hope this helps. The list commands will give you a busy results window but it shows what's all happening. Here's the output at the end of the i loop for Group 1:
+--------------------------------------------------+
| Group ID Value tuple_~m tuple group_~m |
|--------------------------------------------------|
1. | 1 C 20 1 B C 35 |
2. | 1 B 15 1 B C 35 |
3. | 1 A 10 2 A C 30 |
4. | 1 C 20 2 A C 30 |
5. | 1 A 10 3 A B 25 |
|--------------------------------------------------|
6. | 1 B 15 3 A B 25 |
7. | 1 C 20 4 A B C 45 |
8. | 1 A 10 4 A B C 45 |
9. | 1 B 15 4 A B C 45 |
+--------------------------------------------------+
This could be inefficient if your data is actually much larger!
I have a dataset that looks basically like this:
LOCID
Name
Addtl Loc 1
Addtl Loc 2
Addtl Loc 3
1
A
2
3
5
1
B
2
1
C
2
4
And I would like to make it look like this:
LOCID
Name
Gender
1
A
F
2
A
F
3
A
F
5
A
F
1
B
M
2
B
M
1
C
F
2
C
F
4
C
F
So, I'd like to keep the attributes for each person but have a row for each of their locations. I also don't currently have a unique ID or any variable to identify each of the people but I could make one. I'm working in SAS. Does anyone have suggestions on how to do this?
I have been looking up wide to long methods but am having trouble understanding them.
It looks to me like you could just use a DO LOOP to transpose the data.
So assuming your input data set has LOCID and ADD_LOCID1 to ADD_LOCID3 plus any other variables, such as NAME and GENDER, you could just do the following to add an extra observation for every non-missing value found in the extra locid variables.
data want;
set have;
array list add_locid1 - add_locid3;
output;
do index=1 to dim(list);
locid = list[index];
if not missing(locid) then output;
end;
drop index add_locid1-add_locid3 ;
run;
I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!
So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;
Would you kindly be able to assist me with writing SAS script for a specific type of left join as described below?
I’m looking to do a left join of Table – A to Table B [given below], where full matching of all identifying fields or partial matching [at least 1 field] with the remaining fields in Table – B being missing/ null is also treated a missing; however, any partial/ full matching with at least one field populated in Table – B whilst being null/ missing in Table – A will be treated as non-match.
Here’s an example of input tables [A and B] and output matching analysis/ results below:
TABLE - A
S/N COL_1 COL_2 COL_3 COL_4
-----------------------------------
1 A p ii
2 A
3 B r
TABLE - B
S/N COL_1 COL_2 COL_3 COL_4
-----------------------------------
1 A p ii
2 A q
3 A
4 A p 7 ii
5 B
6 B r n
OUTPUT/ MATCHING ANALYSIS
TABLE - A TABLE - B MATCH NO MATCH
----------------------------------------
1 1 Y
1 2 N
1 3 Y
1 4 N
2 1 N
2 2 N
2 3 Y
2 4 N
3 5 Y
3 6 N
I've decided not to use join as there could be more than 4 columns to join...
First,
let's find the equals:
proc sql;
create table Equals as
select a.*,'Y' as Match, '' as No_Match from table_a as a
intersect
select b.*,'Y' as Match, '' as No_Match from table_b as b ;
quit;
Now, let's fine not equals:
proc sql;
create table Not_Equals as
select a.*,'' as Match, 'N' as No_Match from table_a as a
except
select b.*,'' as Match, 'N' as No_Match from table_b as b
union
select b.*,'' as Match, 'N' as No_Match from table_b as b
except
select a.*,'' as Match, 'N' as No_Match from table_a as a ;
quit;
and finally - let's merge the 2 data sets:
data All;
set Equals Not_Equals;
run;
I am running a program where stepwise, certain criteria eliminates records from tables. However, a record could be eliminated after the 7th table or 8th so the way I've been doing it, merging table 1 with 2, then table 2 with 3, and so on isn't very convenient. Is there a way I can "track" an observation with flags? Like say, if A is in table 1 then match = 1 else match = 0. And from there, identify which table A was eliminated from (in this case Table 3). I would need to track possibly multiple observations (not too many, maybe 5 or 10) at a time though and they may be eliminated at different points (one in Table 3, one in Table 8).
Example:
Table 1:
Pat_ID
A
B
C
D
E
F
Table 2:
A
B
D
E
F
Table 3:
B
D
E
F
I think this is what you are looking for. This merges the tables together and records which table the ID exists.
EDIT: In response to the question in the comments, I realize the naming can be confusing. I'm changing the table names to make things more clear.
data mick;
input PAT_ID $ ;
datalines;
A
B
C
D
E
F
;
run;
data keith;
input PAT_ID $ ;
datalines;
A
B
D
E
F
;
run;
data ron;
input PAT_ID $ ;
datalines;
B
D
E
F
;
run;
/* merge */
data want(drop=i);
merge mick (in=t1)
keith (in=t2)
ron (in=t3);
by PAT_ID;
array table[3] Mick Keith Ron;
array t[3];
do i=1 to 3;
if t[i] then table[i]=1;
else table[i]=0;
end;
run;
This produces
PAT_ID Mick Keith Ron
A 1 1 0
B 1 1 1
C 1 0 0
D 1 1 1
E 1 1 1
F 1 1 1
When you eliminated these records, where did they go? I co-wrote a paper a few years back that argued that any time you delete records, you should write them to a dataset of deleted records. This makes it easier to track which records were deleted, at what step, and why.
For example:
data table2
drop_missingScore
;
set table1;
if missing(score) then output drop_missingScore;
else output table2;
run;
The full paper is available here: http://www.lexjansen.com/nesug/nesug11/ds/ds06.pdf