Biderectional Vlookup - flag in the same table - Sas - sas

I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you

You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY

Related

SaS: How to calculate moving average in sas using current observation?

I am trying to calculate moving average for test data set in SaS, where i want to consider the current calculated moving average for next moving average. I have added the below sample calculation.
I have data something like this
data have;
input category week value ;
datalines;
a 1 10
a 2 5
a 3
a 4 30
a 5 50
b 1 30
b 2 5
b 3
b 4 0
b 5 50
;
I want to calculate 4 weeks of moving average at category level
here is below expected output
data want;
input category week value moving_average;
datalines;
a 1 10 .
a 2 5 .
a 3 . .
a 4 30 .
a 5 50 .
a 6 . 28.33
a 7 . 36.11
a 8 . 34.86
b 1 30 .
b 2 5 .
b 3 . .
b 4 0 .
b 5 50 .
b 6 . 18.33
b 7 . 22.77
b 8 . 22.775
b 9 . 28.46
SO here is logic for b
`For Week 6: (50+0+5)/3 = 18.33
For Week 7: (18.33+50+0)/3 = 22.77
For Week 8: (22.77+18.33+50+0)/4 = 22.775
Similar calculation can be done for b
**One can consider till week 5 is training data after week its test data **
Hope this time i have made clear my problem statement.`
So you want to create new observations? You will need an explicit OUTPUT statement.
You can use a "circular array" to make it easier to calculate the average.
data have;
input category $ week value ;
datalines;
a 1 10
a 2 5
a 3 .
a 4 30
a 5 50
b 1 30
b 2 5
b 3 .
b 4 0
b 5 50
;
data want;
set have;
by category ;
array c_array [0:3] _temporary_ ;
if first.category then call missing(of c_array[*]);
if week <= 5 then c_array[mod(week,4)]=value;
output;
if week=5 then do week=6 to 9;
value=.;
average=mean(of c_array[*]);
output;
c_array[mod(week,4)]=average;
end;
run;
Results
Obs category week value average
1 a 1 10 .
2 a 2 5 .
3 a 3 . .
4 a 4 30 .
5 a 5 50 .
6 a 6 . 28.3333
7 a 7 . 36.1111
8 a 8 . 36.1111
9 a 9 . 37.6389
10 b 1 30 .
11 b 2 5 .
12 b 3 . .
13 b 4 0 .
14 b 5 50 .
15 b 6 . 18.3333
16 b 7 . 22.7778
17 b 8 . 22.7778
18 b 9 . 28.4722

How to Capture previous row value and perform subtraction

How to Capture previous row value and perform subtraction
Refer Table 1 as main data, Table 2 as desired output, Let me explain you in detail, Closing_Bal is derived from (Opening_bal - EMI) for eg if (20 - 2) = 18, as value 18 i want in 2nd row under opening_bal column then ( opening_bal - EMI) and so till new LAN , If New LAN available then start the loop again ,
i have created lag function butnot able to run loop
Try this
data A;
input Month $ LAN Opening_Bal EMI Closing_Bal;
infile datalines dlm = '|' dsd;
datalines;
1_Nov|1|20|2|18
2_Dec|1| |3|
3_Jan|1| |5|
4_Feb|1| |3|
1_Nov|2|30|4|26
2_Dec|2| |3|
3_Jan|2| |2|
4_Feb|2| |5|
5_Mar|2| |6|
;
data B(drop = c);
set A;
by LAN;
if first.LAN then c = Closing_Bal;
if Opening_Bal = . then do;
Opening_Bal = c;
Closing_Bal = Opening_Bal - EMI;
c = Closing_Bal;
end;
retain c;
run;
Result:
Month LAN Opening_Bal EMI Closing_Bal
1_Nov 1 20 2 18
2_Dec 1 18 3 15
3_Jan 1 15 5 10
4_Feb 1 10 3 7
1_Nov 2 30 4 26
2_Dec 2 26 3 23
3_Jan 2 23 2 21
4_Feb 2 21 5 16
5_Mar 2 16 6 10
The problem is that you already have CLOSING_BAL on the input dataset, so when the SET statement reads a new observation it will overwrite the value calculated on the previous observation. Either drop or rename the variable in the source dataset.
Example:
data have;
input Month $ LAN Opening_Bal EMI Closing_Bal;
datalines;
1_Nov 1 20 2 18
2_Dec 1 . 3 .
3_Jan 1 . 5 .
4_Feb 1 . 3 .
1_Nov 2 30 4 26
2_Dec 2 . 3 .
3_Jan 2 . 2 .
4_Feb 2 . 5 .
5_Mar 2 . 6 .
;
data want;
set have (drop=closing_bal);
retain Closing_Bal;
Opening_Bal=coalesce(Opening_Bal,Closing_Bal);
Closing_bal=Opening_bal - EMI ;
run;
Results:
Opening_ Closing_
Obs Month LAN Bal EMI Bal
1 1_Nov 1 20 2 18
2 2_Dec 1 18 3 15
3 3_Jan 1 15 5 10
4 4_Feb 1 10 3 7
5 1_Nov 2 30 4 26
6 2_Dec 2 26 3 23
7 3_Jan 2 23 2 21
8 4_Feb 2 21 5 16
9 5_Mar 2 16 6 10
I am not sure this works
data B;
set A;
by lan;
if not first.lan then do;
opening_bal = lag(closing_bal);
closing_bal = opening_bal - EMI;
end;
run;
because you don't execute lag for each observation.

Creating unique order count for duplicate orders by account id and order id

I'm trying to create a data set that will show me the duplicate transactions. The trouble I'm running into is when there are multiple orders on one order_id. The records that get assigned the 2s I would be considering the duplicate order.
data have;
input acct_id order_id;
datalines;
1 121
1 122
2 123
2 124
3 125
3 125
3 125
3 126
3 126
3 126
data want;
set have;
by acct_id order_id;
if first.acct_id then order_count = 1;
else order_count =2;
run;
My desired output is below.
acct_id | order_id | order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2
What I have coded out already I feel like is close but I can't get it figured out.
data want;
set have;
by acct_id order_id notsorted;
if first.acct_id then order_count=0;
if first.order_id then order_count+1;
put acct_id order_id order_count;
run;
acct_id order_id order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2

Find Lagged Average of Group

I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself

Looking up data within a file versus merging

I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n