Insert more rows based on conditions from other rows - sas

I have the following dataset:
data have;
input year firm_id location_id action action_amount operate new_entry
;
cards;
2013 28013 6085 1 10000 0 0
2015 28013 6085 1 12000 0 0
2015 28013 29189 1 10000 0 0
2016 28013 34019 1 5000 1 1
2017 28013 34019 0 0 1 2
2011 120609 9003 1 7000 0 0
2012 120609 9003 0 0 1 1
2013 120609 9003 1 5000 1 2
2012 247908 23001 1 9000 0 0
2013 247908 23001 1 8000 0 0
2014 247908 23001 1 8500 1 1
2015 247908 23001 0 0 1 2
2003 356123 1001 0 0 0 0
2004 356123 1001 0 0 0 0
2009 356123 1001 1 9800 1 1
;
run;
I want to add additional rows and two new variables called "pre_action" and "pre_action_amount" to obtain the following dataset:
data want;
input year firm_id location_id action action_amount operate new_entry pre_action pre_action_amount
;
cards;
2013 28013 6085 1 10000 0 0 . .
2014 28013 6085 0 0 0 0 1 10000
2015 28013 6085 1 12000 0 0 . .
2016 28013 6085 0 0 0 0 1 12000
2015 28013 29189 1 6500 0 0 . .
2016 28013 29189 0 0 0 0 1 6500
2016 28013 34019 1 5000 1 1 0 0
2017 28013 34019 0 0 1 2 . .
2011 120609 9003 1 7000 0 0 . .
2012 120609 9003 0 0 1 1 1 7000
2013 120609 9003 1 5000 1 2 . .
2012 247908 23001 1 9000 0 0 . .
2013 247908 23001 1 8000 0 0 1 9000
2014 247908 23001 1 8500 1 1 1 8000
2015 247908 23001 0 0 1 2 . .
2003 356123 1001 0 0 0 0 . .
2004 356123 1001 0 0 0 0 0 0
2005 356123 1001 0 0 0 0 0 0
2009 356123 1001 1 9800 1 1 0 0
;
run;
The rules are as follows:
1) First, consider only the rows with operate = 0.
For each firm_id and location_id pair, if in the following year there is no row with the same firm_id and location_id, then create a new row with the following year and same firm_id and location_id pair. The variables action, action_amount, operate, and new_entry are all set to 0, while pre_action and pre_action_amount is set to be the value of action and action_amount in the previous year. Example: In year 2013, for the firm_id/location_id pair 28013/6085, we have operate = 0. But in 2014, there are no observations for this firm_id/location_id pair. So we set action, action_amount, operate, and new_entry to be 0 and pre_action=1 and pre_action_amount=10000 which are the values for action and action_amount in 2013.
For each firm_id and location_id pair, if in the following year there is a row with the same firm_id and location_id, then simply set pre_action and pre_action_amount to be the value of action and action_amount in the previous year. Example: In year 2011 for firm_id/location_id 120609/9003, we have operate=0. But in the next year 2012, there is a row with this firm_id/location_id pair. So we set pre_action=1 and pre_action_amount=7000 which are the values for action and action_amount in 2011. Another example is in year 2003, for the firm_id/location_id 356123/1001.
2) Now consider the rows with new_entry=1 that do not yet have a value of pre_action and pre_action_amount. Set both pre_action and pre_action_amount to be 0.
3) All other values of pre_action and pre_action_amount are empty.
I am unsure of how to create these new rows given the complicated rules above, any help would be appreciated.

If your main problem is how to access information from the next record to determine if there is a gap in year (as there is no lead() function in SAS), following is an example that shows the use of the POINT= option of the SET statement to do precisely that.
The example is based on a slightly modified version of your data since:
it reduces the number of observations and variables in the dataset.
it uses just one variable to define the observation ID.
Example code:
/* Example that shows how to use the POINT= option of the SET statement
to look ahead at the next observation and decide whether there is a gap in YEAR. */
data example;
input year id action_amount;
cards;
2013 28013 10000
2015 28013 12000
2016 28013 5000
2017 28013 0
2003 356123 250
2004 356123 320
2009 356123 9800
;
run;
* Observations should be sorted by ID and YEAR for the next data step;
proc sort data=example; by id year; run;
* Process
data example_filled;
set example;
by id year;
* Create the variable that will be used in the POINT= option of the SET statement
* to retrieve information from the next observation;
_next_obs = _N_ + 1;
* Compute variables storing values from the previous observation;
retain pre_action_amount;
pre_action_amount = lag(action_amount);
if first.id then
call missing(pre_action_amount);
* Analyze possible YEAR gaps in a BY group and partly fill them;
if last.id then do;
*** No gaps to check => simply output the last observation of the BY group;
output;
end;
else do;
*** We should check for gaps in YEAR;
* 1) Send the current observation to the output dataset;
output;
* 2) Check if there is a gap > 1 based on the YEAR value from the next observation,
* and if so, fill in (just) ONE observation
* Note the use of the POINT= option of the SET statement to access info from next obs;
set example(keep=year rename=(year=next_year)) POINT=_next_obs;
if next_year > year + 1 then do;
* Compute pre action values as the action values of the obs just output above;
* All the other variables still maintain their values and will be carried over
* this newly created obs;
pre_action_amount = action_amount;
year = year + 1;
* Output the filled observation;
output;
end;
end;
drop next_year;
run;
Expected output dataset example_filled (the new filled observations are Obs = 2 and Obs = 8):
action_ pre_action_
Obs year id amount amount
1 2013 28013 10000 .
2 2014 28013 10000 10000
3 2015 28013 12000 10000
4 2016 28013 5000 12000
5 2017 28013 0 5000
6 2003 356123 250 .
7 2004 356123 320 250
8 2005 356123 320 320
9 2009 356123 9800 320
You may find handy to use this example as a basis to implement the logic you described in order to compute pre_action and pre_action_amount.
Note: here are other ways to access information from the next observation in a data step:
https://blogs.sas.com/content/sgf/2015/06/19/can-you-lag-and-lead-at-the-same-time-if-using-the-sas-data-step-yes-you-can/

Related

Lag function in SAS for checking previous value

In SAS, I would like to create a label that check the previous sell indicator: if the sell indicator of the previous time period is 1/0 and in the current is 0/1 (meaning that it has changed) then I assign a value 1 to the ind variable.
The dataset looks like:
Customer Time Sell_Ind
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
And so on.
My expected output would be
Customer Time Sell_Ind Ind
1 2 1 0
1 3 0 1
1 4 0 0
2 23 0 0
2 24 0 0
2 30 0 0
5 12 1 0
5 11 0 1
The previous/current check is meant by customer.
I have tried as follows
data mydata;
set original;
By customer;
Lag_sell_ind=lag(sell_ind);
If first.customer then Lag_sell_ind=.;
Run;
But it does not return the expected output.
In sql I would probably use partition by customer over time but I do not know how to do the same in SAS.
You were halfway through, you only need to add one if statement to achieve the desired output.
data want;
set have;
by customer;
lag=lag(sell_ind);
if first.customer then lag=.;
if sell_ind ne lag and lag ne . then ind = 1;
else ind = 0;
drop lag;
run;
You can simplify this using the IFN Function like below.
data have;
input Customer Time Sell_Ind;
datalines;
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
;
data want;
set have;
by customer;
Lag_sell_ind = ifn(first.customer, 0, lag(sell_ind));
Run;

Subsampling years before and after an event of an unbalanced panel dataset

I'm trying to figure out a concise way to keep only the two years before and after the year in which an event takes place using daily panel data in Stata. The panel is unbalanced. Ultimately, I'm trying to conduct an event study but I experienced issues because the unique groups report inconsistent years.
The data looks something like this:
ID year month day event
1 1999 1 1 0
1 1999 1 2 0
1 1999 1 3 0
1 1999 1 4 0
1 1999 1 5 0
1 1999 1 6 0
1 1999 1 7 0
1 1999 1 8 0
1 1999 1 9 0
1 1999 1 10 0
1 1999 1 11 0
1 1999 1 12 0
1 1999 1 13 0
1 1999 1 14 0
1 1999 1 15 0
1 1999 1 16 0
1 1999 1 17 0
1 1999 1 18 0
1 1999 1 19 0
1 1999 1 20 0
1 1999 1 21 0
1 1999 1 22 0
1 1999 1 23 0
1 1999 1 24 0
1 1999 1 25 0
1 1999 1 26 0
1 1999 1 27 0
1 1999 1 28 0
1 1999 1 29 0
1 1999 1 30 0
1 1999 1 31 0
1 1999 2 1 1
1 1999 2 2 1
In this case, the event takes place in February 1999. The event is monthly, but I need the daily data for a later part of the analysis. I want to somehow tag the 24 months before February 1999 and the 24 months after February 1999. However, I need to do this in a way that won't codify any months in 2002 if group 1 reported no data in 2000.
I got the following to work on a similar set of monthly data but I can't figure out a way to do it with daily data. Furthermore, if anyone has suggestions for a less clunky solution, I would be very appreciative.
bys ID year (month) : egen year_change = max(event)
bys ID (year month) : replace year_change = 2 if ///
(year_change[_n+24] == 1 & year[_n] == year[_n+24] - 2) | ///
(year_change[_n+12] == 1 & year[_n] == year[_n+12] - 1) | ///
(year_change[_n-12] == 1 & year[_n] == year[_n-12] + 1) | ///
(year_change[_n-24] == 1 & year[_n] == year[_n-24] + 2)
keep if year_change >= 1
It seems that your event date is the first date with event 1. So,
gen dailydate = mdy(month, day, year)
bysort id : egen key = min(cond(event == 1, dailydate, .))
gen wanted = inrange(dailydate, key - 730, key + 730)
Check that wanted gives the dates you want and then modify the rule or keep accordingly.
This code doesn't assume that the event date is the same for each panel, but that would not be a problem.
See this paper for a review of related technique.
For your task, I suggest you to work use actual Stata dates, instead of relying on year + month + day variables - this way, it would be easier to add/subtract 24 months without relying on data sorting (the "_n+24" part in your code) and the codification would not suffer from the issue with missing data that you outline in the question.
I see a straightforward solution, which relies on an assumption I made on your setting (that you did not specify, but is the general form of event studies): the event date is unique for all IDs, hence there is no group-specific "treatment" date.
g stata_date = mdy(month, day, year) // generate variable with Stata date
/* Unique event on Feb 1, 1999 */
bys ID: egen treat_group = max(event) // indicator for an ID to ever be "treated"
g event_window = (stata_date >= td(01Feb1997) & stata_date < td(01Feb2001)) // indicator for event window - 2 years before and after Feb 1, 1999
g event_treatment = treat_group * event_window // indicator for a treated ID during the event window

Stata: Reducing observations based on yearly data

I want to create a variable that is one or zero if a company (companyid below) is "multicolor" in each year. Below is my data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float(red blue green)
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2017 1 0 0
"001045" 2017 1 0 0
"001049" 2019 0 1 0
"001049" 2019 0 0 1
"001055" 2018 1 0 0
"001055" 2018 0 1 0
"001055" 2018 0 0 1
So for example, company #001055 is red, blue, and green for 2018 so this 'multicolor' variable should equal to one.
Additionally, I also want to create variables for the different combinations. I.e. a red-blue var = 1 if a company is red and blue = 1 in each year.
I was trying to do something with bysort companyid year: gen multicolor = 1 if red == 1 & blue == 1 & green == 1 but I realize that has a lot missing in what I want to accomplish.
The overall goal is to reduce multiple year observations so I have one observation per year per company.
This single year/company record would have the info if that company was red, green, blue, or the exact mix of these colors if it is mixed. Below would be the example of data that I want to create from the data above.
input str6 companyid int year float(red blue green r-b-g red-blue blue-green ...more...)
"001045" 2015 0 1 0 0 0 0 ...
"001045" 2017 1 0 0 0 0 0 ...
"001049" 2019 0 0 0 0 0 1 ...
"001055" 2018 0 0 0 1 0 0 ...
I think this is a lot easier than you are fearing. First, collapse to maximum values by company and year. Then you have the individual values of red blue green. Second, concatenate the values, so that "110" is red and blue but not green, and so on.
tabulate would generate all the indicators corresponding to combinations found in the data.
In effect, the 3 colors and 2 possibilities permit binary encoding, and the string is a binary number too.
The correspondence for true 1 and false 0 that maximum over 0s and 1s means "any" and that minimum over 0s and 1s means "all" is obvious once understood, but worth explaining otherwise. For a Stata context, see this FAQ
clear
input str6 companyid int year float(red blue green)
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2017 1 0 0
"001045" 2017 1 0 0
"001049" 2019 0 1 0
"001049" 2019 0 0 1
"001055" 2018 1 0 0
"001055" 2018 0 1 0
"001055" 2018 0 0 1
end
collapse (max) red blue green, by(companyid year)
egen colors = concat(red blue green)
list
+-----------------------------------------------+
| compan~d year red blue green colors |
|-----------------------------------------------|
1. | 001045 2015 0 1 0 010 |
2. | 001045 2017 1 0 0 100 |
3. | 001049 2019 0 1 1 011 |
4. | 001055 2018 1 1 1 111 |
+-----------------------------------------------+

Management of spell data: months spent in given state in the past 24 months

I am working with a spell dataset that has the following form:
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "unemployment spell" 2 "employment spell"
lab val spell_type lab_spelltype
where persid is the id of the person; start and end are the months when the yearly unemployment/employment spell starts and ends, respectively; t_start and t_end are the same measures but starting to count from 1st January 1996; event is equal to 1 for the employment entries for which the previous row was an unemployment spell.
The data is such that there are no overlapping spells during a given year, and each year contiguous spells of the same type have been merged together.
My goal is, for each row such that event is 1, to compute the number of months spent as employed in the last 6 months and 24 months.
In this specific example, what I would like to get is:
clear all
input persid start end t_start t_end spell_type year spell_number event empl_6 empl_24
1 8 9 44 45 1 1999 1 0 . .
1 12 12 60 60 1 2000 1 0 . .
1 1 1 61 61 1 2001 1 0 . .
1 7 11 67 71 1 2001 2 0 . .
1 1 4 85 88 2 2003 1 0 . .
1 5 7 89 91 1 2003 2 1 0 5
1 8 11 92 95 2 2003 3 0 . .
1 1 1 97 97 2 2004 1 0 . .
1 1 3 121 123 1 2006 1 1 0 0
1 4 5 124 125 2 2006 2 0 . .
1 6 9 126 129 1 2006 3 1 3 3
1 10 11 130 131 2 2006 4 0 . .
1 12 12 132 132 1 2006 5 1 4 7
1 1 12 157 168 1 2009 1 0 . .
1 1 12 169 180 1 2010 1 0 . .
1 1 12 181 192 1 2011 1 0 . .
1 1 12 193 204 1 2012 1 0 . .
1 1 12 205 216 1 2013 1 0 . .
end
So the idea is that I have to go back to rows preceding each event==1 entry and count how many months the individual was employed.
Can you suggest a way to obtain this final result?
Some suggested to expand the dataset, but perhaps there are better ways to tackle the problem (especially because the dataset is quite large).
EDIT
The correct labeling of the employment status is:
lab define lab_spelltype 1 "employment spell" 2 "unemployment spell"
The number of past months spent in employment (empl_6 and empl_24) and the definition of event are now correct with this label.
A solution to the problem is to:
expand the data so to have it monthly,
fill in the gap months with tsfill and finally,
use sum() and lag operators to get the running sum for the last 6 and 24 months.
See also Robert solution for some ideas I borrowed.
Important: this is almost surely not an efficient way to solve the issue, especially if the data is large (as in my case).
However, the plus is that one actually "sees" what happens in background to make sure the final result is the one desired.
Also, importantly, this solution takes into account cases where 2 (or more) events happen within 6 (or 24) months from each other.
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "employment" 2 "unemployment"
lab val spell_type lab_spelltype
list
* generate Stata monthly dates
gen spell_start = ym(year,start)
gen spell_end = ym(year,end)
format %tm spell_start spell_end
list
* expand to monthly data
gen n = spell_end - spell_start + 1
expand n, gen(expanded)
sort persid year spell_number (expanded)
bysort persid year spell_number: gen month = spell_start + _n - 1
by persid year spell_number: replace event = 0 if _n > 1
format %tm month
* xtset, fill months gaps with "empty" rows, use lags and cumsum to count past months in employment
xtset persid month, monthly // %tm format
tsfill
bysort persid (month): gen cumsum = sum(spell_type) if spell_type==1
bysort persid (month): replace cumsum = cumsum[_n-1] if cumsum==.
bysort persid (month): gen m6 = cumsum-1 - L7.cumsum if event==1 // "-1" otherwise it sums also current empl month
bysort persid (month): gen m24 = cumsum-1 - L25.cumsum if event==1
drop if event==.
list persid start end year m* if event
The posted example is of little utility in developing and testing a solution so I made up fake data that has the same properties. It's bad practice to use 1 and 2 as values for an indicator so I replaced the employed indicator with 1 meaning employed, 0 otherwise. Using month and year separately is also useless so Stata monthly dates are used.
The first solution uses tsegen (from SSC) after expanding each spell to one observation per month. With panel data, all you need to do is to sum the employment indicator for the desired time window.
The second solution uses rangestat (also from SSC) and does the same computations without expanding the data at all. The idea is simple, just add the duration of previous employment spells if the end of the spell falls within the desired window. Of course if the end of the spell falls within the window but not the start, days outside the window must be subtracted.
* fake data for 100 persons, up to 10 spells with no overlap
clear
set seed 123423
set obs 100
gen long persid = _n
gen spell_start = ym(runiformint(1990,2013),1)
expand runiformint(1,10)
bysort persid: gen spellid = _n
by persid: gen employed = runiformint(0,1)
by persid: gen spell_avg = int((ym(2015,12) - spell_start) / _N) + 1
by persid: replace spell_start = spell_start[_n-1] + ///
runiformint(1,spell_avg) if _n > 1
by persid: gen spell_end = runiformint(spell_start, spell_start[_n+1]-1)
replace spell_end = spell_start + runiformint(1,12) if mi(spell_end)
format %tm spell_start spell_end
* an event is an employment spell that immediately follow an unemployment spell
by persid: gen event = employed & employed[_n-1] == 0
* expand to one obs per month and declare as panel data
expand spell_end - spell_start + 1
bysort persid spellid: gen ym = spell_start + _n - 1
format %tm ym
tsset persid ym
* only count employement months; limit results to first month event obs
tsegen m6 = rowtotal(L(1/6).employed)
tsegen m24 = rowtotal(L(1/24).employed)
bysort persid spellid (ym): replace m6 = . if _n > 1 | !event
bysort persid spellid (ym): replace m24 = . if _n > 1 | !event
* --------- redo using rangestat, without any monthly expansion ----------------
* return to original obs but keep first month results
bysort persid spellid: keep if _n == 1
* employment end and duration for employed observations only
gen e_end = spell_end if employed
gen e_len = spell_end - spell_start + 1 if employed
foreach target in 6 24 {
// define interval bounds but only for event observations
// an out-of-sample [0,0] interval will yield no results for non-events
gen low`target' = cond(event, spell_start-`target', 0)
gen high`target' = cond(event, spell_start-1, 0)
// sum employment lengths and save earliest employment spell info
rangestat (sum) empl`target'=e_len ///
(firstnm) firste`target'=e_end firste`target'len=e_len, ///
by(persid) interval(spell_end low`target' high`target')
// remove from the count months that occur before lower bound
gen e_start = firste`target' - firste`target'len + 1
gen outside = low`target' - e_start
gen empl`target'final = cond(outside > 0, empl`target'-outside, empl`target')
replace empl`target'final = 0 if mi(empl`target'final) & event
drop e_start outside
}
* confirm that we match the -tsegen- results
assert m24 == empl24final
assert m6 == empl6final

How to create dummy variables to indicate two values are the same using SAS

My data looks like:
ID YEAR A B
1078 1989 1 0
1078 1999 1 1
1161 1969 0 0
1161 2002 1 1
1230 1995 0 0
1230 2002 0 1
1279 1996 0 0
1279 2003 0 1
1447 1993 1 0
1447 2001 1 1
1487 1967 0 0
1487 2008 1 1
1487 2008 1 0
1487 2009 0 1
1678 1979 1 0
1678 2002 1 1
1690 1989 1 0
1690 1993 0 1
1690 1993 0 0
1690 1996 0 1
1690 1996 0 0
1690 1997 1 1
I'd like to create two dummy variables, new and X, the scenarios are as follows:
within each ID-B pair (a pair is 2 observations one with B=0 and the other B=1 with YEAR closet together in sequence)
if the observation with B=1 has a value of 1 for A then new=1 for both observations in that pair, otherwise it is 0 for both observations in that pair, and
if the pair has the same value in A then X=0 and if they have different values then X=1.
Therefore, the output would be:
ID YEAR A B new X
1078 1989 1 0 1 0
1078 1999 1 1 1 0
1161 1969 0 0 1 1
1161 2002 1 1 1 1
1230 1995 0 0 0 0
1230 2002 0 1 0 0
1279 1996 0 0 0 0
1279 2003 0 1 0 0
1447 1993 1 0 1 1
1447 2001 1 1 1 1
1487 1967 0 0 1 1
1487 2008 1 1 1 1
1487 2008 1 0 0 1
1487 2009 0 1 0 1
1678 1979 1 0 1 0
1678 2002 1 1 1 0
1690 1989 1 0 0 1
1690 1993 0 1 0 1
1690 1993 0 0 0 0
1690 1996 0 1 0 0
1690 1996 0 0 1 1
1690 1997 1 1 1 1
My codes are
data want;
set have;
by ID;
if B=1 and A=1 then new=1;
else new=0;
run;
proc sql;
create table out as
select a.*,max(a.B=a.A & a.B=1) as new,^(min(A)=max(A)) as X
from have a
group by ID;quit;
The first one doesn't work and the second one reordered variable B. I am stuck here. Any help will be greatly appreciated.
You need to do some research into first./last. processing and the lag function.
The helpful guys here have already gotten you to this point, maybe take this as an opportunity to read the documentation at SAS' Support Site.
At a high level:
You need a conditional statement to step through each observation in an ID group
Find out how many observations are in that group (let's say N obs)
Flag up if any obs match the logic you mentioned
Lag back N obs and set your new to 1 or 0 respectively
Very manual solution, I just used the retain statement to identify the pairs (dataset already in the required order).
data start;
set start;
retain pair 0;
if B=0 then pair=pair+1;
run;
data ForNew;
set start(where=(B=1));
New=(A=B); /*Boolean variable=1 if the condition in brackets is true*/
keep pair New;
run;
/*if A has equal values mean will be 0 or 1*/
proc means data=start NWAY NOPRINT;
class pair;
var A;
output out=ForX(drop=_: where=(media in (0,1)) keep=pair media) mean(A)=media;
run;
data end;
merge start ForNew ForX(in=INX drop=media);
by pair;
X=(^INX);
run;