I'm trying to detect specific values of one variable and create a new one if those conditions are fulfilled.
Here's a part of my data (I have much more rows) :
id time result
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
5 1 normal
5 2 normal
5 3
What I want
id time result base
1 1 normal
1 2 normal x
1 3 abnormal
2 1 normal x
2 2
2 3 normal
3 3 normal
4 1 normal
4 2 normal x
4 3 abnormal
5 1 normal
5 2 normal x
5 3
My baseline value (base) should be populated when result exists at timepoint (time) 2. If there's no result then baseline should be at time=1.
if result="" and time=2 then do;
if time=10 and result ne "" then base=X; end;
if result ne "" and time=2 then base=X; `
It works correctly when time=2 and results exists. But if results missing, then there's something wrong.
The question seems a bit off. "Else if time="" and time=1" There seems to be a typo there somewhere.
However, your syntax seems solid. I've worked an example with your given data. The first condition works, but second (else if ) is assumption. Updating as question is updated.
options missing='';
data begin;
input id time result $ 5-20 ;
datalines;
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
;
run;
data flagged;
set begin;
if time=2 and result NE "" then base='X';
else if time=1 and id=2 then base='X';
run;
Edit based on revisited question.
Assuming that the time-point (1) is always next to the point (2). (If not, then add more lags.) Simulating the Lead function we sort the data backwards and utilize lag.
proc sort data=begin; by id descending time; run;
data flagged;
set begin;
if lag(time)=2 and lag(result) EQ "" then base='X';
if time=2 and result NE "" then base='X';
run;
More about opposite of lag: https://communities.sas.com/t5/SAS-Communities-Library/How-to-simulate-the-LEAD-function-opposite-of-LAG/ta-p/232151
Related
I want to change data of the form
id value
1 1
1 1
1 2
2 7
2 7
2 7
2 5
. .
. .
. .
to
id value
1 1
1 1
1 1
2 7
2 7
2 7
2 7
. .
. .
. .
That is, the last value by group should be the first value by group. I have tried the following code
data want;
set have;
by id;
last.value=first.value;
run;
But that didn't work. Could someone help me out?
You should save first.id value in variable and retain it.
data want(drop=tValue);
set have;
by id;
retain tValue;
if first.id then tValue=value;
if last.id then value=tValue;
run;
The problem here is that first.value and last.value:
Do not hold the actual value, they just tell you if an observation is the first or last in a BY-group
Cannot be assigned - last.value = is not valid syntax
Secondly, first.value and last.value only get set if the value variable is stated in the by statement. You should use first.id and last.id instead.
What we need to do here is:
Check if we are looking at an observation that is the first in the BY-group based on id
Keep the value of the value variable until the last id value is reached
When we are looking at the last id value then set the value from step 1.
Alexey's answer covers the actual syntax required to do this. Here's what the first.id/last.id values look like. (You can always view them by adding put _all_; into your datastep):
id value first.id last.id tValue
1 1 1 0 1
1 1 0 0 1
1 2 0 1 1
2 7 1 0 7
2 7 0 0 7
2 7 0 0 7
2 5 0 1 7
. .
. .
. .
I've spent quite a lot of time on Stack Overflow looking for answers to other questions, but I'm really stuck on this one, so I'm finally asking a question!
I have a dataset of fish in SAS, with:
a unique ID for each angler
three different variables with number of fish released in each category by that angler: over legal size, under legal size, and released dead
a sequential number (fishno) based on the number of rows for each ID; 1 to the last row of that ID.
Variable to be created: Disposition--could be either character variable with "legal" "under" "dead" options or even numeric values of 1-3.
It was originally set up with one row per unique ID, but I set it so that now there is one row per fish discarded (i.e. if there were 3 legal size and 2 undersize fish, I now have 5 rows).
I need to assign, by unique ID, whether each row/fish was released legal, undersize or dead. In the previous example, for a unique ID, I'd need 3 rows assigned to a Disposition of "legal" and 2 rows assigned to a Disposition of "under".
I've tried first.var statements along with if-then-do statements; played around with macros; nothing worked quite right and I'm pretty stuck here. Is there some sort of random assignment I should try? Is there a much easier way that I'm missing?
Example of the data below...
THANK YOU!!
Data in Excel format
Assuming you already have the FISHNO variable, there needs to be some method for assigning each fish as legal, dead, or undersize. The following code will assign the disposition in the that order:
data have;
input ID LEGAL DEAD UNDERSIZE FISHNO;
datalines;
15 1 0 1 1
15 1 0 1 2
29 2 0 2 1
29 2 0 2 2
29 2 0 2 3
29 2 0 2 4
38 1 0 1 1
38 1 0 1 2
53 1 0 1 1
53 1 0 1 2
55 1 0 1 1
55 1 0 1 2
;
run;
data want;
set have;
if legal>0 and legal>=fishno then disposition = 'legal';
else if dead>0 and legal+dead>=fishno then disposition = 'dead';
else if undersize>0 and legal+dead+undersize>=fishno then disposition = 'under';
run;
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.
I have a dataset that is unique by 5 variables, with two dependent variables. My goal is for this dataset to have appended to it additional rows with TOTAL as the value of independent variables, with the values of the dependent variables changing accordingly.
To do this for a single independent variable is not a problem, I would do something along the lines of:
proc sql;
create table want as
select "TOTAL" as independent_var1,
independent_var2,
...
independent_var5,
sum(dependent_1) as dependent_1,
sum(dependent_2) as dependent_2
from have
group by independent_var1,...,independent_var5;
quit;
Followed by appending the original dataset in whatever fashion you choose. However, I want the above, yet x5 (for each independent variable), and then again for each possible combination of TOTAL/nontotal across the 5 independent variables. Not sure just how many datasets that is off the top of my head...but it's a decent amount.
So best strategy I've come up with so far is to use the above with some mildly creative macro code to generate all possible table combinations of total/non-total, but it seems like SAS just might have a better way, maybe tucked away in an esoteric proc step I've never heard of...
--
Attempt to show example, using three independent variables and 1 dependent variable:
Ind1|2|3|Dependent1
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
Desired output would be:
0 0 ALL 4
0 1 ALL 12
0 ALL 0 6
0 ALL 1 10
ALL 0 0 1
ALL 0 1 3
ALL 1 0 5
ALL 1 1 7
0 ALL ALL 16
ALL 0 ALL 4
ALL 1 ALL 12
ALL ALL 0 6
ALL ALL 1 10
ALL ALL ALL 16
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
I may have forgotten some combinations, but that should serve to get the point across.
PROC MEANS should do this for you trivially. You need to clean up the output in order to get it to perfectly match what you want (missing for INDx = "ALL" in your example) but otherwise it gets the calculations done properly.
data have;
input Ind1 Ind2 Ind3 Dependent1;
datalines;
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
;;;;
run;
proc means data=have;
class ind1 ind2 ind3;
var dependent1;
output out=want sum=;
run;