I am currently restructuring my package from SAS Base to SAS Enterprise Guide in a knowledge transfer to a client. Unfortunately, one aspect I have to sacrifice is the change from using compress to strip in my proc sql left joins, for example the following code doesn't work
data have;
input ID VarA;
datalines;
1 2
2 3
3 4
4 5
;
run;
data have1;
input ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9;
datalines;
1 3 4 6 7 3 6 6 7 8
2 2 2 2 2 5 6 7 2 1
3 5 6 7 8 4 5 3 4 3
4 3 4 6 7 4 6 8 3 6
;
run;
proc sql;
create table Want as
select a.*
,b.Var1
,b.Var2
,b.Var3
,b.Var4
,b.Var5
,b.Var6
,b.Var7
,b.Var8
,b.Var9
from Have as a
left join Have1 as b
on compress(a.ID) = compress(b.ID);
quit;
I can use the strip function at times but it is safer to deliver a package with compress as there is often misplaced spaces in observations. any ideas?
Edit: to save further confusion, I usually use the compress function to look up reference rates of bonds like EURIBOR 006m - this makes my generic example incorrect but the left join typically uses character variables
You need a character variable to use the compress function. Your ID variables are numeric.
Try converting to character:
on compress(put(a.ID,8.)) = compress(put(b.ID,8.));
Related
I want to keep only the row with the highest rank1 for each team. If there is a tie, I want the row with the higher rank2. And then the higher rank3.
For example,
data test;
input name $ team $ rank1 rank2 rank3 country $
datalines;
Bob A 5 6 5 US
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Jim B 9 4 4 FRA
Bob C 3 4 1 FRA
Dan D 5 2 7 GER
Ike D 5 2 7 US
Jay D 5 2 8 UK
run;
I want:
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Bob C 3 4 1 FRA
Jay D 5 2 8 UK
What is the most efficient way to do this? The dataset I'm working with is pretty big and is not sorted. I tried the below code but the sorts take forever to run. And the second sort sorts already sorted data. What if most teams only appear once in the dataset? Is it faster to split into duplicates and non-duplicates, sort only the duplicates and then append?
proc sort data=test;
by team descending rank1 descending rank2 descending rank3;
run;
proc sort data=test nodupkey;
by team;
run;
You can do that with PROC SUMMARY. Not sure about performance compared to what you are already doing.
proc summary data=test nway;
class team;
output out=ranked(drop=_:) idgroup(max(rank:) out(name rank: country)=);
run;
I'm trying to detect specific values of one variable and create a new one if those conditions are fulfilled.
Here's a part of my data (I have much more rows) :
id time result
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
5 1 normal
5 2 normal
5 3
What I want
id time result base
1 1 normal
1 2 normal x
1 3 abnormal
2 1 normal x
2 2
2 3 normal
3 3 normal
4 1 normal
4 2 normal x
4 3 abnormal
5 1 normal
5 2 normal x
5 3
My baseline value (base) should be populated when result exists at timepoint (time) 2. If there's no result then baseline should be at time=1.
if result="" and time=2 then do;
if time=10 and result ne "" then base=X; end;
if result ne "" and time=2 then base=X; `
It works correctly when time=2 and results exists. But if results missing, then there's something wrong.
The question seems a bit off. "Else if time="" and time=1" There seems to be a typo there somewhere.
However, your syntax seems solid. I've worked an example with your given data. The first condition works, but second (else if ) is assumption. Updating as question is updated.
options missing='';
data begin;
input id time result $ 5-20 ;
datalines;
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
;
run;
data flagged;
set begin;
if time=2 and result NE "" then base='X';
else if time=1 and id=2 then base='X';
run;
Edit based on revisited question.
Assuming that the time-point (1) is always next to the point (2). (If not, then add more lags.) Simulating the Lead function we sort the data backwards and utilize lag.
proc sort data=begin; by id descending time; run;
data flagged;
set begin;
if lag(time)=2 and lag(result) EQ "" then base='X';
if time=2 and result NE "" then base='X';
run;
More about opposite of lag: https://communities.sas.com/t5/SAS-Communities-Library/How-to-simulate-the-LEAD-function-opposite-of-LAG/ta-p/232151
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
say I have two rows of data I try to read in.
cody: 10 9 20 18
john: 4 5 1 2
and I want to read them in a two row style in datalines, like such:
input cody john ##;
datalines;
10 9 20 18
4 5 1 2
run;
But this reads it in like cody: 10 20 4 1 john: 9 18 5 2
How do I fix this?
You'd need to read in the CODY lines all at once, then the JOHN lines all at once. It's unclear what the final data structure should look like, but this is one possibility, and then you can restructure this how you wish, perhaps with PROC TRANSPOSE.
Basically, I assign name to the proper name (using an array here, but you can do this in better ways, data-driven ways, depending on your data). Then I loop and tell SAS to keep reading in data until it is unable to read any more, using the truncover option (or missover is also fine) to make sure it doesn't skip to the next line, and output a new row for each value.
data want;
array names[2] $ _temporary_ ("Cody","John") ;
infile datalines truncover;
do _name = 1 to 2;
name = names[_name];
do _i = 1 by 1 until (missing(value));
input value #;
if not missing(value) then output;
end;
input;
end;
drop _:;
datalines;
10 9 20 18
4 5 1 2
run;
I think that the solution to your problem is to use the names as another column, not as variables, like this:
data foo;
input var1 $ var2 var3 var4 var5;
datalines;
cody 10 9 20 18
john 4 5 1 2
;
run;
I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.