I have a dataset, say "dataset", and it is composed by two variables, say "id" and "text".
in the variable "text" I may have something like "A" or "A, B" or "A, B, C" or "D" (or any possible combination of the 4 letters, see the image). what I want at the end of the day is a dataset with dummy variables, so something like text_A where text_A contains 1 if A is in text and so on.
Now I have written some code that actually doesn't work:
this creates the "dummified" dataset, and it works
DATA dummified_dataset;
SET starting_dataset;
text_A = 0;
text_B = 0;
text_C = 0;
text_D = 0;
RUN;
here it is my code to do the job:
DATA want;
SET dummified_dataset;
DO i = 1 TO COUNTW(text, ",");
%LET var_text = SCAN(text, i, ",");
IF &var_text = "A" OR &var_text = "a" THEN text_A = 1;
IF &var_text = "B" OR &var_text = "b" THEN text_B = 1;
IF &var_text = "C" OR &var_text = "c" THEN text_C = 1;
IF &var_text = "D" OR &var_text = "d" THEN text_D = 1;
END;
RUN;
with the code above it seems to process only the first element: if a row contains text = "A,B,C,D" only text_A is put to 1 and the others remain to 0
Good start but not sure why you switched to macro logic halfway through.
First take your data set and create a new row for each letter attached to the ID instead of trying to deal with it in the current structure.
Then use PROC FREQ which will generate that table above but not in a data set. The data set will need to be restructured if you want it in that output. For that step, use PROC TRANSPOSE to get your final structure.
DATA long;
SET dummified_dataset;
DO i = 1 TO COUNTW(text, ",");
var_text = SCAN(text, i, ",");
output;
END;
RUN;
proc freq data=long;
table id*var_text / out=long_freq sparse nopercent nocol norow;
run;
proc transpose data=long out=long_freq;
by id;
id var_text;
var count;
run;
Related
libname Prob 'Y:\alsdkjf\alksjdfl';
the geoid here is char I wanna convert to num to be able to merge by id;
data Problem2_1;
set Prob.geocode;
id = substr(GEOID, 8, 2);
id = input(id, best5.);
output;
run;
geoid here is numeric;
data Problem2_2; c
set Prob.households;
id = GEOID;
output;
run;
data Problem2_3;
merge Problem2_1
Problem2_2 ;
by ID;
run;
proc print data = Problem2_3;
*ERROR: Variable geoid has been defined as both character and numeric.
*ERROR: Variable id has been defined as both character and numeric.
It looks like you could replace these two lines:
id = substr(GEOID, 8, 2);
id = input(id, best5.);
With:
id = input(substr(GEOID, 8, 2), best.);
This would mean that both merge datasets contain numeric id variables.
SAS requires the linking id to be same data type. This means that you have to convert the int to string or vice versa. Personally, I prefer to convert to numeric when ever it is possible.
A is a worked out example:
/*Create some dummy data for testing purposes:*/
data int_id;
length id 3 dummy $3;
input id dummy;
cards;
1 a
2 b
3 c
4 d
;
run;
data str_id;
length id $1 dummy2 $3;
input id dummy2;
cards;
1 aa
2 bb
3 cc
4 dd
;
run;
/*Convert string to numeric. Int in this case.*/
data str_id_to_int;
set str_id;
id2 =id+0; /* or you could use something like input(id, 8.)*/
/*The variable must be new. id=id+0 does _not_ work.*/
drop id; /*move id2->id */
rename id2=id;
run;
/*Same, but other way around. Imho, trickier.*/
data int_id_to_str;
set int_id;
id2=put(id, 1.); /*note that '1.' refers to lenght of 1 */
/*There are other ways to convert int to string as well.*/
drop id;
rename id2=id;
run;
/*Testing. Results should be equivalent */
data merged_by_str;
merge str_id(in=a) int_id_to_str(in=b);
by id;
if a and b;
run;
data merged_by_int;
merge int_id(in=a) str_id_to_int(in=b);
by id;
if a and b;
run;
For Problem2_1, if your substring contains only numbers you can coerce it to numeric by adding zero. Something like this should make ID numeric and then you could merge with Problem2_2.
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = temp + 0;
drop temp;
run;
EDIT:
Your original code originally defines ID as the output of substr, which is character. This should work as well:
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = input(temp, 8.0);
drop temp;
run;
Got the following example
I'm trying to know if any part of string in the column nomvar in table tata does exist in col1 in table toto and if yes, give me the definition using col2.
For I2010,RT,IS-IPI,F_CC11_X_CCXBA, I would have in the column intitule "yes,toto,tata,well"
I thought about using a proc sql with an insert and a select but I have two tables and I would need to do a join.
In the same time, I thought to have everything in one table but I'm unsure if it is a good idea.
Any suggestions are welcomed as I'm deeply stuck.
The SAS data step hash object is a nice way to do this. It allows you to read the Toto table into memory and it becomes a lookup table for you. Then you just walk the string from the Tata table using the scan function, tokenize, and lookup the col2 value. Here is the code.
By the way, turning table Tata into a structure like Toto and performing join is a perfectly rational way to do this, too.
/*Create sample data*/
data toto;
length col1 col2 $ 100;
col1='I2010';
col2='yes';
output;
col1='RT';
col2='toto';
output;
col1='IS-IPI';
col2='tata';
output;
col1='F_CC11_X_CCXBA';
col2='well';
output;
run;
data tata;
length nomvar intitule $ 100;
nomvar='I2010,RT,IS-IPI,F_CC11_X_CCXBA';
run;
/*Now for the solution*/
/*You can do this lookup easily with a data step hash object*/
data tata;
set tata;
length col1 col2 token $ 100;
drop col1 col2 token i sepchar rc;
/*slurp the data in from the Toto data set into the hash*/
if (_n_ = 1) then do;
declare hash toto_hash(dataset: 'work.toto');
rc = toto_hash.definekey('col1');
rc = toto_hash.definedata('col2');
toto_hash.definedone();
end;
/*now walk the tokens in data set tata and perform the lookup to get each value*/
i = 1;
sepchar = ''; /*this will be a comma after the first iteration of the loop*/
intitule = '';
do until (token = '');
/*grab nth item in the comma-separated list*/
token = scan(nomvar, i, ',');
/*lookup the col2 value from the toto data set*/
rc = toto_hash.find(key:token);
if (rc = 0) then do;
/*lookup successful so tack the value on*/
intitule = strip(intitule) || sepchar || col2;
sepchar = ',';
end;
i = i + 1;
end;
run;
Assuming your data is all structured like this (you're looking at the different strings in between . characters) I would think the easiest way is to normalize TATA (splitting by .) and then doing a straight join, then (if you need to) transposing back. (It might be better to leave it vertical - very likely you would find this more useful structure for analysis.)
data tata_v;
set tata;
call scan(nomvar,1,position,length,'.');
do _i = 1 by 1 while position le 0);
nomvar_out = substr(nomvar,position,length);
output;
call scan(nomvar,_i+1,position,length,'.');
end;
run;
Now you can join on nomvar_out and then (if needed) recombine things.
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.
I have a dataset like this(sp is an indicator):
datetime sp
ddmmyy:10:30:00 N
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
ddmmyy:10:34:00 N
And I would like to extract observations with "Y" and also the previous and next one:
ID sp
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
I tired to use "lag" and successfully extract the observations with "Y" and the next one, but still have no idea about how to extract the previous one.
Here is my try:
data surprise_6_step3; set surprise_6_step2;
length lag_sp $1;
lag_sp=lag(sp);
if sp='N' and lag(sp)='N' then delete;
run;
and the result is:
ID sp
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
Any methods to extract the previous observation also?
Thx for any help.
Try using the point option in set statement in data step.
Like this:
data extract;
set surprise_6_step2 nobs=nobs;
if sp = 'Y' then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set x point = prev;
output;
end;
set x point = current;
output;
if next <= nobs then do;
set x point = next;
output;
end;
end;
run;
There is an implicite loop through dataset when you use it in set statement.
_N_ is an automatic variable that contains information about what observation is implicite loop on (starts from 1). When you find your value, you store the value of _N_ into variable current so you know on which row you have found it. nobs is total number of observations in a dataset.
Checking if prev is greater then 0 and if next is less then nobs avoids an error if your row is first in a dataset (then there is no previous row) and if your row is last in a dataset (then there is no next row).
/* generate test data */
data test;
do dt = 1 to 100;
sp = ifc( rand("uniform") > 0.75, "Y", "N" );
output;
end;
run;
proc sql;
create table test2 as
select *,
monotonic() as _n
from test
;
create table test3 ( drop= _n ) as
select a.*
from test2 as a
full join test2 as b
on a._n = b._n + 1
full join test2 as c
on a._n = c._n - 1
where a.sp = "Y"
or b.sp = "Y"
or c.sp = "Y"
;
quit;
Consider this simple example:
data abc;
length a $2 b $1;
a = "aa";
b= "b";
run;
data def;
length a $1 b $2;
a = "a";
b= "bb";
run;
data ghi;
set abc def;
run;
In this example the dataset ghi has two variables but their length is determined by what's in dataset abc. Is there a way (without writing macros) to append two datasets so that if the variable names are the same the longer length takes precedence? That is in this example both a and b in dataset ghi is of length 2.
If you don't have very many variables, you can manually generate the length statement for your combined dataset like below. Notice that there is the 32k char limit on the length of the macro variable value. This may also mess up the existing order of the variables.
/* test data */
data abc;
length a $2 b $1;
a = "aa";
b= "b";
run;
data def;
length a $1 b $2;
a = "a";
b= "bb";
run;
/* max lenghs for each and every char type var */
proc sql;
create view abcview as
select name, length
from dictionary.columns
where libname="WORK" and memname="ABC" and type="char";
create view defview as
select name, length
from dictionary.columns
where libname="WORK" and memname="DEF" and type="char";
select catx(" $", a.name, max(a.length, d.length))
into :lengths separated by " "
from abcview as a, defview as d
where a.name = d.name;
quit;
data ghi;
length &lengths;
set abc def;
run;
proc contents data=ghi;
run;
/* on lst - in part
# Variable Type Len
1 a Char 2
2 b Char 2
*/