libname Prob 'Y:\alsdkjf\alksjdfl';
the geoid here is char I wanna convert to num to be able to merge by id;
data Problem2_1;
set Prob.geocode;
id = substr(GEOID, 8, 2);
id = input(id, best5.);
output;
run;
geoid here is numeric;
data Problem2_2; c
set Prob.households;
id = GEOID;
output;
run;
data Problem2_3;
merge Problem2_1
Problem2_2 ;
by ID;
run;
proc print data = Problem2_3;
*ERROR: Variable geoid has been defined as both character and numeric.
*ERROR: Variable id has been defined as both character and numeric.
It looks like you could replace these two lines:
id = substr(GEOID, 8, 2);
id = input(id, best5.);
With:
id = input(substr(GEOID, 8, 2), best.);
This would mean that both merge datasets contain numeric id variables.
SAS requires the linking id to be same data type. This means that you have to convert the int to string or vice versa. Personally, I prefer to convert to numeric when ever it is possible.
A is a worked out example:
/*Create some dummy data for testing purposes:*/
data int_id;
length id 3 dummy $3;
input id dummy;
cards;
1 a
2 b
3 c
4 d
;
run;
data str_id;
length id $1 dummy2 $3;
input id dummy2;
cards;
1 aa
2 bb
3 cc
4 dd
;
run;
/*Convert string to numeric. Int in this case.*/
data str_id_to_int;
set str_id;
id2 =id+0; /* or you could use something like input(id, 8.)*/
/*The variable must be new. id=id+0 does _not_ work.*/
drop id; /*move id2->id */
rename id2=id;
run;
/*Same, but other way around. Imho, trickier.*/
data int_id_to_str;
set int_id;
id2=put(id, 1.); /*note that '1.' refers to lenght of 1 */
/*There are other ways to convert int to string as well.*/
drop id;
rename id2=id;
run;
/*Testing. Results should be equivalent */
data merged_by_str;
merge str_id(in=a) int_id_to_str(in=b);
by id;
if a and b;
run;
data merged_by_int;
merge int_id(in=a) str_id_to_int(in=b);
by id;
if a and b;
run;
For Problem2_1, if your substring contains only numbers you can coerce it to numeric by adding zero. Something like this should make ID numeric and then you could merge with Problem2_2.
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = temp + 0;
drop temp;
run;
EDIT:
Your original code originally defines ID as the output of substr, which is character. This should work as well:
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = input(temp, 8.0);
drop temp;
run;
Related
I'm working with some SAS data, and am trying to figure out how to find a record's sort position in a datastep while using as few steps as possible.
Here's an example --
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
Proc Sort data=WORK.PLACES;
by STATE CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by STATE CITY;
ST_CITY_RNK = _N_;
run;
Proc Sort data=WORK.PLACES;
by CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by CITY;
CITY_RNK = _N_;
run;
In this example, is there a way to calculate ST_CITY_RNK and CITY_RNK without sorting multiple times? It feels like this should be possible with ordered hash tables, but I'm not sure how to go about doing it.
Thank you!
Hash table would be doable. Temporary arrays would have roughly the same effect and might be a bit easier.
The major limitation of either is what do you do with non-unique city names? Salem, Oregon and Salem, Massachusetts? Obviously in state-city rank that's fine, though you may find states with more than one Lincoln or similar, who knows; but in just City you'll certainly find several Columbias, Lincolns, Charlestons, etc. My solution gives the same sort rank to all of them (but would then skip forward 6 or whatever to the next one). The data step solution you post above would give them unique ranks. The hash iterator could probably do either one. You could tweak this with some effort to give unique ranks, but it would be work.
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data sortrank;
*Init pair of arrays - the one that stores the original values, and one to mangle by sorting;
array states[32767] $ _temporary_;
array states_cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
array cities[32767] $40. _temporary_;
array cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
*Iterate over the dataset, load into arrays;
do _n_ = 1 by 1 until (Eof);
set places end=eof;
states[_n_] = state;;
states_cities_sorted[_n_] = catx(',',state,city);
cities[_n_] = city;
cities_sorted[_n_] = city;
end;
*Sort the to-be-sorted arrays;
call sortc(of states_cities_sorted[*]);
call sortc(of cities_sorted[*]);
do _i = 1 to _n_;
*For each array element, look up the rank using `whichc`, looking for the value of the unsorted element in the sorted list;
city_rank = whichc(cities[_i],of cities_sorted[*]);
state_cities_rank = whichc(catx(',',states[_i],cities[_i]),of states_cities_sorted[*]);
*And put the array elements back in their proper variables;
city = cities[_i];
state= states[_i];
*And finally make a row output;
output;
end;
run;
For reference, here's a hash approach:
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data places;
set places;
if _n_ = 1 then do;
declare hash h1(ordered:'a',dataset:'places');
rc = h1.definekey('city');
rc = h1.definedata('city');
rc = h1.definedone();
declare hiter hi1('h1');
declare hash h2(ordered:'a',dataset:'places');
rc = h2.definekey('state','city');
rc = h2.definedata('state','city');
rc = h2.definedone();
declare hiter hi2('h2');
end;
t_city = city;
t_state = state;
rc = hi1.first();
do city_rank = 1 by 1 until(t_city = city);
rc = hi1.next();
end;
rc = hi2.first();
do state_city_rank = 1 by 1 until(t_city = city and t_state = state);
rc = hi2.next();
end;
state = t_state;
city = t_city;
drop t_:;
run;
I have data that looks like this:
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
I need to add:
A new column with column heading for each sequence, add prefix 'RT', drop the letters following the numeric part of the sequence
Fill the new column with the letters that follow the numeric part
of the sequence
I need to create this:
ID Sequence RTE6 RTK11 RTQ174 RTD177 RTV245
-----------------------------------------------------------------------
101 E6S,K11T,Q174K,D177E S T K E
102 K11T,V245EKQ T EKQ
I assume you want a SAS data set and not a report. ANYDIGIT makes it pretty easy to find the last non-digit sub-string.
data seq;
infile cards firstobs=3;
input id:$3. sequence :$50.;
cards;
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
;;;;
run;
proc print;
run;
data seq2V / View=seq2V;
set seq;
length w name sub $32 subl 8;
do i = 1 by 1;
w = scan(sequence,i,',');
if missing(w) then leave;
subl = anydigit(w,-99);
name = substrn(w,1,subl);
sub = substrn(w,subl+1);
output;
end;
run;
proc transpose data=seq2V out=seq3(drop=_name_) prefix=RT;
by id sequence;
var sub;
id name;
run;
proc print;
run;
I had a similar problem a while ago. The code is adapted to your problem.
If found this solution to work faster than anything I tried with proc transpose.
Still overall performance on huge datasets (espc. using many different sequences) is not great at all, as we loop 2*2 over all strings and also the final variables.
Can anyone offer a faster solution?
(Caution: MacroVar is limited to 65534 Characters.)
data var_name ;
set in_data;
length var string $30.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
var = substr(string,1,anydigit(string,-99));
output;
keep var;
end;
run;
proc sql noprint;
select distinct compress("RT"||var) into :var_list separated by ' '
from var_name;
quit;
%put &var_list.;
data out_data;
set in_data;
length string &var_list. $30. n 8. ;
array a_var [*] &var_list.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
do j = 1 to dim(a_var);
n = anydigit(string,-99) ;
if substr(vname(a_var[j]),3) eq substr(string,1,n) then a_var[j] = substr(string,n+1);
end;
end;
drop string i j n;
run;
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.
I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?
I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;
This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;
Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!
Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;
I have a variable, textvar, that looks like this:
type=1&name=bob
type=2&name=sue
I want to create a new table that looks like this:
type name
1 bob
2 sue
My approach is to use scan to split the variables on & so for the first observation I have
var1 var2
type=1 name=bob
So now I can use scan again to split on =:
vname = scan(var1, 1, '=');
value = scan(var1, 2, '=');
But how can I now assign value to the variable named vname?
PROC TRANPSOSE is the quickest way. You need an ID variable (dummy or real).
data test;
informat testvar $50.;
input testvar $;
datalines;
type=1&name=bob
type=2&name=sue
;;;;
run;
data test_vert;
set test;
id+1;
length scanner $20 vname vvalue $20;
scanner=scan(testvar,1,"&");
do _t=2 by 1 until (scanner=' ');
vname=scan(scanner,1,"=");
vvalue=scan(scanner,2,"=");
output;
scanner=scan(testvar,_t,"&");
end;
run;
proc transpose data=test_vert out=test_T;
by id;
id vname;
var vvalue;
run;
Does this help? Dynamic variable names in SAS
I think I have some code to address this, but left it at my workplace.
Obviously you haven't included your real data, but can't you just hard code some of the values if the format of the raw data is the same in each row? My code converts the "=" and "&" to "," to make the scan function easier to use.
data want (keep=type name);
set test;
_newvar=translate(testvar,",,","&=");
type=input(scan(_newvar,2),best12.);
length name $20;
name=scan(_newvar,4);
run;