I have these two datasets here:
data ONE;
input ID LastName $ FirstInit $ 1.;
datalines;
509182793 Smith C
319861601 Williams J
345121778 Connor F
480863211 King L
907636280 Franklin D
729082859 Monroe T
835688938 Hall K
;
run;
data TWO;
input ID $ 11. State $ 2.;
datalines;
334-99-5246 TX
480-86-3211 MD
449-55-9407 VA
345-12-1778 GA
907-63-6280 NY
790-09-9813 WY
319-86-1601 FL
;
run;
I have two questions:
1) How would you use COMPRESS to create a new character variable, "ncv" and set the value of ncv to be the value of the character variable ID with the hyphens removed? Here's my attempt:
data TWO_NUMERIC;
set TWO;
ncv=COMPRESS(TWO, "+-", "d");
run;
2) How would you use PUT/INPUT to convert ncv to a numerical value to create a numeric variable, "newncv"
data TWO_NUMERIC;
set TWO;
put(TWO,z6.);
run;
To start off with these two questions, I start off with the DATA step and SET statements:
data TWO_NUMERIC;
set TWO;
run;
I looked SAS 9.2's help page but the use of these two statements in their example code seems to confuse me.
Ok, I was going to say RTM, but in this case it's not clear, at least not in my opinion.
Your mistake for compress is that the first parameter should be the variable, in this case ID, not the dataset TWO. In addition you only need to specify the - in your list, not +, unless you think there might be + in the variable as well. Adding the modifier D, specifies add digits to the remove list, which is the opposite of what you want.
Similar concept with PUT/INPUT, reference the variable and make sure you're using the correct function, in this case, input to convert it to numeric.
Data two_numeric;
set two;
ncv=COMPRESS(ID, "-");
ncv_num=input(ncv, 12.);
run;
Compress can be used in multiple ways, one way is described by #Reeza above and the other is using the "k" modifier, which means "keep" as shown below,
data TWO_NUMERIC;
set TWO;
ncv_d=COMPRESS(ID," ", "kd"); * kd means keep-digits, your code had TWO which is a dataset name;
ncv_n=COMPRESS(ID," ", "kn"); * kd means keep-numbers;
/* Input Function is used to convert CHAR to NUM *
* the best. format applies the nearest matching format */
newncv=input(ncv_d,best.);
run;
The link I found useful to explain the K modifier is http://www.amadeus.co.uk/sas-training/tips/1/1/11/the-enhanced-compress-function.php
Related
I am looking for a specific employer in a SAS data set. The data set has not been reviewed for spelling so if I am looking for Univ it could be entered as Unversity, University, Univercity ...
I've tried scaning, counting the matching letters, 'contains'. These are work but I am still missing some.
proc sql;
create table SpecificEmployers as
select *
, case when employer contains 'Univ' then 'Y'
else 'N' end as Emp
from AllEmployers
;quit;
In this case, rather than searching for a substring, I would suggest searching individual characters which can occur most commonly such as U, N, V etc. Then you can keep only those values which have all these characters available. For example- I have used findc function to search the string which has U, N and V
data have;
input string $15.;
datalines;
uNiverstY
UNVERSTy
college
univercity
school
schools
UNIVERSITY
Uversity
unvarcity
school123
;
run;
proc sql;
select string from have
where findc(upcase(string),'U')>=1
and findc(upcase(string),'N')>=1
and findc(upcase(string),'V')>=1;
quit;
proc print data=want; run;
using upcase will also make your task easy .. so you don't have to worry about the case. You can put as many conditions as you need depending on the value
You should investigate some of the edit distance functions:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm
One approach would be to loop through each word in the employer name and see if any of the individual words has an edit distance below a certain threshold when compared to the string university.
I have something similar to the code below, I want to create every 2 character combination within my strings and then count the occurrence of each and store in a table. I will be changing the substr statement to a do loop to iterate through the whole string. But for now I just want to get the first character pair to work;
data temp;
input cat $50.;
call symput ('regex', substr(cat,1,2));
®ex = count(cat,substr(cat,1,2));
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Expected results;
cat bv dv cd ud kd
#### 6
#### 4
#### 8
#### 1
#### 3
#### 9
#### 1
I'd prefer not to use a proc transpose as I can't loop through the string to create all the character pairs. I'll have to manually create them and I have upto 500 characters per string, plus I would like to search for 3 and 4 string patterns.
You can't do what you're asking to directly. You will either have to use the macro language, or use PROC TRANSPOSE. SAS doesn't let you reference data in the way you're trying to, because it has to have already constructed the variable names and such before it reads anything in.
I'll post a different solution that uses the macro language, but I suspect TRANSPOSE is the ultimate solution here; there's no practical reason this shouldn't work with your actual problem, and if you're having trouble with that it should be possible to help - post the do loop and what you're wanting, and we can of course help. Likely you just need to put the OUTPUT in the do loop.
data temp;
input cat $50.;
cat_val = substr(cat,1,2);
_var_ = count(cat,substr(cat,1,2));
output;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
proc transpose data=temp out=temp_T(drop=_name_);
by cat notsorted; *or by some ID variable more likely;
id cat_val;
var _var_;
run;
Here's a solution that uses CALL EXECUTE rather than the macro language, as I decided that was actually a better solution. I wouldn't use this in production, but it hopefully shows the concept (in particular, I would not run a PROC DATASETS for each variable separately - I would concat all the renames into one string then run that at the end. I thought this better for showing how the process might work.)
This takes advantage of timing - namely, CALL EXECUTE happens after the data step terminates, so by that point you do know what variable maps to what data point. It does have to pass the data twice in order to drop the spurious variables, though if you either know the actual number of variables you want to have, or if you're okay with the excess variables hanging around, it would be okay to skip that, and PROC DATASETS doesn't actually open the whole dataset, so it would be quite fast (even the above with five calls is quite fast).
data temp;
input cat $50.;
array _catvars[50]; *arbitrary 50 chosen here - pick one big enough for your data;
array _catvarnames[50] $ _temporary_;
cat_val = substr(cat,1,2);
_iternum = whichc(cat_val, of _catvarnames[*]);
if _iternum=0 then do;
_iternum = whichc(' ',of _catvarnames[*]);
_catvarnames[_iternum]=cat_val;
call execute('proc datasets lib=work; modify temp; rename '||vname(_catvars[_iternum])||' = '||cat_val||'; quit;');
end;
_catvars[_iternum]= count(cat,substr(cat,1,2));
if _n_=7 then do; *this needs to actually be a test for end-of-file (so add `end=eof` to the set statement or infile), but you cannot do that in DATALINES so I hardcode the example.;
call execute('data temp; set temp; drop _catvars'||put(whichc(' ',of _catvarnames[*]),2. -l)||'-_catvars50;run;');
end;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
*the title may be misleading
I have (column) cells values as follows:
d="M200,170L149,385"
d="M200,170L150,387"
d="M200,170L275,384"
d="M200,170L49,317"
d="M200,170L92,347"
The values 200 & 170 in each cell represent the x and y origins respectively, while the second set of values (i.e. 149 and 385) represent the x and y values.
I want to separate the x-orgin, y-orgin, x and y values into four columns. (I'm relatively new to sas... I think these are cartesian coordinates)
How would I go about doing this?
Use the scan function. It is used to select the nth word of a string. First argument is the string you want parsed, second is the word (1st, 2nd, etc), and third lists your delimiters (characters that separate the words). That should be all you need.
data want;
set have;
origx = scan(d,1,'M,L');
origy = scan(d,2,'M,L');
x = scan(d,3,'M,L');
y = scan(d,4,'M,L');
run;
Do you have a SAS dataset with a variable named d in it, or do you have a text file? My first read was that you have a SAS dataset already, in which case you need to parse the variable. You could use SCAN() function, or plenty of other methods, e.g.:
data have;
input d $16.;
cards;
M200,170L149,385
M200,170L150,387
M200,170L275,384
M200,170L49,317
M200,170L92,347
;
run;
data want;
set have;
x_origin=scan(d,1,"M,L");
y_origin=scan(d,2,"M,L");
x=scan(d,3,"M,L");
y=scan(d,4,"M,L");
run;
proc print data=want;
run;
I want to conver a code like "13232C" to a numeric value. Maybe assign values 1 to 26 for A to Z. Then the new code would be "132323".
This code will work if there is just 1 letter in the code. If there are more then you will need to scan through each one to get the value. I've calculated the letter value (1-26) by subtracting 64 from the ASCII value (A=65), making sure to convert the letter to upper case if necessary. I've also assumed that the letter always appears at the end of the string
data have;
input code $;
datalines;
132323C
24578D
5147896G
;
run;
data want;
set have;
new_code=input(cats(compress(code,,'dk'),rank(compress(upcase(code),,'ak'))-64),best12.);
run;
Keith's solution is probably better for most uses, but I can't help seeing this as a good chance to play with PROC FCMP (Function Compile). This works nicely in the case where you have A-I only; starting with J it won't work since I'm only allowing a single character's space. If it can have 2 digits, the FCMP would need to be changed to do what Keith's solution does.
proc fcmp outlib=work.funcs.trial;
function cton(charvar $) $;
do n = 1 to length(charvar);
if 48 le rank(char(charvar,n)) le 57 then ;
else substr(charvar,n,1) = put(rank(upcase(char(charvar,n)))-64,1.);
put charvar;
end;
return (charvar);
endsub;
quit;
options cmplib=work.funcs;
data test;
x="23456CAB";
y = cton(x);
put x= y=;
run;
I also return it as a character, but that's not important - you could return it as a numeric if you prefer (I saw the " " in the original question).
Is there a simple way in SAS to convert a string to a SAS-safe name that would be used as a column name?
ie.
Rob Penridge ----> Rob_Penridge
$*#'Blah#* ----> ____Blah__
I'm using a proc transpose and then want to work with the renamed columns after the transpose.
EDIT:
8 year follow-up... is there now a better way to do this? I feel like I saw a better method sometime back but I'm struggling to find any documentation/examples now that I need to do it.
proc transpose will take those names without any modification, as long as you set options validvarname=any;
If you want to work with the columns afterwards, you can use the NLITERAL function to construct named literals that can be used to refer to them:
options validvarname=any;
/* Create dataset and transpose it */
data zz;
var1 = "Rob Penridge";
var2 = 5;
output;
var1 = "$*#'Blah#*";
var2 = 100;
output;
run;
proc transpose
data = zz
out = zz_t;
id var1;
run;
/* Refer to the transposed columns in the dataset using NLITERAL */
data _null_;
set zz;
call symput(cats("name", _n_), nliteral(var1));
run;
data blah;
set zz_t;
&name1. = &name1. + 5;
&name2. = &name2. + 200;
run;
May try perl regular expression function.
Since for column name, the first character should not be numerical, it's more complicated then.
data _null_;
name1 = "1$*#' Blah1#*";
name2 = prxchange("s/[^A-Za-z_]/_/",1,prxchange("s/[^A-Za-z_0-9]/_/",-1,name1));
put name2;
run;
Take a look at the VALIDVARNAME System Option. It might allow you to accept non-valid SAS names.
Also the NOTNAME function could facilitate in helping find invalid characters.
How about using SAS's regular expression functionality? For example:
data names;
set name;
name_cleaned = prxchange('s/[^a-z0-9 ]/_/i', -1, name);
run;
This will convert anything that isn't a letter, number, or space into a _. You can add other characters that you want to allow to the list after the 9. Just be aware that some characters are "special" and must be preceded by a \.
You could also use the IDLABEL statement in the transpose to add labels that match the original values. Then use the VARLABEL function to retrieve the labels and work with them that way.