Is there an elegant way to check in SAS Base if a numeric value is made of only one kind of digit?
Example:
1 -> Yes
11 -> Yes
111 -> Yes
1111 -> Yes
1121 -> No
9999999 -> Yes
9999990 -> No
I would go with something like this.
Realize that SAS does not store leading 0s in numbers, so the last one in your example will pass -- that 0 will not show up.
This converts the numbers to strings and then compares the individual characters in the string. Alter the format in the put statement as needed.
Also note that a decimal will fail because . will be compared to the numbers. If you need these to pass, then remove the . from the string.
data have;
input x;
datalines;
1
11
12
111
1111
1121
99999999
09999999
1.11
;
run;
data test;
set have;
pass = 1;
format temp $32.;
temp = strip(put(x,best32.));
do i=1 to length(temp)-1;
pass = pass and (substr(temp,i,1) = substr(temp,i+1,1));
if ^pass then leave;
end;
drop temp i;
run;
Just want to share an additional solution with regex:
data have;
input x;
datalines;
1
11
12
111
1111
1121
99999999
9999990
;
run;
data want;
set have;
if PRXMATCH("/\b1+\b|\b2+\b|\b3+\b|\b4+\b|\b5+\b|\b6+\b|\b7+\b|\b8+\b|\b9+\b|\b0+\b/",x);
run;
Related
I have a SAS column as below
-10
20
-30
40
I want to make the column like
10
20
30
40
I need to remove the sign and keep the same number. I don't know how to do this.
You can use ABS function.
A small sample code:
data begin;
input var ##;
cards;
1 1 -1 -1 2 -2 -3 3
; run;
data wanted;
set begin;
var2= abs(var);
run;
For more on abs see documentation
EDIT: In case you are dealing with strings you can just remove the string:
data begin;
input var $ ##;
cards;
1 1 -1 -1 2 -2 -3 3
; run;
data wanted;
set begin;
var2= tranwrd(var, '-', '');
run;
Also documentation on TRANWRD
two ways without creating additional variables:
data begin;
input var ##;
cards;
1 1 -1 -1 2 -2 -3 3
; run;
data wanted;
set begin;
var= abs(var);
run;
proc sql noprint;
create table wanted2 as
select abs(var)as var from begin;quit;
Another way would be to be to create a new variable where var2=sqrt(var**2)
I have the following dataset:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
;
PROC PRINT; RUN;
I want to link this data to another table but the thing is that the numbers in the other table are stored in the following format: 0012, 0023, 0023.
So I am looking for a way to do the following:
Check how long the number is
If length = 1, add 3 0 values to the beginning
If length = 2, add 2 0 values to the beginning
Any thoughts on how I can get this working?
Numbers are numbers so if the other table has the field as a number then you don't need to do anything. 13 = 0013 = 13.00 = ....
If the other table actually has a character variable then you need to convert one or the other.
char_number = put(number, Z4.);
number = input(char_number, 4.);
You can use z#. formats to accomplish this:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
9999 999
8888 8
;
data survey2;
set survey;
number_long = put(number, z4.);
run;
If you need it to be four characters long, then you could do it like this:
want = put(input(number,best32.),z4.);
I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides
ex. Dataset
obs code
1 1111
2 1112
3 3333
.
.
.
400 5999
How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".
ex. Large Dataset
obs text
1 abcdef 1111 abcdef
2 abcdef 1111 abcdef 1112 8888
3 abcdef 1111 abcdef 11128888
...
Data set that I want
ex. New Data set
obs text
1 abcdef abcdef
2 abcdef abcdef 8888
3 abcdef abcdef 11128888
...
Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.
I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.
proc sql;
select
*,
tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text
from large_dataset
left join small_dataset
on findw(large_dataset.text, trim(small_dataset.code))
;
quit;
You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD() function then you will need to add extra space characters.
data want ;
set have ;
length code $4 ;
do i=1 to nobs while (text ne ' ');
set codes(keep=code) nobs=nobs point=i ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
end;
drop code;
run;
The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.
If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.
proc sql noprint ;
select code into :codelist separated by '|'
from codes
;
quit;
data want ;
set have ;
text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;
There might be more efficient ways of doing this, but this seems to work fairly well:
/*Create test datasets*/
data codes;
input code;
cards;
1111
1112
3333
5999
;
run;
data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;
/*Get the number of codes to use for array definition*/
data _null_;
set codes(obs = 1) nobs = nobs;
call symput('ncodes',nobs);
run;
%put ncodes = &ncodes;
data want;
set big_dataset;
/*Define and populate array with padded codes*/
array codes{&ncodes} $6 _temporary_;
if _n_ = 1 then do i = 1 to &ncodes;
set codes;
codes[i] = cat(' ',put(code,4.),' ');
end;
do i = 1 to &ncodes;
text = tranwrd(text,codes[i],' ');
end;
drop i code;
run;
I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.
Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.
data want ;
if _n_ = 1 then do;
length code $4 ;
declare hash h(dataset:"codes (keep=code)") ;
h.defineKey("code") ;
h.defineDone() ;
call missing (code);
declare hiter hiter('h') ;
end;
set big_dataset ;
rc = hiter.first() ;
do while (rc = 0 and text ne ' ') ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
rc = hiter.next() ;
end ;
drop code rc ;
run;
Use array and regular express:
proc transpose data=codes out=temp;
var code;
run;
data want;
if _n_=1 then set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;
The dataset looks like this:
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
I want the output dataset looks like
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 PA 1 4
0004 PB 1 3
0005 AC 1 9
0006 AC 1 2
For each Code, Type has at most two values. I want to select the unique Code by summing Rating. But the problem is, for Type, if it has only one value, the passes its value to output dataset. If is has two values (one has to be NULL), then passes the one not equals to NULL to output dataset.
The total number of observation N>100,000,000. So is there any tricky way to achieve this?
If the data is sorted as per your example, then you can achieve this in a single data step. I've assumed that the NULL values are actually missing, however if not then change [if missing(type)] to [if type='NULL']. All this does is sum the Rating values for each Code, then output the last record, keeping the non-null Type. If your data isn't sorted or indexed on Code then you'll need to do a sort first, which will obviously add quite a bit to the execution time.
/* create input file */
data have;
input Code Type $ Rating;
infile datalines dsd;
datalines;
0001,,1
0002,,1
0003,,1
0003,PA 1,3
0004,,1
0004,PB 1,2
0005,AC 1,3
0005,,6
0006,AC 1,2
;
run;
/* create summarised dataset */
data want;
set have;
by code;
retain _type; /* temporary variable */
if first.code then do;
_type = type;
_rating_sum = 0; /* reset sum */
end;
_rating_sum + rating; /* sum rating per Code */
if last.code then do;
if missing(type) then type = _type; /* pick non-null value */
rating = _rating_sum; /* insert sum */
output;
end;
run;
Given the comments, another possibility presents, the hash solution. This is memory-constrained, so it may or may not be able to work with the actual data (the hash table isn't very big, but 100M rows might imply 60 or 70M rows in the hash table, times 40 or 50 bytes would still be pretty big).
This is almost certainly inferior to the plain data step method if the dataset is sorted by code, so this should only be used on unsorted data.
The concepts:
Create hash table keyed on code
If incoming record is new, add to hash table
If incoming record is not a new code, take the retrieved value and sum the rating. Check to see if type needs to be replaced.
Output to dataset.
Code:
data _null_;
if _n_=1 then do;
if 0 then set have;
declare hash h(ordered:'a');
h.defineKey('code');
h.defineData('code','type','rating');
h.defineDone();
end;
set have(rename=(type=type_in rating=rating_in)) end=eof;
rc_1 = h.find();
if rc_1 eq 0 then do;
if type ne type_in and type='NULL' then type=type_in;
rating=sum(rating,rating_in);
h.replace();
end;
else do;
type=type_in;
rating=rating_in;
h.add();
end;
if eof then do;
h.output(dataset:'want');
end;
run;
It's pretty easy to do in one SQL step as well. Just use a CASE...WHEN...END to remove the NULLs and a MAX to then get the non-null value.
data have;
input #1 Code 4.
#9 Type $4.
#19 Rating 1.;
datalines;
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
;;;;
run;
proc sql;
create table want as
select code,
max(case type when 'NULL' then '' else type end) as type,
sum(Rating) as rating
from have
group by code;
quit;
If you want the NULLs back, then you need to wrap the select in a select code, case type when ' ' then 'NULL' else type end as type, rating from ( ... );, though I would suggest leaving them blank.
I'm trying to convert a character string to a numeric variable and then sum the values of each character to use as a unique identifier for that field.
So for example, I would like A=1, B=2, C=3.....X=24 Y=25 Z=26.
Say my string is "CAB" so after running the code I would like the result to be an intermidiary column of numbers, where the value for CAB IS 3 1 2 and the result column would be derived by summing the string 3+1+2= 6 and show the value of the intermideate column, so the final value woud be 6.
Here is the sas code I used to convert the characters to numbers, but I need help with the result column.
DATA CHAR_VALUE;
SET WORK.XYZ;
CHAR_2_NUM=TRANSLATE(MY_VAR_CHAR, '1 2 3 ...24 25 26', 'A B C ...X Y Z');
NUM_CHAR=INPUT(CHAR_2_NUM,32.);
RUN;
Thanks in advance...I appreciate any help or suggestions.
-rachel
RANK will give the ASCII numeric value underlying a character; so A=65, B=66, Z=90, a=97, z=122.
So this should work (if you want only the uppercase values - not a different value for a than A):
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,rank(char(upcase(charval),_t))-64);
end;
put _all_;
run;
Another option (Based on the comments below), is to build an informat with the relationships between letter and value. My loop iterates over each character A to Z, you can then put whatever value you want for each letter as label (I just put 1,2,3,4... but label= will change that).
data fmts;
retain fmtname 'CHARNUM' type 'i';
do _t=65 to 90;
start=byte(_t); *the character, so byte(65)='A';
label=_t-64; *the resulting number;
output;
end;
run;
proc format cntlin=fmts;
quit;
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
end;
put _all_;
run;
Finally, if you want to be able to construct this in the same datastep, you could construct the relationships in a hash table and look up the result. I can explain that if desired, though I'd like to see a more detailed example of what you want to do in terms of defining the relationship between a letter and its code.
If you need to see the intermediate values, you can do that by inserting a CAT function in the loop- I recommend CATX:
data test;
charval='CAB';
format intermed $100.;
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
intermed=catx('|',intermed,input(char(upcase(charval),_t),CHARNUM.)); *or the RANK portion from earlier;
end;
put _all_;
run;
That would give you 3|1|2, which you could then do math on via SCAN:
do _t = 1 to countc(intermed,'|')+1;
numval2 = sum(numval2,scan(intermed,_t,'|'));
end;
Your method to try and translate is a good attempt, but it will not really work. Here is a simple solution:
DATA CHAR_VALUE;
retain all_chars 'ABCDEFGHIJKLMMOPQRSTUVXXYZ';
set XYZ;
length CHAR_2_NUM $200;
CHAR_2_NUM = ' ';
NUM_CHAR = 0;
do i=1 to length(MY_VAR_CHAR);
if i=1 then CHAR_2_NUM = substr(MY_VAR_CHAR,i,1);
else CHAR_2_NUM = trim(CHAR_2_NUM) || ' ' || substr(MY_VAR_CHAR,i,1);
NUM_CHAR + index(all_chars,substr(MY_VAR_CHAR,i,1));
end;
drop i all_chars;
RUN;
This takes advantage of the fact that the indexed position of each character of your source variable in the all_chars variable corresponds to the mapping you desired.
UPDATED to also create your CHAR_2_NUM variable, which I overlooked in the original question.
Another simple solution is based on the collate function:
To convert a variable called MyNumbers (in the range of 1 to 26) to English upper-case characters, one can use:
collate(64 + MyNumbers, 64 + MyNumbers)
To obtain lower-case characters, one can use:
collate(96 + MyNumbers, 96 + MyNumbers)
Here's a quick example:
data _null_;
do MyNumbers = 1 to 26;
MyLettersUpper = collate(64 + MyNumbers, 64 + MyNumbers);
MyLettersLower = collate(96 + MyNumbers, 96 + MyNumbers);
put MyNumbers MyLettersUpper MyLettersLower;
end;
run;
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
7 G g
8 H h
9 I i
10 J j
11 K k
12 L l
13 M m
14 N n
15 O o
16 P p
17 Q q
18 R r
19 S s
20 T t
21 U u
22 V v
23 W w
24 X x
25 Y y
26 Z z
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds