Why does the code below work (&ds is 12345678910), but when I change cats to cat, &ds is just blank? I would expect changing cats to cat would mean &ds is 1 2 3 4 5 6 7 8 9 10.
data new;
length ds $500;
ds = "";
do i = 1 to 10;
ds = cats(ds, i, " ");
end;
call symputx('ds', ds);
run;
%put &ds;
The function cat() will not trim the values so if you concatenate anything to DS and try to store it back into DS whatever you added is not stored because there is no room for it.
It appears you actually want the catx() function.
ds = catx(' ',ds, i);
SAS tends to add leading and trailing spaces if you use the input buffer and doing text manipulation. you can use either the Strip() and catx() functions to remove leading and trailing spaces. With catx() you have the extra option of specifying a delimiter.
ds = cat(strip(ds), i, " ");
Related
I have a sample data set like below.
data d01;
infile datalines dlm='#';
input Name & $15. IdNumber & $4. Salary & $5. Site & $3.;
datalines;
アイ# 2355# 21163# BR1
アイウエオ# 5889# 20976# BR1
カキクケ# 3878# 19571# BR2
;
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=1000 ;
length filler $3;
filler = ' ';
w_out = ksubstr(Name, 1, 5) || IdNumber || Salary || Site || filler;
put w_out;
run ;
I want to export this data set to csv (fixed-width format) and every line will has the length of 20 byte (20 1-byte-character).
But SAS auto remove my trailing spaces. So the result would be 17 byte for each line. (the filler is truncated)
I know I can insert the filler like this.
put w_out filler $3.;
But this won't work in case the `site' column is empty, SAS will truncate its column and the result also not be 20 byte for each line.
I didn't quite understand what you are trying to do with ksubstr, but if you want to add padding to get the total length to 20 characters, you may have to write some extra logic:
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=1000 ;
length filler $20;
w_out = ksubstr(Name,1,5) || IdNumber || Salary || Site;
len = 20 - klength(w_out) - 1;
put w_out #;
if len > 0 then do;
filler = repeat(" ", len);
put filler $varying20. len;
end;
else put;
run ;
You probably do not want to write a fixed column file using a multi-byte character set. Instead look into seeing if your can adjust your process to use a delimited file instead. Like you did in your example input data.
If you want the PUT function to write a specific number of bytes just use formatted PUT statement. To have the number of bytes written vary based on the strings value you can use the $VARYING format. The syntax when using $VARYING is slightly different than when using normal formats. You add a second variable reference after the format specification that contains the actual number of bytes to write.
You can use the LENGTH() function to calculate how many bytes your name values take. Since it normally ignores the trailing space just add another character to the end and subtract one from the overall length.
To pad the end with three blanks you could just add three to the width used in the format for the last variable.
data d01;
infile datalines dlm='#';
length Name $15 IdNumber $4 Salary $5 Site $3 ;
input Name -- Site;
datalines;
アイ# 2355# 21163# BR1
アイウエオ# 5889# 20976# BR1
カキクケ# 3878# 19571# BR2
Sam#1#2#3
;
filename out temp;
data _null_;
set d01;
file out;
nbytes=length(ksubstr(name,1,5)||'#')-1;
put name $varying15. nbytes IdNumber $4. Salary $5. Site $6. ;
run;
Results:
67 data _null_ ;
68 infile out;
69 input ;
70 list;
71 run;
NOTE: The infile OUT is:
Filename=...\#LN00059,
RECFM=V,LRECL=32767,File Size (bytes)=110,
Last Modified=15Aug2019:09:01:44,
Create Time=15Aug2019:09:01:44
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1 アイ 235521163BR1 24
2 アイウエオ588920976BR1 30
3 カキクケ 387819571BR2 28
4 Sam 1 2 3 20
NOTE: 4 records were read from the infile OUT.
The minimum record length was 20.
The maximum record length was 30.
By default SAS sets an option of NOPAD on a FILE statement, it also sets each line to 'variable format', which means lengths of lines can vary according to the data written. To explicitly ask SAS to pad your records out with spaces, don't use a filler variable, just:
Set the LRECL to the width of file you need (20)
Set the PAD option, or set RECFM=F
Sample code:
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=20 PAD;
w_out = Name || IdNumber || Salary || Site;
put w_out;
run ;
More info here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000171874.htm#a000220987
I have two datasets, both with same variable names. In one of the datasets two variables have character format, however in the other dataset all variables are numeric. I use the following code to convert numeric variables to character, but the numbers are changing by 490.6 -> 491.
How can I do the conversion so that the numbers wouldn't change?
data tst ;
set data (rename=(Day14=Day14_Character Day2=Day2_Character)) ;
Day14 = put(Day14_Character, 8.) ;
Day2 = put(Day2_Character, 8.) ;
drop Day14_Character Day2_Character ;
run;
Your posted code is confused. Half of it looks like code to convert from character to numeric and half looks like it is for the other direction.
To convert to character use the PUT() function. Normally you will want to left align the resulting string. You can use the -L modifier on the end of the format specification to left align the value.
So to convert numeric variables DAY14 and DAY2 to character variables of length $8 you could use code like this:
data want ;
set have (rename=(Day14=Day14_Numeric Day2=Day2_Numeric)) ;
Day14 = put(Day14_Numeric, best8.-L) ;
Day2 = put(Day2_Numeric, best8.-L) ;
drop Day14_Numeric Day2_Numeric ;
run;
Remember you use PUT statement or PUT() function with formats to convert values to text. And you use the INPUT statement or INPUT() function with informats to convert text to values.
Change the format to something like Best8.2:
data tst ;
set data (rename=(Day14=Day14_Character Day2=Day2_Character)) ;
Day14 = put(Day14_Character, best8.2) ;
Day2 = put(Day2_Character, best8.2) ;
drop Day14_Character Day2_Character ;
run;
Here is an example:
data test;
input r ;
datalines;
500.04
490.6
;
run;
data test1;
set test;
num1 = put(r, 8.2);
run;
If you do not want to specify the width and number of decimal points you can just use the BEST. informat and SAS will automatically assign the width and decimals based on the input data. However the length of the outcome variable may be large unless you specify it explicitly. This will still retain your numbers as in the original variable.
I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides
ex. Dataset
obs code
1 1111
2 1112
3 3333
.
.
.
400 5999
How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".
ex. Large Dataset
obs text
1 abcdef 1111 abcdef
2 abcdef 1111 abcdef 1112 8888
3 abcdef 1111 abcdef 11128888
...
Data set that I want
ex. New Data set
obs text
1 abcdef abcdef
2 abcdef abcdef 8888
3 abcdef abcdef 11128888
...
Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.
I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.
proc sql;
select
*,
tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text
from large_dataset
left join small_dataset
on findw(large_dataset.text, trim(small_dataset.code))
;
quit;
You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD() function then you will need to add extra space characters.
data want ;
set have ;
length code $4 ;
do i=1 to nobs while (text ne ' ');
set codes(keep=code) nobs=nobs point=i ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
end;
drop code;
run;
The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.
If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.
proc sql noprint ;
select code into :codelist separated by '|'
from codes
;
quit;
data want ;
set have ;
text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;
There might be more efficient ways of doing this, but this seems to work fairly well:
/*Create test datasets*/
data codes;
input code;
cards;
1111
1112
3333
5999
;
run;
data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;
/*Get the number of codes to use for array definition*/
data _null_;
set codes(obs = 1) nobs = nobs;
call symput('ncodes',nobs);
run;
%put ncodes = &ncodes;
data want;
set big_dataset;
/*Define and populate array with padded codes*/
array codes{&ncodes} $6 _temporary_;
if _n_ = 1 then do i = 1 to &ncodes;
set codes;
codes[i] = cat(' ',put(code,4.),' ');
end;
do i = 1 to &ncodes;
text = tranwrd(text,codes[i],' ');
end;
drop i code;
run;
I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.
Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.
data want ;
if _n_ = 1 then do;
length code $4 ;
declare hash h(dataset:"codes (keep=code)") ;
h.defineKey("code") ;
h.defineDone() ;
call missing (code);
declare hiter hiter('h') ;
end;
set big_dataset ;
rc = hiter.first() ;
do while (rc = 0 and text ne ' ') ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
rc = hiter.next() ;
end ;
drop code rc ;
run;
Use array and regular express:
proc transpose data=codes out=temp;
var code;
run;
data want;
if _n_=1 then set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;
I need an advice from guru of SAS :).
Suppose I have two big data sets. The first one is a huge data set (about 50-100Gb!), which contains phone numbers. The second one contains prefixes (20-40 thousands observations).
I need to add the most appropriate prefix to the first table for each phone number.
For example, if I have a phone number +71230000 and prefixes
+7
+71230
+7123
The most appropriate prefix is +71230.
My idea. First, sort the prefix table. Then in data step, process the phone numbers table
data OutputTable;
set PhoneNumbersTable end=_last;
if _N_ = 1 then do;
dsid = open('PrefixTable');
end;
/* for each observation in PhoneNumbersTable:
1. Take the first digit of phone number (`+7`).
Look it up in PrefixTable. Store a number of observation of
this prefix (`n_obs`).
2. Take the first TWO digits of the phone number (`+71`).
Look it up in PrefixTable, starting with `n_obs + 1` observation.
Stop when we will find this prefix
(then store a number of observation of this prefix) or
when the first digit will change (then previous one was the
most appropriate prefix).
etc....
*/
if _last then do;
rc = close(dsid);
end;
run;
I hope my idea is clear enough, but if it's not, I'm sorry).
So what do you suggest?
Thank you for your help.
P.S. Of course, phone numbers in the first table are not unique (may be repeated), and my algorithm, unfortunately, doesn't use it.
There are a couple of ways you could do this, you could use a format or a hash-table.
Example using format :
/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
set prefixtable end=eof ;
retain fmtname 'PREFIX' type 'C' maxlen . ;
maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
start = prefix ;
label = 'Y' ;
output ;
if eof then do ;
hlo = 'O' ;
label = 'N' ;
output ;
call symputx('MAXPL',maxlen) ;
end ;
drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ;
/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
set phonenumberstable ;
length prefix $&MAXPL.. ;
prefix = '' ;
pnum = substr(phonenumber,1,&MAXPL) ;
do until (not missing(prefix) or length(pnum) = 1) ;
if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
end ;
drop pnum ;
run ;
Here's another solution which works very well, speed-wise, as long as you can work under one major (maybe okay) restriction: phone numbers can't start with a 0, and have to be either numeric or convertible to numeric (ie, that "+" needs to be unnecessary to look up).
What I'm doing is building an array of 1/null flags, one 1/null flag per possible prefix. Except this doesn't work with a leading 0: since '9512' and '09512' are the same number. This could be worked around - adding a '1' at the start (so if you have possible 6 digits of prefix, then everything is 1000000+prefix) for example would work - but it would require adjusting the below (and might have performance implications, though I think it wouldn't be all that bad). If "+" is also needed, that might need to be converted to a digit; here you could say anything with a "+" gets 2000000 added to the beginning, or something like that.
The nice thing is this only takes 6 queries (or so) of an array at most per row - quite a bit faster than any of the other search options (since temporary arrays are contiguous blocks of memory, it's just "go check 6 memory addresses that are pre-computed"). Hash and format will be a decent chunk slower, since they have to look up each one anew.
One major performance suggestion: Pay attention to which way your prefixes will likely fail to match. Checking 6 then 5 then 4 then ... might be faster, or checking 1 then 2 then 3 then ... might be faster. It all depends on the actual prefixes themselves, and the actual phone numbers. If most of your prefixes are "+11" and things like that, you almost certainly want to start from the left if that mans a number with "94" will be quickly found as not matching.
With that, the solution.
data prefix_match;
if _n_=1 then do;
array prefixes[1000000] _temporary_;
do _i = 1 to nobs_prefix;
set prefixes point=_i nobs=nobs_prefix;
prefixes[prefix]=1;
end;
call missing(prefix);
end;
set phone_numbers;
do _j = 6 to 1 by -1;
prefix = input(substr(phone_no,1,_j),6.);
if prefix ne 0 and prefixes[prefix]=1 then leave;
prefix=.;
end;
drop _:;
run;
Against a test set, which had 40k prefixes and 100m phone numbers (and no other variables), this ran in a bit over 1 minute on my (good) laptop, versus 6 and change with the format solution and 4 and change with the hash solution (modifying it to output all rows, since the other two solutions do). That seems about right to me performance-wise.
Here's a hashtable example.
Generate some dummy data.
data phone_numbers(keep=phone)
prefixes(keep=prefix);
;
length phone $10 prefix $4;
do i=1 to 10000000;
phone = cats(int(ranuni(0) * 9999999999 + 1));
len = int(ranuni(0) * 4 + 1);
prefix = substr(phone,1,len);
if input(phone,best.) ge 1000000000 then do;
output;
end;
end;
run;
Assuming the longest prefix is 4 chars, try finding a match with the longest first, then continue until the shortest prefix has been tried. If a match is found, output the record and move on to the next observation.
data ht;
attrib prefix length=$4;
set phone_numbers;
if _n_ eq 1 then do;
declare hash ht(dataset:"prefixes");
ht.defineKey('prefix');
ht.defineDone();
end;
do len=4 to 1 by -1;
prefix = substr(phone,1,len);
if ht.find() eq 0 then do;
output;
leave;
end;
end;
drop len;
run;
May need to add logic if a match isn't found to output the record and leave the prefix field blank? Not sure how you want to handle that scenario.
I'm fairly new with SAS. I've used it a bit in the past but am really rusty.
I've got a table that looks like this:
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 .
1 . . t 3
For several unique keys.
I want everything to appear on one row so it looks like.
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 t 3
Another wrinkle is I don't know how many group and metric columns I'll have (although I'll always have the same number).
I'm not sure how to approach this. I'm able to get a list of column names and use them in a macro, I'm just not sure what proc or datastep function I need to use to collapse everything down. I would be extremely greatful for any suggestions.
There's a very simple way to do this using a nice trick. I've answered similar questions on this before, see here for one of them. This should achieve exactly what you're after.
You can use 2 temporary arrays (one for the character variables, and another for the numeric), and fill them with the non-blank values accordingly. When you reach last.key, you can load the temporary arrays back into the source variables.
If you know the maximum length of the character variables in advance, you can hard code it, but if not you can determine it dynamically.
This assumes that for each key, each variable is only populated once. Otherwise it will take the last value it sees for a particular variable within each key.
%LET LIB = work ;
%LET DSN = mydata ;
%LET KEYVAR = key ;
/* Get column name/type/max length */
proc sql ;
/* Numerics */
select name, count(name) into :NVARNAMES separated by ' ', :NVARNUM
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'num' ;
/* Characters */
select name, count(name), max(length) into :CVARNAMES separated by ' ', :CVARNUM, :CVARLEN
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'char' ;
quit ;
data flatten ;
set &LIB..&DSN ;
by &KEYVAR ;
array n{&NVARNUM} &NVARNAMES ;
array nt{&NVARNUM} _TEMPORARY_ ;
array c{&CVARNUM} &CVARNAMES ;
array ct{&CVARNUM} $&CVARLEN.. _TEMPORARY_ ;
retain nt ct ;
if first.&KEYVAR then do ;
call missing(of nt{*}, of ct{*}) ;
end ;
/* Load non-missing numeric values into temporary array */
do i = 1 to dim(n) ;
if not missing(n{i}) then nt{i} = n{i} ;
end ;
/* Load non-missing character values into temporary array */
do i = 1 to dim(c) ;
if not missing(c{i}) then ct{i} = c{i} ;
end ;
if last.&KEYVAR then do ;
/* Load numeric back into original variables */
call missing(of n{*}) ;
do i = 1 to dim(n) ;
n{i} = nt{i} ;
end ;
/* Load character back into original variables */
call missing(of c{*}) ;
do i = 1 to dim(c) ;
c{i} = ct{i} ;
end ;
output ;
end ;
drop i ;
run ;