Prevent SAS to automatically remove trailing blank in string - sas

I have a sample data set like below.
data d01;
infile datalines dlm='#';
input Name & $15. IdNumber & $4. Salary & $5. Site & $3.;
datalines;
アイ# 2355# 21163# BR1
アイウエオ# 5889# 20976# BR1
カキクケ# 3878# 19571# BR2
;
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=1000 ;
length filler $3;
filler = ' ';
w_out = ksubstr(Name, 1, 5) || IdNumber || Salary || Site || filler;
put w_out;
run ;
I want to export this data set to csv (fixed-width format) and every line will has the length of 20 byte (20 1-byte-character).
But SAS auto remove my trailing spaces. So the result would be 17 byte for each line. (the filler is truncated)
I know I can insert the filler like this.
put w_out filler $3.;
But this won't work in case the `site' column is empty, SAS will truncate its column and the result also not be 20 byte for each line.

I didn't quite understand what you are trying to do with ksubstr, but if you want to add padding to get the total length to 20 characters, you may have to write some extra logic:
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=1000 ;
length filler $20;
w_out = ksubstr(Name,1,5) || IdNumber || Salary || Site;
len = 20 - klength(w_out) - 1;
put w_out #;
if len > 0 then do;
filler = repeat(" ", len);
put filler $varying20. len;
end;
else put;
run ;

You probably do not want to write a fixed column file using a multi-byte character set. Instead look into seeing if your can adjust your process to use a delimited file instead. Like you did in your example input data.
If you want the PUT function to write a specific number of bytes just use formatted PUT statement. To have the number of bytes written vary based on the strings value you can use the $VARYING format. The syntax when using $VARYING is slightly different than when using normal formats. You add a second variable reference after the format specification that contains the actual number of bytes to write.
You can use the LENGTH() function to calculate how many bytes your name values take. Since it normally ignores the trailing space just add another character to the end and subtract one from the overall length.
To pad the end with three blanks you could just add three to the width used in the format for the last variable.
data d01;
infile datalines dlm='#';
length Name $15 IdNumber $4 Salary $5 Site $3 ;
input Name -- Site;
datalines;
アイ# 2355# 21163# BR1
アイウエオ# 5889# 20976# BR1
カキクケ# 3878# 19571# BR2
Sam#1#2#3
;
filename out temp;
data _null_;
set d01;
file out;
nbytes=length(ksubstr(name,1,5)||'#')-1;
put name $varying15. nbytes IdNumber $4. Salary $5. Site $6. ;
run;
Results:
67 data _null_ ;
68 infile out;
69 input ;
70 list;
71 run;
NOTE: The infile OUT is:
Filename=...\#LN00059,
RECFM=V,LRECL=32767,File Size (bytes)=110,
Last Modified=15Aug2019:09:01:44,
Create Time=15Aug2019:09:01:44
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1 アイ 235521163BR1 24
2 アイウエオ588920976BR1 30
3 カキクケ 387819571BR2 28
4 Sam 1 2 3 20
NOTE: 4 records were read from the infile OUT.
The minimum record length was 20.
The maximum record length was 30.

By default SAS sets an option of NOPAD on a FILE statement, it also sets each line to 'variable format', which means lengths of lines can vary according to the data written. To explicitly ask SAS to pad your records out with spaces, don't use a filler variable, just:
Set the LRECL to the width of file you need (20)
Set the PAD option, or set RECFM=F
Sample code:
data _null_ ;
set d01 ;
file "/folders/myfolders/test.csv" lrecl=20 PAD;
w_out = Name || IdNumber || Salary || Site;
put w_out;
run ;
More info here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000171874.htm#a000220987

Related

SAS Writing text to a text file results in extra spaces not written to the LOG file

I have a dataset of address standardizations -- take STREET and replace it with ST -- and wish to write code that does the substitution. When testing out the code, it appears as intended in the LOG file but extra spaces are added when I write to the text file. I don't want the extra spaces.
-=-=-=-=-=- SAS CODE
data std ;
length pre $16 post $8 ;
infile datalines delimiter=',' ;
input pre $ post $ ;
pre = strip(pre);
post = strip(post);
datalines;
AVENUES , AVE
AVENUE , AVE
BOULEVARD , BLVD
CIRCLE , CIR
;
run;
data _null_ ;
file "&test.txt";
set std ;
p1 = trim(pre) ;
p2 = trim(post);
put '&var = strip( prxchange("s/(^|\s)' p1 +(-1) '\s/ ' p2 +(-1) ' /i",-1,&var) );' ;
run;
-=-=-=-=-=-=-=- END OF CODE
The SAS code produces the following ...
&var = strip( prxchange("s/(^|\s)AVENUES\s/ AVE /i",-1,&var) );
&var = strip( prxchange("s/(^|\s)AVENUE\s/ AVE /i",-1,&var) );
&var = strip( prxchange("s/(^|\s)BOULEVARD\s/ BLVD /i",-1,&var) );
... in the LOG file when I remove the file statement, but writes ...
&var = strip( prxchange("s/(^|\s)AVENUES \s/ AVE /i",-1,&var) );
&var = strip( prxchange("s/(^|\s)AVENUE \s/ AVE /i",-1,&var) );
&var = strip( prxchange("s/(^|\s)BOULEVARD \s/ BLVD /i",-1,&var) );
... with extra spaces inside the REGEX function in the file test.txt.
This is SAS 9.4 which I'm using through a web-based SAS Studio.
So, your problem is based on how SAS stores character variables.
A character variable is always equal to the characters stored in that variable, followed by as many space ('20'x) characters as needed to fill the length of the data storage. This differs from (mostly newer) languages that have a string terminator character or similar; SAS has no such character, it just fills the space with spaces. So if the variable is 8 bytes long, and contains Avenue, then it actually contains Avenue .
You cannot change that in code, outside of a single line of code. So, your lines:
p1 = trim(pre) ;
p2 = trim(post);
Are meaningless - they do nothing except waste CPU time (sadly, not optimized away from what I can tell).
You need to trim in the line you use the value, as there it can be trimmed away. Now, you can't put a trim(...), so you need to compose your line to be written elsewhere, or else use the $varying. format.
Here's one example:
filename tempfile temp;
data _null_ ;
file tempfile;
set std ;
result = cats('&var = strip( prxchange("s/(^|\s)',pre,catx(' ','\s/',post,'/i",-1,&var) );'));
put result ;
run;
data _null_;
infile tempfile;
input #;
put _infile_;
run;
Here's an example using $varying.:
data _null_ ;
file tempfile;
set std ;
varlen_pre = length(pre);
varlen_post = length(post);
put '&var = strip( prxchange("s/(^|\s)' pre $varying16. varlen_pre '\s/ ' post $varying8. varlen_post ' /i",-1,&var) );' ;
run;
As to why your log doesn't match the file, that's because SAS has slightly different rules for when it writes to logs than when it writes to files. It's much more exact about what it writes to a file; you say it, it writes it. For logs it has a few places where it removes spaces for you, presumably to make logs more readable, as it's not as necessary to be precise. This can be a pain when you DO want the precision in the log, and of course in your case where you want the log to match what you're seeing...
Finally, a note on what you're doing. I don't highly recommend using a regex the way you're using it. It's very slow. Unless you're only doing a handful of replacements, or only have a small dataset size, or really don't care how long this takes...
If it's just 1:1 replacements, I'd recommend tranwrd, which is much faster. See just this small comparison:
data in_data;
do _n_ = 1 to 1e5;
address = catx(' ',rand('Integer',1,9999),'Avenue');
output;
address = catx(' ',rand('Integer',1,9999),'Street');
output;
address = catx(' ',rand('Integer',1,9999),'Boulevard');
output;
address = catx(' ',rand('Integer',1,9999),'Circle');
output;
address = catx(' ',rand('Integer',1,9999),'Route');
output;
end;
run;
data want;
set in_data;
rx_ave = prxparse('s/(^|\s)Avenue\s/ Ave /ios');
rx_st = prxparse('s/(^|\s)Street\s/ St/ios');
rx_blvd = prxparse('s/(^|\s)Boulevard\s/ Blvd /ios');
rx_cir = prxparse('s/(^|\s)Circle\s/ Cir /ios');
do i = 1 to 4;
address = prxchange(i,-1,address);
end;
run;
data want;
set in_data;
address = tranwrd(Address,'Avenue','Ave');
address = tranwrd(address,'Street','St');
address = tranwrd(address,'Boulevard','Blvd');
address = tranwrd(address,'Circle','Cir');
run;
Both work the same - given you're already using 'words' anyway - but the second works in basically the time it takes to write out the dataset (0.15s for me), while the first takes 20s of CPU time on my SAS server. Loading the regex library is really slow.

SAS: How can I pad a character variable with zeroes while reading in from csv

Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;

How can I find and replace specific text in a SAS data set?

I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides
ex. Dataset
obs code
1 1111
2 1112
3 3333
.
.
.
400 5999
How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".
ex. Large Dataset
obs text
1 abcdef 1111 abcdef
2 abcdef 1111 abcdef 1112 8888
3 abcdef 1111 abcdef 11128888
...
Data set that I want
ex. New Data set
obs text
1 abcdef abcdef
2 abcdef abcdef 8888
3 abcdef abcdef 11128888
...
Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.
I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.
proc sql;
select
*,
tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text
from large_dataset
left join small_dataset
on findw(large_dataset.text, trim(small_dataset.code))
;
quit;
You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD() function then you will need to add extra space characters.
data want ;
set have ;
length code $4 ;
do i=1 to nobs while (text ne ' ');
set codes(keep=code) nobs=nobs point=i ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
end;
drop code;
run;
The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.
If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.
proc sql noprint ;
select code into :codelist separated by '|'
from codes
;
quit;
data want ;
set have ;
text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;
There might be more efficient ways of doing this, but this seems to work fairly well:
/*Create test datasets*/
data codes;
input code;
cards;
1111
1112
3333
5999
;
run;
data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;
/*Get the number of codes to use for array definition*/
data _null_;
set codes(obs = 1) nobs = nobs;
call symput('ncodes',nobs);
run;
%put ncodes = &ncodes;
data want;
set big_dataset;
/*Define and populate array with padded codes*/
array codes{&ncodes} $6 _temporary_;
if _n_ = 1 then do i = 1 to &ncodes;
set codes;
codes[i] = cat(' ',put(code,4.),' ');
end;
do i = 1 to &ncodes;
text = tranwrd(text,codes[i],' ');
end;
drop i code;
run;
I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.
Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.
data want ;
if _n_ = 1 then do;
length code $4 ;
declare hash h(dataset:"codes (keep=code)") ;
h.defineKey("code") ;
h.defineDone() ;
call missing (code);
declare hiter hiter('h') ;
end;
set big_dataset ;
rc = hiter.first() ;
do while (rc = 0 and text ne ' ') ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
rc = hiter.next() ;
end ;
drop code rc ;
run;
Use array and regular express:
proc transpose data=codes out=temp;
var code;
run;
data want;
if _n_=1 then set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;

Read in specific data without FIRSTOBS=

Is there a way to read in specific parts of my data without using FIRSTOBS=? For example, I have 5 different files, all of which have a few rows of unwanted characters. I want my data to read in starting with the first row that is numeric. But each of these 5 files have that first numeric row starting in different rows. Rather than going into each file to find where FIRSTOBS should be, is there a way I can instead check this? Perhaps by using an IF statement with ANYDIGIT?
Have you tried something like this from the SAS docs? Example 5: Positioning the Pointer with a Numeric Variable
data office (drop=x);
infile file-specification;
input x #;
if 1<=x<=10 then
input #x City $9.;
else do;
put 'Invalid input at line ' _n_;
delete;
end;
run;
This assumes that you don't know how many lines are to be skipped at the beginning of each file. My filerefs are UNIX to run the example on another OS they will need to be changed;
*Create two example input data files;
filename FT15F001 '~/file1.txt';
parmcards;
char
char and 103
10 10 10
10 10 10.1
;;;;
run;
filename FT15F001 '~/file2.txt';
parmcards;
char
char and 103
char
char
char
10 10 10.5
10 10 10
;;;;
run;
*Read them starting from the first line that has all numbers;
filename FT77F001 '~/file*.txt';
data both;
infile FT77F001 eov=eov;
input #;
/*Reset the flag at the start of each new file*/
if _n_ eq 1 or eov then do;
eov=0;
flag=1;
end;
if flag then do;
if anyalpha(_infile_) then delete;
else flag=0;
end;
input v1-v3;
drop flag;
retain flag;
run;
proc print;
run;
I ended up doing:
INPUT City $#;
StateAvg = input(substr(City,1,4),COMMA4.);
IF 5000<= StateAvg <= 7000 THEN
INPUT City 1-7 State ZIP;
ELSE DO;
Delete;
END;
And this worked. Thanks for the suggestions, I went back and looked at example 5 and it helped.

junk values in a row( sas)

I have my data in a .txt file which is comma separated. I am writing a regular infile statements to import that file into a sas dataset. The data is some 2.5 million rows. However in the 37314th row and many more rows I have junk values. SAS is importing rows only a row above the junk value rows and therefore I am not getting a dataset with all 2.5 million rows but with 37314 rows. I want to write a code which while writing this infile takes care of these junk rows and either doesnt take them or deletes them. All in all, I need all the 2.5 million rows which I am not able to get because of in between junk rows.
any help would be appreciated.
You can read the whole line into the input buffer using just an
Input;
statement. Then you can parse the fields individually using the
_infile_
variable.
Example:
data _null_;
infile datalines firstobs=2;
input;
city = scan(_infile_, 1, ' ');
char_min = scan(_infile_, 3, ' ');
char_min = substr(char_min, 2, length(char_min)-2);
minutes = input(char_min, BEST12.);
put city= minutes=;
datalines;
City Number Minutes Charge
Jackson 415-555-2384 <25> <2.45>
Jefferson 813-555-2356 <15> <1.62>
Joliet 913-555-3223 <65> <10.32>
;
run;
Working with Data in the Input Buffer.
You can also use the ? and ?? modifiers for the input statement to 'ignore' any problem rows.
Here's the link to the doc. Look under the heading "Format Modifiers for Error Reporting".
An example:
data x;
format my_num best.;
input my_num ?? ;
**
** POSSIBLE ERROR HANDLING HERE:
*;
if my_num ne . then do;
output;
end;
datalines;
a
;
run;