Default behavior of Input Buffer in SAS while reading data from external file - sas

Contents of a.txt
22
333
4444
55555
But when i run this code :
data numbers;
infile ’c:\a.txt’;
input var 5.;
/* list */ ;
run;
the data in numbers.sas is saved as :
333
55555
** Note the format of the data in numbers.sas and the format in a.txt
But when i use the list the input buffer is somewhat like this :
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
2 333 3
4 55555 5
Why doesnt sas show 1 and 3?? And how is the input buffer reading?
Please explain

Try adding TRUNCOVER to your infile statement or remove the 5. after your input statement. SAS now expects a 5 digit number. If will continue reading if the line on your sourcefile is less then 5 characters long.
data numbers;
infile 'c:\a.txt' truncover;
input var 5.;
run;
For more infor read this paper on infile statement options

Related

Importing data to SAS from text file with multiple delimiters and line breaks in character variables

I want to read into SAS a text file data set that uses two different delimiters, "|" and the string "[end text]". It is arranged as below:
var1|var2|var3
4657|366|text that
has some line
breaks [end text]
45|264| more text that has
line breaks [end text]
I am trying to figure out how to recognize both of these these two delimiters. I tried to use the DLMSTR option, but this didn't work:
data new ;
infile 'file.txt' dlmstr='|'||'[report_end]' DSD firstobs=2 ;
input var1 var2 var3 $;
run;
Is there any way to use these two delimiters at the same time? Or am I using the wrong input style to import my data?
SAS can read delimited files that have embedded line breaks as long as the embedded line breaks use different character than the normal end of line. So if your real observations end with CRLF (normal for a Windows text file) and the embedded line breaks are just single LF character then those extra breaks will be treated as just another character in that field.
var1|var2|var3<CR><LF>
4657|366|text that<LF>
has some line<LF>
breaks [end text]<CR><LF>
45|264| more text that has<LF>
line breaks [end text]<CR><LF>
For example here is a data step that could convert your original file.
data _null_;
infile original lrecl=32767 ;
file copy lrecl=1000000 termstr=lf ;
input ;
_infile_ = tranwrd(_infile_,'[end text]','0d'x);
if _n_=1 then _infile_=trim(_infile_)||'0d'x;
len = length(_infile_);
put _infile_ $varying32767. len ;
run;
But it might be better to replace the embedded line breaks with some other character , like ^, instead.
data _null_;
infile original truncover ;
file copy lrecl=1000000 ;
input line $char32767.;
len = length(line);
put line $varying32767. len #;
if _n_=1 or index(_infile_,'[end text]') then put ;
else put '^' #;
run;
Result:
var1|var2|var3
4657|366|text that^has some line^breaks [end text]
45|264| more text that has^line breaks [end text]
Which is easy to read.
Obs var1 var2 var3
1 4657 366 text that^has some line^breaks [end text]
2 45 264 more text that has^line breaks [end text]

SAS: Specify newline character when input a text file

I'm quite new to SAS and have a very simple problem. I have a text file which is saved like this:
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
So all the data is written in a single line. The ~ character denotes the newline character. My final goal is to load the text file into sas and create a SAS dataset which should look like this:
V1 V2 V3 V4 V5
123 123 345 457 4.55
123 123 345 457 4.55
123 123 345 457 4.55
So ',' is the delimiter and '~' is the new line character.
How can I achieve this?
Thank you very much for your response.
Kind Regards
Consti
Just tell SAS to use both characters as delimiters and add ## to the input statement to prevent it from going to a new line.
data want ;
infile cards dsd dlm=',~';
input v1-v5 ## ;
cards;
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
;;;;
Result
Obs v1 v2 v3 v4 v5
1 123 123 345 457 4.55
2 123 123 345 457 4.55
3 123 123 345 457 4.55
If you are reading from a file then you might also be able to use the RECFM=N option on the INFILE statement instead of the ## on the INPUT statement, although if the one line actually has LF or CR/LF at the end then you might want to include them in the delimiter list also.
Tom's answer is correct for files that are regular and you don't have issues with inconsistent rows.
If you do need to do exactly what you say though, it's possible; you'd convert ~ to a newline through a pre-processing step. Here's one way to do that.
First, in a data step go through the file with dlm of ~; input the fields until you run to the end of the line, and for each field, output it to a temp file (so now the line has just one data row on it).
Now you have a temp file you can read in like normal, with no ~ characters in it.
You could do this in a number of other ways, literally find/replace ~ with '0D0A'x or whatever your preferred EOL charater is for example (easier/faster to do in another language probably, if you have this in unix and have access to perl for example or even using awk/etc. you could do this probably more easily than in SAS).
filename test_in "c:\temp\test_dlm.txt";
filename test_t temp;
data _null_;
infile test_in dlm='~';
file test_t;
length test_field $32767;
do _n_= 1 by 1 until (_n_ > countc(_infile_,'~'));
input
test_field :$32767. ##;
putlog test_field;
put test_field $;
end;
stop;
run;
data want;
infile test_t dlm=',';
input v1 v2 v3 v4 v5;
run;

sas infile dataline trunctuates at 8 characters

Having not worked with SAS for a couple of years, i am trying to get back into it...
I am trying to read data with comma-delimited datalines. While there are plenty of examples, I can't quite get the following to import my data correctly:
data h0;
infile datalines delimiter=',';
input
kst
kst_bez $
hx $
hx_bez $
hxx $
hxx_bez $
hxxx $
hxxx_bez $
;
datalines;
10000,Team 1 South,H0,Group,H10,Retail,H112,Retail Germany
10001,Team 2 North & West,H0,H10,Retail Division 2,H112,Retail Germany
10003,Human Res,H0,Group,H20,HR,H112,HR Germany
;
I would have thought that delimiter=',' tells SAS to simply read the data between my ,-Characters into something like a VARCHAR-variable... however, any alphanumeric data is truncated at 8 characters.
I vaguely remember I have to use something like $varying40., which is in line with the examples I found - however, if I add this to my variables, the variable doesn't stop at the ,, but instead reads the whole, say, 40 characters.
Any hints?
Thanks a ton!
If you don't define them otherwise SAS will default all characters variables to length 8. It is probably clearer for you and the SAS compiler if you explicitly define the variables using a LENGTH or ATTRIB statement before using them. Otherwise SAS has to guess at how you wanted them defined based on how they are first used.
data h0;
length kst 8 kst_bez $20 hx $20 hx_bez $20 hxx $20 hxx_bez $20
hxxx $20 hxxx_bez $20
;
infile datalines dsd truncover ;
input kst -- hxxx_bez ;
datalines;
...
You could add in-line informat specifications to the INPUT statement as the first use of the variable and SAS will default to the width of the informat used, but make sure to add the colon prefix to prevent SAS from reading past the delimiters.
data h0;
infile datalines dsd truncover ;
input kst kst_bez :$20. hx :$20. hx_bez :$20. hxx :$20. hxx_bez :$20.
hxxx :$20. hxxx_bez :$20.
;
datalines;
...

Is it possible to replicate SAS md5 function output via GNU coreutils?

I expected this to be fairly straightforward, but I've run out of ideas this time. I'm working with with GNU coreutils on Windows 7 (not that it should make any difference). I've found another command line utility that does what I want, but I'd prefer to find a way of doing this via GNU md5sum if possible.
Here's what I'm trying to reproduce:
data _null_;
length a $32;
a = put(md5("Hello"), $hex32.);
put a=;
run;
/*Output to replicate: 8B1A9953C4611296A827ABF8C47804D7*/
Here's what I've tried so far:
%macro wincmd /parmbuff;
filename cmd pipe "&SYSPBUFF" lrecl = 32767;
data _null_;
infile cmd lrecl = 32767;
input;
put _infile_;
run;
filename cmd clear;
%mend wincmd;
%let MD5SUM = C:\Program Files (x86)\coreutils\bin\md5sum.exe;
%wincmd(echo Hello | ""&MD5SUM"");
/*Output: f0d07a42adce73f0e4bc2d5e1cdb71e5 *- */
%wincmd(echo Hello | ""&MD5SUM"" -t);
/*Output: adb3f07f896745a101145fc3c1c7b2ea *- */
%wincmd(echo ""Hello"" | ""&MD5SUM"");
/*Output: 2c3a70806465ad43c09fd387e659fbce *- */
%let MD5 = C:\Program Files (x86)\md5\md5.exe;
%wincmd(echo Hello | ""&MD5"");
/*Output: F0D07A42ADCE73F0E4BC2D5E1CDB71E5 (matches md5sum)*/
%wincmd(echo ""Hello"" | ""&MD5"");
/*Output: 2C3A70806465AD43C09FD387E659FBCE (matches md5sum)*/
%wincmd(""&MD5"" -d""Hello"");
/*Output: 8B1A9953C4611296A827ABF8C47804D7 (matches SAS!)*/
Is there some form of syntax I can use with md5sum that will result in the same output (except possibly for upper/lower case differences) as SAS and md5 -d ? And why does the same string produce a different MD5 hash when read from stdin rather than as a command line parameter?
Update: fix, as suggested by DomPazz and Rob:
I thought I might as well go all in with coreutils at this point and match the SAS output exactly:
%let GNUPATH = C:\Program Files (x86)\coreutils\bin;
%let ECHO = &GNUPATH\echo.exe;
%let TR = &GNUPATH\tr.exe;
%let CUT = &GNUPATH\cut.exe;
%wincmd(""&ECHO"" -n ""Hello"" | ""&MD5SUM"" | ""&TR"" '[a-f]' '[A-F]' | ""&CUT"" -f 1 -d "" "");
/*Output: 8B1A9953C4611296A827ABF8C47804D7*/
You problem is not in md5sum, but in echo. It is adding white space to the "Hello" string.
Verify
C:\>echo Hello > c:\temp\test.txt
C:\>md5sum c:\temp\test.txt
-- I get: f0d07a42adce73f0e4bc2d5e1cdb71e5
Now open the file and notice the white space and a newline. Delete those
Run
C:\>md5sum c:\temp\test.txt
-- I get 8b1a9953c4611296a827abf8c47804d7, which matches SAS.
EDIT:
As mentioned in the comments below GNU echo has the -n option to strip the white space.
C:\Cygwin\bin>echo.exe -n Hello | md5sum.exe
returns: 8b1a9953c4611296a827abf8c47804d7
which matches the SAS value.
The MD5 algorithm output is only affected by 2 things as far as I'm aware:
case of source string
length of source string (includes leading/trailing blanks, length of empty string, etc.)
My guess is that the reason you are getting different outputs is because the different approaches pass in the string to hash with different (perhaps default) lengths and/or leading/trailing blanks, or perhaps your quotes are being included in the hash.

Comparing datasets

I have 2 datasets. 1 containing the columns origin_zip(number) and destination_zip(char) and tracking_number(char) and the other containing zip.
I would like to compare these 2 datasets so I can see all the tracking numbers and destination_zips that are not in the zip column of the second dataset.
Additionally I would like to see all of the tracking_numbers and origin_zips where the origin_zips = the destination_zips.
How would I accomplish this?
origin_zip destination_zip tracking_number
12345 23456 11111
34567 45678 22222
12345 12345 33333
zip
12345
34567
23456
results_tracking_number
22222
33333
Let's start with this...I don't think this completely answers your question, but follow up with comments and I will help if I can...
data zips;
input origin_zip $ destination_zip $ tracking_number $;
datalines;
12345 23456 11111
34567 45678 22222
56789 12345 33333
;
data zip;
input zip $;
datalines;
12345
54321
34567
76543
56789
;
Proc sort data=zips;
by origin_zip;
run;
Proc sort data=zip;
by zip;
run;
Data contained not_contained;
merge zip(in=a) zips(in=b rename=(origin_zip=zip));
by zip;
if a and b then output contained;
if a and not b then output not_contained;
run;