SAS: Specify newline character when input a text file - sas

I'm quite new to SAS and have a very simple problem. I have a text file which is saved like this:
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
So all the data is written in a single line. The ~ character denotes the newline character. My final goal is to load the text file into sas and create a SAS dataset which should look like this:
V1 V2 V3 V4 V5
123 123 345 457 4.55
123 123 345 457 4.55
123 123 345 457 4.55
So ',' is the delimiter and '~' is the new line character.
How can I achieve this?
Thank you very much for your response.
Kind Regards
Consti

Just tell SAS to use both characters as delimiters and add ## to the input statement to prevent it from going to a new line.
data want ;
infile cards dsd dlm=',~';
input v1-v5 ## ;
cards;
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
;;;;
Result
Obs v1 v2 v3 v4 v5
1 123 123 345 457 4.55
2 123 123 345 457 4.55
3 123 123 345 457 4.55
If you are reading from a file then you might also be able to use the RECFM=N option on the INFILE statement instead of the ## on the INPUT statement, although if the one line actually has LF or CR/LF at the end then you might want to include them in the delimiter list also.

Tom's answer is correct for files that are regular and you don't have issues with inconsistent rows.
If you do need to do exactly what you say though, it's possible; you'd convert ~ to a newline through a pre-processing step. Here's one way to do that.
First, in a data step go through the file with dlm of ~; input the fields until you run to the end of the line, and for each field, output it to a temp file (so now the line has just one data row on it).
Now you have a temp file you can read in like normal, with no ~ characters in it.
You could do this in a number of other ways, literally find/replace ~ with '0D0A'x or whatever your preferred EOL charater is for example (easier/faster to do in another language probably, if you have this in unix and have access to perl for example or even using awk/etc. you could do this probably more easily than in SAS).
filename test_in "c:\temp\test_dlm.txt";
filename test_t temp;
data _null_;
infile test_in dlm='~';
file test_t;
length test_field $32767;
do _n_= 1 by 1 until (_n_ > countc(_infile_,'~'));
input
test_field :$32767. ##;
putlog test_field;
put test_field $;
end;
stop;
run;
data want;
infile test_t dlm=',';
input v1 v2 v3 v4 v5;
run;

Related

Importing data to SAS from text file with multiple delimiters and line breaks in character variables

I want to read into SAS a text file data set that uses two different delimiters, "|" and the string "[end text]". It is arranged as below:
var1|var2|var3
4657|366|text that
has some line
breaks [end text]
45|264| more text that has
line breaks [end text]
I am trying to figure out how to recognize both of these these two delimiters. I tried to use the DLMSTR option, but this didn't work:
data new ;
infile 'file.txt' dlmstr='|'||'[report_end]' DSD firstobs=2 ;
input var1 var2 var3 $;
run;
Is there any way to use these two delimiters at the same time? Or am I using the wrong input style to import my data?
SAS can read delimited files that have embedded line breaks as long as the embedded line breaks use different character than the normal end of line. So if your real observations end with CRLF (normal for a Windows text file) and the embedded line breaks are just single LF character then those extra breaks will be treated as just another character in that field.
var1|var2|var3<CR><LF>
4657|366|text that<LF>
has some line<LF>
breaks [end text]<CR><LF>
45|264| more text that has<LF>
line breaks [end text]<CR><LF>
For example here is a data step that could convert your original file.
data _null_;
infile original lrecl=32767 ;
file copy lrecl=1000000 termstr=lf ;
input ;
_infile_ = tranwrd(_infile_,'[end text]','0d'x);
if _n_=1 then _infile_=trim(_infile_)||'0d'x;
len = length(_infile_);
put _infile_ $varying32767. len ;
run;
But it might be better to replace the embedded line breaks with some other character , like ^, instead.
data _null_;
infile original truncover ;
file copy lrecl=1000000 ;
input line $char32767.;
len = length(line);
put line $varying32767. len #;
if _n_=1 or index(_infile_,'[end text]') then put ;
else put '^' #;
run;
Result:
var1|var2|var3
4657|366|text that^has some line^breaks [end text]
45|264| more text that has^line breaks [end text]
Which is easy to read.
Obs var1 var2 var3
1 4657 366 text that^has some line^breaks [end text]
2 45 264 more text that has^line breaks [end text]

sas infile dataline trunctuates at 8 characters

Having not worked with SAS for a couple of years, i am trying to get back into it...
I am trying to read data with comma-delimited datalines. While there are plenty of examples, I can't quite get the following to import my data correctly:
data h0;
infile datalines delimiter=',';
input
kst
kst_bez $
hx $
hx_bez $
hxx $
hxx_bez $
hxxx $
hxxx_bez $
;
datalines;
10000,Team 1 South,H0,Group,H10,Retail,H112,Retail Germany
10001,Team 2 North & West,H0,H10,Retail Division 2,H112,Retail Germany
10003,Human Res,H0,Group,H20,HR,H112,HR Germany
;
I would have thought that delimiter=',' tells SAS to simply read the data between my ,-Characters into something like a VARCHAR-variable... however, any alphanumeric data is truncated at 8 characters.
I vaguely remember I have to use something like $varying40., which is in line with the examples I found - however, if I add this to my variables, the variable doesn't stop at the ,, but instead reads the whole, say, 40 characters.
Any hints?
Thanks a ton!
If you don't define them otherwise SAS will default all characters variables to length 8. It is probably clearer for you and the SAS compiler if you explicitly define the variables using a LENGTH or ATTRIB statement before using them. Otherwise SAS has to guess at how you wanted them defined based on how they are first used.
data h0;
length kst 8 kst_bez $20 hx $20 hx_bez $20 hxx $20 hxx_bez $20
hxxx $20 hxxx_bez $20
;
infile datalines dsd truncover ;
input kst -- hxxx_bez ;
datalines;
...
You could add in-line informat specifications to the INPUT statement as the first use of the variable and SAS will default to the width of the informat used, but make sure to add the colon prefix to prevent SAS from reading past the delimiters.
data h0;
infile datalines dsd truncover ;
input kst kst_bez :$20. hx :$20. hx_bez :$20. hxx :$20. hxx_bez :$20.
hxxx :$20. hxxx_bez :$20.
;
datalines;
...

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

Is it possible to replicate SAS md5 function output via GNU coreutils?

I expected this to be fairly straightforward, but I've run out of ideas this time. I'm working with with GNU coreutils on Windows 7 (not that it should make any difference). I've found another command line utility that does what I want, but I'd prefer to find a way of doing this via GNU md5sum if possible.
Here's what I'm trying to reproduce:
data _null_;
length a $32;
a = put(md5("Hello"), $hex32.);
put a=;
run;
/*Output to replicate: 8B1A9953C4611296A827ABF8C47804D7*/
Here's what I've tried so far:
%macro wincmd /parmbuff;
filename cmd pipe "&SYSPBUFF" lrecl = 32767;
data _null_;
infile cmd lrecl = 32767;
input;
put _infile_;
run;
filename cmd clear;
%mend wincmd;
%let MD5SUM = C:\Program Files (x86)\coreutils\bin\md5sum.exe;
%wincmd(echo Hello | ""&MD5SUM"");
/*Output: f0d07a42adce73f0e4bc2d5e1cdb71e5 *- */
%wincmd(echo Hello | ""&MD5SUM"" -t);
/*Output: adb3f07f896745a101145fc3c1c7b2ea *- */
%wincmd(echo ""Hello"" | ""&MD5SUM"");
/*Output: 2c3a70806465ad43c09fd387e659fbce *- */
%let MD5 = C:\Program Files (x86)\md5\md5.exe;
%wincmd(echo Hello | ""&MD5"");
/*Output: F0D07A42ADCE73F0E4BC2D5E1CDB71E5 (matches md5sum)*/
%wincmd(echo ""Hello"" | ""&MD5"");
/*Output: 2C3A70806465AD43C09FD387E659FBCE (matches md5sum)*/
%wincmd(""&MD5"" -d""Hello"");
/*Output: 8B1A9953C4611296A827ABF8C47804D7 (matches SAS!)*/
Is there some form of syntax I can use with md5sum that will result in the same output (except possibly for upper/lower case differences) as SAS and md5 -d ? And why does the same string produce a different MD5 hash when read from stdin rather than as a command line parameter?
Update: fix, as suggested by DomPazz and Rob:
I thought I might as well go all in with coreutils at this point and match the SAS output exactly:
%let GNUPATH = C:\Program Files (x86)\coreutils\bin;
%let ECHO = &GNUPATH\echo.exe;
%let TR = &GNUPATH\tr.exe;
%let CUT = &GNUPATH\cut.exe;
%wincmd(""&ECHO"" -n ""Hello"" | ""&MD5SUM"" | ""&TR"" '[a-f]' '[A-F]' | ""&CUT"" -f 1 -d "" "");
/*Output: 8B1A9953C4611296A827ABF8C47804D7*/
You problem is not in md5sum, but in echo. It is adding white space to the "Hello" string.
Verify
C:\>echo Hello > c:\temp\test.txt
C:\>md5sum c:\temp\test.txt
-- I get: f0d07a42adce73f0e4bc2d5e1cdb71e5
Now open the file and notice the white space and a newline. Delete those
Run
C:\>md5sum c:\temp\test.txt
-- I get 8b1a9953c4611296a827abf8c47804d7, which matches SAS.
EDIT:
As mentioned in the comments below GNU echo has the -n option to strip the white space.
C:\Cygwin\bin>echo.exe -n Hello | md5sum.exe
returns: 8b1a9953c4611296a827abf8c47804d7
which matches the SAS value.
The MD5 algorithm output is only affected by 2 things as far as I'm aware:
case of source string
length of source string (includes leading/trailing blanks, length of empty string, etc.)
My guess is that the reason you are getting different outputs is because the different approaches pass in the string to hash with different (perhaps default) lengths and/or leading/trailing blanks, or perhaps your quotes are being included in the hash.

Default behavior of Input Buffer in SAS while reading data from external file

Contents of a.txt
22
333
4444
55555
But when i run this code :
data numbers;
infile ’c:\a.txt’;
input var 5.;
/* list */ ;
run;
the data in numbers.sas is saved as :
333
55555
** Note the format of the data in numbers.sas and the format in a.txt
But when i use the list the input buffer is somewhat like this :
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
2 333 3
4 55555 5
Why doesnt sas show 1 and 3?? And how is the input buffer reading?
Please explain
Try adding TRUNCOVER to your infile statement or remove the 5. after your input statement. SAS now expects a 5 digit number. If will continue reading if the line on your sourcefile is less then 5 characters long.
data numbers;
infile 'c:\a.txt' truncover;
input var 5.;
run;
For more infor read this paper on infile statement options