sas infile dataline trunctuates at 8 characters - sas

Having not worked with SAS for a couple of years, i am trying to get back into it...
I am trying to read data with comma-delimited datalines. While there are plenty of examples, I can't quite get the following to import my data correctly:
data h0;
infile datalines delimiter=',';
input
kst
kst_bez $
hx $
hx_bez $
hxx $
hxx_bez $
hxxx $
hxxx_bez $
;
datalines;
10000,Team 1 South,H0,Group,H10,Retail,H112,Retail Germany
10001,Team 2 North & West,H0,H10,Retail Division 2,H112,Retail Germany
10003,Human Res,H0,Group,H20,HR,H112,HR Germany
;
I would have thought that delimiter=',' tells SAS to simply read the data between my ,-Characters into something like a VARCHAR-variable... however, any alphanumeric data is truncated at 8 characters.
I vaguely remember I have to use something like $varying40., which is in line with the examples I found - however, if I add this to my variables, the variable doesn't stop at the ,, but instead reads the whole, say, 40 characters.
Any hints?
Thanks a ton!

If you don't define them otherwise SAS will default all characters variables to length 8. It is probably clearer for you and the SAS compiler if you explicitly define the variables using a LENGTH or ATTRIB statement before using them. Otherwise SAS has to guess at how you wanted them defined based on how they are first used.
data h0;
length kst 8 kst_bez $20 hx $20 hx_bez $20 hxx $20 hxx_bez $20
hxxx $20 hxxx_bez $20
;
infile datalines dsd truncover ;
input kst -- hxxx_bez ;
datalines;
...
You could add in-line informat specifications to the INPUT statement as the first use of the variable and SAS will default to the width of the informat used, but make sure to add the colon prefix to prevent SAS from reading past the delimiters.
data h0;
infile datalines dsd truncover ;
input kst kst_bez :$20. hx :$20. hx_bez :$20. hxx :$20. hxx_bez :$20.
hxxx :$20. hxxx_bez :$20.
;
datalines;
...

Related

SAS proc import then proc format: ERROR: For format $xxxxx, this range is repeated, or values overlap: C4311-C4311

I am a SAS novice and I have encountered this issue. I already referred to several posts including this: [SAS Formats]ERROR: For format COUNTRIES, this range is repeated, or values overlap: .-.
I used following code block to export a particular entry (Ias1012324y22y23mc) in my SAS catalog
libname perm '<path>';
filename tempfile '<filename>.csv' ;
proc FORMAT FMTLIB LIB=formats.formats cntlout=sasuser.fmtdata;
select $Ias1012324y22y23mc;
run;
proc export data=sasuser.fmtdata outfile=tempfile dbms=csv replace;
run;
quit;
My intention is to make a few changes and import into a different catalog but I needed to verify so I am uploading the exact same csv file but I still ran into this issue:
ERROR: For format $IAS1012324Y22Y23MC, this range is repeated, or values overlap: C4311-C4311
Here is my import script:
libname perm '<path>';
filename tempfile '<filename>.csv' ;
PROC IMPORT
datafile=tempfile OUT=updated DBMS=CSV REPLACE;
GETNAMES=YES;
RUN;
proc format library=perm.library fmtlib cntlin=updated;
select $IAS1012324Y22Y23MC;
run;
quit;
I also tried to add a controlset with no luck as mentioned here: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/n03qskwoints2an1ispy57plwrn9.htm
PROC IMPORT
datafile=tempfile OUT=updated DBMS=CSV REPLACE;
GETNAMES=YES;
RUN;
data ctrl;
retain fmtname '$IAS1012324Y22Y23MC';
length FMTNAME $32. START $9. END $9. LABEL $23. PREFIX $2. FILL $1. TYPE $1. SEXCL $1. EEXCL $1. HLO $13. DECSEP $1. DIG3SEP $1. DATATYPE $8. LANGUAGE $8.;
set updated;
* proc print;
run;
proc format library=perm.library fmtlib cntlin=ctrl;
select $IAS1012324Y22Y23MC;
run;
quit;
Here is my dataset where the overlap is happening
FMTNAME,START,END,LABEL,MIN,MAX,DEFAULT,LENGTH,FUZZ,PREFIX,MULT,FILL,NOEDIT,TYPE,SEXCL,EEXCL,HLO,DECSEP,DIG3SEP,DATATYPE,LANGUAGE
IAS1012324Y22Y23MC,C4310,C4310,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
IAS1012324Y22Y23MC,C43111,C43111,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
IAS1012324Y22Y23MC,C43112,C43112,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
IAS1012324Y22Y23MC,C43121,C43121,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
IAS1012324Y22Y23MC,C43122,C43122,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
IAS1012324Y22Y23MC,C4320,C4320,23,1,40,4,4,0,,0,,0,C,N,N,,,,,
Clearly the issue seems to be default value range for START when I import the data it came with 4, I edited the csv file and changed default column to 9 but still the same issue.
Update
Here is the generated data step after adding GUESSINGROWS=MAX; still the same issue.
data WORK.UPDATED ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'F:\SAS Programs\RAF2024InitialModel\import-model-sg\data-in-ascii.txt' delimiter = ',' MISSOVER DSD lrecl=13106 firstobs=2 ;
informat FMTNAME $18. ;
informat START $9. ;
informat END $9. ;
informat LABEL best32. ;
informat MIN best32. ;
informat MAX best32. ;
informat DEFAULT best32. ;
informat LENGTH best32. ;
informat FUZZ best32. ;
informat PREFIX $1. ;
informat MULT best32. ;
informat FILL $1. ;
informat NOEDIT best32. ;
informat TYPE $1. ;
informat SEXCL $1. ;
informat EEXCL $1. ;
informat HLO $1. ;
informat DECSEP $1. ;
informat DIG3SEP $1. ;
informat DATATYPE $1. ;
informat LANGUAGE $1. ;
format FMTNAME $18. ;
format START $9. ;
format END $9. ;
format LABEL best12. ;
format MIN best12. ;
format MAX best12. ;
format DEFAULT best12. ;
format LENGTH best12. ;
format FUZZ best12. ;
format PREFIX $1. ;
format MULT best12. ;
format FILL $1. ;
format NOEDIT best12. ;
format TYPE $1. ;
format SEXCL $1. ;
format EEXCL $1. ;
format HLO $1. ;
format DECSEP $1. ;
format DIG3SEP $1. ;
format DATATYPE $1. ;
format LANGUAGE $1. ;
input
FMTNAME $
START $
END $
LABEL
MIN
MAX
DEFAULT
LENGTH
FUZZ
PREFIX $
MULT
FILL $
NOEDIT
TYPE $
SEXCL $
EEXCL $
HLO $
DECSEP $
DIG3SEP $
DATATYPE $
LANGUAGE $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Don't let PROC IMPORT GUESS how to read your text file. Write your own code instead. If you do have to use PROC IMPORT to read a text file make sure to always use the GUESSINGROWS=MAX; statement so that it checks the whole file before deciding the type and length to use for each variable.

Importing data to SAS from text file with multiple delimiters and line breaks in character variables

I want to read into SAS a text file data set that uses two different delimiters, "|" and the string "[end text]". It is arranged as below:
var1|var2|var3
4657|366|text that
has some line
breaks [end text]
45|264| more text that has
line breaks [end text]
I am trying to figure out how to recognize both of these these two delimiters. I tried to use the DLMSTR option, but this didn't work:
data new ;
infile 'file.txt' dlmstr='|'||'[report_end]' DSD firstobs=2 ;
input var1 var2 var3 $;
run;
Is there any way to use these two delimiters at the same time? Or am I using the wrong input style to import my data?
SAS can read delimited files that have embedded line breaks as long as the embedded line breaks use different character than the normal end of line. So if your real observations end with CRLF (normal for a Windows text file) and the embedded line breaks are just single LF character then those extra breaks will be treated as just another character in that field.
var1|var2|var3<CR><LF>
4657|366|text that<LF>
has some line<LF>
breaks [end text]<CR><LF>
45|264| more text that has<LF>
line breaks [end text]<CR><LF>
For example here is a data step that could convert your original file.
data _null_;
infile original lrecl=32767 ;
file copy lrecl=1000000 termstr=lf ;
input ;
_infile_ = tranwrd(_infile_,'[end text]','0d'x);
if _n_=1 then _infile_=trim(_infile_)||'0d'x;
len = length(_infile_);
put _infile_ $varying32767. len ;
run;
But it might be better to replace the embedded line breaks with some other character , like ^, instead.
data _null_;
infile original truncover ;
file copy lrecl=1000000 ;
input line $char32767.;
len = length(line);
put line $varying32767. len #;
if _n_=1 or index(_infile_,'[end text]') then put ;
else put '^' #;
run;
Result:
var1|var2|var3
4657|366|text that^has some line^breaks [end text]
45|264| more text that has^line breaks [end text]
Which is easy to read.
Obs var1 var2 var3
1 4657 366 text that^has some line^breaks [end text]
2 45 264 more text that has^line breaks [end text]

SAS: Specify newline character when input a text file

I'm quite new to SAS and have a very simple problem. I have a text file which is saved like this:
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
So all the data is written in a single line. The ~ character denotes the newline character. My final goal is to load the text file into sas and create a SAS dataset which should look like this:
V1 V2 V3 V4 V5
123 123 345 457 4.55
123 123 345 457 4.55
123 123 345 457 4.55
So ',' is the delimiter and '~' is the new line character.
How can I achieve this?
Thank you very much for your response.
Kind Regards
Consti
Just tell SAS to use both characters as delimiters and add ## to the input statement to prevent it from going to a new line.
data want ;
infile cards dsd dlm=',~';
input v1-v5 ## ;
cards;
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
;;;;
Result
Obs v1 v2 v3 v4 v5
1 123 123 345 457 4.55
2 123 123 345 457 4.55
3 123 123 345 457 4.55
If you are reading from a file then you might also be able to use the RECFM=N option on the INFILE statement instead of the ## on the INPUT statement, although if the one line actually has LF or CR/LF at the end then you might want to include them in the delimiter list also.
Tom's answer is correct for files that are regular and you don't have issues with inconsistent rows.
If you do need to do exactly what you say though, it's possible; you'd convert ~ to a newline through a pre-processing step. Here's one way to do that.
First, in a data step go through the file with dlm of ~; input the fields until you run to the end of the line, and for each field, output it to a temp file (so now the line has just one data row on it).
Now you have a temp file you can read in like normal, with no ~ characters in it.
You could do this in a number of other ways, literally find/replace ~ with '0D0A'x or whatever your preferred EOL charater is for example (easier/faster to do in another language probably, if you have this in unix and have access to perl for example or even using awk/etc. you could do this probably more easily than in SAS).
filename test_in "c:\temp\test_dlm.txt";
filename test_t temp;
data _null_;
infile test_in dlm='~';
file test_t;
length test_field $32767;
do _n_= 1 by 1 until (_n_ > countc(_infile_,'~'));
input
test_field :$32767. ##;
putlog test_field;
put test_field $;
end;
stop;
run;
data want;
infile test_t dlm=',';
input v1 v2 v3 v4 v5;
run;

Is it possible to replicate SAS md5 function output via GNU coreutils?

I expected this to be fairly straightforward, but I've run out of ideas this time. I'm working with with GNU coreutils on Windows 7 (not that it should make any difference). I've found another command line utility that does what I want, but I'd prefer to find a way of doing this via GNU md5sum if possible.
Here's what I'm trying to reproduce:
data _null_;
length a $32;
a = put(md5("Hello"), $hex32.);
put a=;
run;
/*Output to replicate: 8B1A9953C4611296A827ABF8C47804D7*/
Here's what I've tried so far:
%macro wincmd /parmbuff;
filename cmd pipe "&SYSPBUFF" lrecl = 32767;
data _null_;
infile cmd lrecl = 32767;
input;
put _infile_;
run;
filename cmd clear;
%mend wincmd;
%let MD5SUM = C:\Program Files (x86)\coreutils\bin\md5sum.exe;
%wincmd(echo Hello | ""&MD5SUM"");
/*Output: f0d07a42adce73f0e4bc2d5e1cdb71e5 *- */
%wincmd(echo Hello | ""&MD5SUM"" -t);
/*Output: adb3f07f896745a101145fc3c1c7b2ea *- */
%wincmd(echo ""Hello"" | ""&MD5SUM"");
/*Output: 2c3a70806465ad43c09fd387e659fbce *- */
%let MD5 = C:\Program Files (x86)\md5\md5.exe;
%wincmd(echo Hello | ""&MD5"");
/*Output: F0D07A42ADCE73F0E4BC2D5E1CDB71E5 (matches md5sum)*/
%wincmd(echo ""Hello"" | ""&MD5"");
/*Output: 2C3A70806465AD43C09FD387E659FBCE (matches md5sum)*/
%wincmd(""&MD5"" -d""Hello"");
/*Output: 8B1A9953C4611296A827ABF8C47804D7 (matches SAS!)*/
Is there some form of syntax I can use with md5sum that will result in the same output (except possibly for upper/lower case differences) as SAS and md5 -d ? And why does the same string produce a different MD5 hash when read from stdin rather than as a command line parameter?
Update: fix, as suggested by DomPazz and Rob:
I thought I might as well go all in with coreutils at this point and match the SAS output exactly:
%let GNUPATH = C:\Program Files (x86)\coreutils\bin;
%let ECHO = &GNUPATH\echo.exe;
%let TR = &GNUPATH\tr.exe;
%let CUT = &GNUPATH\cut.exe;
%wincmd(""&ECHO"" -n ""Hello"" | ""&MD5SUM"" | ""&TR"" '[a-f]' '[A-F]' | ""&CUT"" -f 1 -d "" "");
/*Output: 8B1A9953C4611296A827ABF8C47804D7*/
You problem is not in md5sum, but in echo. It is adding white space to the "Hello" string.
Verify
C:\>echo Hello > c:\temp\test.txt
C:\>md5sum c:\temp\test.txt
-- I get: f0d07a42adce73f0e4bc2d5e1cdb71e5
Now open the file and notice the white space and a newline. Delete those
Run
C:\>md5sum c:\temp\test.txt
-- I get 8b1a9953c4611296a827abf8c47804d7, which matches SAS.
EDIT:
As mentioned in the comments below GNU echo has the -n option to strip the white space.
C:\Cygwin\bin>echo.exe -n Hello | md5sum.exe
returns: 8b1a9953c4611296a827abf8c47804d7
which matches the SAS value.
The MD5 algorithm output is only affected by 2 things as far as I'm aware:
case of source string
length of source string (includes leading/trailing blanks, length of empty string, etc.)
My guess is that the reason you are getting different outputs is because the different approaches pass in the string to hash with different (perhaps default) lengths and/or leading/trailing blanks, or perhaps your quotes are being included in the hash.

Default behavior of Input Buffer in SAS while reading data from external file

Contents of a.txt
22
333
4444
55555
But when i run this code :
data numbers;
infile ’c:\a.txt’;
input var 5.;
/* list */ ;
run;
the data in numbers.sas is saved as :
333
55555
** Note the format of the data in numbers.sas and the format in a.txt
But when i use the list the input buffer is somewhat like this :
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
2 333 3
4 55555 5
Why doesnt sas show 1 and 3?? And how is the input buffer reading?
Please explain
Try adding TRUNCOVER to your infile statement or remove the 5. after your input statement. SAS now expects a 5 digit number. If will continue reading if the line on your sourcefile is less then 5 characters long.
data numbers;
infile 'c:\a.txt' truncover;
input var 5.;
run;
For more infor read this paper on infile statement options