Read DATA INPUT column statement conditionally - sas

I'm relatively new to SAS (using SAS EG if it matters).
I have several previously written programs that read new data files. For each type of data file there's a separate program with it's specific INPUT column statement.
For example, one program would have:
DATA data1;
INFILE 'D:\file.txt' noprint missover;
INPUT
ID 1 - 8
NAME $ 9 - 20;
run;
whereas another program would have other definitions. for example:
INPUT
ID 1 - 5
NAME $ 6 - 20
Each data file contains hundreds of variables, so the INPUT column statement in each program is very long. However, the rest of these programs are completely identical.
My intention is to combine these programs into one,
I have two questions:
Is it possible to combine these programs with a conditional INPUT column statement?
Is it possible to read the definition of each file type columns from a variable? (Thus enabling me to define it elsewhere in the workflow or even to read it from an external file)

It seems like you use text files with a fixed width definition. For these you can each specify a format file of the form
column, type, start, end
and then read that file first in order to build the INPUT statement. column is the column name, type one of n (numeric) or c (character), start and end start and end positions for this column.
You would wrap this into a MACRO like this:
%macro readFile(file, output);
%local input_statement;
/* First, read the format file that contains the column details. */
data _null_;
infile "&file..fmt" dlm="," end=eof;
input column $ type $ start end;
length input_statement $ 32767;
retain input_statement "input";
if type = "c" then type = "$";
else type = "";
input_statement = catx(" ", input_statement, column, type, start, "-", end);
if eof then call symputx("input_statement", input_statement);
run;
/* Read the actual file. */
data &output.;
infile "&file.";
&input_statement.;
run;
%mend;
For a file file.txt the macro needs the format file to be named file.txt.fmt in the same path. Call the macro as
%readFile(%str(D:\file.txt), data1);

Related

SAS Export Issue as it is giving additional double quote

I am trying to export SAS data into CSV, sas dataset name is abc here and format is
LINE_NUMBER DESCRIPTION
524JG 24PC AMEFA VINTAGE CUTLERY SET "DUBARRY"
I am using following code.
filename exprt "C:/abc.csv" encoding="utf-8";
proc export data=abc
outfile=exprt
dbms=tab;
run;
output is
LINE_NUMBER DESCRIPTION
524JG "24PC AMEFA VINTAGE CUTLERY SET ""DUBARRY"""
so there is double quote available before and after the description here and additional doble quote is coming after & before DUBARRY word. I have no clue whats happening. Can some one help me to resolve this and make me understand what exatly happening here.
expected result:
LINE_NUMBER DESCRIPTION
524JG 24PC AMEFA VINTAGE CUTLERY SET "DUBARRY"
There is no need to use PROC EXPORT to create a delimited file. You can write it with a simple DATA step. If you want to create your example file then just do not use the DSD option on the FILE statement. But note that depending on the data you are writing that you could create a file that cannot be properly parsed because of extra un-protected delimiters. Also you will have trouble representing missing values.
Let's make a sample dataset we can use to test.
data have ;
input id value cvalue $ name $20. ;
cards;
1 123 A Normal
2 345 B Embedded|delimiter
3 678 C Embedded "quotes"
4 . D Missing value
5 901 . Missing cvalue
;
Essentially PROC EXPORT is writing the data using the DSD option. Like this:
data _null_;
set have ;
file 'myfile.txt' dsd dlm='09'x ;
put (_all_) (+0);
run;
Which will yield a file like this (with pipes replacing the tabs so you can see them).
1|123|A|Normal
2|345|B|"Embedded|delimiter"
3|678|C|"Embedded ""quotes"""
4||D|Missing value
5|901||Missing cvalue
If you just remove DSD option then you get a file like this instead.
1|123|A|Normal
2|345|B|Embedded|delimiter
3|678|C|Embedded "quotes"
4|.|D|Missing value
5|901| |Missing cvalue
Notice how the second line looks like it has 5 values instead of 4, making it impossible to know how to split it into 4 values. Also notice how the missing values have a minimum length of at least one character.
Another way would be to run a data step to convert the normal file that PROC EXPORT generates into the variant format that you want. This might also give you a place to add escape characters to protect special characters if your target format requires them.
data _null_;
infile normal dsd dlm='|' truncover ;
file abnormal dlm='|';
do i=1 to 4 ;
if i>1 then put '|' #;
input field :$32767. #;
field = tranwrd(field,'\','\\');
field = tranwrd(field,'|','\|');
len = lengthn(field);
put field $varying32767. len #;
end;
put;
run;
You could even make this datastep smart enough to count the number of fields on the first row and use that to control the loop so that you wouldn't have to hard code it.

Read a file line by line for every observation in a dataset

I'm trying to create a program that takes a text file, replaces any macro references within it, and appends it to a single output file. The macro references are generated as I iterate over the observations in a dataset.
I'm having trouble trying to get it to read the entire text file for each observation in my source table. I think there's an implicit stop instruction related to my use of the end= option on the infile statement that is preventing my set statement from iterating over each record.
I've simplified the template and code, examples below:
Here is the template that I'm trying to populate:
INSERT INTO some_table (name,age)
VALUES (&name,&age);
Here is the SAS code:
filename dest "%sysfunc(pathname(work))\backfill.sql";
data _null_;
attrib line length=$1000;
set sashelp.class;
file dest;
infile "sql_template.sas" end=template_eof;
call symput('name', quote(cats(name)));
call symput('age' , cats(age));
do while (not template_eof);
input;
line = resolve(_infile_);
put line;
end;
run;
Running the above code produces the desired output file but only for the first observation in the dataset.
You cannot do it that way since after the first observation you are already at the end of the input text file. So your DO WHILE loop only runs for the first observation.
Here is a trick that I learned a long time ago on SAS-L. Toggle between two input files so that you can start at the top of the input file again.
First let's create your example template program and an empty dummy file.
filename template temp;
filename dummy temp;
data _null_;
file template;
put 'INSERT INTO some_table (name,age)'
/ ' VALUES (&name,&age)'
/ ';'
;
file dummy ;
run;
Now let's write a data step to read the input data and use RESOLVE() function to convert the text.
filename result temp;
data _null_;
length filename $256 ;
file result ;
set sashelp.class;
call symputx('name', catq('1at',name));
call symputx('age' , age);
do filename=pathname('template'),pathname('dummy');
infile in filevar=filename end=eof ;
do while (not eof);
input;
_infile_ = resolve(_infile_);
put _infile_;
end;
end;
run;
The resulting file will look like this:
INSERT INTO some_table (name,age)
VALUES ('Alfred',14)
;
INSERT INTO some_table (name,age)
VALUES ('Alice',13)
;
...

How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc export data = test1
file = "test1.dta"
dbms = stata replace;
quit;
the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.
This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.
Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.
Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.
This is how you can do it using existing functions.
filename FT41F001 temp;
data _null_;
file FT41F001;
set test1;
put 256*' ' #;
__s=1;
do while(1);
length __name $32.;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
substr(_FILE_,__s) = vvaluex(__name);
putlog _all_;
__s = sum(__s,vformatwx(__name));
end;
_file_ = trim(_file_);
put;
format month f6.;
run;
To avoid the use of _FILE_;
data _null_;
file FT41F001;
set test1;
__s=1;
do while(1);
length __name $32. __value $128 __w 8;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
__value = vvaluex(__name);
__w = vformatwx(__name);
put __value $varying128. __w #;
end;
put;
format month f6.;
run;
If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD X
B 199912 D C
;
run;
data _null_;
file "/folders/myfolders/test.txt";
set test1;
put #;
_FILE_ = cat(of _all_);
put;
run;
/* Print contents of the file to the log (for debugging only)*/
data _null_;
infile "/folders/myfolders/test.txt";
input;
put _infile_;
run;
This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.
Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).

Construct SAS dataset based on file containing metadata

I have two text files, one containing raw data with no headers and another containing the associated column names and lengths. I'd like to use these two files to construct a single SAS dataset containing the data from one file with the column names and lengths from the other.
The file containing the data is a fixed-width text file. That is, each column of data is aligned to a particular column of the text file, padded with spaces to ensure alignment.
datafile.txt:
John 45 Has two kids
Marge 37 Likes books
Sally 29 Is an astronaut
Bill 60 Drinks coffee
The file containing the metadata is tab-delimited with two columns: one with the name of the column in the data file and one with the character length of that column. The names are listed in the order in which they appear in the data file.
metadata.txt:
Name 7
Age 5
Comments 15
My goal is to have a SAS dataset that looks like this:
Name | Age | Comments
-------+------+-----------------
John | 45 | Has two kids
Marge | 37 | Likes books
Sally | 29 | Is an astronaut
Bill | 60 | Drinks coffee
I want every column to be character with the length specified in the metadata file.
There has to be a better way than my naive approach, which is to construct a length statement and an input statement using the imported metadata, like so:
/* Import metadata */
data meta;
length colname $ 50 collen 8;
infile 'C:\metadata.txt' dsd dlm='09'x;
input colname $ collen;
run;
/* Construct LENGTH and INPUT statements */
data _null_;
length lenstmt inptstmt $ 1000;
retain lenstmt inptstmt '' colstart 1;
set meta end=eof;
call catx(' ', lenstmt, colname, '$', collen);
call catx(' ', inptstmt, cats('#', colstart), colname, '$ &');
colstart + collen;
if eof then do;
call symputx('lenstmt', lenstmt);
call symputx('inptstmt', inptstmt);
end;
run;
/* Import data file */
data datafile;
length &lenstmt;
infile 'C:\datafile.txt' dsd dlm='09'x;
input &inptstmt;
run;
This gets me what I need, but there has to be a cleaner way. One could run into trouble with this approach if insufficient space is allocated to the variables storing the length and input statements, or if the statement lengths exceed the maximum macro variable length.
Any ideas?
What you're doing is a fairly standard method of doing this. Yes, you could check things a bit more carefully; I would allocate $32767 for the two statements, for example, just to be cautious.
There are some ways you can improve this, though, that may take some of your worries away.
First off, a common solution is to build this at the row level (as you do) and then use proc sql to create the macro variable. This has a larger maximum length limitation than the data step method (the data step method maximum is $32767 if you don't use multiple variables, SQL's is double that at 64kib).
proc sql;
select catx(' ',colname,'$',collen)
into :lenstmt separated by ' '
from meta; *and similar for inputstmt;
quit;
Second, you can surpass the 64k limit by writing to a file instead of to a macro variable. Take your data step, and instead of accumulating and then using call symput, write each line out to a temp file (or two). Then %include those files instead of using the macro variable in the input datastep - yes, you can %include in the middle of a datastep.
There are other methods, but these two are the most common and should work for most use cases. Some other methods include call execute, run_macro, or using file open commands to work with the file directly. In general, those are either more complicated or less useful than the most common two, although certainly they are also acceptable solutions and not uncommon to see in practice.
call execute show be able to help.
data _null_;
retain start 0;
infile 'c:\metadata.txt' missover end=eof;
if _n_=1 then do;
start=1;
call execute('data final_output; infile "c:\datafile.txt" truncover; input ');
end;
input colname :$8.
collen :8.
;
call execute( '#'|| put(start,8. -l) || ' ' || colname || ' $'|| put(collen,8. -r) ||'. ' );
start=sum(start,collen);
if eof then do;
call execute(';run;');
end;
run;
proc contents data=final_output;run;

Get the current observation count in SAS

I have a file in which the first line is a header line containing some meta-data information.
How can I get the current observation number(say =1 for the first observation) that the SAS processor is dealing with so that I can put in a IF clause to handle such special data line.
Follow up: I want to process the first line and keep one of the column values in a local variable for further processing. I don't want to keep this line in my final output. is this possible?
The automatic variable _N_ returns the current iteration number of the SAS data step loop. For a traditional data step, ie:
data something;
set something;
(code);
run;
_N_ is equivalent to the row number (since one row is retrieved for each iteration of the data step loop).
So if you wanted to only do something once, on the first iteration, this would accomplish that:
data something;
set something;
if _n_ = 1 then do;
(code);
end;
(more code);
run;
For your follow up, you want something like this:
data want;
set have;
retain _temp;
if _n_ = 1 then do;
_temp = x;
end;
... more code ...
drop _temp;
run;
DROP and RETAIN statements can appear anywhere in the code and have the same effect, I placed them in their human-logical locations. RETAIN says to not reset the variable to missing each time through the data step loop, so you can access it further down.
if you are reading a particularly large text file, you may want to avoid having to execute the (if _n_=1 then) condition for every iteration. You can do this by reading the file twice - once to extract the header row, and again to read in the file, as follows:
data _null_; /* create dummy file for demo purposes */
file "c:\myfile.txt";
put 'blah'; output;
put 'blah blah blah 666'; output;
data _null_; /* read in header info */
infile "c:\myfile.txt";
input myvar:$10.; /* or wherever the info is that you need */
call symput('myvar',myvar);/* create macro variable with relevant info */
stop; /* no further processing at this point */
data test; /* read in data FROM SECOND LINE */
infile "c:\myfile.txt" firstobs=2 ; /* note the FIRSTOBS option */
input my $ regular $ input $ statement ;
remember="&myvar";
run;
For short / simple stuff though, Joe's answer is better as it's more readable.. (and may be more efficient for small files).