I need to use the INFILE statement to read a file called np_traffic.csv, name the table traffic2, and only import a column called ReportingDate as a character.
Current Code is giving me the error
"The data set WORK.TRAFFIC2 may be incomplete. When this step was
stopped there were 0 observations and 1 variables."
DATA traffic2;
INFILE “E:/Documents/Week 2/np_traffic.csv”
dsd firstobs=2;
INPUT ReportingDate $;
RUN;
Let's assume that you really have a delimited text file, which is what a CSV file is, instead of the spreadsheet you pictured in the photograph in your post. To read the 6th field in a line you need to first read the first 5 fields. That does not mean you need use the values read from those fields.
data traffic2;
infile “E:/Documents/Week 2/np_traffic.csv”
dsd firstobs=2
;
length dummy $1 ReportingDate $12;
input 5*dummy ReportingDate ;
drop dummy;
run;
I would suggest to try it this way:
data traffic2;
drop a b c d e g;
infile 'E:\Documents\Week 2\np_traffic.csv' dsd dlm='<Insert your delimiter>' firstobs=2;
input a b c d e f g;
run;
https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1rill4udj0tfun1fvce3j401plo.htm&docsetVersion=9.4&locale=en
Related
This is a follow-up of my previous question:
How to import a txt file with single quote mark in a variable and another in another variable.
The solution there works perfectly until there is not a variable whose values could be null.
In this latter case, I get:
filename sample 'c:\temp\sample.txt';
data _null_;
file sample;
input;
put _infile_;
datalines;
001|This variable could be null|PROVA|MILANO|1000
002||'80S WERE GREAT|FORLI'|1100
003||'80S WERE GREAT|ROMA|1110
;
data want;
data prova;
infile sample dlm='|' lrecl=50 truncover;
format
codice $3.
could_be_null $20.
nome $20.
luogo $20.
importo 4.
;
input
codice
could_be_null
nome
luogo
importo
;
putlog _infile_;
run;
proc print;
run;
Is it possible to correctly load a file like the one in the example directly in SAS, without manually modifying the original .txt?
You will need to pre-process the file to fix the issue.
If you add quotes around the values then you will not have the problem.
002||"'80S WERE GREAT"|"FORLI'"|1100
IF you know that none of the values contain the delimiter then adding a space before every delimiter
002 | |'80S WERE GREAT |FORLI' |1100
will let you read it without the DSD option.
If lines are shorter than 32K bytes then it can be done in the same step that reads the data.
data test2 ;
infile sample dlm='|' truncover ;
input #;
_infile_ = tranwrd(_infile_,'|',' |');
input (var1-var5) (:$40.);
run;
proc print;
run;
Results:
Obs var1 var2 var3 var4 var5
1 001 This variable could be null PROVA MILANO 1000
2 002 '80S WERE GREAT FORLI' 1100
3 003 '80S WERE GREAT ROMA 1110
One way to test if you have the issue is to make sure each line has the right number of fields.
filename sample temp;
options parmcards=sample;
parmcards;
001|This variable could be null|PROVA|MILANO|1000
002||'80S WERE GREAT|FORLI'|1100
003||'80S WERE GREAT|ROMA|1110
;
data _null_;
infile sample dsd end=eof;
if eof then do;
call symputx('nfound',nfound);
putlog / 'Found ' nfound :comma11.
'problem lines out of ' _n_ :comma11. 'lines.'
;
end;
input;
retain expect nfound;
words=countw(_infile_,'|','qm');
if _n_=1 then expect=words;
else if expect ne words then do;
nfound+1;
if nfound <= 10 then do;
putlog (_n_ expect words) (=) ;
list;
end;
end;
run;
Example Results:
_N_=2 expect=5 words=4
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
2 002||'80S WERE GREAT|FORLI'|1100 32
_N_=3 expect=5 words=3
3 003||'80S WERE GREAT|ROMA|1110 30
Found 2 problem lines out of 4 lines.
PS Go tell SAS to enhance their delimited file processing: https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-file-variations/idi-p/435977
You need to add the DSD option to your INFILE statement.
https://support.sas.com/techsup/technote/ts673.pdf
DSD (delimiter-sensitive data) option—Specifies that SAS should treat
delimiters within a data value as character data when the delimiters
and the data value are enclosed in quotation marks. As a result, SAS
does not split the string into multiple variables and the quotation
marks are removed before the variable is stored. When the DSD option
is specified and SAS encounters consecutive delimiters, the software
treats those delimiters as missing values. You can change the default
delimiter for the DSD option with the DELIMTER= option.
I'm trying to read a pipe delimited text file in SAS with following code :
Data MyData;
Infile MyFile Dsd Dlm= '|' Firstobs= 2 Termstr = CRLF Truncover;
Input A: $30.
B: 2.
C: $30.
D: $30.
E: 2.;
Run;
Column A to C are definitely present for each record but columns D and E may or may not be present. The file is delimited in a way that there is a pipe between two inputs but not after the end of a line.
An example is shown below.
A1|4|C1|D1|5A2|7|C2A3|3|C3|D3|1A4 ...
How do I read this file where the last two inputs are optional? I don't want to use Proc Import because it's a large file and the columns A, B and C have a range of values that Proc Import isn't able to handle very well (as per my experience).
My current code causes some of the values from column A to be pulled into column D when there are missing values.
Usually, there's some indication of when E ends. Some EOL character (maybe one you can't see). If so, then you can use that as a delimiter.
If there's no way to tell when E ends, then you will need to figure it out from business logic (what kind of value exists in E and in A). If E is only 2 long, then you can process the field using the _INFILE_ variable. Something like this might work, if the total line length is <= 32767:
data want;
infile 'h:\test.txt' dlm='|'; *infile with dlm statement as usual;
input ##; *open input pointer;
call scan(_infile_,5*_N_,pos,len,'|',); *find where the 5Nth field is;
_infile_ = cat(substr(_infile_,1,pos+1),'|',substr(_infile_,pos+2));
*Insert a | there;
input a: $30.
b: 11.
c: $5.
d: $5.
e: 2.
##
; *note the ## holding the input pointer;
run;
I am reading in a .csv file in SAS where some of the fields are populated in the main by null values . and a handful are populated by 5 digit SAS dates. I need SAS to recognise the field as a date field (or at the very least a numeric field), instead of reading it in as text as it is is doing at the minute.
A simplified version of my code is as so:
data test;
informat mydate date9.;
infile myfile dsd dlm ',' missover;
input
myfirstval
mydate
;
run;
With this code all values are read in as . and the field data type is text. Can anyone tell me what I need to change in the above code to get the output I need?
Thanks
If you write a data step to read a CSV file SAS will create the variable as the data type that you specify. If you tell it that MYDATE is a number it will NOT convert it to a character variable.
data test;
infile cards dsd dlm=',' TRUNCOVER ;
length myfirstval 8 mydate 8 mythirdval 8;
input myfirstval mydate mythirdval;
format mydate date9.;
cards;
1,1234,5
2,.,6
;
Note that the data step compiler will define the type of the variable at the first chance that it can. For example if the first reference is in a statement like IF MYDATE='.' ... then MYDATE will be defined as character length one to match the type of the value that it is being compared to. That is why it is best to start with a LENGTH or ATTRIB statement to clearly define your variables.
This seems like it should be straightforward, but I can't find how to do this in the documentation. I want to read in a comma-delimited file, but it's very wide, and I just want to read a few columns.
I thought I could do this, but the # pointer seems to point to columns of the text rather than the column numbers defined by the delimiter:
data tmp;
infile 'results.csv' delimiter=',' MISSOVER DSD lrecl=32767 firstobs=2;
#1 id
#5 name$
run;
In this example, I want to read just what is in the 1st and 5th columns based on the delimiter, but SAS is reading what is in position 1 and position 5 of text file. So if the first line of the input file starts like this
1234567, "x", "y", "asdf", "bubba", ... more variables ...
I want id=1234567 and name=bubba, but I'm getting name=567, ".
I realize that I could read in every column and drop the ones I don't want, but there must be a better way.
Indeed, # does point to column of text not the delimited column. The only method using standard input I've ever found was to read in blank, ie
input
id
blank $
blank $
blank $
name $
;
and then drop blank.
However, there is a better solution if you don't mind writing your input differently.
data tmp;
infile datalines;
input #;
id = scan(_INFILE_,1,',');
name = scan(_INFILE_,5,',');
put _all_;
datalines;
12345,x,y,z,Joe
12346,x,y,z,Bob
;;;;
run;
It makes formatting slightly messier, as you need put or input statements for each variable you do not want in base character format, but it might be easier depending on your needs.
You can skip fields fairly efficiently if you know a bit of INPUT statement syntax, note the use of (3*dummy)(:$1.). Reading just one byte should also improve performance slightly.
data tmp;
infile cards DSD firstobs=2;
input id $ (3*dummy)(:$1.) name $;
drop dummy;
cards;
id,x,y,z,name
1234567, "x", "y", "asdf", "bubba", ... more variables
1234567, "x", "y", "asdf", "bubba", ... more variables
run;
proc print;
run;
One more option that I thought of when answering a related question from another user.
filename tempfile temp;
data _null_;
set sashelp.cars;
file tempfile dlm=',' dsd lrecl=32767;
put (Make--Wheelbase) ($);
run;
data mydata;
infile tempfile dlm=',' dsd truncover lrecl=32767;
length _tempvars1-_tempvars100 $32;
array _tempvars[100] $;
input (_tempvars[*]) ($);
make=_tempvars[1];
type=_tempvars[3];
MSRP=input(_tempvars[6],dollar8.);
keep make type msrp;
run;
Here we use an array of effectively temporary (can't actually BE temporary, unfortunately) variables, and then grab just what we want specifying the columns. This is probably overkill for a small file - just read in all the variables and deal with it - but for 100 or 200 variables where you want just 15, 18, and 25, this might be easier, as long as you know which column you want exactly. (I could see using this in dealing with census data, for example, if you have it in CSV form. It's very common to just want a few columns most of which are way down 100 or 200 columns from the starting column.)
You have to take some care with your lengths for the temporary array (has to be as long as your longest column that you care about!), and you have to make sure not to mess up the columns since you won't get to know if you mess up unless it's obvious from the data.
I have the below raw data
1,,35,000
2,100,45,000
and need the below in a dataset
1 . 35000
2 100 45000
this would require both dsd option and using comma. informat.
How to carry this out?
DSD has nothing to do with this - DSD involves input like
1,,"35,000"
2,100,"45,000"
If that is what you have, then you can use the : operator to read it in with the comma informat.
data test;
infile datalines dlm=',' dsd;
input id
num
dollar :comma8.;
datalines;
1,,"35,000"
2,100,"45,000"
;;;;
run;
If you do not have the quotes around the field, then you will need to parse this somehow. One solution is below, which will work as long as the field with commas is the final field.
data test;
infile datalines dlm=',' dsd;
input #;
if countc(_infile_,',') =3 then do;
_commapos = findc(_infile_,',',-1*length(_infile_));
_infile_ = substr(_infile_,1,_commapos-1)||substr(_infile_,_commapos+1);
end;
input id
num
dollar ;
put _all_;
datalines;
1,,35,000
2,100,45,000
;;;;
run;
If the field your potential is in is in a consistent field, but NOT the first one, you can modify the above solution to correct it. If it's in potentially more than one field, you have a much more difficult problem to solve.