I have my data in a .txt file which is comma separated. I am writing a regular infile statements to import that file into a sas dataset. The data is some 2.5 million rows. However in the 37314th row and many more rows I have junk values. SAS is importing rows only a row above the junk value rows and therefore I am not getting a dataset with all 2.5 million rows but with 37314 rows. I want to write a code which while writing this infile takes care of these junk rows and either doesnt take them or deletes them. All in all, I need all the 2.5 million rows which I am not able to get because of in between junk rows.
any help would be appreciated.
You can read the whole line into the input buffer using just an
Input;
statement. Then you can parse the fields individually using the
_infile_
variable.
Example:
data _null_;
infile datalines firstobs=2;
input;
city = scan(_infile_, 1, ' ');
char_min = scan(_infile_, 3, ' ');
char_min = substr(char_min, 2, length(char_min)-2);
minutes = input(char_min, BEST12.);
put city= minutes=;
datalines;
City Number Minutes Charge
Jackson 415-555-2384 <25> <2.45>
Jefferson 813-555-2356 <15> <1.62>
Joliet 913-555-3223 <65> <10.32>
;
run;
Working with Data in the Input Buffer.
You can also use the ? and ?? modifiers for the input statement to 'ignore' any problem rows.
Here's the link to the doc. Look under the heading "Format Modifiers for Error Reporting".
An example:
data x;
format my_num best.;
input my_num ?? ;
**
** POSSIBLE ERROR HANDLING HERE:
*;
if my_num ne . then do;
output;
end;
datalines;
a
;
run;
Related
I'm importing a parameter file in txt form with no title row such as the following:
byvars TEST16X GIO
log_transform N
y_intercept Y
exclude_outliers N
exclude_einmos N
Because it is a parameter file, the length of the two columns will not be fixed. The following is the problematic code I created to import the txt file. The two columns are concatenated instead of splitting into individual columns:
data test1;
infile "files/parameters.txt" DELIMITER='09'x col=Colpoint
length=linelen;
length pname $30 pvalue $10;
input #1 pname $ #;
varlen=linelen - colpoint + 1;
input pvalue $varying1024. varlen;
call symputx('pname', STRIP(pvalue));
run;
Output:
This parameter file defines global macro variables and their values. Such that log_transform is a macro variable with value 'N'.
You seem to be working way too hard. Just use TRUNCOVER and formatted input for the PVALUE field. Use list mode input for the parameter name field.
data parameters;
infile "files/parameters.txt" truncover ;
input pname :$32. pvalue $200. ;
call symputx(pname,pvalue);
run;
I tried formatting the data so that U is only numeric while all descriptions should be under T. May i know what could i possibly do to fix it?
DATA data;
infile '....csv'
dlm=',' firstobs=2 dsd;
format A$ B$ C$ D$ E$ F$ G$ H$ I$ J$ K$ L$ M$ N$ O$ P$ Q$ R$ S$ T$ U V W$ X$ Y$ Z$ AA$ AB$ AC$ AD$ AE$ AF$ AG$ AH$ AI$ AJ$ AK$ AL$ AM$ AN$ AO$ AP$ AQ$ AR$ AS;
input A B C D E F G H I J K L M N O P Q R S T#;
do _n_=1 to 24;
input U #;
description=catx(', ',T, U);
end;
input U V W X Y Z AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS;
RUN;
If you are talking about the data file in this Kaggle project then I would use a divide and conquer approach. Check each line in the file to see how many columns it contains. Then split the problem lines into separate file(s) and figure out how to handle whatever issue it is that causes them to be poorly formatted/parsed.
So get a list of the rows and number of columns in each row.
data maxcol;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv'
dsd truncover column=cc length=ll lrecl=2000000
;
row+1;
input #;
do maxcol=1 by 1 while(cc<=ll); input dummy :$1. # +(-1) dummy $char1. #; end;
if dummy ne ',' then maxcol=maxcol-1;
keep row maxcol ;
run;
proc freq ;
tables maxcol;
run;
For example you could get the list of bad row numbers into a macro variable.
proc sql noprint;
select row into :rowlist separated by ' ' from maxcol where maxcol > 48 ;
quit;
Then use that macro variable in your code that reads the datafile.
data want;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
input #;
if _n_ in (1 &rowlist) then delete;
... rest of data step to read the "good" rows ...
For the other rows take a look at them and figure out where they are getting extra columns inserted. Possibly just fix them by hand. Or craft separate data steps to read each set separately using the same &ROWLIST trick.
If you are positive that
the extra columns are inserted between column 20 and 21
that column 21 always has a valid numeric string
none of the extra values are valid numeric strings
then you could use logic like this to generate a new delimited file (why not use | as the delimiter this time?).
data _null_;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
row+1;
length col1-col48 $32767;
input col1-col21 #;
if _N_>1 then do while(missing(input(col21,??32.)));
col20=catx(',',col20,col21);
input col21 #;
end;
input col22-col48;
file "c:\downloads\shoes.txt" dsd dlm='|' lrecl=2000000 ;
put col1-col48 ;
run;
Which you could even then try to read using PROC IMPORT to guess how to define the variables. (But watch out as PROC IMPORT might truncate some of the records by using LRECL=32767)
proc import datafile="c:\downloads\shoes.txt" dbms=csv out=want replace ;
delimiter='|';
guessingrows=max;
run;
Checking column 21:
The MEANS Procedure
Analysis Variable : prices_amountMin
N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------
19387 111.8138820 276.7080893 0 16949.00
---------------------------------------------------------------------
I have a SAS dataset, let us say
it has 4 columns A,B,C,D and the values
A = x
B = x
C = x
**D = x,y**
Here column D has two values inside a single column while converting it into CSV format it generates a new column with the value Y. How to avoid this and to convert SAS dataset into CSV file?
* get some test records in a file;
Data _null_;
file 'c:\tmp\test.txt' lrecl=80;
put '1,22,Hans Olsen,Denmark,333,4';
put '1111,2,Turner, Alfred,England,3333,4';
put '1,222,Horst Mayer,Germany,3,4444';
run;
* Read the file as a delimited file;
data test; infile 'c:\tmp\test.txt' dsd dlm=',' missover;
length v1 v2 8 v3 v4 $40 v5 v6 8;
input
'V1'n : ?? BEST5.
'V2'n : ?? BEST5.
'V3'n : $CHAR40.
'V4'n : $CHAR40.
'V5'n : ?? BEST5.
'V6'n : ?? BEST5.;
run;
* Read the file and write another file.
* If 6 delimiters and not 5, change the third to #;
data test2;
infile 'c:\tmp\test.txt' lrecl=80 truncover;
file 'c:\tmp\test2.txt' lrecl=80;
length rec $80;
drop pos len;
input rec $char80.;
if count(rec,',') = 6 then do;
call scan(rec,4,pos,len,',');
substr(rec,pos-1,1) = '','';
end;
put rec;
run;
* Read the new file as a delimited file;
data test2; infile 'c:\tmp\test2.txt' dsd dlm=',' missover;
length v1 v2 8 v3 v4 $40 v5 v6 8;
input
'V1'n : ?? BEST5.
'V2'n : ?? BEST5.
'V3'n : $CHAR40.
'V4'n : $CHAR40.
'V5'n : ?? BEST5.
'V6'n : ?? BEST5.;
run;
In this code, it add '#' but I want ',' itself in the output.
Could anyone please guide me to do that?
Thanks in advance!!
It sounds like you are starting with an improperly created CSV file.
1,22,Hans Olsen,Denmark,333,4
1111,2,Turner, Alfred,England,3333,4
1,222,Horst Mayer,Germany,3,4444
That should have been made like this:
1,22,Hans Olsen,Denmark,333,4
1111,2,"Turner, Alfred",England,3333,4
1,222,Horst Mayer,Germany,3,4444
If you are positive that you know that the only field with embedded commas is the third then you can use a data step to read it in and generate a valid file.
data _null_;
infile bad dsd truncover ;
file good dsd ;
length v1-v6 dummy $200;
input v1-v2 #;
do i=1 to countw(_infile_,',','q')-5;
input dummy #;
v3=catx(', ',v3,dummy);
end;
input v4-v6 ;
put v1-v6 ;
run;
Once you have a properly formatted CSV file then it is easy to read.
data want;
infile good dsd truncover ;
length v1-v2 8 v3-v4 $40 v5-v6 8;
input v1-v6 ;
run;
But if the extra comma could be in any field then you will probably need to have a human fix those lines.
If your field value contains the field delimiter you will want to double quote the field value. Proc EXPORT will do such double quoting when the data base type is specified as CSV
Example:
data have;
A = 1;
B = 2;
C = 3;
D = 'x,y';
run;
filename csv temp;
proc export data=have outfile=csv dbms=csv;
run;
data _null_;
infile csv;
input;
put _infile_;
run;
The log will show the exported file contains double quoted values as needed in the csv file produced.
Log
A,B,C,D
1,2,3,"x,y"
Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;
Is there a way to read in specific parts of my data without using FIRSTOBS=? For example, I have 5 different files, all of which have a few rows of unwanted characters. I want my data to read in starting with the first row that is numeric. But each of these 5 files have that first numeric row starting in different rows. Rather than going into each file to find where FIRSTOBS should be, is there a way I can instead check this? Perhaps by using an IF statement with ANYDIGIT?
Have you tried something like this from the SAS docs? Example 5: Positioning the Pointer with a Numeric Variable
data office (drop=x);
infile file-specification;
input x #;
if 1<=x<=10 then
input #x City $9.;
else do;
put 'Invalid input at line ' _n_;
delete;
end;
run;
This assumes that you don't know how many lines are to be skipped at the beginning of each file. My filerefs are UNIX to run the example on another OS they will need to be changed;
*Create two example input data files;
filename FT15F001 '~/file1.txt';
parmcards;
char
char and 103
10 10 10
10 10 10.1
;;;;
run;
filename FT15F001 '~/file2.txt';
parmcards;
char
char and 103
char
char
char
10 10 10.5
10 10 10
;;;;
run;
*Read them starting from the first line that has all numbers;
filename FT77F001 '~/file*.txt';
data both;
infile FT77F001 eov=eov;
input #;
/*Reset the flag at the start of each new file*/
if _n_ eq 1 or eov then do;
eov=0;
flag=1;
end;
if flag then do;
if anyalpha(_infile_) then delete;
else flag=0;
end;
input v1-v3;
drop flag;
retain flag;
run;
proc print;
run;
I ended up doing:
INPUT City $#;
StateAvg = input(substr(City,1,4),COMMA4.);
IF 5000<= StateAvg <= 7000 THEN
INPUT City 1-7 State ZIP;
ELSE DO;
Delete;
END;
And this worked. Thanks for the suggestions, I went back and looked at example 5 and it helped.