How to read first n number of columns into a SAS dataset? - sas

My raw datafiles is a simple comma delimited file that contains 10 columns. I know there must be a way of importing only 'say 3 columns using missover OR ls ?
Sure, I can import all file and drop uneeded varialbes, however, how can I use
infile missover=3
OR
infile ls=3
?

No, I guess as SAS has to read all the data from a delimited file - it has to read every byte from beginning to end to 1) process the column delimiters, and 2) determine the obs delimiters (CR&LF characters) - DROPing the unneeded columns wouldn't be much of an overhead.

If you need to read only the first n fields you can just list those in your input statement.
If the columns you need are more to the right you will need to list in your input all the columns before those you need and up to the last column you need.
MISSOVER does not take =n and it tells sas to go to the next record and start reading into the first variable of the input statement if it doesn't find enough columns and not look for the last columns in the following record.
e.g
file is like this
a1,b1,c1
a2,b2,c3,d2
data test;
infile "myfile" dlm="," dsd /*2 cosecutive delimiters are not considered as one*/ missover;
input
a $
b $
c $
d $
;
run;
will result in a dataset:
a b c d
a1 b1 c1
a2 b2 c2 d2
data test;
infile "myfile" dlm="," dsd /*2 cosecutive delimiters are not considered as one*/ ;
input
a $
b $
c $
d $
;
run;
will result in a dataset:
a b c d
a1 b1 c1 a2

Related

SAS importing messy data

I tried formatting the data so that U is only numeric while all descriptions should be under T. May i know what could i possibly do to fix it?
DATA data;
infile '....csv'
dlm=',' firstobs=2 dsd;
format A$ B$ C$ D$ E$ F$ G$ H$ I$ J$ K$ L$ M$ N$ O$ P$ Q$ R$ S$ T$ U V W$ X$ Y$ Z$ AA$ AB$ AC$ AD$ AE$ AF$ AG$ AH$ AI$ AJ$ AK$ AL$ AM$ AN$ AO$ AP$ AQ$ AR$ AS;
input A B C D E F G H I J K L M N O P Q R S T#;
do _n_=1 to 24;
input U #;
description=catx(', ',T, U);
end;
input U V W X Y Z AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS;
RUN;
If you are talking about the data file in this Kaggle project then I would use a divide and conquer approach. Check each line in the file to see how many columns it contains. Then split the problem lines into separate file(s) and figure out how to handle whatever issue it is that causes them to be poorly formatted/parsed.
So get a list of the rows and number of columns in each row.
data maxcol;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv'
dsd truncover column=cc length=ll lrecl=2000000
;
row+1;
input #;
do maxcol=1 by 1 while(cc<=ll); input dummy :$1. # +(-1) dummy $char1. #; end;
if dummy ne ',' then maxcol=maxcol-1;
keep row maxcol ;
run;
proc freq ;
tables maxcol;
run;
For example you could get the list of bad row numbers into a macro variable.
proc sql noprint;
select row into :rowlist separated by ' ' from maxcol where maxcol > 48 ;
quit;
Then use that macro variable in your code that reads the datafile.
data want;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
input #;
if _n_ in (1 &rowlist) then delete;
... rest of data step to read the "good" rows ...
For the other rows take a look at them and figure out where they are getting extra columns inserted. Possibly just fix them by hand. Or craft separate data steps to read each set separately using the same &ROWLIST trick.
If you are positive that
the extra columns are inserted between column 20 and 21
that column 21 always has a valid numeric string
none of the extra values are valid numeric strings
then you could use logic like this to generate a new delimited file (why not use | as the delimiter this time?).
data _null_;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
row+1;
length col1-col48 $32767;
input col1-col21 #;
if _N_>1 then do while(missing(input(col21,??32.)));
col20=catx(',',col20,col21);
input col21 #;
end;
input col22-col48;
file "c:\downloads\shoes.txt" dsd dlm='|' lrecl=2000000 ;
put col1-col48 ;
run;
Which you could even then try to read using PROC IMPORT to guess how to define the variables. (But watch out as PROC IMPORT might truncate some of the records by using LRECL=32767)
proc import datafile="c:\downloads\shoes.txt" dbms=csv out=want replace ;
delimiter='|';
guessingrows=max;
run;
Checking column 21:
The MEANS Procedure
Analysis Variable : prices_amountMin
N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------
19387 111.8138820 276.7080893 0 16949.00
---------------------------------------------------------------------

values of commun column in A replaced by that in B with function merge in SAS

I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.

SAS: How can I pad a character variable with zeroes while reading in from csv

Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;

usage of missover statement in SAS

I have below dataset
data ab;
infile cards missover;
input m p c;
cards;
1,2,3
4,5,
6,7,
run;
The output of this query is
m p c
. . .
. . .
. . .
Why did i get the below output instead of error?
I havent specified any delimiter also.
Please explain.
Thanks in Advance,
Nikhila
You do get INVALID DATA messages. SAS is defaulting to space delimited fields, you need to specify the DSD INFILE statement option and or DLM=','. You don't actually need MISSOVER as you have the proper number of delimiters for three comma delimited fields, but I would probably go ahead and keep it.
24 data ab;
25 infile cards missover;
26 input m p c;
27 cards;
NOTE: Invalid data for m in line 28 1-5.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
28 1,2,3
m=. p=. c=. _ERROR_=1 _N_=1
NOTE: Invalid data for m in line 29 1-4.
29 4,5,
m=. p=. c=. _ERROR_=1 _N_=2
NOTE: Invalid data for m in line 30 1-4.
30 6,7,
m=. p=. c=. _ERROR_=1 _N_=3
NOTE: The data set WORK.AB has 3 observations and 3 variables.
24 data ab;
25 infile cards dsd missover;
26 input m p c;
27 cards;
NOTE: The data set WORK.AB has 3 observations and 3 variables.
The MISSOVER is what makes you have three observations instead just one. Without the MISSOVER then SAS will try to read each line as one value and you will end up with one observation of all missing values. It is easier to see if you change your variables to character instead of numeric since you can see where the values end up.
data ab;
infile cards missover;
input m $ p $ c $;
put (m p c) (=);
cards;
1,2,3
4,5,
6,7,
;
m=1,2,3 p= c=
m=4,5, p= c=
m=6,7, p= c=
data ab;
infile cards /*missover */;
input m $ p $ c $;
put (m p c) (=);
cards;
1,2,3
4,5,
6,7,
;
m=1,2,3 p=4,5, c=6,7,
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

junk values in a row( sas)

I have my data in a .txt file which is comma separated. I am writing a regular infile statements to import that file into a sas dataset. The data is some 2.5 million rows. However in the 37314th row and many more rows I have junk values. SAS is importing rows only a row above the junk value rows and therefore I am not getting a dataset with all 2.5 million rows but with 37314 rows. I want to write a code which while writing this infile takes care of these junk rows and either doesnt take them or deletes them. All in all, I need all the 2.5 million rows which I am not able to get because of in between junk rows.
any help would be appreciated.
You can read the whole line into the input buffer using just an
Input;
statement. Then you can parse the fields individually using the
_infile_
variable.
Example:
data _null_;
infile datalines firstobs=2;
input;
city = scan(_infile_, 1, ' ');
char_min = scan(_infile_, 3, ' ');
char_min = substr(char_min, 2, length(char_min)-2);
minutes = input(char_min, BEST12.);
put city= minutes=;
datalines;
City Number Minutes Charge
Jackson 415-555-2384 <25> <2.45>
Jefferson 813-555-2356 <15> <1.62>
Joliet 913-555-3223 <65> <10.32>
;
run;
Working with Data in the Input Buffer.
You can also use the ? and ?? modifiers for the input statement to 'ignore' any problem rows.
Here's the link to the doc. Look under the heading "Format Modifiers for Error Reporting".
An example:
data x;
format my_num best.;
input my_num ?? ;
**
** POSSIBLE ERROR HANDLING HERE:
*;
if my_num ne . then do;
output;
end;
datalines;
a
;
run;