usage of missover statement in SAS - sas

I have below dataset
data ab;
infile cards missover;
input m p c;
cards;
1,2,3
4,5,
6,7,
run;
The output of this query is
m p c
. . .
. . .
. . .
Why did i get the below output instead of error?
I havent specified any delimiter also.
Please explain.
Thanks in Advance,
Nikhila

You do get INVALID DATA messages. SAS is defaulting to space delimited fields, you need to specify the DSD INFILE statement option and or DLM=','. You don't actually need MISSOVER as you have the proper number of delimiters for three comma delimited fields, but I would probably go ahead and keep it.
24 data ab;
25 infile cards missover;
26 input m p c;
27 cards;
NOTE: Invalid data for m in line 28 1-5.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
28 1,2,3
m=. p=. c=. _ERROR_=1 _N_=1
NOTE: Invalid data for m in line 29 1-4.
29 4,5,
m=. p=. c=. _ERROR_=1 _N_=2
NOTE: Invalid data for m in line 30 1-4.
30 6,7,
m=. p=. c=. _ERROR_=1 _N_=3
NOTE: The data set WORK.AB has 3 observations and 3 variables.
24 data ab;
25 infile cards dsd missover;
26 input m p c;
27 cards;
NOTE: The data set WORK.AB has 3 observations and 3 variables.

The MISSOVER is what makes you have three observations instead just one. Without the MISSOVER then SAS will try to read each line as one value and you will end up with one observation of all missing values. It is easier to see if you change your variables to character instead of numeric since you can see where the values end up.
data ab;
infile cards missover;
input m $ p $ c $;
put (m p c) (=);
cards;
1,2,3
4,5,
6,7,
;
m=1,2,3 p= c=
m=4,5, p= c=
m=6,7, p= c=
data ab;
infile cards /*missover */;
input m $ p $ c $;
put (m p c) (=);
cards;
1,2,3
4,5,
6,7,
;
m=1,2,3 p=4,5, c=6,7,
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

Related

SAS importing messy data

I tried formatting the data so that U is only numeric while all descriptions should be under T. May i know what could i possibly do to fix it?
DATA data;
infile '....csv'
dlm=',' firstobs=2 dsd;
format A$ B$ C$ D$ E$ F$ G$ H$ I$ J$ K$ L$ M$ N$ O$ P$ Q$ R$ S$ T$ U V W$ X$ Y$ Z$ AA$ AB$ AC$ AD$ AE$ AF$ AG$ AH$ AI$ AJ$ AK$ AL$ AM$ AN$ AO$ AP$ AQ$ AR$ AS;
input A B C D E F G H I J K L M N O P Q R S T#;
do _n_=1 to 24;
input U #;
description=catx(', ',T, U);
end;
input U V W X Y Z AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS;
RUN;
If you are talking about the data file in this Kaggle project then I would use a divide and conquer approach. Check each line in the file to see how many columns it contains. Then split the problem lines into separate file(s) and figure out how to handle whatever issue it is that causes them to be poorly formatted/parsed.
So get a list of the rows and number of columns in each row.
data maxcol;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv'
dsd truncover column=cc length=ll lrecl=2000000
;
row+1;
input #;
do maxcol=1 by 1 while(cc<=ll); input dummy :$1. # +(-1) dummy $char1. #; end;
if dummy ne ',' then maxcol=maxcol-1;
keep row maxcol ;
run;
proc freq ;
tables maxcol;
run;
For example you could get the list of bad row numbers into a macro variable.
proc sql noprint;
select row into :rowlist separated by ' ' from maxcol where maxcol > 48 ;
quit;
Then use that macro variable in your code that reads the datafile.
data want;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
input #;
if _n_ in (1 &rowlist) then delete;
... rest of data step to read the "good" rows ...
For the other rows take a look at them and figure out where they are getting extra columns inserted. Possibly just fix them by hand. Or craft separate data steps to read each set separately using the same &ROWLIST trick.
If you are positive that
the extra columns are inserted between column 20 and 21
that column 21 always has a valid numeric string
none of the extra values are valid numeric strings
then you could use logic like this to generate a new delimited file (why not use | as the delimiter this time?).
data _null_;
infile "C:\downloads\archive.zip" zip member='Datafiniti_Mens_Shoe_Prices.csv' dsd
truncover lrecl=2000000
;
row+1;
length col1-col48 $32767;
input col1-col21 #;
if _N_>1 then do while(missing(input(col21,??32.)));
col20=catx(',',col20,col21);
input col21 #;
end;
input col22-col48;
file "c:\downloads\shoes.txt" dsd dlm='|' lrecl=2000000 ;
put col1-col48 ;
run;
Which you could even then try to read using PROC IMPORT to guess how to define the variables. (But watch out as PROC IMPORT might truncate some of the records by using LRECL=32767)
proc import datafile="c:\downloads\shoes.txt" dbms=csv out=want replace ;
delimiter='|';
guessingrows=max;
run;
Checking column 21:
The MEANS Procedure
Analysis Variable : prices_amountMin
N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------
19387 111.8138820 276.7080893 0 16949.00
---------------------------------------------------------------------

SAS replace character in ALL columns

I have a SAS dataset that I have to export to a .csv-file. I have the following two contradicting requirements.
I have to use the semicolon as the delimiter in the .csv-file.
Some of the character variables are manually inputted strings from formulas, hence they may contain semicolons.
My solution to the above is to either escape the semicolon or to replace it with a comma.
How can I, in a nice, clean and efficient way use e.g. tranwrd on an entire dataset?
My attempt:
For each variable, use the tranwrd(.., ";", ",") function on a variable in the data set. Update the dataset and loop through all variables. This, however, is naturally a very inefficient way of doing it for even semi-large datasets, since I have to do a datastep for each variable. The code for it is quite ugly, since I have to get the variable names by a few steps, but the inefficiency definitely takes the cake.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
* Get the variable names;
proc contents data=test out=vars(keep=name type varnum) order=varnum noprint;
run;
* Sort by variable number;
proc sort data=vars;
by varnum;
run;
* Put variable names into a space-separated string;
proc sql noprint;
select compress(name)
into :name_list separated by ' '
from vars;
quit;
%let len = %sysfunc(countw(&name_list));
*Initialize loop dataset;
data a;
set test;
run;
%macro loop;
%do i = 1 %to &len;
%let j = %scan(&name_list,&i);
data a(rename=(v_&j = &j) drop=&j);
set a;
v_&j.=compress(tranwrd(&j,";",","));
run;
%end;
%mend;
%loop;
I think I may have more elegant solution to your problem:
data class;
set sashelp.class;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","X"));
end;
drop i;
run;
You can use array to reference all character columns from your data set and then loop through them.
The most widely used standard for csv files whose fields can contain delimiters is to quote fields that contain them, and double up any quotes. In SAS you can do this automatically using the dlm and dsd options in a put statement:
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb" 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
data _null_;
set test;
file "c:\temp\test.csv" dsd dlm=';';
put (_ALL_) (&);
run;
This results in the following semicolon-delimited csv (minus a header row, but that's a separate issue):
"Aaa;;";50;11;1;222;"a;s"
"Bbb""";35;12;2;250;qw
Comma,;75;13;3;foo;zx
Sorry, didn't notice your comment about the workaround until after I posted this. I'll leave it here in case anyone finds it helpful.
Fields in a properly formatted delimited file are quoted. PROC EXPORT will do that. There is no need to change the data.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
filename FT45F001 temp;
proc export data=test outfile=FT45F001 dbms=csv;
delimiter=';';
run;
data _null_;
infile FT45F001;
input;
list;
run;
proc import replace datafile=FT45F001 dbms=csv out=test2;
delimiter=';';
run;
proc print;
run;
proc compare base=test compare=test2;
run;

Why these many observations are created in the following program?

data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
T
;run;
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
Th
;run;
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
This is SAS.
;run;
When first program is run, 80 observations are created
When second program is run, 79 observations are created
When third program is run, 72 observations are created
I know these program has worst programming style. Wrong options are set for wrong technique. DSD option is set, double trailing operator ## (line holder), Colon modifier (:) are used and more than 1 delimeter is used which is worst SAS programming ever.
Aside from this I want to know why so many observations are created, why 80? 79? how program is executed? I think DSD option & 2 delimeters have major impact. Can anyone explain?
The reason you get more records than you expect is because CARDS are fixed length records. The reason you get a difference number of records is because there is a different number of null fields left after reading the non-null field(s). You can see this by adding the COL option to the INFILE statement to show you where the column pointer is after reading each field. Col=3, 4 , 13
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
T
;run;
proc print data=test(obs=5);
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
Th
;run;
proc print data=test(obs=5);
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
This is SAS.
;run;
proc print data=test(obs=5);
run;

Read in specific data without FIRSTOBS=

Is there a way to read in specific parts of my data without using FIRSTOBS=? For example, I have 5 different files, all of which have a few rows of unwanted characters. I want my data to read in starting with the first row that is numeric. But each of these 5 files have that first numeric row starting in different rows. Rather than going into each file to find where FIRSTOBS should be, is there a way I can instead check this? Perhaps by using an IF statement with ANYDIGIT?
Have you tried something like this from the SAS docs? Example 5: Positioning the Pointer with a Numeric Variable
data office (drop=x);
infile file-specification;
input x #;
if 1<=x<=10 then
input #x City $9.;
else do;
put 'Invalid input at line ' _n_;
delete;
end;
run;
This assumes that you don't know how many lines are to be skipped at the beginning of each file. My filerefs are UNIX to run the example on another OS they will need to be changed;
*Create two example input data files;
filename FT15F001 '~/file1.txt';
parmcards;
char
char and 103
10 10 10
10 10 10.1
;;;;
run;
filename FT15F001 '~/file2.txt';
parmcards;
char
char and 103
char
char
char
10 10 10.5
10 10 10
;;;;
run;
*Read them starting from the first line that has all numbers;
filename FT77F001 '~/file*.txt';
data both;
infile FT77F001 eov=eov;
input #;
/*Reset the flag at the start of each new file*/
if _n_ eq 1 or eov then do;
eov=0;
flag=1;
end;
if flag then do;
if anyalpha(_infile_) then delete;
else flag=0;
end;
input v1-v3;
drop flag;
retain flag;
run;
proc print;
run;
I ended up doing:
INPUT City $#;
StateAvg = input(substr(City,1,4),COMMA4.);
IF 5000<= StateAvg <= 7000 THEN
INPUT City 1-7 State ZIP;
ELSE DO;
Delete;
END;
And this worked. Thanks for the suggestions, I went back and looked at example 5 and it helped.

How to read first n number of columns into a SAS dataset?

My raw datafiles is a simple comma delimited file that contains 10 columns. I know there must be a way of importing only 'say 3 columns using missover OR ls ?
Sure, I can import all file and drop uneeded varialbes, however, how can I use
infile missover=3
OR
infile ls=3
?
No, I guess as SAS has to read all the data from a delimited file - it has to read every byte from beginning to end to 1) process the column delimiters, and 2) determine the obs delimiters (CR&LF characters) - DROPing the unneeded columns wouldn't be much of an overhead.
If you need to read only the first n fields you can just list those in your input statement.
If the columns you need are more to the right you will need to list in your input all the columns before those you need and up to the last column you need.
MISSOVER does not take =n and it tells sas to go to the next record and start reading into the first variable of the input statement if it doesn't find enough columns and not look for the last columns in the following record.
e.g
file is like this
a1,b1,c1
a2,b2,c3,d2
data test;
infile "myfile" dlm="," dsd /*2 cosecutive delimiters are not considered as one*/ missover;
input
a $
b $
c $
d $
;
run;
will result in a dataset:
a b c d
a1 b1 c1
a2 b2 c2 d2
data test;
infile "myfile" dlm="," dsd /*2 cosecutive delimiters are not considered as one*/ ;
input
a $
b $
c $
d $
;
run;
will result in a dataset:
a b c d
a1 b1 c1 a2