SAS: dlm and dsd? - sas

I am confused about what DSD actually does in terms of "moving the pointer" and reading in data. To better explain, look at the following code:
data one;
infile cards dlm=',' TRUNCOVER ; /*using dlm','*/
input cust_id date ddmmyy10. A $ B $ C $;
cards;
1,10/01/2015,5000,dr
;
run;
data two;
infile cards dsd TRUNCOVER ;
input cust_id date ddmmyy10. A $ B $ C $;
cards;
1,10/01/2015,5000,dr
;
run;
The dataset one contains values for A and B of 5000 and dr but the dataset two contains values of A as missing whereas B and C are 5000 and dr. I don't get why the dsd sets A to missing.
Thanks!

Your problem is not DLM or DSD it is "DATE DDMMYY10." that is inFORMATTED input which is not compatible with delimited input in any form DSD or NO.
You need INFORMAT statement or : informat modified.
date :DDMMYY10.

Related

specifying data informat using do loops in sas

I have a large data file with data in the following format: country, datatype, year1month1 to year2018month7.
Reading the data using proc import did not work for all data fields. I ended up modifying the SAS datastep code to ensure data format was correct.
However, I am having trouble simplifying the code, namely I would like a do loop to go through all the years and month. This way, I could use current date to figure out the range of dates for the file and the code to create Year/Month variable does not have to repeat 100 times in the file.
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
informat _(i)M(j) best32.;
end;
end;
informat Base_Year $1. ;
format Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
format _(i)M(j) best12.;
end;
end;
format Base_Year $1. ;
input
Country_Name $
do i = 1940 to 2018;
do j = 1 to 12;
_(i)M(j) $;
end;
end;
Base_Year $;
run;
There are a few approaches here that could work. The most directly translatable to your approach is to use the macro language.
You need to translate those two loops to something like this:
%do i = 1940 %to 2018;
%do j = 1 %to 12;
informat _&i.M&j. best32.;
%end;
%end;
Notice the % there. This also has to be in a macro; you can't do this in normal datastep code.
I would rewrite it to use a macro like so:
%macro make_ym(startyear=, endyear=, separator=);
%local i j;
%do i = &startyear. %to &endyear.;
%do j = 1 %to 12;
_&i.&separator.&j.
%end;
%end;
%mend make_ym;
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
informat %make_ym(startyear=1940,endyear=2018,separator=M) best32.;
informat Base_Year $1. ;
format %make_ym(startyear=1940,endyear=2018,separator=M) best12.;
format Base_Year $1. ;
input
Country_Name $
%make_ym(startyear=1940,endyear=2018,separator=M)
Base_Year $;
run;
I took out the $ after the yMm bits in the input since you declared them as numeric.
Don't model your data step after the code generated by PROC IMPORT. It does a lot of useless things, like attaching formats and informats to variables that don't need them.
For your problem you just need a simple program like this:
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. Y1940M01 .... Y2018M08 Base_Year :$1. ;
run;
Now the only tricky part is building that list of numerical variables. If the list is small enough you could just put it into a macro variable. Fortunately that is not a problem in this case since using 8 character names (YyyyyMmm) there is room for over 300 years worth in a data step character variable. A variable of length 10,800 bytes should have room for 100 years of month names.
So just run this data step first.
data _null_;
length names $10800 ;
basedate = mdy(1,1,1940);
lastdate = today();
do i=0 to intck('month',basedate,lastdate);
date=intnx('month',basedate,i);
names=catx(' ',names,cats('Y',year(date),'M',put(month(date),Z2.)));
end;
call symputx('names',names);
run;
Now you can use the macro variable in your INPUT statement.
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. &names Base_Year :$1. ;
run;

SAS group by first digits

I have a variable in SAS with a lot of numbers, for example 11000, 30129, 11111, 30999. I want to group this by the first two digits so "11000 and 11111" and "30129 and 30999" will be in a own table.
It's quite simple,
You have to create a second column and extract the 2 first digit.
Then sort the dataset by this second columns.
data test;
infile datalines dsd ;
input a : 15. ;
datalines;
11000,
30129,
11111,
309999,
;
run;
data test_a;
length val_a $2;
set test;
val_a= SUBSTRN(a,1,2);
run;
proc sort data=test_a out=test_b;
by val_a;
run;
Result will be :
val_a a
11 11000
11 11111
30 30129
30 309999
And then you can create 2 dataset with selection on the val_a like this :
data want data_11 data_30;
set test_b;
if val_a = 11 then output data_11;
if val_a = 30 then output data_30;
run;
Regards,
I think I did like you, but my new column only shows with ".". But I think your answer can give me some help anyways, thank you!
data books;
infile "&path\Boken.csv" dlm=';' missover dsd firstobs=2;
input ISBN: $12.
Book: $quote150.;
run;
data test_a;
format val_ISBN 15.;
set books;
val_ISBN= SUBSTRN(ISBN,1,2);
run;
proc sort data=test_a out=test_b;
by val_ISBN;
run;
proc print data=test_b (obs=10) noobs ;
run;

SAS - datalines equivalent of MISSOVER / TRUNCOVER statement?

I know for the infile statement, you can add a missover and truncover statement for missing values in the input, but if I'm using datalines for data instead of infile, is there an equivalent statement I can use?
You can still use infile by following it with datalines and then missover or truncover etc.
data _null_;
infile datalines missover;
input a b c;
put _all_;
datalines;
1 2
3 4
;
run;

Why these many observations are created in the following program?

data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
T
;run;
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
Th
;run;
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .';
input stmt : $ ##;
cards;
This is SAS.
;run;
When first program is run, 80 observations are created
When second program is run, 79 observations are created
When third program is run, 72 observations are created
I know these program has worst programming style. Wrong options are set for wrong technique. DSD option is set, double trailing operator ## (line holder), Colon modifier (:) are used and more than 1 delimeter is used which is worst SAS programming ever.
Aside from this I want to know why so many observations are created, why 80? 79? how program is executed? I think DSD option & 2 delimeters have major impact. Can anyone explain?
The reason you get more records than you expect is because CARDS are fixed length records. The reason you get a difference number of records is because there is a different number of null fields left after reading the non-null field(s). You can see this by adding the COL option to the INFILE statement to show you where the column pointer is after reading each field. Col=3, 4 , 13
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
T
;run;
proc print data=test(obs=5);
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
Th
;run;
proc print data=test(obs=5);
/*-----------------------------------------------*/
data test;
infile cards dsd dlm=', .' col=c;
input stmt : $ ##;
col=c;
cards;
This is SAS.
;run;
proc print data=test(obs=5);
run;

Issue inputing sas date as datalines

I've the following code. Though I've entered 30jun1983 it gets saved as 30/jun/2020. And it is reading only when there's two spaces between the date values in the cards and if there's only one space it reads the second value as missing.
DATA DIFFERENCE;
infile cards dlm=',' dsd;
INPUT DATE1 DATE9. Dt2 DATE9.;
FORMAT DATE1 DDMMYY10. Dt2 DDMMYY10.;
DIFFERENCE=YRDIF(DATE1,Dt2,'ACT/ACT');
DIFFERENCE=ROUND(DIFFERENCE);
CARDS;
11MAY2009 30jun1983
;
RUN;
You need colons on your input statement (to denote INformats), and also a comma in your datalines (you specified a comma as your DLM - delimiter):
DATA DIFFERENCE;
infile cards dlm=',' dsd;
INPUT DATE1 :DATE9. Dt2 :DATE9.;
FORMAT DATE1 DDMMYY10. Dt2 DDMMYY10.;
DIFFERENCE=YRDIF(DATE1,Dt2,'ACT/ACT');
DIFFERENCE=ROUND(DIFFERENCE);
CARDS;
11MAY2009,30jun1983
;
RUN;