Let me sum up what I have got from this website. https://communities.sas.com/t5/General-SAS-Programming/Please-explain-DSD-and-DLM-differences/td-p/146773
(1): Without dsd, the cursor passes all the delimiters before reading the next field, while on the other hand, with dsd, the cursor only pass one delimiter.
(2): If you use dsd, the informat should use a colon somehow?
Do you know any differences between the two? Many thanks for your time and attention.
The most obvious difference is how DSD treats consecutive delimiters. From the docs:
When you specify DSD, SAS treats two consecutive delimiters as a
missing value and removes quotation marks from character values.
Whereas the default functionality of DLM=',' is to treat consecutive commas as a single comma, DSD will assign missing values between consecutive commas. Here's an example:
data work.dlm_test;
infile datalines dlm=','; /* using dlm */
input var1 var2 var3;
datalines; /* note how the consecutive commas are treated! */
1,2,3
1,,3
,2,3
;
data work.dsd_test;
infile datalines dsd; /* using dsd */
input var1 var2 var3;
datalines;
1,2,3
1,,3
,2,3
;
proc print data=dlm_test;
/* this will print something like:
OBS | var1 | var2 | var3
-----+------+------+------ Note only 2 observations b/c of
1 | 1 | 2 | 3 default FLOWOVER functionality.
2 | 1 | 3 | 2 <--- Also, final '3' is ignored because there
is no variable to store it.
*/
run;
proc print data=dsd_test;
/* this will print something like:
OBS | var1 | var2 | var3
-----+------+------+------
1 | 1 | 2 | 3
2 | 1 | . | 3 <-- note the missing value in var2
3 | . | 2 | 3 <-- note the third observation, with missing val
*/
run;
Also, DSD will be able to tell that a comma found inside quotation marks is actually not a delimiter, but part of a character string. In contrast, if you use only DLM=',', then it will ignore quotation marks and treat every comma-cluster as a delimiter.
TIP: By default, DSD drops the quotes around character strings, but you can keep the quotes by using the ~format identifier in the INPUT statement.
It is useful to note that DSD and DLM can also be used together to get the behavior of DSD, but change the default delmiiter from a comma to something else, like a semicolon (;). Example:
infile (filename) dsd dlm=';';
I found this documentation page to be the most instructive.
Remember: DSD stands for "delimiter-sensitive data" because it is more deliberate about processing delimiters!
The real issue how the input statement behaves when it sees a delimiter when it starts to read a variable. With the DSD option it will set the value to missing and move the pointer past the delimiter. Without the DSD option it will skip over the delimiter (or multiple adjacent delimiters) before reading the value. You can confirm this by reading a line that starts with a delimiter.
The colon modifier helps when the actual value is shorter than the informat's width, but it also helps by moving the pointer PAST the delimiter so that the NEXT variable is read correctly. This is what makes it important when using formatted input statements with the DSD infile option.
You can avoid the need to worry about the : modifiers by using an INFORMAT statement instead of listing the informats in the input statement.
Related
I have two lines of observations to read in SAS.
It is a comma-delimited data set.
My code is as below:
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $ 15;
INPUT Company $ State $ Expense COMMA9. ;
FORMAT Expense DOLLAR9.;
*INFORMAT Expense DOLLAR10.;
RUN; * not ready;
The raw data set looks like this:
I can print out the first line of observations well,
but the last "0" will go to the first position of the second
line, becoming "0Lee's..".
Any suggestions would be highly appreciated!!
It is just doing what you told it to do. You told it to read exactly 9 characters.
Normally you should not use formatted input mode with delimited data. You prevent that by either adding the : (colon) prefix in front of the informat specification in the INPUT statement or removing the informat specification completely and using an INFORMAT statement to let SAS know what informat to use.
But your data is NOT properly delimited because the last field contains the delimiter, but the value is not enclosed in quotes. So the commas make it look like two values instead of one. The real solution is to fix the process that created the file to create a valid delimited file. It needs to quote the values with commas in them, or remove the commas from the numbers, or use a delimiter character that does not appear in the data.
Fortunately since it is the last field on the line you CAN use formatted input to read just that field. Since you are using the TRUNCOVER option just set the width of the informat in the INPUT statement to the maximum.
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $15 State $15 Expense 8;
INPUT Company State Expense COMMA32. ;
FORMAT Expense DOLLAR9.;
RUN;
I have been trying to read from a text file which has rows like below and has delimiter as semi-colon:
Sun rises
in the east
and;
sets in the
west
;
I am trying to read the data from delimiter to delimiter in single separate records like below
variable_name
1 Sun rises in the east and
2 sets in the west
I have tried almost all the options available with infile option for no avail. Is it possible to read like above? How to do it? Any clue/help would be appreciated.
recfm=n is the way to tell SAS to not have 'line breaks'. So:
data want;
infile "c:\temp\test.txt" recfm=n dsd dlm=';';
length text $1024;
input text $;
run;
Note that the line break will be read as just another two characters, so if you want to remove those characters you can use compress with the c option to remove control characters (including LF/FF).
Read it word by word and concatenate into longer lines.
data want ;
infile mydat end=eof ;
length word $200 line $2000 ;
drop word;
do while (^eof and ^index(word,';'));
input word # ;
line = catx(' ',line,compress(word,';'));
end;
if _n_ > 1 or line ne ' ' then output;
run;
Can you please tell me if we can use single trailing (#) for delimited data
rather than fixed width.
Thanks,
Nikhila
From the comments it looks like the question is really how to skip columns in delimited data. A simple way is to read the value into a variable that you later drop. Or even read it into a variable that you want and then overwrite it with the value from the column you do want to keep.
data want ;
infile cards dsd truncover ;
length var1 var2 $20;
input 3*var1 var2 ;
cards;
nikhila,26,hyd,btech
akhila,24,blr,btech
nitesh,20,blr,bmm
;
In the following program all data is read correctly
data test ;
infile datalines ;
input make 10$ mpg ## ; /* should I use make : 10$ . . */
datalines ;
Ford 20 Honda 29 Oldsmobile 20 Cadillac 17
Toyota 24 Chevrolet 17
;
run ;
proc print ;
run ;
The above code works fine, however my teacher says that I must use colon : and the correct answer is input make : 10$ mpg ## ;
I dont understand why . As far as I know : is useful if we have trailing spaces at the begining of a record line , otherwise why should we use it here ?
The colon tells SAS to use the following informat. Without the colon SAS would ignore that part (it doesn't do anything). SAS by default uses an informat (and resultant length) of $8. if you don't specify it otherwise.
You are always better off specifying the informat, as a character of 2 length stored in the default 8 length character variable would be wasting storage space and processing time, but it won't alter the value (assuming you know to be aware of the trailing spaces).
You can also specify the informat ahead of time:
data test;
infile datalines;
informat make $10.;
input
make $ mpg ##;
datalines;
Ford 20 Honda 29 Oldsmobile 20 Cadillac 17
Toyota 24 Chevrolet 17
;;;;
run;
proc print data=test;
run ;
I find that usually easier to read, although using :$10. in stream is acceptable as well.
The : modifier on the INPUT statement says to read the text using LIST MODE even when there is an in-line format specification. In list mode the input statement reads the next word on the line into the variable. Without the : modifier the INPUT statement FORMATTED MDOE will be used which will read exactly the number of character specified by the in-line informat. Even if this could cause it to stop before the end of the current word on the line or read past the delimiter into the next word on the line.
But there are other problems with your INPUT statement. As currently written it will generate an error.
2112 input make 10$ mpg ## ;
-
22
ERROR 22-322: Expecting a name.
The number 10 in the INPUT statement is taken to mean you want to read MAKE using COLUMN MODE input. So you want to read the single digit number from the 10th character on the line. Then the $ modifier after the column number is generating an error because there is no variable directly in front of it to modify. If you want specify an informat you need to include the period as part of the specification. If you want to specify a character informat instead of a numeric informat then the name of the informat should start with the $ character.
So your INPUT statement with an in-line informat specification for MAKE would look like:
input make :$10. mpg ## ;
The other way to make sure the MAKE is defined long enough to hold 10 characters is to define the variable before referencing it in the INPUT statement. Then SAS does not have to guess how you want it defined by how you are using it in the INPUT statement. Once the variable is known there is no need to include any extra characters in the INPUT statement.
data test ;
length make $10 mpg 8;
input make mpg ## ;
datalines ;
Ford 20 Honda 29 Oldsmobile 20 Cadillac 17
Toyota 24 Chevrolet 17
;
I am reading a period '.' as a character variable's value but it is reading it as a blank value.
data output1;
input #1 a $1. #2 b $1. #3 c $1.;
datalines;
!..
1.3
;
run;
Output Required
------ --------
A B C A B C
! ! . .
1 3 1 . 3
Please help me in reading a period as such.
The output is determined by the informat used ($w. informat in your case, requested by $1. in your code, so $1. is first of all informat definition, lenght definition of variable is a side product of this).
Use $char. informat for desired result.
data output1;
input #1 a $char1. #2 b $char1. #3 c $char1.;
datalines;
!..
1.3
;
run;
From documentation:
$w Informat
The $w. informat trims leading blanks and left aligns the values before storing the text. In addition, if a field contains only blanks and a single period, $w. converts the period to a blank because it interprets the period as a missing value. The $w. informat treats two or more periods in a field as character data.
$CHARw. informat
The $CHARw. informat does not trim leading and trailing blanks or convert a single period in the input data field to a blank before storing values.
I don't immediately see why it does not work.
But if you are not interested in figuring out why it does not work, but just want something that does: read it in as 1 variable of length $3. Then in a next step; split it using substr.
E.g.,
data output1;
length tmp $3;
input tmp;
datalines;
!..
1.3
;
run;
data output2 (drop=tmp);
length a $1;
length b $1;
length c $1;
set output1;
a=substr(tmp,1,1);
b=substr(tmp,2,1);
c=substr(tmp,3,1);
run;