Read from multiple lines into single row until delimiter is encountered - sas

I have been trying to read from a text file which has rows like below and has delimiter as semi-colon:
Sun rises
in the east
and;
sets in the
west
;
I am trying to read the data from delimiter to delimiter in single separate records like below
variable_name
1 Sun rises in the east and
2 sets in the west
I have tried almost all the options available with infile option for no avail. Is it possible to read like above? How to do it? Any clue/help would be appreciated.

recfm=n is the way to tell SAS to not have 'line breaks'. So:
data want;
infile "c:\temp\test.txt" recfm=n dsd dlm=';';
length text $1024;
input text $;
run;
Note that the line break will be read as just another two characters, so if you want to remove those characters you can use compress with the c option to remove control characters (including LF/FF).

Read it word by word and concatenate into longer lines.
data want ;
infile mydat end=eof ;
length word $200 line $2000 ;
drop word;
do while (^eof and ^index(word,';'));
input word # ;
line = catx(' ',line,compress(word,';'));
end;
if _n_ > 1 or line ne ' ' then output;
run;

Related

Read in SAS with two lines end and start at different positions

I have two lines of observations to read in SAS.
It is a comma-delimited data set.
My code is as below:
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $ 15;
INPUT Company $ State $ Expense COMMA9. ;
FORMAT Expense DOLLAR9.;
*INFORMAT Expense DOLLAR10.;
RUN; * not ready;
The raw data set looks like this:
I can print out the first line of observations well,
but the last "0" will go to the first position of the second
line, becoming "0Lee's..".
Any suggestions would be highly appreciated!!
It is just doing what you told it to do. You told it to read exactly 9 characters.
Normally you should not use formatted input mode with delimited data. You prevent that by either adding the : (colon) prefix in front of the informat specification in the INPUT statement or removing the informat specification completely and using an INFORMAT statement to let SAS know what informat to use.
But your data is NOT properly delimited because the last field contains the delimiter, but the value is not enclosed in quotes. So the commas make it look like two values instead of one. The real solution is to fix the process that created the file to create a valid delimited file. It needs to quote the values with commas in them, or remove the commas from the numbers, or use a delimiter character that does not appear in the data.
Fortunately since it is the last field on the line you CAN use formatted input to read just that field. Since you are using the TRUNCOVER option just set the width of the informat in the INPUT statement to the maximum.
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $15 State $15 Expense 8;
INPUT Company State Expense COMMA32. ;
FORMAT Expense DOLLAR9.;
RUN;

Removing Characters from SAS String Starting on Left

I have a SAS string that always starts with a date. I want to remove the date from the substring.
Example of data is below (data does not have bullets, included bullets to increase readability)
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
I want the data to look like this (data does not have bullets, included bullets to increase readability)
test_num15
recom_1_test1
test_0_8_i0|vacc_previous0
Index find '|' position in the string, then substr substring; or use regular expression.
data have;
input x $50.;
x1=substr(x,index(x,'|')+1);
x2=prxchange('s/([^_]+\|)(?=\w+)//',1,x);
cards;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;
run;
This is a great use case for call scan. If your length of date is constant (always 10), then you don't actually need this (start would be 12 then and skip to the substr, as user667489 noted in comments), but if it's not this would be helpful.
data have;
length textstr $100;
input textstr $;
datalines;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;;;;
run;
data want;
set have;
call scan(textstr,2,start,length,'|');
new_textstr = substr(textstr,start);
run;
It would also let you grab the second word only if that's useful (using length third argument for substr).

SAS PRXCHANGE a number between words into similar number of spaces

Is it possible to use the number in this string:
'xx8xx'
by replacing the number with 8 spaces to get this string:
'xx xx'
I can identify the number between the xx but the replacement syntax does not work as intended:
PRXCHANGE(s/xx([\d]*)xx/' ' x $1/io, -1, 'xx8xx')
Is there a way to use the number being held in $1 to repeat the space character by that number i.e. something like ' ' x $1?
Any help much appreciated!
Tiaan
Supposed you need to replace with three blank.
data _null_;
x=prxchange('s/(xx)\d+(xx)/$1 $2/', -1, 'xx8xx');
_x=prxchange('s/(?=\w+)(\d+)/ /',1,'xx8xx');
put _all_;
run;
Edit:
I missed important information. Tranwrd and repeat could be used to get it.
data _null_;
x=tranwrd('xx8xx', prxchange('s/.*(\d+).*/$1/',1,'xx8xx'), repeat(' ',prxchange('s/.*(\d+).*/$1/',1,'xx8xx')));
put _all_;
run;
You'll need to extract first, then compile a new regex. This will be expensive since you have to compile once per line.
data have;
input xstr $;
datalines;
xx8xx
xx3xx
xx4xx
;;;;
run;
data want;
set have;
rx1 = prxparse('/xx([\d])*xx/io');
rc1 = prxmatch(Rx1,xstr);
num_x = prxposn(rx1,1,xstr);
rx2 = prxparse(cat('s/(xx)[\d]*(xx)/$1',repeat(" ",num_x-1),'$2/i'));
newstr = prxchange(rx2,-1,xstr);
run;

usage of single trailing(#) in SAS for delimited data

Can you please tell me if we can use single trailing (#) for delimited data
rather than fixed width.
Thanks,
Nikhila
From the comments it looks like the question is really how to skip columns in delimited data. A simple way is to read the value into a variable that you later drop. Or even read it into a variable that you want and then overwrite it with the value from the column you do want to keep.
data want ;
infile cards dsd truncover ;
length var1 var2 $20;
input 3*var1 var2 ;
cards;
nikhila,26,hyd,btech
akhila,24,blr,btech
nitesh,20,blr,bmm
;

what is the difference between dsd and dlm="," in SAS?

Let me sum up what I have got from this website. https://communities.sas.com/t5/General-SAS-Programming/Please-explain-DSD-and-DLM-differences/td-p/146773
(1): Without dsd, the cursor passes all the delimiters before reading the next field, while on the other hand, with dsd, the cursor only pass one delimiter.
(2): If you use dsd, the informat should use a colon somehow?
Do you know any differences between the two? Many thanks for your time and attention.
The most obvious difference is how DSD treats consecutive delimiters. From the docs:
When you specify DSD, SAS treats two consecutive delimiters as a
missing value and removes quotation marks from character values.
Whereas the default functionality of DLM=',' is to treat consecutive commas as a single comma, DSD will assign missing values between consecutive commas. Here's an example:
data work.dlm_test;
infile datalines dlm=','; /* using dlm */
input var1 var2 var3;
datalines; /* note how the consecutive commas are treated! */
1,2,3
1,,3
,2,3
;
data work.dsd_test;
infile datalines dsd; /* using dsd */
input var1 var2 var3;
datalines;
1,2,3
1,,3
,2,3
;
proc print data=dlm_test;
/* this will print something like:
OBS | var1 | var2 | var3
-----+------+------+------ Note only 2 observations b/c of
1 | 1 | 2 | 3 default FLOWOVER functionality.
2 | 1 | 3 | 2 <--- Also, final '3' is ignored because there
is no variable to store it.
*/
run;
proc print data=dsd_test;
/* this will print something like:
OBS | var1 | var2 | var3
-----+------+------+------
1 | 1 | 2 | 3
2 | 1 | . | 3 <-- note the missing value in var2
3 | . | 2 | 3 <-- note the third observation, with missing val
*/
run;
Also, DSD will be able to tell that a comma found inside quotation marks is actually not a delimiter, but part of a character string. In contrast, if you use only DLM=',', then it will ignore quotation marks and treat every comma-cluster as a delimiter.
TIP: By default, DSD drops the quotes around character strings, but you can keep the quotes by using the ~format identifier in the INPUT statement.
It is useful to note that DSD and DLM can also be used together to get the behavior of DSD, but change the default delmiiter from a comma to something else, like a semicolon (;). Example:
infile (filename) dsd dlm=';';
I found this documentation page to be the most instructive.
Remember: DSD stands for "delimiter-sensitive data" because it is more deliberate about processing delimiters!
The real issue how the input statement behaves when it sees a delimiter when it starts to read a variable. With the DSD option it will set the value to missing and move the pointer past the delimiter. Without the DSD option it will skip over the delimiter (or multiple adjacent delimiters) before reading the value. You can confirm this by reading a line that starts with a delimiter.
The colon modifier helps when the actual value is shorter than the informat's width, but it also helps by moving the pointer PAST the delimiter so that the NEXT variable is read correctly. This is what makes it important when using formatted input statements with the DSD infile option.
You can avoid the need to worry about the : modifiers by using an INFORMAT statement instead of listing the informats in the input statement.