How to skip certain lines with random character sequences in file - sas

I have a file in following format:
Name Salary Age
bob 10000 18
sally 5555 20
#not found 4fjfjhdfjfnvndf
#not found 4fjfjhdfjfnvndf
9/2-10/2
but then I have random points in the file where there are 4-6 lines of random characters. The files has 2 million rows. I was wondering if the infile statement automatically skips these random spurt of lines or do I have to go into the file and delete these lines automatically.

You probably have to deal with them in some fashion. If you have truncover or missover on the infile statement, it won't do any harm (you must have one, though, or it might cause your next lines to get shifted over). But you'll have a garbage line in your program that you need to deal with.
The quick and dirty method would be something like this:
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input name $ salary age;
if missing(salary) and missing(age) then delete;
run;
If the garbage was likely to generate missing values for the numerics, that would work. However, your log probably has some warnings in it that aren't great, and this isn't perfect in what it finds, either, if the garbage might be numeric values. (If it's entirely numeric values, you could test if name is a number.)
The better method is to preprocess _infile_ - which is a bit more 'advanced' but certainly a good approach.
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input #;
if countw(_infile_) ne 3 then delete; *if there are not exactly 3 "words" then delete it;
if notdigit(scan(_infile_,2)) or notdigit(scan(_infile_,3)) then delete; *if the 2nd or 3rd word contain non-digit values then delete;
input name $ salary age;
run;
Both approaches require some consistency with data to work, and probably require some tweaking - for example if salary and age are acceptable to be missing, both of these would delete rows you don't want deleted.

Related

Read in SAS with two lines end and start at different positions

I have two lines of observations to read in SAS.
It is a comma-delimited data set.
My code is as below:
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $ 15;
INPUT Company $ State $ Expense COMMA9. ;
FORMAT Expense DOLLAR9.;
*INFORMAT Expense DOLLAR10.;
RUN; * not ready;
The raw data set looks like this:
I can print out the first line of observations well,
but the last "0" will go to the first position of the second
line, becoming "0Lee's..".
Any suggestions would be highly appreciated!!
It is just doing what you told it to do. You told it to read exactly 9 characters.
Normally you should not use formatted input mode with delimited data. You prevent that by either adding the : (colon) prefix in front of the informat specification in the INPUT statement or removing the informat specification completely and using an INFORMAT statement to let SAS know what informat to use.
But your data is NOT properly delimited because the last field contains the delimiter, but the value is not enclosed in quotes. So the commas make it look like two values instead of one. The real solution is to fix the process that created the file to create a valid delimited file. It needs to quote the values with commas in them, or remove the commas from the numbers, or use a delimiter character that does not appear in the data.
Fortunately since it is the last field on the line you CAN use formatted input to read just that field. Since you are using the TRUNCOVER option just set the width of the informat in the INPUT statement to the maximum.
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $15 State $15 Expense 8;
INPUT Company State Expense COMMA32. ;
FORMAT Expense DOLLAR9.;
RUN;

Removing Characters from SAS String Starting on Left

I have a SAS string that always starts with a date. I want to remove the date from the substring.
Example of data is below (data does not have bullets, included bullets to increase readability)
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
I want the data to look like this (data does not have bullets, included bullets to increase readability)
test_num15
recom_1_test1
test_0_8_i0|vacc_previous0
Index find '|' position in the string, then substr substring; or use regular expression.
data have;
input x $50.;
x1=substr(x,index(x,'|')+1);
x2=prxchange('s/([^_]+\|)(?=\w+)//',1,x);
cards;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;
run;
This is a great use case for call scan. If your length of date is constant (always 10), then you don't actually need this (start would be 12 then and skip to the substr, as user667489 noted in comments), but if it's not this would be helpful.
data have;
length textstr $100;
input textstr $;
datalines;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;;;;
run;
data want;
set have;
call scan(textstr,2,start,length,'|');
new_textstr = substr(textstr,start);
run;
It would also let you grab the second word only if that's useful (using length third argument for substr).

usage of single trailing(#) in SAS for delimited data

Can you please tell me if we can use single trailing (#) for delimited data
rather than fixed width.
Thanks,
Nikhila
From the comments it looks like the question is really how to skip columns in delimited data. A simple way is to read the value into a variable that you later drop. Or even read it into a variable that you want and then overwrite it with the value from the column you do want to keep.
data want ;
infile cards dsd truncover ;
length var1 var2 $20;
input 3*var1 var2 ;
cards;
nikhila,26,hyd,btech
akhila,24,blr,btech
nitesh,20,blr,bmm
;

Reading in Inconsistent Data

I am having trouble reading in inconsistent comma separated data. Here is a sample of what the data looks like:
JefferyThomas,"200","2,500","12,344",100,"999","865,100",800
GeorgeMontgomery,"50","700",200,"2,500","2,500","8,000","950"
I have never dealth with both numbers within quotes, as well as numbers not in quotes. If it was just one or the other, obviously that is not difficult to read in. But because some numbers are in quotes, and others not, I find myself having trouble reading in all of the data. This is what I have tried so far:
Data test;
INFILE ......"data.csv" dlm="," dsd missover;
length Name $16;
input Name $ Score1 Score2 Score3 Score4 Score5 Score6 Score7;
All this returns is missing values except for the numbers that aren't within quotes.
You need to also tell SAS to read numbers with commas using COMMA INFORMAT.
Data test;
INFILE cards dlm="," dsd missover;
length Name $16;
informat score1-score7 comma16.;
input (_all_)(:);
cards;
JefferyThomas,"200","2,500","12,344",100,"999","865,100",800
GeorgeMontgomery,"50","700",200,"2,500","2,500","8,000","950"
;;;;
run;
proc print;
run;

Character value with embedded blanks with list input

I would like to read following instream datalines
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
I employed while it read AAA items to team variables but not as div. And how should I place &(ampersand to read character with embedded blanks?)
data scores2;
infile datalines dlm=",";
input name : $10. score1-score3 team $20. div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Notice I have used : before team also ( well you have already used colon operator : for other variables , not sure why did you miss over here) As I have already mentioned in your other query, use : colon operator (tilde, dlm and colon format modifier in list input) which would tell SAS to use the informat supplied but to stop reading the value for this variable when a delimiter is encountered. Here as you had not used this operator , that is why SAS was trying to read 20 chars, even though
there was a delimiter in between.
Tested
data scores2;
infile datalines dlm=",";
input name : $10.
score1-score3
team : $20.
div : $3.;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Another way to do this that's often a bit easier to read is to use the informat statement.
data scores2;
infile datalines dlm=",";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team $ div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
That accomplishes the same thing as using the colon (input name :$10.) but organizes it a bit more cleanly.
And just to be clear, embedded blanks are irrelevant in comma delimited input; '20'x (ie, space) is just another character when it's not the delimiter. What ampersand will do is addressed in this article, and more specifically, if space is the delmiiter it allows you to require two consecutive delimiters to end a field. Example:
data scores2;
infile datalines dlm=" ";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team & $ div $;
datalines;
Smith 12 22 46 Green Hornets AAA
FriedmanLi 23 19 25 High Volts AAA
Jones 09 17 54 Las Vegas AA
;
run;
Note the double space after all of the team names - that's required by the &. But this is only because delimiter is space (which is default, so if you removed the dlm=' ' it would also be needed.)