Reading in Inconsistent Data - sas

I am having trouble reading in inconsistent comma separated data. Here is a sample of what the data looks like:
JefferyThomas,"200","2,500","12,344",100,"999","865,100",800
GeorgeMontgomery,"50","700",200,"2,500","2,500","8,000","950"
I have never dealth with both numbers within quotes, as well as numbers not in quotes. If it was just one or the other, obviously that is not difficult to read in. But because some numbers are in quotes, and others not, I find myself having trouble reading in all of the data. This is what I have tried so far:
Data test;
INFILE ......"data.csv" dlm="," dsd missover;
length Name $16;
input Name $ Score1 Score2 Score3 Score4 Score5 Score6 Score7;
All this returns is missing values except for the numbers that aren't within quotes.

You need to also tell SAS to read numbers with commas using COMMA INFORMAT.
Data test;
INFILE cards dlm="," dsd missover;
length Name $16;
informat score1-score7 comma16.;
input (_all_)(:);
cards;
JefferyThomas,"200","2,500","12,344",100,"999","865,100",800
GeorgeMontgomery,"50","700",200,"2,500","2,500","8,000","950"
;;;;
run;
proc print;
run;

Related

Why does my regex only change my first entry in SAS?

I have a number of text entries (municipalities) from which I need to remove the s at the end.
Data test;
input city $;
datalines;
arjepogs
askers
Londons
;
run;
data cities;
set test;
if prxmatch("/^(.*?)s$/",city)
then city=prxchange("s/^(.*?)s$/$1/",-1,city);
run;
Strangely enough, my s's are only removed from my first entry.
What am I doing wrong?
You defined CITY as length $8. The s in Londons is in the 7th position of the string. Not the LAST position of the string. Use the TRIM() function to remove the trailing spaces from the value of the variable.
data have;
input city $20.;
datalines;
arjepogs
Kent
askers
Londons
;
data want;
set have;
length new_city $20 ;
new_city=prxchange("s/^(.*?)s$/$1/",-1,trim(city));
run;
Result
Obs city new_city
1 arjepogs arjepog
2 Kent Kent
3 askers asker
4 Londons London
You could also just change the REGEX to account for the trailing spaces.
new_city=prxchange("s/^(.*?)s\ *$/$1/",-1,city);
Here is another solution using only SAS string functions and no regex. Note that in this case there is no need to trim the variable:
data cities;
set test;
if substr(city,length(city)) eq "s" then
city=substr(city,1,length(city)-1);
run;

Read in SAS with two lines end and start at different positions

I have two lines of observations to read in SAS.
It is a comma-delimited data set.
My code is as below:
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $ 15;
INPUT Company $ State $ Expense COMMA9. ;
FORMAT Expense DOLLAR9.;
*INFORMAT Expense DOLLAR10.;
RUN; * not ready;
The raw data set looks like this:
I can print out the first line of observations well,
but the last "0" will go to the first position of the second
line, becoming "0Lee's..".
Any suggestions would be highly appreciated!!
It is just doing what you told it to do. You told it to read exactly 9 characters.
Normally you should not use formatted input mode with delimited data. You prevent that by either adding the : (colon) prefix in front of the informat specification in the INPUT statement or removing the informat specification completely and using an INFORMAT statement to let SAS know what informat to use.
But your data is NOT properly delimited because the last field contains the delimiter, but the value is not enclosed in quotes. So the commas make it look like two values instead of one. The real solution is to fix the process that created the file to create a valid delimited file. It needs to quote the values with commas in them, or remove the commas from the numbers, or use a delimiter character that does not appear in the data.
Fortunately since it is the last field on the line you CAN use formatted input to read just that field. Since you are using the TRUNCOVER option just set the width of the informat in the INPUT statement to the maximum.
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $15 State $15 Expense 8;
INPUT Company State Expense COMMA32. ;
FORMAT Expense DOLLAR9.;
RUN;

usage of single trailing(#) in SAS for delimited data

Can you please tell me if we can use single trailing (#) for delimited data
rather than fixed width.
Thanks,
Nikhila
From the comments it looks like the question is really how to skip columns in delimited data. A simple way is to read the value into a variable that you later drop. Or even read it into a variable that you want and then overwrite it with the value from the column you do want to keep.
data want ;
infile cards dsd truncover ;
length var1 var2 $20;
input 3*var1 var2 ;
cards;
nikhila,26,hyd,btech
akhila,24,blr,btech
nitesh,20,blr,bmm
;

Character value with embedded blanks with list input

I would like to read following instream datalines
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
I employed while it read AAA items to team variables but not as div. And how should I place &(ampersand to read character with embedded blanks?)
data scores2;
infile datalines dlm=",";
input name : $10. score1-score3 team $20. div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Notice I have used : before team also ( well you have already used colon operator : for other variables , not sure why did you miss over here) As I have already mentioned in your other query, use : colon operator (tilde, dlm and colon format modifier in list input) which would tell SAS to use the informat supplied but to stop reading the value for this variable when a delimiter is encountered. Here as you had not used this operator , that is why SAS was trying to read 20 chars, even though
there was a delimiter in between.
Tested
data scores2;
infile datalines dlm=",";
input name : $10.
score1-score3
team : $20.
div : $3.;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Another way to do this that's often a bit easier to read is to use the informat statement.
data scores2;
infile datalines dlm=",";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team $ div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
That accomplishes the same thing as using the colon (input name :$10.) but organizes it a bit more cleanly.
And just to be clear, embedded blanks are irrelevant in comma delimited input; '20'x (ie, space) is just another character when it's not the delimiter. What ampersand will do is addressed in this article, and more specifically, if space is the delmiiter it allows you to require two consecutive delimiters to end a field. Example:
data scores2;
infile datalines dlm=" ";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team & $ div $;
datalines;
Smith 12 22 46 Green Hornets AAA
FriedmanLi 23 19 25 High Volts AAA
Jones 09 17 54 Las Vegas AA
;
run;
Note the double space after all of the team names - that's required by the &. But this is only because delimiter is space (which is default, so if you removed the dlm=' ' it would also be needed.)

How to skip certain lines with random character sequences in file

I have a file in following format:
Name Salary Age
bob 10000 18
sally 5555 20
#not found 4fjfjhdfjfnvndf
#not found 4fjfjhdfjfnvndf
9/2-10/2
but then I have random points in the file where there are 4-6 lines of random characters. The files has 2 million rows. I was wondering if the infile statement automatically skips these random spurt of lines or do I have to go into the file and delete these lines automatically.
You probably have to deal with them in some fashion. If you have truncover or missover on the infile statement, it won't do any harm (you must have one, though, or it might cause your next lines to get shifted over). But you'll have a garbage line in your program that you need to deal with.
The quick and dirty method would be something like this:
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input name $ salary age;
if missing(salary) and missing(age) then delete;
run;
If the garbage was likely to generate missing values for the numerics, that would work. However, your log probably has some warnings in it that aren't great, and this isn't perfect in what it finds, either, if the garbage might be numeric values. (If it's entirely numeric values, you could test if name is a number.)
The better method is to preprocess _infile_ - which is a bit more 'advanced' but certainly a good approach.
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input #;
if countw(_infile_) ne 3 then delete; *if there are not exactly 3 "words" then delete it;
if notdigit(scan(_infile_,2)) or notdigit(scan(_infile_,3)) then delete; *if the 2nd or 3rd word contain non-digit values then delete;
input name $ salary age;
run;
Both approaches require some consistency with data to work, and probably require some tweaking - for example if salary and age are acceptable to be missing, both of these would delete rows you don't want deleted.