What do the numbers in the grey box represent? And what's a simple way of understanding how the colon modifier affects the way sas reads in values?
The answer depends on information not provided. The answer B is the best choice in the sense that you should use the colon modifier when using informats in the INPUT statement to prevent the use of the formatted input mode instead of list input mode. Otherwise the formatted input could read too many or too few characters and also might leave the cursor in the wrong place for reading the next field.
But if you try to read that data from in-line cards it works fine for those two lines. That is because in-line data lines are padded to next multiple of 80 bytes.
If you put those lines into a file without any trailing spaces on the lines then the second line fails because there are not 10 characters to read for the last field. But if you add the TRUNCOVER option (or PAD) to the INFILE statement then it will work.
Try it yourself. TEST1 and TEST3 work. TEST2 gets a LOST CARD note.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
options parmcards=test;
filename test temp ;
parmcards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
data test2;
infile test;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
data test3;
infile test truncover;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
With different data the first formatted input can cause trouble also. For example if the date values used only 2 digits for the year it would throw things off. So it tries to read FL as the age and then reads the first 8 characters of the salary as the STATE and just blanks as the SALARY.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR08 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
Results:
Obs name hired age state salary
1 Donny 05MAR2008 . $43,123. .
2 Margaret 20FEB2008 43 NC 65150
Related
In the code below, I'm wondering why is the last observation(=carlo) lost when using the column pointer control?
data work.toExercise ;
infile "/home/u61425323/BASE_DATA/exercise.txt" ; /* my direction */
input Name $7. +3 Nation $7. +2 Code $5. ;
title "Why is the last observation(=carlo) lost?" ;
run;
proc print ; run ;
Below are the exercise.txt.
natasha korea a1111
kelly america b2222
carlo mexico c333
Below are the output results.
enter image description here
Please forgive my poor English.
To stop SAS from going to a new line for input when the line is too short to satisfy the INPUT statement use the TRUNCOVER option on the INFILE statement.
Let's create a text file with your variable length records.
filename text temp;
options parmcards=text;
parmcards;
natasha korea a1111
kelly america b2222
carlo mexico c333
;
If you read it with your data step we get this message:
NOTE: LOST CARD.
Name=carlo Nation=mexico Code= _ERROR_=1 _N_=3
NOTE: 3 records were read from the infile TEXT.
The minimum record length was 23.
The maximum record length was 24.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.ORGINAL has 2 observations and 3 variables.
But when we add the TRUNCOVER option it reads all three observations.
data want ;
infile text truncover ;
input Name $7. +3 Nation $7. +2 Code $5. ;
run;
Result
Do not use the ancient MISSOVER option. That option will discard text at the end of lines that are not long enough for the format that is reading them. It can work if you only use LIST MODE input style where SAS adjusts the width of the informat to match the length of the next word on the line, but then you are just getting the TRUNCOVER behavior anyway so why not be specific.
data wrong ;
infile text missover ;
input Name $7. +3 Nation $7. +2 Code $5. ;
run;
Use the TRUNCOVER option with the INFILE statement.
From the INPUT documentation
TRUNCOVER
overrides the default behavior of the INPUT statement when an input data record is shorter than the INPUT statement expects. By default, the INPUT statement automatically reads the next input data record. TRUNCOVER enables you to read variable-length records when some records are shorter than the INPUT statement expects. Variables without any values assigned are set to missing.
I think that happens because you have the last record shorter than the code expects.
You can try one of the infile options to control the processing in this case, for example:
infile "/home/u61425323/BASE_DATA/exercise.txt" MISSOVER;
I also do not know your task requirements but probably this version of the code would work more stable:
data work.toExercise ;
length Name $7 Nation $7 Code $5;
infile "/home/u61425323/BASE_DATA/exercise.txt" dlm=' ';
input Name Nation Code;
title "Why is the last observation(=carlo) lost?" ;
run;
I'm trying to create variables Cap1 through Cap6. I'm not sure how to have read them as character data. My code is:
DATA Capture;
INFILE '/folders/myfolders/sasuser.v94/Capture.txt' DLM='09'x DSD MISSOVER FIRSTOBS=2;
INPUT Sex $ AgeGroup $ Weight Cap1 - Cap6 $;
RUN;
And my issue is Cap1 through Cap5 are interpreted as numerical data. How do I solve this?
Your issue is simple: you are using a variable list, but you aren't applying the $ to the whole variable list! You need ( ) around the list and the modifier to apply it to the whole list.
See:
DATA Capture;
INFILE datalines DLM=' ' DSD;
INPUT Sex $ AgeGroup $ Weight (Cap1 - Cap6) ($);
datalines;
M 18-34 135 A B C D E F
F 35-54 115 G H I J K L
;;;;
RUN;
Indeed,
I would also expect this input statement to work as you did, but it does not. Putting a $ after Cap1 does not resolve it either, as this log shows.
26 INPUT Sex $ AgeGroup $ Weight Cap1 $ - Cap6 $;
_
22
ERROR 22-322: Expecting a name.
You can solve it
by assigning a format to your variables before reading them, for instance format Cap1 - Cap6 $2.;
To test it,
I included the data in the source file, i.e. using datalines
DATA Capture;
INFILE datalines DLM='09'x DSD missover FIRSTOBS=1;
format Sex $1. AgeGroup $9. Weight 8.2 Cap1 - Cap6 $2.;
INPUT Sex AgeGroup Weight Cap1 - Cap6;
datalines;
M 1-5 24.5 11 12 13 14 15 16
M 6-10 34.2 21 22 23 24 25 26
;
proc print;
proc contents;
RUN;
How to understand this:
SAS was originally created as a programming language for non-developers (i.c. statisticians) who rather don't care about data formats, so SAS does a lot of guess work for you (just like VBA if you don't use option explicit).
So, the first time you mention a variable name in a data step, SAS ads a variable to the Program Data Vector (PDV) with an apropriate type (numeric or charater) and length, but this is guess work.
For instance: as the first student in the test dataset CLASS included in the standard instalation of SAS is male,
data WORK.CLASS;
set sasHelp.CLASS;
select (sex);
when ('M') gender = 'male';
when ('F') gender = 'female';
otherwise gender = 'unknown';
end;
run;
results in truncating 'female' to four positions:
You can correct that by instructing sas to add the variable to the PDV beforehand.
For a character variable,
format myName $20.; and
length myName $20.; are equivalent and
informat myName $20.; is also about the same.
(The storry becomes more complex with user defined formats, though.)
For numerics, there is a huge difference:
length mySize 8.; preserves 8 bytes in the PDV for mySize
format mySize 8.; tells SAS to print or display mySize with up to 8 digits and no decimals
informat mySize $20.; tells SAS a expect 8 digits without decimals when reading mySize.
Numericals can only have certain lengths, depending on the operatin system. On windowns
8. is the default and corresponds to a double on most databases
4. corresponds to a float
3. is the minimum, which I use for booleans
Formats can be very different
format mySize 8.3; tells SAS tot print mySize with 8 characters, including 3 decimals for the fraction (which leaves room for up to 4 decimals before the decimal dot if it has a positive value. Less decimals will be printed to display larger numbers)
format mySize 8.3; tells SAS tot read mySize assuming the last 3 decimals are the fraction, so 12345678 will be interpreted as 12345.678
Then there are special formats to read and write dates, times and so on and user defined value and picture formats, but that lead me too far.
This is the story
This is the input file
mukesh,04/04/15,04/06/15,125.00,333.23
vishant,04/05/15,04/07/15,200.00,200
achal,04/06/15,04/08/15,275.00,55.43
this is the import statement that I am using
data datetimedata;
infile fileref dlm=',';
input lastname$ datechkin mmddyy10. datechkout mmddyy10. room_rate equip_cost;
run;
the below is the log which shows success
NOTE: The infile FILEREF is:
Filename=\\VBOXSVR\win_7\SAS\DATA\datetime\datetimedata.csv,
RECFM=V,LRECL=256,File Size (bytes)=688,
Last Modified=13Jun2015:12:08:36,
Create Time=13Jun2015:09:13:09
NOTE: 17 records were read from the infile FILEREF.
The minimum record length was 34.
The maximum record length was 40.
NOTE: The data set WORK.DATETIMEDATA has 17 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
I have published only 3 observation here.
Now when I print the sas dataset everything works fine except the room_rate variable.
THe output should be 3 digit numbers , but i am getting only the last digit .
Where Am i going wrong !!!
You're mixing input types. When you use list input, you can't specify informats. You either need to specify them using modified list input (add a colon to the informat) or use an informat statement earlier. The following works.
data datetimedata;
infile datalines dlm=',';
input lastname$ datechkin :mmddyy10. datechkout :mmddyy10. room_rate equip_cost;
datalines;
mukesh,04/04/15,04/06/15,125.00,333.23
vishant,04/05/15,04/07/15,200.00,200
achal,04/06/15,04/08/15,275.00,55.43
;;;;
run;
proc print data=datetimedata;
run;
I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.
I have a dataset as below:
country
United States, Seattle
United Kingdom, London
How can I split country into a data in SAS like:
country city
United States Seattle
United Kingdom London
Use function SCAN() with comma as separator.
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
Another option, INFILE magic (google the term for papers on the topic); useful for parsing many variables from one string and/or dealing with quoted fields and such that would be more work with scan.
filename tempfile "c:\temp\test.txt";
data have;
input #1 country $50.;
datalines;
United States, Seattle
United Kingdom, London
;;;;
run;
data want;
set have;
infile tempfile dlm=',' dsd;
input #1 ##;
_infile_=country;
format newcountry city $50.;
input newcountry $ city $ ##;
run;
tempfile can be any file (or one you create on the fly with any character in it to avoid premature EOF).
Response to:
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
What if I want to split the last comma in the string only, keeping 7410 City?
Example: "Junior 18, Plays Piano, 7410 City