I'm trying to read in some raw data using datalines...
data Exp_data;
INPUT a: 2. b: 2. DATE1: MMDDYY10. DATE2: MMDDYY10.;
FORMAT DATE1 DATE9. DATE2 DATE9.;
datalines;
27 93 03/16/2008 03/17/2008
27 93 03/17/2009 03/19/2009
68 68
55 55
46 68
34 34
45 67
56 75
34 34
34 34
;RUN;
But this code is reading data until 6 th row. I couldn't figure out where I'm doing mistake.
Thanks in advance!
Add this line before your input statement.
infile datalines missover;
As of the third row you don't have 4 values, so SAS needs to know what to do with the missing values. Missover tells sas to set the remaining values to missing.
Related
I would like to retain the data of the first weight given by each SUBJID. How do I do that?
sample data
DATA Have;
Input SUBJID WEIGHT;
01 88
01 86
01 86
02 .
02 101
02 100
;
run;
expected data:
SUBJID WEIGHT
01 88
02 101
Use FIRST. in SAS.
data want;
set have;
where not missing weight;
by subjid;
if first.subjid;
run;
When I run the following code the third observation is not output. Why does SAS omit the third observation?
data info;
input Gender $ Age Height Weight;
datalines;
M 45 72 149
F 64 62
M 61 72 271
F 29 73 125
M 16 65 178
;
Run;
title "Listing of Dataset Demographics";
proc print data=info;
run;
Defaults will get you, the default in SAS is FLOWOVER, so if a record is missing it looks for it on the next line. You want MISSOVER or TRUNCOVER instead.
Your log tells you this happened with the following note:
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
This works:
data info;
infile cards truncover;
input Gender $ Age Height Weight;
datalines;
M 45 72 149
F 64 62
M 61 72 271
F 29 73 125
M 16 65 178
;
Run;
More details are available in the Example 2 in the documentation here.
Specifically:
When you omit the MISSOVER option or use FLOWOVER (which is the default), SAS moves the
input pointer to line 2 and reads values for TEMP4 and TEMP5 (variables it cannot find). The next
time the DATA step executes, SAS reads a new line which, in this case,
is line 3. This message appears in the SAS log:
NOTE: SAS went to a new line when INPUT statement
reached past the end of a line.
Lines of text do not have "observations". They just have lines.
It didn't skip any of the lines of data. It just used two lines for the second observation because the first of the lines only had values for 3 of the 4 variables the INPUT statement requested.
This behavior is what SAS calls the flowover option of the INFILE statement. This allows you to have more than one line of text to represent the data for a single observation without having to be too persnickety about which fields you insert the line breaks between across the different observations of data.
If you don't want it to have to go hunt for the next field on the next line of text then make sure every variable has a value in the text lines. You can represent missing values by using a period for either numeric or character variables.
So use something like this:
data info;
input Gender $ Age Height Weight;
datalines;
M 45 72 149
F 64 62 .
M 61 72 271
. 29 73 125
M 16 65 178
;
When using flowover you can insert as many extra line breaks as you want as long as each new observation starts on a new line. Like this
data info;
input Gender $ Age Height Weight;
datalines;
M 45 72
149
F 64
62 .
M
61 72 271
F 29 73 125
M 16 65 178
;
If you want SAS to just give up when a there are no more values on the line use the flowover option on the infile statement.
data info;
infile datalines flowover;
input Gender $ Age Height Weight;
datalines;
M 45 72 149
F 64 62
M 61 72 271
F 29 73 125
M 16 65 178
;
There is also the older missover option, but you would normally never want that as it will set values at the end of the line that too short for an explicit INFORMAT width to missing instead of just use the number of characters that are available.
PS Don't indent lines of data. That will just make the code harder to read and the diagnostic messages about invalid data values harder to interpret. To make it easier don't intend the DATALINES (aka CARDS) statement line either. That will also make it clearer the data step definition ends where the lines of data starts and prevent you from accidentally inserting other statements for the data step after the data.
I´m trying to combine and sum certain observations of a dataset with different values for their common variables, in this case, I am trying to combine the deaths of three age intervals (85-90), (91-95), (95+) in one only (85+) age interval. Our teacher told us it is better if we do not create a new variable and use proc means, tabulate etc.
I have read every google page and all I can find is a proc means combining and summing by variable, but I don´t need the whole group summed, just some observations of the group.
Having the dataset like:
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
86 to 90 3
91 to 95 1
95+ 3
I would like to have it like
.
.
.
71 to 75 3
76 to 80 4
81 to 85 2
85+ 7
Thanks!
Create a custom format to map the existing literal categorizations into a new ones.
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
This only works for concatenating categories, creating a coarser aggregation.
Example:
* A format to get you into the pickle you are in;
proc format;
value age_cat_have
71-75 = '71 to 75'
76-80 = '76 to 80'
81-84 = '81 to 85'
86-90 = '86 to 90'
91-95 = '91 to 95'
95-high = '95+'
;
data have;
input age ##;
agecat = put (age, age_cat_have.);
datalines;
71 72 73
76 77 78 79
82 83
87 86 86
94
99 101 113
;
proc freq data=have;
title "Original categories are character literals";
table agecat;
run;
* A format to map literal agecat strings to broader categories;
proc format ;
value $age_cat_want (default=20)
'86 to 90' = '86+'
'91 to 95' = '86+'
'95+' = '86+'
;
proc freq data=have;
title "New age categories via custom format $age_cat_want";
table agecat;
format agecat $age_cat_want.;
run;
Note: An existing literal categorization cannot be explicitly split. You would have to make presumptions about the age value distribution within each category and impute a specific age that could be applied to a different age mapping format.
Problem Statement: I have a text file and I want to read it using SAS INFILE function. But SAS is not giving me the proper output.
Text File:
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
The Code that I have tried:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name$ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
While the result i got is :
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas 20 16 .
3 Shashank16 20 .
4 Meera 18 20 .
Which is improper. So what is wrong with the code? I need to read the file in SAS with the same sequence of marks as in text file. Please help.
Expected result:
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas . 20 16
3 Shashank16 20 .
4 Meera 18 . 20
Thanks in advance.
If that text file is tab-delimited, you should specify the delimiter in the infile statement and use the dsd option to account for missing values:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover dlm='09'x dsd;
INPUT Name $ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
EDIT: after editing, your sample text file now looks fixed-width rather than space-delimited. In that case you should be using column input:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
RUN;
example with datalines:
DATA Arr;
INFILE datalines missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
datalines;
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
RUN;
I have the following data set:
Date jobboardid Sales
Jan05 3 256
Jan05 6 70
Jan05 54 90
Feb05 32 456
Feb05 11 89
Feb05 16 876
March05
April05
.
.
.
Jan06 6 678
Jan06 54 87
Jan06 13 56
Feb06 McDonald 67
Feb06 11 281
Feb06 16 876
March06
April06
.
.
.
Jan07 6 567
Jan07 54 76
Jan07 34 87
Feb07 10 678
Feb07 11 765
Feb07 16 67
March07
April06
I am trying to calculate a 12 month growth rate for Sales column when jobboardid column has the same value 12 months apart. I have the following code:
data Want;
set Have;
by Date jobboardid;
format From Till monyy7.;
from = lag12(Date);
oldsales = lag12(sales);
if lag12 (jobboardid) EQ jobboardid
and INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
run;
However I keep getting the following error message:
Note: Missing values were created as a result of performing operation on missing values.
But when I checked my dataset, there aren't any missing values. What's the problem?
Note: My date column is in monyy7. format. jobboardid is numeric value and so does the Sales.
The NOTE is being thrown by the INTCK() function. When you say from=lag12(date) the first 12 records will have a missing value for from. And then INTCK('month', from, Date) will throw the NOTE. Even though INTCK is not used in an assignment statement, it still throws the NOTE because one of its arguments has a missing value. Below is an example. The log reports that missing values were created 12 times, because I used lag12.
77 data have;
78 do Date=1 to 20;
79 output;
80 end;
81 run;
NOTE: The data set WORK.HAVE has 20 observations and 1 variables.
82 data want;
83 set have;
84 from=lag12(Date);
85 if intck('month',from,today())=. then put 'Missing: ' (_n_ Date)(=);
86 else put 'Not Missing: ' (_n_ Date)(=);
87 run;
Missing: _N_=1 Date=1
Missing: _N_=2 Date=2
Missing: _N_=3 Date=3
Missing: _N_=4 Date=4
Missing: _N_=5 Date=5
Missing: _N_=6 Date=6
Missing: _N_=7 Date=7
Missing: _N_=8 Date=8
Missing: _N_=9 Date=9
Missing: _N_=10 Date=10
Missing: _N_=11 Date=11
Missing: _N_=12 Date=12
Not Missing: _N_=13 Date=13
Not Missing: _N_=14 Date=14
Not Missing: _N_=15 Date=15
Not Missing: _N_=16 Date=16
Not Missing: _N_=17 Date=17
Not Missing: _N_=18 Date=18
Not Missing: _N_=19 Date=19
Not Missing: _N_=20 Date=20
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
12 at 85:6
NOTE: There were 20 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 20 observations and 2 variables.
One way to avoid the problem would be to add another do block something like (untested):
if lag12 (jobboardid) EQ jobboardid and _n_> 12 then do;
if INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
end;