How do I read a data file one row at a time in SAS?
Say, I have 3 lines of data
1.0 3.0 5.6 7.8
2.3 4.9
3.2 5.3 6.8 7.5 3.9 4.1
I have to read each line in a different variable. I want the data to look like.
A 1.0
A 3.0
A 5.6
A 7.8
B 2.3
B 4.9
C 3.2
C 5.3
C 6.8
C 7.5
C 3.9
C 4.1
I tried a bunch of things.
If it has a variable name before every data point, following code works fine
INPUT group $ x ##;
I can't figure out how to go about this. Can someone please guide me on this?
Thanks
i think this will produce almost exactly the result you want. you could apply a format to the Group variable.
data orig;
infile datalines missover pad;
format Group 4. Value 4.1;
Group = _n_;
do until (Value eq .);
input value #;
if value ne . then output;
else return;
end;
datalines;
1.0 3.0 5.6 7.8
2.3 4.9
3.2 5.3 6.8 7.5 3.9 4.1
run;
proc print; run;
/*
Obs Group Value
1 1 1.0
2 1 3.0
3 1 5.6
4 1 7.8
5 2 2.3
6 2 4.9
7 3 3.2
8 3 5.3
9 3 6.8
10 3 7.5
11 3 3.9
12 3 4.1 */
Related
I have two data sets having the same content but one is in tab-delimited format, and the other is in space-delimited format.
Space-Delimited
Tab_Delimited
I have three questions which I could not figure them out and would like to ask for help. Any suggestions would be highly appreciated.
First, I used the TextWrangler to open these two data sets, and I feel that the space-delimited data set means that the data sets are separated by spaces and the observations each row are in the same position.
On the other hand, my understanding for tab-delimited data set was that the data sets which are separated by blanks and the blanks might not be necessary the same widths for each rows of the variables. Was my understanding correct? I am having trouble distinguishing them.
Second, I was printing out the snowfall dataset as mentioned above from row number 5 to row number 122, and the "T" values in the dataset has to
be converted to 0.
My code for the space-delimited file of the snowfall data was as below,
and my question was about its LOG. There were many warnings about "T" but I did not receive any errors.
LOG
Should I be concerned about the warnings here mentioning
"invalid data for month(i) in line..."
* Trying Space-Delimited data set;
OPTIONS Errors=200;
DATA SASWEEK.SnowSpace;
DROP i MyTot diff;
INFILE "&dirLSB.RochesterSnowfallSpace.txt" FIRSTOBS= 2 OBS= 122;
INPUT Season $ Sep Oct Nov Dec Jan Feb Mar Apr May Total ;
ARRAY Month(10) Sep -- Total;
DO i = 1 TO 10 ;
IF Month(i) = . THEN Month(i) = 0 ;
MyTot = sum (of Sep -- May);
diff = round (MyTot-Total, 3);
IF diff ne 0 THEN PUT "**ERROR" MyTot= Total= diff= ;
END;
PROC PRINT DATA=sasweek.snowspace;
TITLE "Rochester Snowfall in Space-Delimited format";
RUN;
One of my professors suggested I should have made the monthly snowfall as "character". So the "T"s would not incur a warning in the LOG. I am not sure whether I should try it this way.
Lastly, I tried to use "Proc Import" for the same data set but in xls file.
The data set is as the link
And my code is as follows:
* Trying Excel file ;
OPTIONS ERRORS=200;
OPTIONS MSGLEVEL=i;
PROC IMPORT OUT=SASWEEK.SNOWxls
DATAFILE= "&dirLSB.RochesterSnowfall.xls" DBMS=xls;
GETNAMES= no;
RANGE= "Sheet1$a5:k122" ;
PROC PRINT DATA= SASWEEK.SNOWxls;
TITLE "Rochester Snowfall in xls format";
RUN;
I received the error in the LOG saved as the HTML
I still printed out a part of the dataset but the variable names were messed up and the output was not complete.
Any ideas?
Thank you all for your reading and thanks for any help:)
The DATA step with INPUT statement might be the best place to start.
WARNINGs are fine, unless the goal is to have no warnings.
The data file can be cleanly read by creating an input environment built for it:
Custom informat zeroT converts T(text) to 0(number). Prevents warnings.
INFILE
DLM='0920'x specifying either tab or space may be delimiting data file values.
INPUT
Wrap fields Sep to Total in parenthesis ( ) to indicate grouped input
Wrap informat specifiers in parenthesis ( ) that are applied over grouped variables
: list input modifier that advances input parsing to next non-blank and reads until next character is blank.
Sample Code
proc format;
invalue zeroT 'T'=0 other=[best12.];
run;
data have;
infile snowdata firstobs=2 dlm='0920'x;
INPUT Season $ (Sep Oct Nov Dec Jan Feb Mar Apr May Total) (10 * :zeroT.) ;
run;
Sample Data (from SP text viewer)
filename snowdata "%TEMP%\roc_snowfalls.txt";
* create local sample data file, text copied from sharepoint viewer;
data _null_;
file snowdata;
input;
put _infile_;
datalines;
Season Sep Oct Nov Dec Jan Feb Mar Apr May Total
1884-85 0 T 1 27.1 22.2 17 3.5 19.5 T 90.3
1885-86 0 1.7 8.2 8.4 16.9 16 6.5 7 0 64.7
1886-87 0 T 22.2 12.5 12 18.4 6.3 1.2 0 72.6
1887-88 0 0.2 2.2 9.3 21.3 4.1 13.2 0.4 0 50.7
1888-89 0 T 4 15.5 17.8 22 17.5 5.4 0 82.2
1889-90 0 T 5.7 6.1 20.2 14.8 19 T 0 65.8
1890-91 0 0 2.1 29.2 16.1 24.6 12.2 0.3 0.1 84.6
1891-92 0 0.1 9.7 4.7 26.4 10.3 25.1 0.8 T 77.1
1892-93 0 T 14 19.2 15.9 29.8 8.1 9.6 0 96.6
1893-94 0 0.5 6.1 27.6 20 29.5 5.4 13.3 0 102.4
1894-95 0 T 11.1 22.1 26.5 23.6 9.5 0.6 0 93.4
1895-96 0 1.5 5.9 8.7 22.5 39.1 45.1 1 0 123.8
1896-97 0 T 5.5 13.9 20.1 13.7 8.1 5.2 0 66.5
1897-98 0 0 10.1 18.4 32.1 26.8 1.2 2.4 0 91
1898-99 0 T 10.6 27 16.6 16.3 21.2 4.3 T 96
1899-00 T T 1.3 21.5 24.7 28.5 54 1.3 0 131.3
1900-01 0 0 17 20.3 29.8 36.9 13.7 23.8 T 141.5
1901-02 0 0.1 14.1 14.5 23.8 23 1.2 2.3 T 79
1902-03 0 0.1 4.1 27.7 18.1 15.6 2.4 0.3 0 68.3
1903-04 0 0.6 4.4 16.1 27.2 17.2 10.7 19.5 T 95.7
1904-05 0 0.2 2.1 15.8 27.5 15.2 7 0.5 0 68.3
1905-06 0 T 4 8.4 7.6 8 15.2 1.1 0 44.3
1906-07 0 5 5.7 18.7 11.7 15.7 3.1 2.5 1.3 63.7
1907-08 0 0 2.2 11.6 16.5 19.8 7.9 6.3 3 67.3
1908-09 0 0.5 4.6 10 22.5 6.1 9.7 9.8 3.3 66.5
1909-10 0 T 1.7 14.6 22 42.7 3.4 0.5 0 84.9
1910-11 0 2.2 15.7 29.8 9.5 30 13.5 4.7 2 107.4
1911-12 0 0 6.5 7.5 21.5 10.8 8.8 6.9 T 62
1912-13 0 0 7.2 6.9 10 18.6 15.2 1.3 0 59.2
1913-14 0 0.2 0.3 14.4 15.1 21.6 27.9 7.2 0 86.7
1914-15 0 0.8 4.7 16.1 22.9 9.8 6 0.5 0 60.8
1915-16 0 0 3.4 14.8 8.5 35.7 43.8 0.7 0 106.9
1916-17 0 0 11.7 24.9 22.7 16.7 14.6 2.3 T 92.9
1917-18 0 T 7.9 29.7 17.2 12.7 10.5 1.3 0 79.3
run;
My dataset is like this.....
Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
Here is what I need to do (this is essentially calculating a Wald estimator in econometrics, if you are familiar, if not, no biggie)
I need to create new categories so that if the observation is type 1 then it is 'first' and if it is 2, 3, or 4, it is 'other'
calculate the averages of pizzas and hamburgers by first and other
subtract the means between first and other
divide the differences
There must be more structure than this to the problem; otherwise it's school arithmetic. This may get you started, but I think you need to show more substance about your data structure and larger goals. In a larger dataset, collapse may be a good idea, depending on what you want to do with the results.
clear
input Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
end
gen First = Type == 1
egen MeanPizzas = mean(Pizzas), by(First)
egen MeanHamb = mean(Hamb), by(First)
sort First
gen DiffMeanPizzas = MeanPizzas[1] - MeanPizzas[_N]
gen DiffMeanHamb = MeanHamb[1] - MeanHamb[_N]
tabdisp First, c(Mean* Diff*)
--------------------------------------------------------------------------
First | MeanPizzas MeanHamb DiffMeanPizzas DiffMeanHamb
----------+---------------------------------------------------------------
0 | 10.06667 4.833333 -.6333332 -.7666669
1 | 10.7 5.6 -.6333332 -.7666669
--------------------------------------------------------------------------
i got this data frame
x1 x2 x3
1 2.5 2.8 1.4
2 2.1 1.9 2.3
3 1.7 2.2 4.4
4 2.4 3.8 3.7
5 4.3 4.4 4.1
6 4.2 4.9 2.4
7 2.7 1.5 2.5
8 2.8 3.3 4.9
9 3.5 2.3 2.9
10 4.1 2.8 2.2
so i need to check for every row a condition and apply a function to this row so that the value of this function would be in the fourth column or in the external vector. i.e. if min_value_of_row < thrshld then min(row) else mean(row)
How would one do that?
A bit late, but I was looking for something similar. Firstly I would create two columns with min and mean values of each row with:
df['min'] = df.min(axis=1)
and
df['mean'] = df.mean(axis=1)
then build a function:
def f(x):
thr = 2
if x['min'] <= thr:
x = x['min']
else:
x = x['mean']
return x
and apply it to the dataframe row-wise (axis=1):
df['value'] = df.apply(f, axis=1)
this returns:
x1 x2 x3 value
1 2.5 2.8 1.4 1.400
2 2.1 1.9 2.3 1.900
3 1.7 2.2 4.4 1.700
4 2.4 3.8 3.7 3.075
5 4.3 4.4 4.1 4.225
6 4.2 4.9 2.4 3.475
7 2.7 1.5 2.5 1.500
8 2.8 3.3 4.9 3.450
9 3.5 2.3 2.9 2.750
10 4.1 2.8 2.2 2.825
I want to append one column below another column. My dataset looks like the following:
date xy ab cd
1 1.5 3.1 4.8
2 4.3 8.5 1.0
3 7.7 9.1 7.7
I want to create a dataset which looks like this:
date id price
1 xy 1.5
2 xy 4.3
3 xy 7.7
1 ab 3.1
2 ab 8.5
3 ab 9.1
1 cd 4.8
2 cd 1.0
3 cd 7.7
Do you have an idea how I can handle this?
Like this:
proc transpose data=indataname out=outdataname(rename=(_NAME_=id col1 = price));
by date;
run;
One:
data have;
input x1 x2;
diff=x1-x2;
a_diff= round(abs(diff), .01);
* a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
Results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.5
6 2.9 4.9 -2.0 2.0 5.5
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Two:
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.0
6 2.9 4.9 -2.0 2.0 6.0
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Attention Please,Obs 3,9,5,6, why ranks were different? Thank you!
Run the code below and you'll see that they are actually different. That's because of inaccuracies in numeric storage; similar to how 1/3 is not representable in decimal notation (0.333333333333333 etc.) and 1-(1/3)-(1/3)-(1/3) is not equal to zero if you use, say, ten digits to store each result as you go (it is equal to 0.000000001, then), any computer system will have some issues with certain numbers that while in decimal (base 10) appear to store nicely, in binary do not.
The solution here is basically to round as you are, or to fuzz the result which amounts to the same thing (it ignores differences less than 1x10^-12).
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
put a_diff= hex16.;
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;