SAS minimum of all rows - sas

I have a file with 25 rows like:
Model Cena (zl) Nagrywanie fimow HD Optyka - krotnosc zoomu swiatlo obiektywu przy najkrotszej ogniskowej Wielkosc LCD (cale)
Lumix DMC-LX3 1699 tak 2.5 2 3
Lumix DMC-GH1 + LUMIX G VARIO HD 14-140mm/F4.0-5.8 ASPH./MEGA O.I.S 5199 tak 10 4 3
And I wrote:
DATA lab_1;
INFILE 'X:\aparaty.txt' delimiter='09'X;
INPUT Model $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film;
f_skal = MAX(Cena - 1500, Optyka - 10, Wielkosc_LCD - 1, Nagr_film - 1) + 1/1000*(Cena - 1500 + Optyka - 10 + Wielkosc_LCD - 1 + Nagr_film - 1);
*rozw = MIN(f_skal);
*rozw = f_skal[,<:>];
PROC SORT;
BY DESCENDING f_skal;
PROC PRINT DATA = lab_1;
data _null_;
set lab_1;
FILE 'X:\aparatyNOWE.txt'; DLM='09'x;
PUT Model= $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film f_skal;
RUN;
I need to find the lowest value of f_skal and I don't know how because min(f_skal) doesn't work.

In a data step, the min function only looks at one row at a time - if you feed it several variables, it will give you the minimum value out of all of those variables for that row, but you can't use it to look at values across multiple rows (unless you get data from multiple rows into 1 row first, e.g. via use of retain / lag).
One way of calculating statistics in SAS across a whole dataset is to use proc means / proc summary, e.g.:
proc summary data = lab1;
var f_skal;
output out = min_val min=;
run;
This will create a dataset called min_val in your work library, and the value of f_skal in that dataset will be the minimum from anywhere in the dataset lab1.
If you would rather create a macro variable containing the minimum value, so that you can use it in subsequent code, one way of doing that is to use proc sql instead:
proc sql noprint;
select min(f_skal) into :min_value from lab1;
quit;
run;
%put Minimum value = &min_value;
In proc sql the behaviour of min is different - here it compares values across rows, the way you were trying to use it.

Related

Way to change direction of risk difference in proc freq, without changing data in SAS

Suppose I have a dataset called example with variable x and y, where both are binary {0,1}. I want to find the risk difference stratified by a variable strata.
proc freq data = example;
table strata*x*y / commonriskdiff(CL=NEWCOMBEMR);
run;
However, suppose I want them in different direction, i.e., I want 0.004 (-0.017, 0.028)
I can do
data example2;
set example;
y_2 = 1 - y;
run;
proc freq data = example2;
table strata*x*y_2 / commonriskdiff(CL=NEWCOMBEMR);
run;
but is there a way to do it directly on proc freq, without the extra step of creating example2 dataset?
commonriskdiff(CL=NEWCOMBEMR column = 2)

Produce custom table in SAS with a subsetted data set

I want to use SAS and eg. proc report to produce a custom table within my workflow.
Why: Prior, I used proc export (dbms=excel) and did some very basic stats by hand and copied pasted to an excel sheet to complete the report. Recently, I've started to use ODS excel to print all the relevant data to excel sheets but since ODS excel would always overwrite the whole excel workbook (and hence also the handcrafted stats) I now want to streamline the process.
The task itself is actually very straightforward. We have some information about IDs, age, and registration, so something like this:
data test;
input ID $ AGE CENTER $;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
The goal is to produce a table report which should look like this structure-wise:
ID NO-ID Total
Count 3 2 5
Age (mean) 27 45.5 34.4
Count by Center:
A 2 1 3
B 0 1 1
A 1 0 1
It seems, proc report only takes variables as columns but not a subsetted data set (ID NE .; ID =''). Of course I could just produce three reports with three subsetted data sets and print them all separately but I hope there is a way to put this in one table.
Is proc report the right tool for this and if so how should I proceed? Or is it better to use proc tabulate or proc template or...?
I found a way to achieve an almost match to what I wanted. First if all, I had to introduce a new variable vID (valid ID, 0 not valid, 1 valid) in the data set, like so:
data test;
input ID $ AGE CENTER $;
if ID = '' then vID = 0;
else vID = 1;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
After this I was able to use proc tabulate as suggested by #Reeza in the comments to build a table which pretty much resembles what I initially aimed for:
proc tabulate data = test;
class vID Center;
var age;
keylabel N = 'Count';
table N age*mean Center*N, vID ALL;
run;
Still, I wonder if there is a way without introducing the new variable at all and just use the SAS counters for missing and non-missing observations.
UPDATE:
#Reeza pointed out to use the proc format to assign a value to missing/non-missing ID data. In combination with the missing option (prints missing values) in proc tabulate this delivers the output without introducing a new variable:
proc format;
value $ id_fmt
' ' = 'No-ID'
other = 'ID'
;
run;
proc tabulate data = test missing;
format ID $id_fmt.;
class ID Center;
var age;
keylabel N = 'Count';
table N age*(mean median) Center*N, (ID=' ') ALL;
run;

PROC SQL - Counting distinct values across variables

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values.
However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
Consider a dataset that looks like this:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1.
The result should be 4 (abc, efg, hij, kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
This gives me:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?
It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;
I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.

Nearest neighbour in SAS

I have a point data set containing latitude, longitude and elevation data. I would like to identify the nearest neighbour of a given point by using the distance between any two given points (2d or 3d). Could anybody suggest the different methods available in SAS for such geo-spatial data analysis and an example SAS code? Thanks.
Your best bet is to look into the clustering procedures, as KNN style clustering is pretty close to what you want (and at minimum cluster analysis can get you to a 'set' of neighbors to check). PROC MODECLUS, PROC FASTCLUS, PROC CLUSTER all give you some value here, as does PROC DISTANCE which is used as input in some cases to the above. Exactly what you want to use depends on what you need and your speed/size constraints (PROC CLUSTER is very slow with large datasets, but gives more useful results oftentimes).
Here is an example of nearest-neighbour calculation via the use of SQL (given in the SAS help file somewhere):
options ls=80 ps=60 nodate pageno=1 ;
data stores;
input Store $ x y;
datalines;
store1 5 1
store2 5 3
store3 3 5
store4 7 5
;
data houses;
input House $ x y;
datalines;
house1 1 1
house2 3 3
house3 2 3
house4 7 7
;
options nodate pageno=1 linesize=80 pagesize=60;
proc sql;
title 'Each House and the Closest Store';
select house, store label='Closest Store',
sqrt((abs(s.x-h.x)**2)+(abs(h.y-s.y)**2)) as dist
label='Distance' format=4.2
from stores s, houses h
group by house
having dist=min(dist);
quit;
I wrote 2 macros to accomplish this!
first macro to take one input gps location and use the lat and lon as a set value new variables for the "neighbor" location dataset. compute all distance and select the min value and store in temp dataset.
second calling macro loop through the input datset and to pass in the indivial gps location, call the first macro to do the work and append each min distance to my output dataset.
/*** first concatenate your input lat, lon as well as some id into a | seperate long string for later %scan into individual input ***/
%macro min_distance;
data compute_all_dis;
set all_neighbor_gps_locations;
/** here create a new variable to this big dataset with the one point gps value***/
first_lat = &latitude;
first_lon = &longitude;
ID = &PERIOD;
/** compute all **/
distance = geodist(lat, long, first_lat, first_lon, 'dm');
run;
/** get the shorted distance***/
proc sql;
create table closest_neighbor as
select milepost,OFF_PERIOD_ID, lat, long, first_lat, first_lon, distance
from compute_all_dis
having distance = min ( distance);
quit;
%mend min_distance;
%macro find_all_closest_neighbors;
data _null_;
runno=countw("&ID",'|');
call symputx('runno',put(runno,8.));
run;
%put &runno;
%do i=1 %to &runno;
%let PERIOD = %SCAN(&OFF_ID, &i, "|");
%let latitude = %SCAN (&LAT_I, &i, "|");
%let longitude = %SCAN (&LONG_I, &i, "|");
%min_distance;
proc datasets nowarn;
append base= pout.all_close_neighbors data=closest_neighbor;
run;
%end;
%mend find_all_closest_neighbors;
%find_all_closest_neighbors;

several regressions on a single dataset in SAS

I have a dataset of the following format:
a table of M rows and 2K columns.
My columns are pairs of variables: X_i, Y_i and the rows are observations.
I would like to perform many linear regressions: one for each pair of columns (Y_i ~ X_i)
and obtain the results.
I know how to access specific columns using arrays, like so:
data Xs_Ys_data (drop=i);
array Xs[60] X1-X60;
array Ys[60] Y1-Y60;
I also know how to fit a single linear regression model, like so:
proc reg data=some_data;
model y = x;
output out=out_lin_reg;
run;
And I am familiar with the concept of loops:
do i=1 to 60;
Xs[i] .......;
end;
How do I combine these three to get what I need?
Thanks!
P.S - I asked a similar question on a different format here:
SAS reading a file in long format
Update:
I have managed to create the regressions using a macro like so:
%macro mylogit();
%do i = 1 %to 60;
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
%end;
%mend;
%mylogit()
Now I am not sure how to export the results into a single table...
You have this in your macro:
proc reg data=Xs_Ys_data;
model Y&i = X&i;
run;
So instead create:
data x_y_Data;
set xs_yx_data;
array xs x1-x60;
array yx y1-y60;
do iter = 1 to dim(xs);
x=xs[iter];
y=ys[iter];
output;
end;
run;
proc reg data=X_Y_data;
by iter;
model Y = X;
run;
And then add an output statement however you normally would to get your resulting dataset. Now you get 1 output table with all 60 iterations (still 60 printed outputs), and if you want to create one printed output you can construct that from the output dataset.