How to create new columns with names of columns with values in descending order in SAS Enterprise Guide / PROC SQL? - sas

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
I need to create new columns: TOP_1, TOP_2, TOP_3 where will be names of columns from the highest value from COL_A, COL_B, COL_C columns to the lowest per ID
If for example 2 or more columns have the same highest value take the first under the alphabet.
In TOP_1 - name of column with the hihest value per ID
In TOP_2 - name of column with the second highest value per ID
In TOP_3 - name of column with the third highest value per ID
Desire output:
ID | COL_A | COL_B | COL_C | TOP_1 | TOP_2 | TOP_3
-----|-------|-------|--------|--------|---------|---------
111 | 10 | 20 | 30 | COL_C | COL_B | COL_A
222 | 15 | 80 | 10 | COL_B | COL_A | COL_C
333 | 11 | 10 | 20 | COL_C | COL_A | COL_B
444 | 20 | 5 | 20 | COL_A | COL_C | COL_B
Because:
for ID = 111 the highest value is in COL_C, co name "COL_C" going to column "TOP_1", second highest value is in COL_B, so name "COL_B" going to column "TOP_2" and so on...
for ID = 444 two columns have the highest value, so we have to use alphabet criteria and in column "TOP_1" is name "COL_A" and name "COL_B is in column "TOP_2"
How can I do that in SAS Enterprise Gude or in PROC SQL ?

First let's convert your listing into an actual dataset.
data have;
input ID COL_A COL_B COL_C ;
cards;
111 10 20 30
222 15 80 10
333 11 10 20
444 20 5 20
;
If you use PROC TRANSPOSE to covert your COL_: into observations.
proc transpose data=have out=tall;
by id col_a col_b col_c;
var col_a col_b col_c;
run;
You can then sort by descending values (and ascending variable name):
proc sort;
by id col_a col_b col_c descending col1 _name_;
run;
And use another PROC TRANSPOSE to make your new variables:
proc transpose data=tall out=want(drop=_name_ _label_) prefix=TOP_;
by id col_a col_b col_c;
var _name_;
run;
If the data is really large (or you have a lot more than 3 columns to check) you might want to eliminate COL_A COL_B and COL_C from the BY group and instead just merge the resulting TOP_: variable back onto the original dataset.

Related

How to create column with name of column with the highest value per each ID in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
And I need to create new column "TOP" where will be the name of column with the highest values for each ID.
If for example 2 or more columns have the same highest value take the first under the alphabet.
Desire output:
ID | COL_A | COL_B | COL_C | TOP
-----|-------|-------|--------|-------
111 | 10 | 20 | 30 | COL_C
222 | 15 | 80 | 10 | COL_B
333 | 11 | 10 | 20 | COL_C
444 | 20 | 5 | 20 | COL_A
Becasue:
for ID = 111 the highest value is in COL_C, so name "COL_C" is in column "TOP"
for ID = 444 two columns have the highest value, so based on alpabet criterion in column "TOP" is name "COL_A"
How can i do that in SAS Enterprise Guide or in PROC SQL ?
This you can do with functions. Use MAX() to find the largest value. Use WHICHN() to find the index number of the first variable with that value. Use the VNAME() function to get the name of the variable with that index.
data want;
set have;
length TOP $32;
array list col_a col_b col_c;
top = vname(list[whichn(max(of list[*]),of list[*])]);
run;

PROC FORMAT does not work with BY statement in other procedures

I want to get distribution of a variable that is categorized using PROC FORMAT. However I do not get the frequency distribution based on the new groups using BY statement. I discovered this while using PHREG on a larger data. I have given a sample code below.
data p;
input v1 $ v2;
datalines;
A 1
A 2
A 1
A 2
B 3
B 2
C 1
D 1
;
RUN;
proc format;invalue $ v1f 'A','C'='Grp-1' 'B','D'='Grp-2'; run;
proc freq;tables v1; format v1 $v1f.;run;
proc sort;by v1; run;
proc freq;tables v2; by v1;format v1 $v1f.;run;
Not sure why the last PROC FREQ is not working as expected.
I need to keep changing these categories for iterative analysis and so I find PROC FORMAT easy to code but I am very confused as to why it is not working.
Any tips would be appreciated.
To FORMAT a variable you need to use a FORMAT. The INVALUE statement is for defining an INFORMAT. To define a FORMAT you need to use the VALUE statement instead.
FORMATs are used to convert values to text. INFORMATs are used to convert text to values. You use a FORMAT with the FORMAT and PUT statements and the PUT() function. You use an INFORMAT with the INFORMAT and INPUT statements and the INPUT() function.
BY groups are done by the actual values, not the formatted values. If you want the frequencies of V1 crossed with V2 specify that in the TABLES statement.
proc freq;
tables v1*v2;
format v1 $v1f.;
run;
Results
The FREQ Procedure
Table of v1 by v2
v1 v2
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
---------+--------+--------+--------+
Grp-1 | 3 | 2 | 0 | 5
| 37.50 | 25.00 | 0.00 | 62.50
| 60.00 | 40.00 | 0.00 |
| 75.00 | 66.67 | 0.00 |
---------+--------+--------+--------+
Grp-2 | 1 | 1 | 1 | 3
| 12.50 | 12.50 | 12.50 | 37.50
| 33.33 | 33.33 | 33.33 |
| 25.00 | 33.33 | 100.00 |
---------+--------+--------+--------+
Total 4 3 1 8
50.00 37.50 12.50 100.00
If you want to sort by the formatted value then use the PUT() function to make a new variable.
data by_group;
set p ;
group = put(v1,$v1f.);
run;
proc sort data=by_group;
by group;
run;
Use the Proc FORMAT VALUE Statement to define a custom format.
Proc SQL and PUT() can be used to sort data in formatted order.
Proc FREQ BY processing will honor a formatted value when the contiguous underlying values in the data map to the same formatted value.
proc format;
value $v1f
'A','C'='Grp-1'
'B','D'='Grp-2';
run;
proc sql;
create table two as
select *
from have
order by put(v1,$v1f.), v1 /* ensure order is by formatted value, and then unerlying value within (for good measure in case data is viewed rawly) */
;
proc freq;
tables v2;
by v1;
format v1 $v1f.;
run;

Dates in columns instead of rows in SAS - advantages?

Few of my clients using SAS store dates in columns.
e.g:
| Id | Variable1_201101 | Variable1_201102 | ... | Variable1_201909 | Variable2_201101 | Variable2_201102 | ... |
etc.
Instead of storing dates in rows:
| Id | Date | Variable1 | Variable2 |
In a result, they have huge number of cells, because even if some ID does not exist in particular date, there will be empty cell in first structure, where in second structure, the row will be omitted.
I have never met such storage structures in SQL, where it wouldn't be perfect solution. Are there any advantages of such structures in SAS?
There is never a perfect storage structure. There are superior structures for solutions to problems at hand. Sometimes you have to reshape data for a particular solution, sometimes a procedure has grammar or mechanisms for reshaping within the procedure itself.
For example, examining a variable in different time frames in The TTEST Procedure might use a PAIRED statement and require different variables for the values. Thus the comparing Jan-2011 values to Jan-2012 values would make sense to have structure with Variable1_201101 Variable1_201201.
Disk space for sparse wide data can be reduced effectively using COMPRESS= options, at the cost of decompression CPU cycles. Depending on the data it can be significantly less disk use, but then is hard to deal with in alternate categorical analysis.
Traditional RDBMS has the categorical form (vertical) as a very common best practice, with indexing and foreign keys. If this is the original layout, you might need to pivot or reshape the data for a particular TTEST analysis.
Dealing with data found in a NOSQL data store you might end up more often encountering the horizontal form (because underlayment handles sparseness better).
Prepare code:
data have;
id=786;
Variable1_201101 = 78;
Variable1_201102 =67;
Variable1_201909 = 23;
Variable2_201101 = 34 ;
Variable2_201102 = 12;
run;
Now, we have :
+-----+------------------+------------------+------------------+------------------+------------------+
| id | Variable1_201101 | Variable1_201102 | Variable1_201909 | Variable2_201101 | Variable2_201102 |
+-----+------------------+------------------+------------------+------------------+------------------+
| 786 | 78 | 67 | 23 | 34 | 12 |
+-----+------------------+------------------+------------------+------------------+------------------+
Use transpose with wildcards:
PROC TRANSPOSE DATA=have
OUT=have2
PREFIX=Column
NAME=Source
LABEL=Label
;
BY id;
VAR Variable1_: Variable2_:;
Result:
+-----+------------------+---------+
| id | Source | Column1 |
+-----+------------------+---------+
| 786 | Variable1_201101 | 78 |
| 786 | Variable1_201102 | 67 |
| 786 | Variable1_201909 | 23 |
| 786 | Variable2_201101 | 34 |
| 786 | Variable2_201102 | 12 |
+-----+------------------+---------+
Now we will be "parse":
data have3;
set have2;
format date ddmmyyp10.;
date_str=substr(Source,find(source,"_")+1);
date=INputN(date_str||"01"," yymmn6.");
variable_name=substr(Source,1,find(source,"_")-1);
/* Optional*/
drop date_str source ;
run;
PROC SORT
;
BY date id;
RUN;
And transpose again:
PROC TRANSPOSE DATA=have3
OUT=want (drop=source)
PREFIX=Column
NAME=Source
LABEL=Label
;
BY date id;
ID variable_name;
VAR Column1;
Result:
+------------+-----+-----------------+-----------------+
| date | id | ColumnVariable1 | ColumnVariable2 |
+------------+-----+-----------------+-----------------+
| 01.01.2011 | 786 | 78 | 34 |
| 01.02.2011 | 786 | 67 | 12 |
| 01.09.2019 | 786 | 23 | . |
+------------+-----+-----------------+-----------------+

Update results in a column from multiple columns with different names

Based on the image, I would like to loop through the columns to find where there is a text mo. It updates mo with the results not the text mo. The challenge has been how to select the result in the next column different from where mo is.
Your answer to my comment above suggests to me that the question you ask reflects the wrong approach to the larger problem. Your description suggests that you have observations with a varying number of testname/testvalue pairs, such as
+----------------------------------------+
| id day test1 val1 test2 val2 |
|----------------------------------------|
| A 1 mo 11 . |
| A 2 mo 12 df 98.2 |
|----------------------------------------|
| B 1 df 98.3 mo 23 |
| B 2 mo 14 . |
+----------------------------------------+
and your objective is to produce observations that look like this
+----------------------+
| id day df mo |
|----------------------|
| A 1 . 11 |
| A 2 98.2 12 |
|----------------------|
| B 1 98.3 23 |
| B 2 . 14 |
+----------------------+
If that is the case, here is a reproducible example that you can copy, paste into Stata's Do-file Editor window, execute it, and examine the output to see how the technique avoids all the complexity you introduce by trying to use loops to accomplish the task. The reshape command is one of Stata's most powerful data management tools and it will benefit you to learn how to use it.
clear
input str8 id int day str8 test1 float val1 str8 test2 float val2
A 1 "mo" 11 "" .
A 2 "mo" 12 "df" 98.2
B 1 "df" 98.3 "mo" 23
B 2 "mo" 14 "" .
end
list, sepby(id) noobs
reshape long test val, i(id day) j(num)
drop if missing(test)
drop num
list, sepby(id) noobs
reshape wide val, i(id day) j(test) str
rename val* *
list, sepby(id) noobs

Column totals by group (Stata)

Here are the relevant commands:
sysuse auto
table foreign, c(max mpg max rep78) row
Reading through the documentation (row: add row totals), I expected it to turn out like this:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
----------------------------------
However, the Total row is actually just the max of the column:
----------------------------------
Car type | max(mpg) max(rep78)
----------+-----------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 41 5
----------------------------------
I was wondering if there is a similar command (without me having to collapse) that would allow me to construct a table like this (within the Stata window) but actually have the Total SUM at the bottom. Thanks for your time.
Stata's answer in table is arguably what would be expected. Given an instruction to calculate maximums, it does that by group and for the total dataset.
You want the maximums by group, but also to see their total or sum. That seems puzzling, but it can be done indirectly:
. sysuse auto , clear
(1978 Automobile Data)
. egen mpg_max = max(mpg), by(foreign)
. egen rep_max = max(rep78), by(foreign)
. egen tag = tag(foreign)
. table foreign if tag, c(sum mpg_max sum rep_max) row
--------------------------------------
Car type | sum(mpg_max) sum(rep_max)
----------+---------------------------
Domestic | 34 5
Foreign | 41 5
|
Total | 75 10
--------------------------------------
The trick here is that taking the maximums is done outside table. Then we feed just one observation in each category to table and the total is what is needed.