How to create column with name of column with the highest value per each ID in SAS Enterprise Guide / PROC SQL? - sas

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
And I need to create new column "TOP" where will be the name of column with the highest values for each ID.
If for example 2 or more columns have the same highest value take the first under the alphabet.
Desire output:
ID | COL_A | COL_B | COL_C | TOP
-----|-------|-------|--------|-------
111 | 10 | 20 | 30 | COL_C
222 | 15 | 80 | 10 | COL_B
333 | 11 | 10 | 20 | COL_C
444 | 20 | 5 | 20 | COL_A
Becasue:
for ID = 111 the highest value is in COL_C, so name "COL_C" is in column "TOP"
for ID = 444 two columns have the highest value, so based on alpabet criterion in column "TOP" is name "COL_A"
How can i do that in SAS Enterprise Guide or in PROC SQL ?

This you can do with functions. Use MAX() to find the largest value. Use WHICHN() to find the index number of the first variable with that value. Use the VNAME() function to get the name of the variable with that index.
data want;
set have;
length TOP $32;
array list col_a col_b col_c;
top = vname(list[whichn(max(of list[*]),of list[*])]);
run;

Related

How to create new columns with names of columns with values in descending order in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
I need to create new columns: TOP_1, TOP_2, TOP_3 where will be names of columns from the highest value from COL_A, COL_B, COL_C columns to the lowest per ID
If for example 2 or more columns have the same highest value take the first under the alphabet.
In TOP_1 - name of column with the hihest value per ID
In TOP_2 - name of column with the second highest value per ID
In TOP_3 - name of column with the third highest value per ID
Desire output:
ID | COL_A | COL_B | COL_C | TOP_1 | TOP_2 | TOP_3
-----|-------|-------|--------|--------|---------|---------
111 | 10 | 20 | 30 | COL_C | COL_B | COL_A
222 | 15 | 80 | 10 | COL_B | COL_A | COL_C
333 | 11 | 10 | 20 | COL_C | COL_A | COL_B
444 | 20 | 5 | 20 | COL_A | COL_C | COL_B
Because:
for ID = 111 the highest value is in COL_C, co name "COL_C" going to column "TOP_1", second highest value is in COL_B, so name "COL_B" going to column "TOP_2" and so on...
for ID = 444 two columns have the highest value, so we have to use alphabet criteria and in column "TOP_1" is name "COL_A" and name "COL_B is in column "TOP_2"
How can I do that in SAS Enterprise Gude or in PROC SQL ?
First let's convert your listing into an actual dataset.
data have;
input ID COL_A COL_B COL_C ;
cards;
111 10 20 30
222 15 80 10
333 11 10 20
444 20 5 20
;
If you use PROC TRANSPOSE to covert your COL_: into observations.
proc transpose data=have out=tall;
by id col_a col_b col_c;
var col_a col_b col_c;
run;
You can then sort by descending values (and ascending variable name):
proc sort;
by id col_a col_b col_c descending col1 _name_;
run;
And use another PROC TRANSPOSE to make your new variables:
proc transpose data=tall out=want(drop=_name_ _label_) prefix=TOP_;
by id col_a col_b col_c;
var _name_;
run;
If the data is really large (or you have a lot more than 3 columns to check) you might want to eliminate COL_A COL_B and COL_C from the BY group and instead just merge the resulting TOP_: variable back onto the original dataset.

How do I repeat row labels in a matrix?

I have data showing me the dates grouped like this:
For security reasons, I had to remove the Customer Description detail, due to confidentiality.
How do I repeat the date column the same way you repeat the Row Labels in an Excel Pivot?
I've looked, but couldn't find a solution to this - this option should be available.
EDIT
When you have the following source data in Excel:
Date | Customer | Item Description | Qty Out | Unit Price | Sales
--------------------------------------------------------------------------------------------------------------------------------------------
14/08/2020 | Customer 1 | Item 11 | 4.00 | 65.00 | 260.00
14/08/2020 | Customer 2 | Item 12 | 56.00 | 12.00 | 672.00
14/08/2020 | Customer 3 | Item 13 | 64.00 | 35.00 | 2,240.00
14/08/2020 | Customer 4 | Item 14 | 29.00 | 65.00 | 1,885.00
15/08/2020 | Customer 2 | Item 15 | 746.00 | 12.00 | 8,952.00
15/08/2020 | Customer 3 | Item 16 | 14.00 | 75.00 | 1,050.00
15/08/2020 | Customer 4 | Item 17 | 45.00 | 741.00 | 33,345.00
15/08/2020 | Customer 5 | Item 18 | 456.00 | 125.00 | 57,000.00
15/08/2020 | Customer 6 | Item 19 | 925.00 | 17.00 | 15,725.00
16/08/2020 | Customer 4 | Item 20 | 6.00 | 532.00 | 3,192.00
16/08/2020 | Customer 5 | Item 21 | 56.00 | 94.00 | 5,264.00
16/08/2020 | Customer 6 | Item 22 | 546.00 | 37.00 | 20,202.00
You then pivot this data using Microsoft Excel, where you get the following:
You then choose the option to Repeat Item Labels as can be seen below:
After selecting this, you get my expected results I require in Power BI:
Is there not a function available like this in Power BI?
Just adding this for your reference as a work around. Check this below image with a custom column created in the Power Query Editor-
date_customer = Date.ToText([Date]) &" : "& [Customer]
Then added both Date and date_customer in the Matrix row level. The output is as below- (using your sample data)
ANOTHER OPTION Another option is to add Date and Customer in the Matrix row and the output is will be as below- (using your sample data)
This is also a meaningful output as date are showing as a group header. But in case of requirement of having redundant date to show, you can consider the first option.

Dates in columns instead of rows in SAS - advantages?

Few of my clients using SAS store dates in columns.
e.g:
| Id | Variable1_201101 | Variable1_201102 | ... | Variable1_201909 | Variable2_201101 | Variable2_201102 | ... |
etc.
Instead of storing dates in rows:
| Id | Date | Variable1 | Variable2 |
In a result, they have huge number of cells, because even if some ID does not exist in particular date, there will be empty cell in first structure, where in second structure, the row will be omitted.
I have never met such storage structures in SQL, where it wouldn't be perfect solution. Are there any advantages of such structures in SAS?
There is never a perfect storage structure. There are superior structures for solutions to problems at hand. Sometimes you have to reshape data for a particular solution, sometimes a procedure has grammar or mechanisms for reshaping within the procedure itself.
For example, examining a variable in different time frames in The TTEST Procedure might use a PAIRED statement and require different variables for the values. Thus the comparing Jan-2011 values to Jan-2012 values would make sense to have structure with Variable1_201101 Variable1_201201.
Disk space for sparse wide data can be reduced effectively using COMPRESS= options, at the cost of decompression CPU cycles. Depending on the data it can be significantly less disk use, but then is hard to deal with in alternate categorical analysis.
Traditional RDBMS has the categorical form (vertical) as a very common best practice, with indexing and foreign keys. If this is the original layout, you might need to pivot or reshape the data for a particular TTEST analysis.
Dealing with data found in a NOSQL data store you might end up more often encountering the horizontal form (because underlayment handles sparseness better).
Prepare code:
data have;
id=786;
Variable1_201101 = 78;
Variable1_201102 =67;
Variable1_201909 = 23;
Variable2_201101 = 34 ;
Variable2_201102 = 12;
run;
Now, we have :
+-----+------------------+------------------+------------------+------------------+------------------+
| id | Variable1_201101 | Variable1_201102 | Variable1_201909 | Variable2_201101 | Variable2_201102 |
+-----+------------------+------------------+------------------+------------------+------------------+
| 786 | 78 | 67 | 23 | 34 | 12 |
+-----+------------------+------------------+------------------+------------------+------------------+
Use transpose with wildcards:
PROC TRANSPOSE DATA=have
OUT=have2
PREFIX=Column
NAME=Source
LABEL=Label
;
BY id;
VAR Variable1_: Variable2_:;
Result:
+-----+------------------+---------+
| id | Source | Column1 |
+-----+------------------+---------+
| 786 | Variable1_201101 | 78 |
| 786 | Variable1_201102 | 67 |
| 786 | Variable1_201909 | 23 |
| 786 | Variable2_201101 | 34 |
| 786 | Variable2_201102 | 12 |
+-----+------------------+---------+
Now we will be "parse":
data have3;
set have2;
format date ddmmyyp10.;
date_str=substr(Source,find(source,"_")+1);
date=INputN(date_str||"01"," yymmn6.");
variable_name=substr(Source,1,find(source,"_")-1);
/* Optional*/
drop date_str source ;
run;
PROC SORT
;
BY date id;
RUN;
And transpose again:
PROC TRANSPOSE DATA=have3
OUT=want (drop=source)
PREFIX=Column
NAME=Source
LABEL=Label
;
BY date id;
ID variable_name;
VAR Column1;
Result:
+------------+-----+-----------------+-----------------+
| date | id | ColumnVariable1 | ColumnVariable2 |
+------------+-----+-----------------+-----------------+
| 01.01.2011 | 786 | 78 | 34 |
| 01.02.2011 | 786 | 67 | 12 |
| 01.09.2019 | 786 | 23 | . |
+------------+-----+-----------------+-----------------+

Transpose/Add new colums in SAS

I have this data in SAS:
+-----------+-------+
| PartnerNo | SAPNo |
+-----------+-------+
| P1 | 123 |
| P1 | 124 |
| P1 | 125 |
| P2 | 126 |
| P2 | 127 |
| P3 | 128 |
+-----------+-------+
Now I want a row per partner and a new column for each SAPNo.
Like this:
+-----------+------+------+------+
| PartnerNo | SAP1 | SAP2 | SAP3 |
+-----------+------+------+------+
| P1 | 123 | 124 | 125 |
| P2 | 126 | 127 | |
| P3 | 128 | | |
+-----------+------+------+------+
This needs to be dynamic. There could be up to 8 SAPNo per PartnerNo.
I m using SAS Enterprise Guide 5.1
Proc transpose is powerfull, but not that intuitive.
This is the sollution:
proc transpose prefix=SAP
data=iHave
out=iNeed (drop=_name_);
by PartnerNo;
var SAPNo;
run;
Of course data specifies the input and out the output.
var SAPNo specifies that you want the values of that variable to be listed in columns.
SAS procedures often generate documentary variables, enclosed in undersores. Proc transposem for instandce, creates a _name_ field. If you specify multiple variable sin the var statement, the procedure will generate multiple lines per id variable. This _name_ then indicates where the values come from. We do not need it, so we drop it. Alternatively, we could have remove all variable starting with an underscore with drop=_:. This comes in handy for procedures generating multiple documentary variable if you need none of them.
By defauld, these colomns are named col1, col2 etc. I changed this with the prefix option. Often this is not what you want, because the destination column is named in another variable. Then use the id statement.
by PartnerNo specifies you want a new observation (or row, if you prefer database vocabulary over SAS vocabulary) for eacht value that variable

Stata table: how to compute difference column without adding a new variable?

In a panel data set, I'm using
table Region TIME if TIME==2014 | TIME==2020 | TIME==2030 | TIME==2040, contents(sum BF ) row
to create the following table:
------------------------------------------
| TIME
Region | 2014 2020 2030 2040
----------+-------------------------------
701 | 26751 27941 29944 31477
702 | 10456 11354 12723 13788
704 | 41550 44481 49340 53273
706 | 44976 47535 51940 55573
709 | 43258 44398 46612 48191
711 | 6580 7011 7539 7856
713 | 9036 10139 11776 13194
714 | 3091 3284 3563 3750
716 | 9144 9730 10724 11543
719 | 5719 6292 7258 8036
720 | 11509 12161 13188 13919
722 | 21403 22344 23839 25006
723 | 4927 5094 5345 5447
728 | 2460 2576 2761 2906
|
Total | 240860 254340 276552 293959
------------------------------------------
I'd like to add a fifth column, which displays the difference between the year 2014 and 2040 in %.
Question: is this possible WITHOUT adding a new variable to the dataset? For instance by letting the fifth column being derived from a formula?
If not, how do I easily compute a new variable, taking account of the long format of the panel data set?
This isn't possible within table.
Your variable could be something like
egen total2014 = total(BF / (TIME == 2014)), by(Region)
egen total2040 = total(BF / (TIME == 2040)), by(Region)
gen pcdiff = 100 * (total2040 - total2014)/total2014
after which you can tabulate its (mean) value for each region. See Section 10 in http://www.stata-journal.com/sjpdf.html?articlenum=dm0055 for the first trick here.
You may need to go outside table for the tabulation, but if all else fails, collapse to a new dataset of totals and means.