Change column and row using SAS - sas

My data set is:
|ID | SNP | GEN |
|:--:|:---:|:---:|
|1 | A | AG |
|2 | A | GG |
|3 | A | AG |
|4 | A | AG |
I need to do this:
|SNP | ID1 | ID2 | ID3 | ID4 |
|:---|:---:|:---:|:---:|:---:|
|A | AG | GG | AG | AG |
I've tried to use the following command, but it did not work:
proc transpose data=data1; run;
Someone knows how to make this using SAS?

You just need to add more options to your PROC TRANPOSE call to get it to do what you want.
proc transpose data=data1 out=WANT prefix=ID ;
by SNP;
id ID;
var GEN;
run;
The BY statement will process each value of SNP as a separate set to transpose. Make sure data is sorted by SNP if there are more values than in your example. The ID statement tells it which variable to use to generate the new variable name. The VAR statement tells it which variable to transpose, to transpose character variables you must use a VAR statement. The PREFIX= option lets you specify characters to use as prefix for the generated variable names.

Related

How to create column with name of column with the highest value per each ID in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
And I need to create new column "TOP" where will be the name of column with the highest values for each ID.
If for example 2 or more columns have the same highest value take the first under the alphabet.
Desire output:
ID | COL_A | COL_B | COL_C | TOP
-----|-------|-------|--------|-------
111 | 10 | 20 | 30 | COL_C
222 | 15 | 80 | 10 | COL_B
333 | 11 | 10 | 20 | COL_C
444 | 20 | 5 | 20 | COL_A
Becasue:
for ID = 111 the highest value is in COL_C, so name "COL_C" is in column "TOP"
for ID = 444 two columns have the highest value, so based on alpabet criterion in column "TOP" is name "COL_A"
How can i do that in SAS Enterprise Guide or in PROC SQL ?
This you can do with functions. Use MAX() to find the largest value. Use WHICHN() to find the index number of the first variable with that value. Use the VNAME() function to get the name of the variable with that index.
data want;
set have;
length TOP $32;
array list col_a col_b col_c;
top = vname(list[whichn(max(of list[*]),of list[*])]);
run;

PROC FORMAT does not work with BY statement in other procedures

I want to get distribution of a variable that is categorized using PROC FORMAT. However I do not get the frequency distribution based on the new groups using BY statement. I discovered this while using PHREG on a larger data. I have given a sample code below.
data p;
input v1 $ v2;
datalines;
A 1
A 2
A 1
A 2
B 3
B 2
C 1
D 1
;
RUN;
proc format;invalue $ v1f 'A','C'='Grp-1' 'B','D'='Grp-2'; run;
proc freq;tables v1; format v1 $v1f.;run;
proc sort;by v1; run;
proc freq;tables v2; by v1;format v1 $v1f.;run;
Not sure why the last PROC FREQ is not working as expected.
I need to keep changing these categories for iterative analysis and so I find PROC FORMAT easy to code but I am very confused as to why it is not working.
Any tips would be appreciated.
To FORMAT a variable you need to use a FORMAT. The INVALUE statement is for defining an INFORMAT. To define a FORMAT you need to use the VALUE statement instead.
FORMATs are used to convert values to text. INFORMATs are used to convert text to values. You use a FORMAT with the FORMAT and PUT statements and the PUT() function. You use an INFORMAT with the INFORMAT and INPUT statements and the INPUT() function.
BY groups are done by the actual values, not the formatted values. If you want the frequencies of V1 crossed with V2 specify that in the TABLES statement.
proc freq;
tables v1*v2;
format v1 $v1f.;
run;
Results
The FREQ Procedure
Table of v1 by v2
v1 v2
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
---------+--------+--------+--------+
Grp-1 | 3 | 2 | 0 | 5
| 37.50 | 25.00 | 0.00 | 62.50
| 60.00 | 40.00 | 0.00 |
| 75.00 | 66.67 | 0.00 |
---------+--------+--------+--------+
Grp-2 | 1 | 1 | 1 | 3
| 12.50 | 12.50 | 12.50 | 37.50
| 33.33 | 33.33 | 33.33 |
| 25.00 | 33.33 | 100.00 |
---------+--------+--------+--------+
Total 4 3 1 8
50.00 37.50 12.50 100.00
If you want to sort by the formatted value then use the PUT() function to make a new variable.
data by_group;
set p ;
group = put(v1,$v1f.);
run;
proc sort data=by_group;
by group;
run;
Use the Proc FORMAT VALUE Statement to define a custom format.
Proc SQL and PUT() can be used to sort data in formatted order.
Proc FREQ BY processing will honor a formatted value when the contiguous underlying values in the data map to the same formatted value.
proc format;
value $v1f
'A','C'='Grp-1'
'B','D'='Grp-2';
run;
proc sql;
create table two as
select *
from have
order by put(v1,$v1f.), v1 /* ensure order is by formatted value, and then unerlying value within (for good measure in case data is viewed rawly) */
;
proc freq;
tables v2;
by v1;
format v1 $v1f.;
run;

Dates in columns instead of rows in SAS - advantages?

Few of my clients using SAS store dates in columns.
e.g:
| Id | Variable1_201101 | Variable1_201102 | ... | Variable1_201909 | Variable2_201101 | Variable2_201102 | ... |
etc.
Instead of storing dates in rows:
| Id | Date | Variable1 | Variable2 |
In a result, they have huge number of cells, because even if some ID does not exist in particular date, there will be empty cell in first structure, where in second structure, the row will be omitted.
I have never met such storage structures in SQL, where it wouldn't be perfect solution. Are there any advantages of such structures in SAS?
There is never a perfect storage structure. There are superior structures for solutions to problems at hand. Sometimes you have to reshape data for a particular solution, sometimes a procedure has grammar or mechanisms for reshaping within the procedure itself.
For example, examining a variable in different time frames in The TTEST Procedure might use a PAIRED statement and require different variables for the values. Thus the comparing Jan-2011 values to Jan-2012 values would make sense to have structure with Variable1_201101 Variable1_201201.
Disk space for sparse wide data can be reduced effectively using COMPRESS= options, at the cost of decompression CPU cycles. Depending on the data it can be significantly less disk use, but then is hard to deal with in alternate categorical analysis.
Traditional RDBMS has the categorical form (vertical) as a very common best practice, with indexing and foreign keys. If this is the original layout, you might need to pivot or reshape the data for a particular TTEST analysis.
Dealing with data found in a NOSQL data store you might end up more often encountering the horizontal form (because underlayment handles sparseness better).
Prepare code:
data have;
id=786;
Variable1_201101 = 78;
Variable1_201102 =67;
Variable1_201909 = 23;
Variable2_201101 = 34 ;
Variable2_201102 = 12;
run;
Now, we have :
+-----+------------------+------------------+------------------+------------------+------------------+
| id | Variable1_201101 | Variable1_201102 | Variable1_201909 | Variable2_201101 | Variable2_201102 |
+-----+------------------+------------------+------------------+------------------+------------------+
| 786 | 78 | 67 | 23 | 34 | 12 |
+-----+------------------+------------------+------------------+------------------+------------------+
Use transpose with wildcards:
PROC TRANSPOSE DATA=have
OUT=have2
PREFIX=Column
NAME=Source
LABEL=Label
;
BY id;
VAR Variable1_: Variable2_:;
Result:
+-----+------------------+---------+
| id | Source | Column1 |
+-----+------------------+---------+
| 786 | Variable1_201101 | 78 |
| 786 | Variable1_201102 | 67 |
| 786 | Variable1_201909 | 23 |
| 786 | Variable2_201101 | 34 |
| 786 | Variable2_201102 | 12 |
+-----+------------------+---------+
Now we will be "parse":
data have3;
set have2;
format date ddmmyyp10.;
date_str=substr(Source,find(source,"_")+1);
date=INputN(date_str||"01"," yymmn6.");
variable_name=substr(Source,1,find(source,"_")-1);
/* Optional*/
drop date_str source ;
run;
PROC SORT
;
BY date id;
RUN;
And transpose again:
PROC TRANSPOSE DATA=have3
OUT=want (drop=source)
PREFIX=Column
NAME=Source
LABEL=Label
;
BY date id;
ID variable_name;
VAR Column1;
Result:
+------------+-----+-----------------+-----------------+
| date | id | ColumnVariable1 | ColumnVariable2 |
+------------+-----+-----------------+-----------------+
| 01.01.2011 | 786 | 78 | 34 |
| 01.02.2011 | 786 | 67 | 12 |
| 01.09.2019 | 786 | 23 | . |
+------------+-----+-----------------+-----------------+

Transpose/Add new colums in SAS

I have this data in SAS:
+-----------+-------+
| PartnerNo | SAPNo |
+-----------+-------+
| P1 | 123 |
| P1 | 124 |
| P1 | 125 |
| P2 | 126 |
| P2 | 127 |
| P3 | 128 |
+-----------+-------+
Now I want a row per partner and a new column for each SAPNo.
Like this:
+-----------+------+------+------+
| PartnerNo | SAP1 | SAP2 | SAP3 |
+-----------+------+------+------+
| P1 | 123 | 124 | 125 |
| P2 | 126 | 127 | |
| P3 | 128 | | |
+-----------+------+------+------+
This needs to be dynamic. There could be up to 8 SAPNo per PartnerNo.
I m using SAS Enterprise Guide 5.1
Proc transpose is powerfull, but not that intuitive.
This is the sollution:
proc transpose prefix=SAP
data=iHave
out=iNeed (drop=_name_);
by PartnerNo;
var SAPNo;
run;
Of course data specifies the input and out the output.
var SAPNo specifies that you want the values of that variable to be listed in columns.
SAS procedures often generate documentary variables, enclosed in undersores. Proc transposem for instandce, creates a _name_ field. If you specify multiple variable sin the var statement, the procedure will generate multiple lines per id variable. This _name_ then indicates where the values come from. We do not need it, so we drop it. Alternatively, we could have remove all variable starting with an underscore with drop=_:. This comes in handy for procedures generating multiple documentary variable if you need none of them.
By defauld, these colomns are named col1, col2 etc. I changed this with the prefix option. Often this is not what you want, because the destination column is named in another variable. Then use the id statement.
by PartnerNo specifies you want a new observation (or row, if you prefer database vocabulary over SAS vocabulary) for eacht value that variable

Rescale Dataset using Power BI

I'm trying to rescale a dataset in using PowerBI Desktop. I've imported a dataset full of raw data, but I can't use row context together with an aggregate. I'm trying to accomplish this:
Data:
+---------+-----+
| Name | Bar |
+---------+-----+
| Alfred | 0 |
| Alfred | -1 |
| Alfred | 1 |
| Burt | 1 |
| Burt | 0 |
| Charlie | 1 |
| Charlie | 1 |
| Charlie | 0 |
+---------+-----+
Calculations:
Foo: = SUM(Bar) / COUNT(Bar) GROUP BY Name
Which would Generate this dataset:
+---------+-----+
| Name | Foo |
+---------+-----+
| Alfred | 0 |
| Burt | .5 |
| Charlie | .67 |
+---------+-----+
Final Calculation:
Score: = (#Foo - MIN(Foo)) / (MAX(Foo)-MIN(Foo))
The goal is to grade on a curve with a set of data. I can do it in excel, but was hoping that Power BI could handle all the heavy lifting.
At this point it might be easier to do it all in SQL before bringing it into PowerBI, but that would make it significantly less dynamic (with date filters and the like). Thanks for any insight you might have!
I think you're looking for the GROUPBY DAX function. https://support.office.com/en-us/article/GROUPBY-Function-DAX-d6d064b2-fd8b-4c1b-97f8-c6d03cdf8ad0
You then would GROUPBY on the Name field and proceed from there. If need to use the measure outside of a visual that groups by each Name (like show me the average score after applying the curve), then you'll need to wrap that in a calculate table where you include the names, your measure projected as a column, and then do your aggregates (min/max/average) over that calculated table.