Transpose/Add new colums in SAS - sas

I have this data in SAS:
+-----------+-------+
| PartnerNo | SAPNo |
+-----------+-------+
| P1 | 123 |
| P1 | 124 |
| P1 | 125 |
| P2 | 126 |
| P2 | 127 |
| P3 | 128 |
+-----------+-------+
Now I want a row per partner and a new column for each SAPNo.
Like this:
+-----------+------+------+------+
| PartnerNo | SAP1 | SAP2 | SAP3 |
+-----------+------+------+------+
| P1 | 123 | 124 | 125 |
| P2 | 126 | 127 | |
| P3 | 128 | | |
+-----------+------+------+------+
This needs to be dynamic. There could be up to 8 SAPNo per PartnerNo.
I m using SAS Enterprise Guide 5.1

Proc transpose is powerfull, but not that intuitive.
This is the sollution:
proc transpose prefix=SAP
data=iHave
out=iNeed (drop=_name_);
by PartnerNo;
var SAPNo;
run;
Of course data specifies the input and out the output.
var SAPNo specifies that you want the values of that variable to be listed in columns.
SAS procedures often generate documentary variables, enclosed in undersores. Proc transposem for instandce, creates a _name_ field. If you specify multiple variable sin the var statement, the procedure will generate multiple lines per id variable. This _name_ then indicates where the values come from. We do not need it, so we drop it. Alternatively, we could have remove all variable starting with an underscore with drop=_:. This comes in handy for procedures generating multiple documentary variable if you need none of them.
By defauld, these colomns are named col1, col2 etc. I changed this with the prefix option. Often this is not what you want, because the destination column is named in another variable. Then use the id statement.
by PartnerNo specifies you want a new observation (or row, if you prefer database vocabulary over SAS vocabulary) for eacht value that variable

Related

Dates in columns instead of rows in SAS - advantages?

Few of my clients using SAS store dates in columns.
e.g:
| Id | Variable1_201101 | Variable1_201102 | ... | Variable1_201909 | Variable2_201101 | Variable2_201102 | ... |
etc.
Instead of storing dates in rows:
| Id | Date | Variable1 | Variable2 |
In a result, they have huge number of cells, because even if some ID does not exist in particular date, there will be empty cell in first structure, where in second structure, the row will be omitted.
I have never met such storage structures in SQL, where it wouldn't be perfect solution. Are there any advantages of such structures in SAS?
There is never a perfect storage structure. There are superior structures for solutions to problems at hand. Sometimes you have to reshape data for a particular solution, sometimes a procedure has grammar or mechanisms for reshaping within the procedure itself.
For example, examining a variable in different time frames in The TTEST Procedure might use a PAIRED statement and require different variables for the values. Thus the comparing Jan-2011 values to Jan-2012 values would make sense to have structure with Variable1_201101 Variable1_201201.
Disk space for sparse wide data can be reduced effectively using COMPRESS= options, at the cost of decompression CPU cycles. Depending on the data it can be significantly less disk use, but then is hard to deal with in alternate categorical analysis.
Traditional RDBMS has the categorical form (vertical) as a very common best practice, with indexing and foreign keys. If this is the original layout, you might need to pivot or reshape the data for a particular TTEST analysis.
Dealing with data found in a NOSQL data store you might end up more often encountering the horizontal form (because underlayment handles sparseness better).
Prepare code:
data have;
id=786;
Variable1_201101 = 78;
Variable1_201102 =67;
Variable1_201909 = 23;
Variable2_201101 = 34 ;
Variable2_201102 = 12;
run;
Now, we have :
+-----+------------------+------------------+------------------+------------------+------------------+
| id | Variable1_201101 | Variable1_201102 | Variable1_201909 | Variable2_201101 | Variable2_201102 |
+-----+------------------+------------------+------------------+------------------+------------------+
| 786 | 78 | 67 | 23 | 34 | 12 |
+-----+------------------+------------------+------------------+------------------+------------------+
Use transpose with wildcards:
PROC TRANSPOSE DATA=have
OUT=have2
PREFIX=Column
NAME=Source
LABEL=Label
;
BY id;
VAR Variable1_: Variable2_:;
Result:
+-----+------------------+---------+
| id | Source | Column1 |
+-----+------------------+---------+
| 786 | Variable1_201101 | 78 |
| 786 | Variable1_201102 | 67 |
| 786 | Variable1_201909 | 23 |
| 786 | Variable2_201101 | 34 |
| 786 | Variable2_201102 | 12 |
+-----+------------------+---------+
Now we will be "parse":
data have3;
set have2;
format date ddmmyyp10.;
date_str=substr(Source,find(source,"_")+1);
date=INputN(date_str||"01"," yymmn6.");
variable_name=substr(Source,1,find(source,"_")-1);
/* Optional*/
drop date_str source ;
run;
PROC SORT
;
BY date id;
RUN;
And transpose again:
PROC TRANSPOSE DATA=have3
OUT=want (drop=source)
PREFIX=Column
NAME=Source
LABEL=Label
;
BY date id;
ID variable_name;
VAR Column1;
Result:
+------------+-----+-----------------+-----------------+
| date | id | ColumnVariable1 | ColumnVariable2 |
+------------+-----+-----------------+-----------------+
| 01.01.2011 | 786 | 78 | 34 |
| 01.02.2011 | 786 | 67 | 12 |
| 01.09.2019 | 786 | 23 | . |
+------------+-----+-----------------+-----------------+

Preserving data more than once

I am writing some code in Stata and I have already used preserve once. However, now I would like to preserve again, without using restore.
I know this will give an error message, but does it save up to the new preserve area?
No, preserving twice without restoring in-between simply throws an error:
sysuse auto, clear
preserve
drop mpg
preserve
already preserved
r(621);
However, you can do something similar using temporary files. From help macro:
"...tempfile assigns names to the specified local macro names that may be used as names for temporary files. When the program or do-file concludes, any
datasets created with these assigned names are erased..."
Consider the following toy example:
tempfile one two three
sysuse auto, clear
save `one'
drop mpg
save `two'
drop price
save `three'
use `two'
list price in 1/5
+-------+
| price |
|-------|
1. | 4,099 |
2. | 4,749 |
3. | 3,799 |
4. | 4,816 |
5. | 7,827 |
+-------+
use `one'
list mpg in 1/5
+-----+
| mpg |
|-----|
1. | 22 |
2. | 17 |
3. | 22 |
4. | 20 |
5. | 15 |
+-----+

Change column and row using SAS

My data set is:
|ID | SNP | GEN |
|:--:|:---:|:---:|
|1 | A | AG |
|2 | A | GG |
|3 | A | AG |
|4 | A | AG |
I need to do this:
|SNP | ID1 | ID2 | ID3 | ID4 |
|:---|:---:|:---:|:---:|:---:|
|A | AG | GG | AG | AG |
I've tried to use the following command, but it did not work:
proc transpose data=data1; run;
Someone knows how to make this using SAS?
You just need to add more options to your PROC TRANPOSE call to get it to do what you want.
proc transpose data=data1 out=WANT prefix=ID ;
by SNP;
id ID;
var GEN;
run;
The BY statement will process each value of SNP as a separate set to transpose. Make sure data is sorted by SNP if there are more values than in your example. The ID statement tells it which variable to use to generate the new variable name. The VAR statement tells it which variable to transpose, to transpose character variables you must use a VAR statement. The PREFIX= option lets you specify characters to use as prefix for the generated variable names.

Create new variable by dividing column by observation in last row

I want to create a new variable, say cheese2, that takes cheese and divides every by the last observation (2921333).
+----------+
| cheese |
|----------|
1. | 3060000 |
2. | 840333.3 |
3. | 1839667 |
4. | 1.17e+07 |
5. | 1374000 |
|----------|
6. | 2092333 |
7. | 341000 |
8. | 3149000 |
9. | 3557667 |
10. | 590666.7 |
|----------|
11. | 8937000 |
12. | 4142000 |
13. | 2624000 |
14. | 1973667 |
15. | 2921333 |
I would also like to do this for multiple columns at once i.e. divide multiple columns by the last row of my data set.
In Stata terminology,
create a new variable by dividing a column by the observation in the last row
becomes
create a new variable by dividing a variable by the value in the last observation.
Such a question suggests that you are storing totals in your last observation, spreadsheet style. Such a practice is undoubtedly convenient for what you are asking, but it creates obligations to exclude the last observation from almost every other manipulation and to maintain precisely the same sort order, and would generally be considered a bad idea therefore.
All that said,
gen cheese2 = cheese/cheese[_N]
is what you ask and a loop over several variables could be
foreach v of var frog newt toad lizard dragon {
gen `v'2 = `v'/`v'[_N]
}
See also the help for foreach.

Rescale Dataset using Power BI

I'm trying to rescale a dataset in using PowerBI Desktop. I've imported a dataset full of raw data, but I can't use row context together with an aggregate. I'm trying to accomplish this:
Data:
+---------+-----+
| Name | Bar |
+---------+-----+
| Alfred | 0 |
| Alfred | -1 |
| Alfred | 1 |
| Burt | 1 |
| Burt | 0 |
| Charlie | 1 |
| Charlie | 1 |
| Charlie | 0 |
+---------+-----+
Calculations:
Foo: = SUM(Bar) / COUNT(Bar) GROUP BY Name
Which would Generate this dataset:
+---------+-----+
| Name | Foo |
+---------+-----+
| Alfred | 0 |
| Burt | .5 |
| Charlie | .67 |
+---------+-----+
Final Calculation:
Score: = (#Foo - MIN(Foo)) / (MAX(Foo)-MIN(Foo))
The goal is to grade on a curve with a set of data. I can do it in excel, but was hoping that Power BI could handle all the heavy lifting.
At this point it might be easier to do it all in SQL before bringing it into PowerBI, but that would make it significantly less dynamic (with date filters and the like). Thanks for any insight you might have!
I think you're looking for the GROUPBY DAX function. https://support.office.com/en-us/article/GROUPBY-Function-DAX-d6d064b2-fd8b-4c1b-97f8-c6d03cdf8ad0
You then would GROUPBY on the Name field and proceed from there. If need to use the measure outside of a visual that groups by each Name (like show me the average score after applying the curve), then you'll need to wrap that in a calculate table where you include the names, your measure projected as a column, and then do your aggregates (min/max/average) over that calculated table.