I want to get distribution of a variable that is categorized using PROC FORMAT. However I do not get the frequency distribution based on the new groups using BY statement. I discovered this while using PHREG on a larger data. I have given a sample code below.
data p;
input v1 $ v2;
datalines;
A 1
A 2
A 1
A 2
B 3
B 2
C 1
D 1
;
RUN;
proc format;invalue $ v1f 'A','C'='Grp-1' 'B','D'='Grp-2'; run;
proc freq;tables v1; format v1 $v1f.;run;
proc sort;by v1; run;
proc freq;tables v2; by v1;format v1 $v1f.;run;
Not sure why the last PROC FREQ is not working as expected.
I need to keep changing these categories for iterative analysis and so I find PROC FORMAT easy to code but I am very confused as to why it is not working.
Any tips would be appreciated.
To FORMAT a variable you need to use a FORMAT. The INVALUE statement is for defining an INFORMAT. To define a FORMAT you need to use the VALUE statement instead.
FORMATs are used to convert values to text. INFORMATs are used to convert text to values. You use a FORMAT with the FORMAT and PUT statements and the PUT() function. You use an INFORMAT with the INFORMAT and INPUT statements and the INPUT() function.
BY groups are done by the actual values, not the formatted values. If you want the frequencies of V1 crossed with V2 specify that in the TABLES statement.
proc freq;
tables v1*v2;
format v1 $v1f.;
run;
Results
The FREQ Procedure
Table of v1 by v2
v1 v2
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
---------+--------+--------+--------+
Grp-1 | 3 | 2 | 0 | 5
| 37.50 | 25.00 | 0.00 | 62.50
| 60.00 | 40.00 | 0.00 |
| 75.00 | 66.67 | 0.00 |
---------+--------+--------+--------+
Grp-2 | 1 | 1 | 1 | 3
| 12.50 | 12.50 | 12.50 | 37.50
| 33.33 | 33.33 | 33.33 |
| 25.00 | 33.33 | 100.00 |
---------+--------+--------+--------+
Total 4 3 1 8
50.00 37.50 12.50 100.00
If you want to sort by the formatted value then use the PUT() function to make a new variable.
data by_group;
set p ;
group = put(v1,$v1f.);
run;
proc sort data=by_group;
by group;
run;
Use the Proc FORMAT VALUE Statement to define a custom format.
Proc SQL and PUT() can be used to sort data in formatted order.
Proc FREQ BY processing will honor a formatted value when the contiguous underlying values in the data map to the same formatted value.
proc format;
value $v1f
'A','C'='Grp-1'
'B','D'='Grp-2';
run;
proc sql;
create table two as
select *
from have
order by put(v1,$v1f.), v1 /* ensure order is by formatted value, and then unerlying value within (for good measure in case data is viewed rawly) */
;
proc freq;
tables v2;
by v1;
format v1 $v1f.;
run;
Related
I have table in SAS Enterprise Guide like below:
ID | COL_A | COL_B | COL_C
-----|-------|-------|------
111 | 10 | 20 | 30
222 | 15 | 80 | 10
333 | 11 | 10 | 20
444 | 20 | 5 | 20
Requirements:
I need to create new columns: TOP_1, TOP_2, TOP_3 where will be names of columns from the highest value from COL_A, COL_B, COL_C columns to the lowest per ID
If for example 2 or more columns have the same highest value take the first under the alphabet.
In TOP_1 - name of column with the hihest value per ID
In TOP_2 - name of column with the second highest value per ID
In TOP_3 - name of column with the third highest value per ID
Desire output:
ID | COL_A | COL_B | COL_C | TOP_1 | TOP_2 | TOP_3
-----|-------|-------|--------|--------|---------|---------
111 | 10 | 20 | 30 | COL_C | COL_B | COL_A
222 | 15 | 80 | 10 | COL_B | COL_A | COL_C
333 | 11 | 10 | 20 | COL_C | COL_A | COL_B
444 | 20 | 5 | 20 | COL_A | COL_C | COL_B
Because:
for ID = 111 the highest value is in COL_C, co name "COL_C" going to column "TOP_1", second highest value is in COL_B, so name "COL_B" going to column "TOP_2" and so on...
for ID = 444 two columns have the highest value, so we have to use alphabet criteria and in column "TOP_1" is name "COL_A" and name "COL_B is in column "TOP_2"
How can I do that in SAS Enterprise Gude or in PROC SQL ?
First let's convert your listing into an actual dataset.
data have;
input ID COL_A COL_B COL_C ;
cards;
111 10 20 30
222 15 80 10
333 11 10 20
444 20 5 20
;
If you use PROC TRANSPOSE to covert your COL_: into observations.
proc transpose data=have out=tall;
by id col_a col_b col_c;
var col_a col_b col_c;
run;
You can then sort by descending values (and ascending variable name):
proc sort;
by id col_a col_b col_c descending col1 _name_;
run;
And use another PROC TRANSPOSE to make your new variables:
proc transpose data=tall out=want(drop=_name_ _label_) prefix=TOP_;
by id col_a col_b col_c;
var _name_;
run;
If the data is really large (or you have a lot more than 3 columns to check) you might want to eliminate COL_A COL_B and COL_C from the BY group and instead just merge the resulting TOP_: variable back onto the original dataset.
I’m trying to categorize customers, and one of the columns I need to make is a customer’s primary product with ties broken by the amount the customer spent. Here's an example of what the ties in my raw data look like:
|Customer #|Product|Spend|
|Customer 1| A |800 |
|Customer 1| B |900 |
|Customer 2| C |100 |
|Customer 2| C |550 |
|Customer 2| B |200 |
|Customer 2| B |300 |
Customers 1 and 2 both have ties in the number of times they purchased a product, so for the Primary Product field I need to make, I need to use the amount the customers spent on each product to break the ties. Customer 1 spend more on ProdB than on ProdA, some Customer 1's Primary Product should be B.
So I need to make something like this:
| Customer #|Prod A |Prod B|Prod C|Primary Product|
| Customer 1| 1 | 1 | 0 | B |
| Customer 2| 0 | 2 | 2 | C |
I have entries for each customer’s transactions, showing what products they bought and how much they paid for, as well as a Customer ID number unique to each customer.
Summing up the number of times a customer buys a product is no problem, but I don't know how to write the code to break the tie.
How do I write code to choose a customer’s primary product that will break ties based on the amount the customer spent?
Broken down into smaller steps. In the future please include whatever you've tried in your questions.
data have;
infile cards delimiter='|' dsd truncover;
length customerID $10.;
input customerID $ product $ Spend;
cards;
Customer 1| A |800 |
Customer 1| B |900 |
Customer 2| C |100 |
Customer 2| C |550 |
Customer 2| B |200 |
Customer 2| B |300 |
;;;;
run;
*summarize to totals first;
proc means data=have nway noprint ;
class customerID product;
var SPEND;
output out=product_total sum=total_spend N=COUNT;
run;
*sort as needed;
proc sort data=product_total;
by customerID descending COUNT descending total_spend;
run;
*identify the primary product;
data want_long;
set product_total;
by customerID;
retain primary_product;
if first.customerID then primary_product=product;
run;
*sort for order after transpose;
proc sort data=want_long;
by customerID primary_PRODUCT PRODUCT;
run;
*transpose to desired structure;
proc transpose data=want_long out=want_wide prefix=PROD_;
by customerID primary_Product;
id product;
var Count;
run;
Few of my clients using SAS store dates in columns.
e.g:
| Id | Variable1_201101 | Variable1_201102 | ... | Variable1_201909 | Variable2_201101 | Variable2_201102 | ... |
etc.
Instead of storing dates in rows:
| Id | Date | Variable1 | Variable2 |
In a result, they have huge number of cells, because even if some ID does not exist in particular date, there will be empty cell in first structure, where in second structure, the row will be omitted.
I have never met such storage structures in SQL, where it wouldn't be perfect solution. Are there any advantages of such structures in SAS?
There is never a perfect storage structure. There are superior structures for solutions to problems at hand. Sometimes you have to reshape data for a particular solution, sometimes a procedure has grammar or mechanisms for reshaping within the procedure itself.
For example, examining a variable in different time frames in The TTEST Procedure might use a PAIRED statement and require different variables for the values. Thus the comparing Jan-2011 values to Jan-2012 values would make sense to have structure with Variable1_201101 Variable1_201201.
Disk space for sparse wide data can be reduced effectively using COMPRESS= options, at the cost of decompression CPU cycles. Depending on the data it can be significantly less disk use, but then is hard to deal with in alternate categorical analysis.
Traditional RDBMS has the categorical form (vertical) as a very common best practice, with indexing and foreign keys. If this is the original layout, you might need to pivot or reshape the data for a particular TTEST analysis.
Dealing with data found in a NOSQL data store you might end up more often encountering the horizontal form (because underlayment handles sparseness better).
Prepare code:
data have;
id=786;
Variable1_201101 = 78;
Variable1_201102 =67;
Variable1_201909 = 23;
Variable2_201101 = 34 ;
Variable2_201102 = 12;
run;
Now, we have :
+-----+------------------+------------------+------------------+------------------+------------------+
| id | Variable1_201101 | Variable1_201102 | Variable1_201909 | Variable2_201101 | Variable2_201102 |
+-----+------------------+------------------+------------------+------------------+------------------+
| 786 | 78 | 67 | 23 | 34 | 12 |
+-----+------------------+------------------+------------------+------------------+------------------+
Use transpose with wildcards:
PROC TRANSPOSE DATA=have
OUT=have2
PREFIX=Column
NAME=Source
LABEL=Label
;
BY id;
VAR Variable1_: Variable2_:;
Result:
+-----+------------------+---------+
| id | Source | Column1 |
+-----+------------------+---------+
| 786 | Variable1_201101 | 78 |
| 786 | Variable1_201102 | 67 |
| 786 | Variable1_201909 | 23 |
| 786 | Variable2_201101 | 34 |
| 786 | Variable2_201102 | 12 |
+-----+------------------+---------+
Now we will be "parse":
data have3;
set have2;
format date ddmmyyp10.;
date_str=substr(Source,find(source,"_")+1);
date=INputN(date_str||"01"," yymmn6.");
variable_name=substr(Source,1,find(source,"_")-1);
/* Optional*/
drop date_str source ;
run;
PROC SORT
;
BY date id;
RUN;
And transpose again:
PROC TRANSPOSE DATA=have3
OUT=want (drop=source)
PREFIX=Column
NAME=Source
LABEL=Label
;
BY date id;
ID variable_name;
VAR Column1;
Result:
+------------+-----+-----------------+-----------------+
| date | id | ColumnVariable1 | ColumnVariable2 |
+------------+-----+-----------------+-----------------+
| 01.01.2011 | 786 | 78 | 34 |
| 01.02.2011 | 786 | 67 | 12 |
| 01.09.2019 | 786 | 23 | . |
+------------+-----+-----------------+-----------------+
My data set is:
|ID | SNP | GEN |
|:--:|:---:|:---:|
|1 | A | AG |
|2 | A | GG |
|3 | A | AG |
|4 | A | AG |
I need to do this:
|SNP | ID1 | ID2 | ID3 | ID4 |
|:---|:---:|:---:|:---:|:---:|
|A | AG | GG | AG | AG |
I've tried to use the following command, but it did not work:
proc transpose data=data1; run;
Someone knows how to make this using SAS?
You just need to add more options to your PROC TRANPOSE call to get it to do what you want.
proc transpose data=data1 out=WANT prefix=ID ;
by SNP;
id ID;
var GEN;
run;
The BY statement will process each value of SNP as a separate set to transpose. Make sure data is sorted by SNP if there are more values than in your example. The ID statement tells it which variable to use to generate the new variable name. The VAR statement tells it which variable to transpose, to transpose character variables you must use a VAR statement. The PREFIX= option lets you specify characters to use as prefix for the generated variable names.
I have this data in SAS:
+-----------+-------+
| PartnerNo | SAPNo |
+-----------+-------+
| P1 | 123 |
| P1 | 124 |
| P1 | 125 |
| P2 | 126 |
| P2 | 127 |
| P3 | 128 |
+-----------+-------+
Now I want a row per partner and a new column for each SAPNo.
Like this:
+-----------+------+------+------+
| PartnerNo | SAP1 | SAP2 | SAP3 |
+-----------+------+------+------+
| P1 | 123 | 124 | 125 |
| P2 | 126 | 127 | |
| P3 | 128 | | |
+-----------+------+------+------+
This needs to be dynamic. There could be up to 8 SAPNo per PartnerNo.
I m using SAS Enterprise Guide 5.1
Proc transpose is powerfull, but not that intuitive.
This is the sollution:
proc transpose prefix=SAP
data=iHave
out=iNeed (drop=_name_);
by PartnerNo;
var SAPNo;
run;
Of course data specifies the input and out the output.
var SAPNo specifies that you want the values of that variable to be listed in columns.
SAS procedures often generate documentary variables, enclosed in undersores. Proc transposem for instandce, creates a _name_ field. If you specify multiple variable sin the var statement, the procedure will generate multiple lines per id variable. This _name_ then indicates where the values come from. We do not need it, so we drop it. Alternatively, we could have remove all variable starting with an underscore with drop=_:. This comes in handy for procedures generating multiple documentary variable if you need none of them.
By defauld, these colomns are named col1, col2 etc. I changed this with the prefix option. Often this is not what you want, because the destination column is named in another variable. Then use the id statement.
by PartnerNo specifies you want a new observation (or row, if you prefer database vocabulary over SAS vocabulary) for eacht value that variable