Proc freq for values comparison - sas

I have the following table
ID POINTS1 POINTS2 time Table1 Table2
12 10 9 2011 A B
13 12 8 2010 A B
22 9 11 2011 A C
24 8 8 2012 A C
I would need to visually compare these results focusing on the change of points of IDs between the two tables.
I am thinking about a correlation matrix or transition matrix where I have on x points1 and on y points2, possibly differently coloured based on the following conditions
A -> B
A -> C
I know that transition matrices can be created by frequency tables in SAS.
Proc freq data=my_table;
Tables points1*points2;
Run;
But I am wondering how to include information on the transition between table1 and table2, using colours in a two-way table with proc freq. I know that it could be possible using proc tabulate.

I have found a way using proc format and tabulate, instead of proc freq.
proc format;
value cell_col low-50=“green”
50-high=“red”;
run;
proc tabulate data=my_table s=[foreground=cell_col.];
class points1 points2 table2;
table points1 points2 table2;
run;

Related

PROC Transpose by YearMonth and ID for duplicate values

I'm trying to transpose this table:
YearMonth ID Purchase Purchase Value
201912 1 Laptop 1000
202012 1 Computer 2000
202112 1 Phone 1000
201912 2 Stereo 500
202012 2 Headset 200
To look like this using PROC Transpose:
ID Purchase_201912 Purchase_202012 Purchase_202112 PV_201912 PV_202012 PV_202112
1 Laptop Computer Phone 1000 2000 1000
2 Stereo Headset - 500 200 -
I think I'll have to transpose multiple times to achieve this. The first transpose I've tried doing is this:
proc transpose data=query_final out=transpose_1 let;
by yearmonth agent_number;
run;
but I keep getting the error
ERROR: Data set WORK.QUERY_FINAL is not sorted in ascending sequence. The current BY group has YearMonth = 202112
and the next BY group has YearMonth = 201912.
I've checked the the data from the table I'm pulling from is indeed sorted in ascending order by YearMonth then grouped by agent number, so I'm not sure what this error is referring to. Could it be that not all IDs have the same YearMonths associated with them (i.e. in example above, ID 2 did not purchase anything in 2021).
Assuming agent_number is equivalent to id in your example, I reproduced the data:
data have;
infile datalines4 delimiter="|";
input yearmonth id purchase :$8. PV;
datalines4;
201912|1|Laptop|1000
202012|1|Computer|2000
202112|1|Phone|1000
201912|2|Stereo|500
202012|2|Headset|200
;;;;
quit;
You can use two proc transpose and then merge the latter
proc transpose data=have out=stage1(drop=_name_) prefix=Purchase_;
by id;
var purchase;
id yearmonth;
run;
proc transpose data=have out=stage2(drop=_name_) prefix=PV_;
by id;
var PV;
id yearmonth;
run;
data want;
merge stage1 stage2;
run;
Resulting in the desired output
id purchase_201912 purchase_202012 purchase_202112 PV_201912 PV_202012 PV_202112
1 Laptop Computer Phone 1000 2000 1000
2 Stereo Headset 500 200 .
PS: In order to avoid getting the error you report, sort the data first, in the same manner as in the by statement in the proc transpose. However, it is not needed here as it is already sorted by id.

How to standardize all numeric columns in SAS datasets?

Problem:
I have a dataset with hundreds of variables (columns) and I want to standardize all numeric variables. But instead of center and dividing by just one standard deviation, I need to center and divide all variables by two standard deviations.
This is an example of the dataset I have
data have;
INPUT year $1-4 program_id $6-8 program_name $10-31 enrollments 33-36 admissions 38-41 graduates 43-46;
datalines;
2010 002 Electrical Engineering 1563 0321 0156
2010 001 Civil Engineering 2356 0739 0236
2010 003 Mechanical Engineering 0982 0234 0069
2010 021 English 3945 1034 0269
2010 031 Physics 0459 0134 0069
2010 041 Arts 0234 0072 0045
2019 004 Engineering 4745 1202 0597
2019 022 English Teaching 2788 0887 0201
2019 023 English and Spanish 0751 0345 0092
2019 031 Physics 0589 0126 0039
2019 032 Astronomy 0093 0035 0021
2019 041 Arts 0359 0097 0062
2019 044 Cinema 0293 0100 0039
;
run;
I want two different datasets. In the first, standardization applies for all variables across the whole dataset.
proc sql;
create table want1 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have;
quit;
In the second, standardization is grouped by year:
proc sql;
create table want2 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have
group by year;
quit;
Question: How to do this for all the hundreds of numeric variables of my dataset, without needing to write down the name of each one of them?
What I tried:
As I want this code to be replicable to different datasets, I was trying to follow the reasoning of this other question. That is, first to identify all numeric variables, than to save all variables names into an array and them doing the computations. I thought that perhaps I also need to save the resulting parameters of each column (mean and std) in an array as well. But I still did not get how to make arrays, datasteps and loops to work together.
I started trying to set an array for calculating the number of numerical variables. This runs fine.
data _null_;
set have;
array x[*] _numeric_;
call symput("nVar",dim(x));
stop;
run;
%put Number Variables = &nVar;
Then I tried to adapt the following code - which is a combination of #DomPazz answer with #Tom suggestion in the comments - but it did not work:
data want;
set have nobs=nobs;
array x[&nVar] _numeric_;
array N[&nVar];
n(1)=x(1); do i=2 to dim(n); n(i)=(x(i) - mean(x(i))/(2*(STD(x(i)); end;
keep N:;
run;
I don't know if the above code would get the right result. But I get an error saying that I have the incorrect number of arguments for the STD function. I looked it up: apparently, STD() in datastep runs row-wise, not column-wise.
I also tried PROC STANDARD, I get some results, but they don't match with my calculations. Probably I did not set the parameters right:
proc standard data=have mean=0 std=2
out=want;
run;
You can use the METHED=STD on PROC STDIZE to standardize around the mean and one STD.
So just add the MULT= option to divide by 2.
proc stdize data=have method=STD mult=0.5 out=want;
run;
Answering last comment:
#Tom I was reading the proc stdize documentation, but I could not figure out if I can customize the LOCATION and SCALE measures. For example, if instead of dividing by 2sdt, I want to subtract the mean and divide by the range for all variables. Would it be possible?
Quick solution:
* Output Mean;
proc stdize data=have method=mean out=out1 outstat=mean1;
var _numeric_;
run;
* Output Range;
proc stdize data=have method=range out=out1 outstat=range1;
var _numeric_;
run;
* LOCATION and SCALE;
data scale_location;
set mean1 (where=(_type_='LOCATION')) range1 (where=(_type_='SCALE'));
run;
* Target;
proc stdize data=have method=in(scale_location) out=want;
var _numeric_;
run;

proc transpose with variable name

I have a dataset as following
AGE GENDER
11 F
12 M
13
15
now I want to create a dataset as following
Basically I want to have the variable names in another column.
or may be in one column like
VAR Value
AGE 11
AGE 12
AGE 13
AGE 15
GENDER F
GENDER M
I have tried normal proc transpose, but looks like it doesnt give the desired result.
This is not a strictly speaking a transpose. Transpose implies that you want to transform some columns into rows or vice-versa, which is not the case here. That sample data transposed would look like:
VAR VALUE1 VALUE2 VALUE3 VALUE4
----------------------------------
AGE 11 12 13 14
GENDER F M
What you're trying to do here instead is have all your variables in the same column and add a 'label' column.
You could have your desired result with a data step:
data have;
infile datalines missover
;
input age $ gender $;
datalines;
11 F
12 M
13
15
;
run;
data want;
length var $6;
set have(keep=age rename=(age=value) in=a)
have(keep=gender rename=(gender=value) where=(value is not missing) in=b);
if b then var='GENDER';
else if a then var='AGE';
run;
Note the where= dataset option on the second part of the set statement since your desired result does not include the missing values that you have for gender in your sample data.
Alternatively, you could do it with two proc transpose:
proc transpose data=have out=temp name=VAR;
var age gender;
run;
proc transpose data=temp out=want(drop=_name_ rename=(col1=VALUE) where=(VALUE is not missing));
var col1 col2 col3 col4;
by var;
run;
One solution is to introduce a new unique row identifier and use that in a BY statement. This will let TRANSPOSE pivot the data values in each row.
data have;
rownum + 1; * new variable for pivoting by row via BY statement;
input AGE GENDER $;
datalines;
11 F
12 M
13 .
15 .
run;
proc transpose data=have out=want(drop=_name_ rename=(col1=value) where=(value ne ''));
by rownum;
var age gender;
run;
In Proc TRANPOSE the default new column names are prefixed with COL and indexed by the number of occurrences of a value 1..n in the incoming rows. The artificial rownum and BY statement ensure the pivoted data has only one data column. Note: the prefix can be specified with option PREFIX=, and additionally the pivoted data column names can come from the data itself if you use the ID statement.
Mixed data types can be a problem because the new column will use character representation of underlying data values. So dates will come out as numbers and numeric that were initially formatted will lose their format.
If you are trying to make a JSON transmission I would recommend researching the JSON library engine or the JSON package of Proc DS2.
If you are looking to create a report with the data in this transposed shape I would recommend Proc TABULATE.

Tracking ID in SAS

I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
************* UPDATE ******************
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
Aside from a different sample, you didn't provide much extra info from your previous question in order to understand what was lacking in the previous answer.
To build on the latter though, you could use a macro do loop to dynamically account for the distinct year values present in your dataset.
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
This will produce the following result:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
Here is a more efficient way of doing this and also giving you the summary values.
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now transpose that.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
Merge the indicators and the summary values:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;

Rolling up data in SAS

Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;