I am looking for a way to produce a frequency table displaying number of unique values for variable ID per uniqe value of variable Subclass.
I would like to order the results by variable Class.
Preferably I would like to display the number of unique values for ID per Subclass as a fraction of n for ID. In the want-example below this values is displayed under %totalID.
In addition I would like to display the number of unique values for ID per Subclass as a fraction of the sum of unique ID values found within each Class. In the want-example below this values is displayed under %withinclassID.
Have:
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
Want:
Unique number
Class Subclass of IDs %totalID %withinclassID
--------------------------------------------------------------------
1
1a 3 100.0 50.00
1b 2 66.67 33.33
1c 1 33.33 16.67
SUM 6
2
2a 3 100.0 75.00
2b 1 33.33 25.00
SUM 4
3
3a 2 66.67 66.67
3b 1 33.33 33.33
SUM 3
My initial approach was to perform a PROC FREQ on NLEVELS producing a frequency table for number of unique IDs per subclass. Here however I lose information on class. I therefore cannot order the results by class.
My second approach involved using PROC TABULATE. I however cannot produce any percentage calculations based on unique counts in such a table.
Is there a direct way to tabulate the frequencies of one variable according to a second variable, grouped by a third variable--displaying overall and within group percentages?
You can do a double proc freq or SQL.
/*This demonstrates how to count the number of unique occurences of a variable
across groups. It uses the SASHELP.CARS dataset which is available with any SAS installation.
The objective is to determine the number of unique car makers by origin/
Note: The SQL solution can be off if you have a large data set and these are not the only two ways to calculate distinct counts.
If you're dealing with a large data set other methods may be appropriate.*/
*Count distinct IDs;
proc sql;
create table distinct_sql as
select origin, count(distinct make) as n_make
from sashelp.cars
group by origin;
quit;
*Double PROC FREQ;
proc freq data=sashelp.cars noprint;
table origin * make / out=origin_make;
run;
proc freq data=origin_make noprint;
table origin / out= distinct_freq outpct;
run;
title 'PROC FREQ';
proc print data=distinct_freq;
run;
title 'PROC SQL';
proc print data=distinct_sql;
run;
The nlevels option in proc freq can produce the unique count you're after without losing data, providing you include Class and Subclass variables in the by statement. That also means you'll have to presort the data by the same variables.
Then you could try proc tabulate to get the rest of your requirement.
data have;
input ID $ Class Subclass $;
datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;
run;
proc sort data=have;
by class subclass;
run;
ods output nlevels = unique_id_count;
proc freq data=have nlevels;
by class subclass;
run;
Related
I have 1500+ observations in a dataset with about 350 unique subject ids. I am trying to create weighted averages of a variable for all the observations for each unique subject id. I'm not sure how to code a procedure that could automate this process where I didn't need to create the variable one at a time for all 350 subject ids. Thoughts?
Assuming you have something like the following:
Data have;
input SubjID $ varMeasure varWeight;
cards;
01 25 1
01 30 2
01 40 3
02 25 2
02 30 1
02 40 3
;;;;
run;
You can use Proc means and specify your weight:
proc means data= have nway noprint;
class subjid;
var varMeasure;
weight varWeight;
output out=want mean = weighted_mean;
run;
Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;
For a project, I have a large dataset of 1.5m entries, I am looking to aggregate some car loan data by some constraint variables such as:
Country, Currency, ID, Fixed or floating , performing , Initial Loan Value , Car Type , Car Make
I am wondering if it is possible to aggregate data by summing the initial loan value for the numeric and then condensing the similar variables into one row with the same observation such that I turn the first dataset into the second
Country Currency ID Fixed_or_Floating Performing Initial_Value Current_Value
data have;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
data want;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 410 150
UK GBP 1 Floating Performing 275 170
UK GBP 1 Fixed Non-performing 220 80
;
run;
Essentially looking for a way to sum the numeric values while concatenating the character variables.
I've tried this code
proc means data=have sum;
var initial current;
by country currency id fixed performing;
run;
Unsure If i'll have to use a proc sql (would be too slow for such a large dataset) or possibly a data step.
any help in concatenating would be appreciated.
Create an output data set from Proc MEANS and concatenate the variables in the result. MEANS with a BY statement requires sorted data. Your have does not.
Concatenation of the aggregations key (those lovely categorical variables) into a single space separated key (not sure why you need to do that) can be done with CATX function.
data have_unsorted;
length country $2 currency $3 id 8 type $8 evaluation $20 initial current 8;
input country currency ID type evaluation initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
Way 1 - MEANS with CLASS/WAYS/OUTPUT, post process with data step
The cardinality of the class variables may cause problems.
proc means data=have_unsorted noprint;
class country currency ID type evaluation ;
ways 5;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 2 - SORT followed by MEANS with BY/OUTPUT, post process with data step
BY statement requires sorted data.
proc sort data=have_unsorted out=have;
by country currency ID type evaluation ;
proc means data=have noprint;
by country currency ID type evaluation ;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 3 - MEANS, given data that is grouped but unsorted, with BY NOTSORTED/OUTPUT, post process with data step
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
proc means data=have_unsorted noprint;
by country currency ID type evaluation NOTSORTED;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 4 - DATA Step, DOW loop, BY NOTSORTED and key construction
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
data want_way4;
do until (last.evaluation);
set have;
by country currency ID type evaluation NOTSORTED;
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
end;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 5 - Data Step hash
data can be processed with out a presort or clumping. In other words, data can be totally disordered.
data _null_;
length key $50 initial_sum current_sum 8;
if _n_ = 1 then do;
call missing (key, initial_sum, current_sum);
declare hash sums();
sums.defineKey('key');
sums.defineData('key','initial_sum','current_sum');
sums.defineDone();
end;
set have_unsorted end=end;
key = catx(' ',country,currency,ID,type,evaluation);
rc = sums.find();
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
sums.replace();
if end then
sums.output(dataset:'have_way5');
run;
1.5m entries is not very big dataset. The dataset is sorted first.
proc sort data=have;
by country currency id fixed performing;
run;
proc means data=have sum;
var initial current;
by country currency id fixed performing;
output out=sum(drop=_:) sum(initial)=Initial sum(current)=Current;
run;
Props to paige miller
proc summary data=testa nway;
var net_balance;
class ID fixed_or_floating performing_status initial country currency ;
output out=sumtest sum=sum_initial;
run;
I am trying to create a two way transposed table. The original table I have looks like
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
I am trying to create a table that looks like
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
In other words, how many id have cc1 also have cc2..cc177..etc
The number under ID is not count; an ID could range from 3 digits to 5 digits ID or with numbers such as 122345ab78
Is it possible to have percentage display next to each other?
CC CC1 % CC2 %… …CC177
1 264 100% 5 1.9% 0
2 0 132 6
…
…
177 2 1 692
If I want to change the CC1 CC2 to characters, how do I modify the arrays?
Eventually, I would like my table looks like
CC Dell Lenovo HP Sony
Dell
Lenovo
HP
Sony
The order of the names must match the CC number I provided above. CC1=Dell CC2=Lenovo, etc. I would also want to add percentage to the matrice. If Dell X Dell = 100 and Dell X Lenovo = 25, then Dell X Lenovo = 25%.
This changes your data structure to a wide format with an indicator for each value of CC and then uses proc corr (correlation) to create the summary table.
Proc Corr will generate the SCCP - which is the uncorrected sum of squares and crossproducts. It's something that's related to correlation, but the gist is it creates the table you're looking for. The table is output in the SAS results window and the ODS OUTPUT statement will capture the table in a dataset called coocs.
data temp;
set have;
by ID;
retain CC1-CC177;
array CC_List(177) CC1-CC177;
if first.ID then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_List(CC)=1;
if last.ID then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=temp sscp;
var CC1-CC177;
run;
proc print data=coocs;
run;
Here's another answer, but it's inefficient and has it's issues. For one, if a value is not anywhere in the list it will not show up in the results, i.e. if there is no 20 in the dataset there will be no 20 in the final data. Also, the variables are out of order in the final dataset.
proc sql;
create table bigger as
select a.id, catt("CC", a.cc) as cc1, catt("CC", b.cc) as cc2
from have as a
cross join have as b
where a.id=b.id;
quit;
proc freq data=bigger noprint;
table cc1*cc2/ list out=bigger2;
run;
proc transpose data=bigger2 out=want2;
by cc1;
var count;
id cc2;
run;
I inherited a poorly documented person-month dataset that does not have a matching person-level dataset. I want to determine which of the variables in the person-month dataset are actually person-level variables (constant for all observations with a particular id), such as you would expect for date of birth. Simplistic example:
id month dob race tx weight
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
In this example, dob and race are fixed within an individual but tx and weight vary by month within an individual.
I have come up with a clumsy solution: use proc means to calculate the standard deviation of all numeric variables BY id, and then take the maximum of those standard deviations. If the maximum of the std of a variable is 0, there is no variance of that column within any individual, and I can flag that variable as being fixed (or person-level).
I feel like I'm missing a simpler statistical test to determine which of my hundreds of variables are fixed within each individuals and which vary within an individual's observations. Any suggestions?
pT
I would use the NLEVELS option in PROC FREQ. This gives you the number of unique values for each variable, so you're looking for variables with a unique value (nlevels) of 1.
Here's the code, you'll need to sort the data by id beforehand if not done already.
data have;
input id month dob race tx weight;
cards;
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
;
run;
ods select nlevels;
ods output nlevels=want;
ods noresults;
proc freq data=have nlevels;
by id;
run;
ods results;
I don't think there's a 'simple statistical test' beyond what you have worked out - standard deviation, or even MIN/MAX (which is about the same). I'd probably just do it in PROC SQL, unless there are a huge number of variables; this allows you to use character variables also.
%macro comparetype(var);
max(&var.) = min(&var.) as &var.
%mend comparetype;
proc sql;
select min(origin) as origin, min(type) as type, min(drivetrain) as drivetrain,
min(msrp) as msrp,min(invoice) as invoice,min(enginesize) as enginesize from (
select make,
%comparetype(origin),
%comparetype(type),
%comparetype(drivetrain),
%comparetype(msrp),
%comparetype(invoice),
%comparetype(enginesize)
from sashelp.cars
group by make
);
quit;