Building a table that counts the frequency that a value appears within the same by group as another value of the same variable - sas

I have build a table where the headers each column except the first column is a unique value, and every cell in the first column is a unique value - just the headers but vertical.
values
value1
value2
value3
value1
value2
value3
I also have a dataset group. Every unique group can have multiple unique values.
group
value
group1
value1
group1
value2
group1
value3
group2
value2
group2
value3
group2
value4
group2
value5
I want to see the amount of times every unique value appears alongside any other unique value that is not itself within the same group by populating my first table.
Desired output, given the above example group dataset:
values
value1
value2
value3
value1
0
1
1
value2
1
0
2
value3
1
2
0
I have difficulties finding answers as I am not sure what the type of output I want is called. How would I go about achieving my desired results?
Thank you for your help.

Create a cross-join that counts the number of times each value matches with a different value:
proc sql noprint;
create table count_values as
select t1.value as value_t1
, t2.value as value_t2
, count(*) as total
from have as t1, have as t2
where t1.group = t2.group
AND t1.value NE t2.value
group by t1.value, t2.value
;
quit;
Which gives you this:
value_t1 value_t2 total
value1 value2 1
value1 value3 1
value2 value1 1
value2 value3 2
value3 value1 1
value3 value2 2
Now we can do a few different things. You can use proc tabulate and generate your table:
options missing='0';
proc tabulate data = count_values
out = totals
format = 8.0;
class value_t1 value_t2;
var total;
/* ' ' eliminates row headings */
table value_t1 = ' ',
value_t2 = ' '*total = ' '*sum=' '
;
run;
Or, you can turn it into a dataset:
proc transpose data = count_values
out = count_values_tpose(drop=_NAME_);
by value_t1;
id value_t2;
var total;
run;
data want;
/* Set var order */
length value_t1 $8.
value1-value3 8.
;
set count_values_tpose;
array value[*] value1-value3;
/* Set missing to 0 */
do i = 1 to dim(value);
if(value[i] = .) then value[i] = 0;
end;
rename value_t1 = values;
drop i;
run;
Output:
values value1 value2 value3
value1 0 1 1
value2 1 0 2
value3 1 2 0

Related

Proc tabulate,in my column i have more than 50 variable how can i reduce it to only 5?

proc tabulate data=D.Arena out=work.Arena ;
class Row1 Column1/ order=freq ;
table Row1,Column1 ;
run;
after running this i received these results and now i want to restrict the columns to only 5 variables
Use a 'where' statement to restrict the col1 values being tabulated.
You can restrict based on a value property such as starts with the letter A
where col1 =: 'A';
You can restrict based on a value list:
where col1 in ('Apples', 'Lentils', 'Oranges', 'Sardines', 'Cucumber');
Sample data:
data have;
call streaminit(123);
array col1s[50] $20 _temporary_ (
"Apples" "Avocados" "Bananas" "Blueberries" "Oranges" "Strawberries" "Eggs" "Lean beef" "Chicken breasts" "Lamb" "Almonds" "Chia seeds" "Coconuts" "Macadamia nuts" "Walnuts" "Asparagus" "Bell peppers" "Broccoli" "Carrots" "Cauliflower" "Cucumber" "Garlic" "Kale" "Onions" "Tomatoes" "Salmon" "Sardines" "Shellfish" "Shrimp" "Trout" "Tuna" "Brown rice" "Oats" "Quinoa" "Ezekiel bread" "Green beans" "Kidney beans" "Lentils" "Peanuts" "Cheese" "Whole milk" "Yogurt" "Butter" "Coconut oil" "Olive oil" "Potatoes" "Sweet potatoes" "Vinegar" "Dark chocolate"
);
do row1 = 1 to 20;
do _n_ = 1 to 1000;
col1 = col1s[ceil(rand('uniform',50))];
x = ceil(rand('uniform',250));
output;
end;
end;
run;
Frequency tabulation, also showing ALL counts
* col1 values shown in order by value;
proc tabulate data=have;
class row1 col1;
table ALL row1,col1;
run;
* col1 values shown in order by ALL frequency;
proc tabulate data=have;
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
* Letter T col1 values shown in order by ALL frequency;
proc tabulate data=have;
where col1 =: 'T';
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
A top 5 only list of Col1s would require a step that determines which col1s meet that criteria. A list of those col1s can be used as part of a where in clause.
* determine the 5 col1s with highest frequency count;
proc sql noprint outobs=5;
select
quote(col1) into :top5_col1_list separated by ' '
from
( select col1, count(*) as N from have
group by col1
)
order by N descending;
quit;
proc tabulate data=have;
where col1 in (&top5_col1_list);
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
Col1s in order of value
Col1s in order of frequency
T Col1s
Top 5 Col1s

Combine information into a department vector

I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;

SAS: merge/ join datasets by dynamic columns in the look up table

I want to join sas dataset with the look up table but the column/key for joining is a value in the look up table
Dataset: table4
ID lev1 lev2 lev3 lev4 lev5
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
Look up table:
context table_name column operator value
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
Now I want to join table4 with lookup table but it is only possible with fields lev2 and lev3(it is dynamic so could be changed in the future, so don't want to hardcode in it).
I have tried below code but doesn't want to hard code as the fields are dynamic( someone might add lev4 as well in future).
proc sql ;
create table want as
select ID
from table4 as a
inner join lookup as b
on a.lev2 = input(value,12.) or a.lev3=input(value,12.)
where Context="unprod";
quit;
Thanks heaps in advance.
That does not look like a lookup table. It appears to be a set of rules. You could use it to generate code. Let's simplify the process by making the table contain actual code instead of three columns. But you could easily write the code to convert from your current format into code strings.
data rules ;
infile cards truncover ;
input context $ table_name $ rule $100. ;
cards;
extract table1 col1 = xyd
asset table2 var1 = 11111
asset table2 var2 = 25858
prod table3 x1 = 87999
unprod table4 lev2 = 14589
unprod table4 lev2 = 14589
unprod table4 lev3 = 55255
;
So now it looks like you want to take the rules that have a specific value of CONTEXT and use that to generate a new dataset from the dataset named in TABLE_NAME. Not sure what name you want to use for the generated table or what you want to do when more than one table is mentioned in the same "context".
%let context=unprod ;
filename code temp;
data _null_;
set rules ;
where context=symget('context');
by table_name ;
file code ;
if first.table_name then table_no+1;
if first.table_name then put
'data want' table_no ';'
/ ' set ' table_name ';'
/ ' where 1=0'
;
put ' or (' rule ')' ;
if last.table_name then put
';'
/ 'run;'
;
run;
%include code / source2 ;
Which results in code like this:
130 +data want1 ;
131 + set table4 ;
132 + where 1=0
133 + or (lev2 = 14589 )
134 + or (lev2 = 14589 )
135 + or (lev3 = 55255 )
136 +;
137 +run;
NOTE: There were 3 observations read from the data set WORK.TABLE4.
WHERE (lev2=14589) or (lev3=55255);
NOTE: The data set WORK.WANT1 has 3 observations and 6 variables.
Here is a sample code that would get what I understood you are trying to do. This code is based on the comment by #Reeza. If this is not what you are trying to do, please send a sample output file.
data table4;
input ID $ lev1 $ lev2 $ lev3 $ lev4 $ lev5 $;
datalines;
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
;
run;
data look_up;
input context $ table_name $ column $ operator $ value $;
datalines;
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
;
run;
PROC transpose DATA=work.table4 out=temp1 prefix=value;
by ID;
VAR lev1-lev5;
run;
proc sql;
create table want as
select a.*, b.ID
from look_up as a
inner join temp1 as b
on a.value=b.value1 and a.column=_Name_;
quit;

Count rows number by group and subgroup when some subgroup factor is 0

I know how to count group and subgroup numbers through proc freq or sql. My question is when some factor in the subgroup is missing, and I still want to show missing factor as 0. How can I do that? For example,
the data set is:
group1 group2
1 A
1 A
1 A
1 A
2 A
2 B
2 B
I want a result as:
group1 group2 N
1 A 4
1 B 0
2 A 1
2 B 2
If I only use the default SAS setting, it will usually show as
group1 group2 N
1 A 4
2 A 1
2 B 2
But I still want to the second line in the result tell to me that there are 0 observations in that category.
Use the SPARSE option within proc freq. Consider it a cross join between all options from GROUP1 and GROUP2.
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc freq data=have;
table group1*group2/out=want sparse;
run;
proc print data=want;
run;
Reeza's sparse option works as long as each group is represented in your data at least once. Suppose there were a group1 3 that is not represented in your data, and you would still want them to show up in the frequency table. If that is the case, the solution is to create a reference table with all of your categories then right join your frequency table to it.
Create a reference table:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
Create the frequency table with proc sql, right joining to the reference table:
proc sql;
select
r.group1,
r.group2,
count(h.group1) as freq
from
have h
right join ref r
on h.group1 = r.group1
and h.group2 = r.group2
group by
r.group1,
r.group2
order by
r.group1,
r.group2
;
quit;
Another option that's a cross between DWal's issue of "what if the data isn't in the data" and Reeza's One Proc, One Solution, is proc tabulate. If the format contains all possible values, even if the values don't appear, it works, with printmiss.
proc format;
value groupformat
1='Group 1'
2='Group 2'
3='Group 3'
;
quit;
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc tabulate data=have;
class group1 group2/preloadfmt;
format group1 groupformat.;
tables group1*group2,n/printmiss misstext='0';
run;
How to do this via proc summary, using DWal's reference table to specify which combinations of values to use:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
data have;
input group1 group2 $1.;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc summary nway data = have classdata=ref;
class group1 group2;
output out = summary (drop = _TYPE_);
run;
N.B. I had to tweak the have dataset slightly to make sure that group2 has length 1 in both datasets. If you use variables with the same name but different lengths in your classdata= and data= datasets, SAS will complain.

How to refer a variable from another file in SAS?

Suppose I have two files named "test" and "lookup".
The file "test" contains the following information:
COL1 COL2
az ab
fc ll
gc ms
cc ds
And the file "lookup" has:
VAR
ll
dd
cc
ab
ds
I want to find those observations, which are in "test" but not in "lookup" and to replace them with missing values. Here is my code:
data want; set test;
array COL[2] COL1 COL2;
do n=1 to 2;
if COL[n] in lookup.VAR then COL[n]=COL[n];
else COL[n]=.;
end;
run;
I tried the above code. But ERROR shows that "Expecting an relational or arithmetic operator".
My question is how to refer a variable from another file?
First, grab the %create_hash() macro from this post.
You need to use a hash object to achieve what you are looking for.
The return code from a hash lookup is zero when found and non-zero when not found.
Character missing values are not . but "".
data want;
set have;
if _n_ = 1 then do;
%create_hash(lu,var,var,"lookup");
end;
array COL[2] COL1 COL2;
do n=1 to 2;
var = col[n];
rc = lu.find();
if rc then
col[n] = "";
end;
drop rc var n;
run;
Here is an alternative approach using proc sql:
proc sql;
create table want as
select case when col1 in (select var from lookup) then '' else col1 end as col1,
case when col2 in (select var from lookup) then '' else col2 end as col2
from test;
quit;