Counting matching character values - sas

I'm new to SAS and I have a basic question about counting the number of matching values in a column.
For example, if I have a variable called hair_color, and the different values are "brown", "black", "blonde", and "red" - I want to be able to produce a table that shows my database has 45 people with brown hair, 43 with black hair, 23 with blonde, etc.
I've written:
proc freq data=fake_dataset;
tables hair_color;
where hair_color="brown" "black" "blonde" "red";
run;
This code only runs if I have one value in the 'where' statement, for example, only "brown". How can I create a table of counts for all four hair colors?

Proc FREQ will create a bin for each distinct value of hair_color. If you have more than the four colors you listed but want only counts of those four, you will want to use the IN operator in the where clause
proc freq data=fake_dataset;
tables hair_color;
where hair_color in ("brown", "black", "blonde", "red");
run;
The commas separating the values list in the IN clause are optional.

Related

add a new column based on other columns in sas

I'm new to SAS and would like to get help with the question as follows:
1: Sample table shows as below
Time Color Food label
2020 red Apple A
2019 red Orange A,B
2018 blue Apple A,B
2017 blue Orange B
Logic to return label is:
when color = 'red' then 'A'
when color = 'blue' then 'B'
when food = 'orange' then 'B'
when food = 'apple' then 'A',
since for row 2, we have both red and orange then our label should contains both 'A,B', same as row 3.
The requirement is to print out the label for each combination. I know that we can use CASE WHEN statement to define how is our label should be based on color and food. Here we only have 2 kind of color and 2 different food, but what if we like 7 different color and 10 different food, then we would have 7*10 different combinations. I don't want to list all of those combinations by using case when statement.
Is there any convenient way to return the label? Thanks for any ideas!(prefer to achieve it in PROC SQL, but SAS is also welcome)
This looks like a simple application of formats. So define a format that converts COLOR to a code letter and a second one that converts FOOD to a code letter.
proc format ;
value color 'red'='A' 'blue'='B';
value food 'Apple'='A' 'Orange'='B' ;
run;
Then use those to convert the actual values of COLOR and FOOD variables into the labels. Either in a data step:
data want;
set have ;
length label $5 ;
label=catx(',',put(color,color.),put(food,food.));
run;
Or an SQL query:
proc sql ;
create table want as
select *
, catx(',',put(color,color.),put(food,food.)) as label length=5
from have
;
run;
You do not need to re-create the format if the data changes, only if the list of possible values changes.

SAS join on a flag to indicate percentile

I am looking to join two tables together
Table 1 - The baseball dataset
DATA baseball;
SET sashelp.baseball
(KEEP = crhits);
RUN;
Table 2 - A table containing the percentiles of CRhits
PROC STDIZE
DATA = baseball
OUT=_NULL_
PCTLMTD=ORD_STAT
PCTLDEF=5
OUTSTAT=STDLONGPCTLS
(WHERE = (SUBSTR(_TYPE_,1,1) = "P"))
pctlpts = 1 TO 99 BY 1;
RUN;
I would like to join these tables together to create a table that contains the values for crhits and then a column identifying which percentile that value belongs to like below
crhits percentile percentile_value
54 p3 54
66 p5 66
825 p63 825
1134 p76 1133
The last column indicates the percentile value given by stdlongpctls
I currently use the following code to calculate the percentiles and a loop to count the number of "Events" per percentile, per factor
I have tried a cross-join but I am having trouble visualising how to join these two tables without an explicit key
PROC SQL;
CREATE TABLE cross_join_table AS
SELECT
a.crhits
, b._TYPE_
, CASE WHEN
a.crhits < b.type THEN b._TYPE_ END AS percentile
FROM
baseball a
CROSS JOIN
stdlongpctls b;
QUIT;
If there is another easier / more efficient way to find the number of observations and number of dependent variables (e.g. I am modelling on a default flag event in my actual dataset, so the sum of 1's per percentile group, I would appreciate it)
Use PROC RANK instead to group it into the percentiles.
proc rank data=sashelp.baseball out=baseball_ranks group=100;
var crhits;
rank rank_crhits;
run;
You can then summarize it using PROC MEANS.

Proc compare - comparing variables in two datasets that have different sizes and different variable placement

So, I have a significant problem with proc compare. I have two datasets with the two columns. One column lists table names and the other one - names of variables which correspond to table names from the first column. I want compare values of one of them based on the values of first column. I somewhat made it work but the thing is that these datasets have different sizes due to additional values in one of them. Which means that some new variable was added in the middle of a dataset (new variable was added to a table). Unfortunately, proc compare compares values from two datasets horizontally and checks them against each other for values, so in my case it looks like this:
ds 1 | ds 2
cost | box_nr
other | cost_total
As you can see, a new value box_nr was added to the second dataset that appears above the value that I want it to compare variable cost to (cost_total). So I would like to know if it's possible to compare values (check for differences in character sequence) that have at least minimal similarity - for example 3 letters (cos) or if it's possible to just put values like box_nr at the end suggesting that they don't appear in a certain dataset.
My code:
PROC Compare base=USERSPDS.MIzew compare=USERSPDS.MIwew
out=USERSPDS.result outbase outcomp outdif noprint;
id 'TABLE HD'n;
where ;
run;
proc print data=USERSPDS.result noobs;
by 'TABLE HD'n;
id 'TTABLE HD'n;
title 'COMPARISON:';
run;
Untested, but this should get you some of the way.
proc sql;
create table compare as
select
coalesce(a.cola, b.cola) as cola,
a.colb as acolb,
b.colb as bcolb
from dataa as a
full outer join datab as b
on
a.cola = b.cola and
compged(a.colb, b.colb) <= 100;
quit;
Have a look at the compged documentation for further information.
Sounds like you could make a new variable in both datasets, VAR3chars=substr(var,1,3) and then add that variable to your ID statement. I think that should work unless there are duplicate values.
So if one dataset had var="cost" and the other had var="cost_total", they would match on the id so they would be compared and found to be different.
If one dataset had var="box_nr" and the other did not have any values starting with "box", they would not match on the id so compare would find that a record exists for that id in one dataset but not the other.

How to merge two columns in SAS keeping format

I have two columns. The first column contains values (count) and the second column the percentage (from the total population). So I have:
count percentage
12 (48%)
14 (29%)
89 (50%)
I would like one column with both parts:
count
12(48%)
14(29%)
89(50%)
I have tried:
data mydata1;
set mydata;
countper=catx(' ',count, percent);
run;
and
data mydata1;
set mydata;
counter=count || percent;
run;
Both of these sort of combine, the 'catx' being more successful, but I lose the brackets and percentage and the number of decimal places in the percentage originally used.
How can I combine these columns as I wish?
countper=catx(' ',vvalue(count), vvalue(percent)); will do it for you without requiring a re-formatting.

SAS: concatenate different datasets while keeping the individual data table names

I'm trying to concatenate multiple datasets in SAS, and I'm looking for a way to store information about individual dataset names in the final stacked dataset.
For eg. initial data sets are "my_data_1", "abc" and "xyz", each with columns 'var_1' and 'var_2'.
I want to end up with "final" dataset with columns 'var_1', 'var_2' and 'var_3'. where 'var_3' contains values "my_data_1", "abc" or "xyz" depending on from which dataset a particular row came.
(I have a cludgy solution for doing this i.e. adding table name as an extra variable in all individual datasets. But I have around 100 tables to be stacked and I'm looking for an efficient way to do this.)
If you have SAS 9.2 or newer you have the INDSNAME option
http://support.sas.com/kb/34/513.html
So:
data final;
format dsname datasetname $20.; *something equal to or longer than the longest dataset name including the library and dot;
set my_data_1 abc xyc indsname=dsname;
datasetname=dsname;
run;
Use the in statement when you set each data set:
data final;
set my_data_1(in=a) abc(in=b) xyc(in=c);
if a then var_3='my_data_1';
if b then var_3='abc';
if c then var_3='xyz';
run;