I'd appreciate some guidance. I think I should use retain in a data step but I am not too sure how it works yet.
I have a table with 3 columns.
ID, Date, value (numerical).
The table is already sorted by ID1 and Date
I simply want to select the rows in which the amount changed based on the previous and drop the rows in which it does not. Example below
id | Date |amount |
A | 01JAN| 1 |
A | 02JAN| 1 | <- Drop this row
A | 03JAN| 2 |
B | 01JAN| 0 |
B | 02JAN| 1 |
You can use the NOTSORTED keyword on the BY statement. So although the data is sorted by ID and DATE have the BY statement create the FIRST./LAST. flags based on ID and AMOUNT instead.
data want ;
set have ;
by id amount notsorted ;
if first.amount;
run;
The following solution uses the retain statement to remember the values from the previous record, compares it with the current record and deletes if the amount is the same (only checks for the same ID values - if you want to introduce some date conditions, you will need to do it here since your question does not specify any checks on the date).
data want;
set have;
by id;
retain prev_id ' ';
retain prev_amt;
if _N_ = 1 then call missing(prev_id, prev_amt);
if prev_id = id and prev_amt = amount then delete;
prev_id = id;
prev_amt = amount;
keep id amount date;
run;
Related
I have a file documenting changes in marital status - ID, type of change (marriage, divorce, being widowed) and year (and month) of change. I want to calculate each person's marital status (married, divorced, widow(er), never been married) for any given year. Since a person can go through many changes and my file is around 20 million rows I'd like to skip to the next person when I find the answer and not continue through all of that person's other records.
I thought to sort by ID and descending date of change and then set by ID. For each ID, if the year I'm interested in is greater than (or equal to) the year of change then calculate marital status and output the ID and marital status. If not, continue to the next record until the condition is met. If no record meets the condition then marital status=never been married.
data a;
length type_change $10;
input ID type_change yr_change mnth_change;
cards;
1 marriage 2006 9
1 divorce 2010 5
10 marriage 2005 2
10 divorce 2012 10
10 marriage 2016 8
23 marriage 2017 6
35 marriage 2002 7
35 widow 2013 12
;
run;
For 2015 I'd like to get:
- ID marital_status
- 1 divorced
- 10 divorced
- 23 never been married
- 35 widowed
Thanks in advance!
/* do this sort only once and save sorted */
proc sort data = have out = sorted;
by id yr_change;
run;
proc sort data = have (keep =id) out = ids nodupkey;
by id;
run;
data step1;
set sorted;
where yr_change <= &y;
by id;
if last.id;
run;
data want;
merge step1 (in =a) ids (in =b);
by id;
if b and not a then status = "never married";
else status = type_change;
run;
If by skip you mean not reading them then you cannot "skip" observations. But you can ignore them by using IF statement (or other conditional logic).
Using RETAIN and BY group processing should get you answer.
%let year=2015;
data want ;
set a ;
by id yr_change mnth_change ;
length status $20;
retain status ;
if first.id then status='never been married ';
if yr_change <= &year then status=type_change ;
if last.id;
keep id status;
run;
Result:
Obs ID status
1 1 divorce
2 10 divorce
3 23 never been married
4 35 widow
If you have access to a master list of ID's you could convert to using a WHERE statement which MIGHT reduce the I/O needed to process all of the records. For example merge the list of ID's with a subset of the marital status change records.
data want;
merge id_list a(in=in2 where=(yr_change <= &year));
by id;
length status $20;
retain status ;
if first.id then status='never been married ';
if in2 then status=type_change ;
if last.id;
keep id status;
run;
A DOW loop will let you compute a result over a group. An implicit output will save the result computed for the group. Because the result is dependent on your year of interest, you will want to track that also in any created data sets.
%let YEAR_CUTOFF = 2015;
data want (keep=id status year_cutoff);
attrib
id length = 8
status length=$20 label="Status at year end &YEAR_CUTOFF"
year_cutoff length = 8
;
retain year_cutoff &YEAR_CUTOFF;
status = 'never been married';
do until (last.ID); /* The DOW loop */
set have (rename=status=status_of_interest);
by id;
if year <= &YEAR_CUTOFF then status = status_of_interest;
end;
/* No explicit OUTPUT in the step, so,
* an implicit OUTPUT occurs here at the bottom of the step
*/
run;
Then use retain statement.
Extract all IDs:
proc sort data=a out=ids(keep= id) nodupkey ;
by id;
run;
Generate all years that you need to all IDs
data years;
set ids;
must_be_date=2000;
do i = 1 to 20;
must_be_date+1;
output;
end;
drop i;
run;
Join by condition:
proc sql;
create table res as
select *
from years left join a on years.must_be_date = a.yr_change and a.id = years.id
;
run;
proc sort ;
by id must_be_date;
run;
Use retain:
data res;
retain temp "never been married";
set res;
by id must_be_date;
if first.id then temp="never been married";
if type_change="" then type_change = temp;
else temp=type_change;
run;
to check :
data res_2015;
set res;
where must_be_date=2015;
run;
Result table:
+--------------------+----+--------------+-------------+-----------+-------------+
| temp | ID | must_be_date | type_change | yr_change | mnth_change |
+--------------------+----+--------------+-------------+-----------+-------------+
| divorce | 1 | 2015 | divorce | . | . |
| divorce | 10 | 2015 | divorce | . | . |
| never been married | 23 | 2015 | never been | . | . |
| widow | 35 | 2015 | widow | . | . |
+--------------------+----+--------------+-------------+-----------+-------------+
I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;
I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;
Sample Data
I was wondering if it is possible to use data instead of proc to count the number of categorical variables on a row as shown in 'count' example above. This will allow me to further use the data e.g COUNT=1 or COUNT > 1 to check morbidity.
Also will it be possible to then count the number of each diagnosis in the entire data set per patient while accounting for duplicates if there is any? For example there are 3 CB's and 2 AA's in this data set but CB should be 2 because patient 2 had it recorded twice.
Thank you for your time and have a lovely new year.
Your question is not clear but your could manage your diag using union all and count distinct
selec patient count(distinct diag )
from (
select patient, diag1 as diag
from my_table
uniona all
select patient, diag2
from my_table
uniona all
select patient, diag3
from my_table
uniona all
select patient, diag4
from my_table
) t
group by patient
or simply union and count
selec patient count(diag )
from (
select patient, diag1 as diag
from my_table
uniona
select patient, diag2
from my_table
uniona
select patient, diag3
from my_table
uniona
select patient, diag4
from my_table
) t
group by patient
The image indicates that for each row you want a count of the number of columns with non-missing values. Additionally, you apparently have some way to do this using a PROC step, but would like to know how using a DATA step.
In DATA step you can count the number of non-missing values indirectly using CMISS, or directly using COUNTC against a constructed value:
data have;
attrib pid length=8 diag1-diag4 length=$5;
input pid & diag1-diag4;
datalines;
1 AA J9 HH6 .
2 CB . . CB
3 J10 AA CB J10
4 B B . F90 .
5 J10 . . .
6 . . . .
run;
data have_with_count;
set have;
count = 4 - cmiss (of diag1-diag4);
count_way2 = countc(catx('~', of diag1-diag4, 'SENTINEL'), '~');
run;
In order to work again MySQL data source you will also need a libref that connects you to that remote data server.
Added
Counting distinct values across a row can be accomplished using a hash or sortc. Consider this example that sorts a copy of the row data (as an array) and counts the unique values within:
data want;
set have;
array diag diag1-diag4;
array v(4) $5 _temporary_;
do _n_ = 1 to dim(diag);
v(_n_) = diag(_n_);
end;
call sortc(of v(*));
uniq = 0;
do _n_ = 1 to dim(v);
if missing(v(_n_)) then continue;
if uniq = 0 then
uniq + 1;
else
uniq + ( v(_n_) ne v(_n_-1) );
end;
run;
With Richard's dummy data to count number of diagnosis and unique number of diagnosis:
data want;
set have;
array var diag:;
length temp $30.;
call missing(diag_num);
do over var;
if not missing(var) then do;
diag_num+1;
temp=ifc(whichc(var, temp),temp,catx(' ',temp,var));
end;
end;
unique_diag=countw(temp);
drop temp;
run;
I am trying to create a hash variable that is built incrementally. The specific problem I am trying to solve is that I have a column of currency pairs:
|--------------------|
| ID | CurrencyPair |
|----|---------------|
| 1 | USD/GBP |
| 2 | GBP/USD |
| 3 | USD/BRL |
| ...| ... |
I want currency pair for row 1 and currency pair for row 2 (USD/GBP) and (GBP/USD) to be recognized as the same. So I am trying to implement the following algorithm:
Create an empty column CurrencyPairRecode
Create a hash variable declare hash h(); h.defineKey('k'); h.defineData('d');
For every row of data, lookup if the currency pair exists in the hash table. If it does the value of CurrencyPairRecode is the same as CurrencyPair
rc = h.Check(key: CurrencyPair)
IF (rc=0) THEN
CurrencyPairRecode = CurrencyPair
If not, check if the flipped currency pair is in the hash table. If it is, CurrencyPairRecode is the flipped value
CALL CATX("/",FLIPPED,SUBSTR(SETTLEMENT_EXCHANGE_RATE_BASIS, 4, 3),SUBSTR(SETTLEMENT_EXCHANGE_RATE_BASIS, 1, 3));
flip_rc = h.Check(key: FLIPPED);
IF (flip_rc = 0) THEN
CurrencyPairRecode = flipped;
If neither, CurrencyPairRecode is the same as CurrencyPair and add CurrencyPair to the hash table.
IF (rc^=0 AND flip_rc^= 0) THEN
h.ADD(key: CurrencyPair, data: 1);
CurrencyPairRecode = CurrencyPair
I have tried this code but am getting errors. I am completely new to SAS so not sure how to troubleshoot. All help is appreciated.
The general approach I'd use is to store the currency pair in always sorted order. This is particularly appealing when the order is really not relevant (as you don't have to keep track of it).
I would do something like this.
data have;
input ID CurrencyPair $;
datalines;
1 USD/GBP
2 GBP/USD
3 USD/BRL
;;;;
run;
data for_hash;
set have;
array curs[2] $ _temporary_;
curs[1] = scan(currencyPair,1,'/');
curs[2] = scan(currencyPair,2,'/');
call sortc(of curs[*]);
new_pair = catx('/',of curs[*]);
put _all_;
run;
You can then load the hash in that same datastep. Using call sortc will sort the variables alphabetically, so that you have a single currency pair. You can then test its presence and add it if needed without having to test twice.
I would also express a general preference for storing it with two keys (the two currencies) rather than with a single merged key, but there may be reasons for not doing that as well in your application. Two keys tends to be easier to work with in applications like this in my experience.
If I understand your question correctly, maybe you could try this:
data want;
if _n_=1 then do;
declare hash h();
h.definekey('CurrencyPair');
h.definedata('CurrencyPair');
h.definedone();
end;
set have;
_CurrencyPair=prxchange('s/(.*)\/(.*)/$2\/$1/',-1,strip(CurrencyPair));
rc1=h.check();
rc2=h.check(key:_CurrencyPair);
if rc1^=0 and rc2^=0 then do;
h.add();
CurrencyPairRecode = CurrencyPair;
end;
else if rc1^=0 and rc2=0 then do;
h.add();
CurrencyPairRecode =_CurrencyPair;
end;
else if rc1=0 then CurrencyPairRecode = CurrencyPair;
drop rc: _:;
run;