I'm trying to compare multiple sets of data by putting them in separate groups between two numbers. Originally I had statements like,
if COLUMN1 gt 0 and COLUMN1 LE 1000 then PRICE_GROUP = 1000;
I had this going up by 1000 to 100,000. The only problem is that once I counted how many were in each price_group, some price_groups were missing (57,000 had no values so when I would count(Price_group) it would not appear for some groups). The solution I think is to make a table with the bounds for each, and then compare the actual value vs the upper and lower bound.
proc iml;
mat = j(100,2,0);
total = 100000;
mat[1,1] = 0;
mat[1,2] = mat[1,1] + (total/100);
do i = 2 to nrow(mat);
mat[i,1] = mat[i-1,1] + (total/100);
mat[i,2] = mat[i,1] + (total/100);
end;
create dataset from mat;
append from mat;
quit;
This creates the table which I can compare the values, but is there an easier way besides proc iml? I was next going to do a loop to compare each value with the two columns and create a new column on the table to have the count in each bucket. This still seems like an intensive process that is inefficient.
IML isn't a terrible solution, but there are a few others depending on what exactly you're doing.
The most common is proc format. Create a format that manages each bucket, like so:
proc format;
value buckets
0-1000 = 1000
1000<-2000 = 2000
...
other="NA";
quit;
Then you can either use the format (or informat) to create a new variable with the bucketed value, or even better, use the format on the fly (ie, in proc means or whatnot) which not only means you don't have to rewrite the dataset, but you can swap formats on and off depending on how many buckets you want (say, buckets100 format or buckets20 and whatnot).
Second, your specific question looks like it's solveable just using math:
data want;
set have;
bucket = &total/100*(1+floor(column1/(&total/100)));
run;
although obviously that doesn't work for every example.
Third, you could use a hash lookup table, if you are unable to use formats (such as there are two or more elements that determine the bucket). If that's useful I can expand on that, or just google about as those are very commonly used for lookups in SAS. That's the closest solution to the IML solution inside a regular datastep.
Create another table with groups:
data group_table;
do price_group=1000 to 100000 by 1000;
output;
end;
run;
Then left join the grouping/comparison table with this new table using price_group as key:
proc sql;
create table price_group_compare as
select L.price_group,R.group1_count,R.group2_count
from group_table as L, group_counts as R
on L.price_group = R.price_group;
quit;
Related
I have 18 separate datasets that contain similar information: patient ID, number of 30-day equivalents, and total day supply of those 30-day equivalents. I've output these from a dataset that contains those 3 variables plus the medication class (VA_CLASS) and the quarter it was captured in (a total of 6 quarters).
Here's how I've created the 18 separate datasets from the snip of the dataset shown above:
%macro rx(class,num);
proc sql;
create table dm_sum&clas._qtr&num as select PatID,
sum(equiv_30) as equiv_30_&class._&num
from dm_qtrs
where va_class = "HS&class" and dm_qtr = &qtr
group by 1;
quit;
%mend;
%rx(500,1);
%rx(500,2);
%rx(500,3);
%rx(500,4);
%rx(500,5);
%rx(500,6);
%rx(501,1);
and so on...
I then need to merge all 18 datasets back together by PatID and what I'd like to do is iteratively add the next dataset created to the previous, as in, add dataset dm_sum_500_qtr3 to a file that already contains the results of dm_sum_500_qtr1 & dm_sum_500_qtr1.
Thanks for looking, Brian
In the macro append the created data set to it an accumulator data set. Be sure to delete it before starting so there is a fresh accumulation. If the process is run at different times (like weekly or monthly) you may want to incorporate a unique index to prevent repeated appendings. If you are stacking all these sums, the create table should also select va_class and dm_qtr
%macro (class, num, stack=perm.allClassNumSums);
proc sql; create table dm_sum&clas._qtr&num as … ;
proc append force base=perm.allClassNumSums data=dm_sum&clas._qtr#
run;
%mend;
proc sql;
drop table perm.allClassNumSums;
%rx(500,1)
%rx(500,2)
%rx(500,3)
%rx(500,4)
%rx(500,5)
…
A better approach might be a single query with an larger where, and leave the class and qtr as categorical variables. Your current approach is moving data (class and qtr) into metadata (column names). Such a transformation makes additional downstream processing more difficult.
Proc TABULATE or REPORT can be use a CLASS statement to assist the creation of output having category based columns. These procedures might even be able to work directly with the original data set and not require a preparatory SQL query.
proc sql;
create table want as
select
PatID, va_class, dm_qtr,
sum(equiv_30) as equiv_30_sum
from dm_qtrs
where catx(':', va_class, dm_sqt) in
(
'HS500:1'
'HS500:2'
'HS500:3'
…
'HS501:1'
)
group by PatID, va_class, dm_qtr;
quit;
I have a dataset which has to be rolled up based on the granularity(FIELD1 & FIELD2). Two of the metrics fields(METRIC1 & METRIC2) have to be summed up. Until now it seems to be an easy GROUP BY task. But I have a string field(FLAG) which has to be rolled up too, by concatenating the distinct values.
Input Dataset:
Expected Result:
This operation can be performed in Oracle using the LISTAGG() function.
Kindly help me out in achieving the same in SAS Proc SQL.
I don't believe there's a direct way to do this in SAS. CATS (and similar concatenation functions) aren't aggregation functions. It was suggested to add these back a few years ago but nothing came of it that I'm aware of (see this thread.)
If I understand right, what you're doing is GROUP BY field1/field2, SUM metric1/metric2, and make a single FLAG field that concatenates all seen FLAG field values (but doesn't group by them).
The way I would handle this is to first do your aggregation (field1/field2), and then join that to a separate table that was just field1/field2/flag. You could make that most easily in the data step, something like:
data want;
set have;
by field1 field2;
length flag_out $100; *or longer if you need longer;
flag_out = catx(',',flag_out,flag);
if last.field2 then output;
rename flag_out=flag;
drop flag;
run;
This assumes it's sorted already by field1/field2, otherwise you need to do that first.
As stated, there is no LISTAGG() function, and there is also no built-in feature for creating a custom aggregate function. However, there are two possibilities that get the output.
Example one
Data step with DOW processing and hashing for tracking distinct flag values while concatenating within a group.
data want;
if 0 then set have; *prep pdv;
length flags $200;
declare hash _flags();
_flags.defineKey('flag');
_flags.defineDone();
do until (last.f2);
set have;
by f1 f2;
m1_sum = sum(m1_sum,m1);
m2_sum = sum(m2_sum,m2);
if _flags.find() ne 0 then do;
_flags.add();
flags = catx(',',flags,flag);
end;
end;
drop m1 m2 flag;
_flags.delete();
run;
Example two
Create a FCMP custom function used from within SQL. Because FCMP can not create an aggregate function, the result will be automatically remerged against original data which must then be filtered. The FCMP function also uses a hash for tracking distinct values of flag within a group.
proc fcmp outlib=sasuser.functionsx.package;
function listagg(f1 $, f2 $, n, item $) $;
length result $32000 index 8;
static flag;
static index;
declare hash items();
if flag = . then do;
flag = 1;
rc = items.defineKey('item');
rc = items.defineDone();
end;
static items;
index + 1;
rc = items.replace();
if index = n then do;
declare hiter hi('items');
result = '';
do while (hi.next() = 0);
result = catx(',',result,item);
end;
index = 0;
rc = items.clear();
return (result);
end;
else
return ("");
endsub;
run;
options cmplib=sasuser.functionsx;
proc sql;
create table wanted as
select * from
(
select /* subselect is a remerge due to 'listagg' mimic */
f1,
f2,
listagg(f1, f2, count(*), flag) as flags,
sum(m1) as m1,
sum(m2) as m2
from have
group by f1, f2
)
where flags is not null /* filter the subselect */
;
quit;
Ideally a hash of hashes would have been used, but FCMP only provides for hash instance creation in a declare statement, and dynamic hashes can not be instantiated with _new_. SAS Viya users would be able to use the new component object Dictionary in an FCMP function, and might be able to have a Dictionary of Dictionaries to track the distinct flag values in each group.
Thanks everyone for your valuable inputs. Apparently there is no straightforward solution to this scenario in SAS. With the bigger picture of the requirement in mind, I have decided to tackle the problem at the data layer itself or add another intermediate presentation layer.
I'm sure many have pointed out this need to SAS, I have raised this issue with SAS too. Hope they look into it and come up with a similar function as LISTAGG OR GROUP_CONCAT.
Answer from Joe is very good but is missing one critical part.
There should be a
retain flag_out;
line after the 'by' line.
I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.
So, I have a significant problem with proc compare. I have two datasets with the two columns. One column lists table names and the other one - names of variables which correspond to table names from the first column. I want compare values of one of them based on the values of first column. I somewhat made it work but the thing is that these datasets have different sizes due to additional values in one of them. Which means that some new variable was added in the middle of a dataset (new variable was added to a table). Unfortunately, proc compare compares values from two datasets horizontally and checks them against each other for values, so in my case it looks like this:
ds 1 | ds 2
cost | box_nr
other | cost_total
As you can see, a new value box_nr was added to the second dataset that appears above the value that I want it to compare variable cost to (cost_total). So I would like to know if it's possible to compare values (check for differences in character sequence) that have at least minimal similarity - for example 3 letters (cos) or if it's possible to just put values like box_nr at the end suggesting that they don't appear in a certain dataset.
My code:
PROC Compare base=USERSPDS.MIzew compare=USERSPDS.MIwew
out=USERSPDS.result outbase outcomp outdif noprint;
id 'TABLE HD'n;
where ;
run;
proc print data=USERSPDS.result noobs;
by 'TABLE HD'n;
id 'TTABLE HD'n;
title 'COMPARISON:';
run;
Untested, but this should get you some of the way.
proc sql;
create table compare as
select
coalesce(a.cola, b.cola) as cola,
a.colb as acolb,
b.colb as bcolb
from dataa as a
full outer join datab as b
on
a.cola = b.cola and
compged(a.colb, b.colb) <= 100;
quit;
Have a look at the compged documentation for further information.
Sounds like you could make a new variable in both datasets, VAR3chars=substr(var,1,3) and then add that variable to your ID statement. I think that should work unless there are duplicate values.
So if one dataset had var="cost" and the other had var="cost_total", they would match on the id so they would be compared and found to be different.
If one dataset had var="box_nr" and the other did not have any values starting with "box", they would not match on the id so compare would find that a record exists for that id in one dataset but not the other.