SAS creating subtables ordered by year - sas

I have a table in SAS which consists of data from stock exchange. One of its columns holds information about date. I would like to create subtables, each one hold data from only one specific year.

Assuming you want to do this (often, this is an inferior option as analyses run separately by year can be run from one dataset using by year;, but certainly this can sometimes be appropriate), the gold standard method for doing this is the hash table, as the hash table can produce unlimited tables based on the data. I will edit in a method for doing this with hash table if I have time while running things this afternoon; it's the 'hashing' method described on this page.
Hashing code, adapted from the sascommunity.org page above:
data have;
call streaminit(7);
do year=1998 to 2014;
do id= 1 to 10;
x=rand('Uniform');
output;
end;
end;
run;
data _null_ ;
dcl hash byyear () ;
byyear.definekey ('k') ; if `id` or similar is a safe unique ID you could use that here, otherwise `k` is your unique identifier - hash requires unique;
byyear.definedata ('year','id','x') ;
byyear.definedone () ;
do k = 1 by 1 until ( last.year ) ;
set have;
by year ;
byyear.add () ;
end ;
dsetname=cats('year',year);
byyear.output (dataset: dsetname) ;
run ;
There is a similar set of methods that revolve around using a macro to generate the code. This paper goes into detail about one method to do that; I won't explain it in detail as I consider it inferior to the hash method (even if it is lower CPU time, it is more complicated to write than either a pure macro method or a pure hash method) but in certain cases it could be better.
A simple example of the macro method using the conceptual have aframe defined:
proc sql;
select distinct(cats('year',year(date))) into :dsetlist
separated by ' '
from have;
select distinct(cats('%outputyear(year=',year(date),')')) into :outputlist
separated by ' '
from have;
quit;
%macro outpuyear(year=);
if year(date)=&year. then output year&dset.;
%mend outputyear;
data &dsetlist.;
set have;
&outputlist.;
run;

data year1 year2 year3 yearN;
set stockdata;
if year(date) = 2014 then
output year1;
else if year(date) = 2013 then
output year2;
else if year(date) = 2012 then
output year3;
else
output yearN;
run;
You could also use case statements I guess.

Related

Combine one column's values into a single string

This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;

List Aggregation and Group Concatenation in SAS Proc SQL

I have a dataset which has to be rolled up based on the granularity(FIELD1 & FIELD2). Two of the metrics fields(METRIC1 & METRIC2) have to be summed up. Until now it seems to be an easy GROUP BY task. But I have a string field(FLAG) which has to be rolled up too, by concatenating the distinct values.
Input Dataset:
Expected Result:
This operation can be performed in Oracle using the LISTAGG() function.
Kindly help me out in achieving the same in SAS Proc SQL.
I don't believe there's a direct way to do this in SAS. CATS (and similar concatenation functions) aren't aggregation functions. It was suggested to add these back a few years ago but nothing came of it that I'm aware of (see this thread.)
If I understand right, what you're doing is GROUP BY field1/field2, SUM metric1/metric2, and make a single FLAG field that concatenates all seen FLAG field values (but doesn't group by them).
The way I would handle this is to first do your aggregation (field1/field2), and then join that to a separate table that was just field1/field2/flag. You could make that most easily in the data step, something like:
data want;
set have;
by field1 field2;
length flag_out $100; *or longer if you need longer;
flag_out = catx(',',flag_out,flag);
if last.field2 then output;
rename flag_out=flag;
drop flag;
run;
This assumes it's sorted already by field1/field2, otherwise you need to do that first.
As stated, there is no LISTAGG() function, and there is also no built-in feature for creating a custom aggregate function. However, there are two possibilities that get the output.
Example one
Data step with DOW processing and hashing for tracking distinct flag values while concatenating within a group.
data want;
if 0 then set have; *prep pdv;
length flags $200;
declare hash _flags();
_flags.defineKey('flag');
_flags.defineDone();
do until (last.f2);
set have;
by f1 f2;
m1_sum = sum(m1_sum,m1);
m2_sum = sum(m2_sum,m2);
if _flags.find() ne 0 then do;
_flags.add();
flags = catx(',',flags,flag);
end;
end;
drop m1 m2 flag;
_flags.delete();
run;
Example two
Create a FCMP custom function used from within SQL. Because FCMP can not create an aggregate function, the result will be automatically remerged against original data which must then be filtered. The FCMP function also uses a hash for tracking distinct values of flag within a group.
proc fcmp outlib=sasuser.functionsx.package;
function listagg(f1 $, f2 $, n, item $) $;
length result $32000 index 8;
static flag;
static index;
declare hash items();
if flag = . then do;
flag = 1;
rc = items.defineKey('item');
rc = items.defineDone();
end;
static items;
index + 1;
rc = items.replace();
if index = n then do;
declare hiter hi('items');
result = '';
do while (hi.next() = 0);
result = catx(',',result,item);
end;
index = 0;
rc = items.clear();
return (result);
end;
else
return ("");
endsub;
run;
options cmplib=sasuser.functionsx;
proc sql;
create table wanted as
select * from
(
select /* subselect is a remerge due to 'listagg' mimic */
f1,
f2,
listagg(f1, f2, count(*), flag) as flags,
sum(m1) as m1,
sum(m2) as m2
from have
group by f1, f2
)
where flags is not null /* filter the subselect */
;
quit;
Ideally a hash of hashes would have been used, but FCMP only provides for hash instance creation in a declare statement, and dynamic hashes can not be instantiated with _new_. SAS Viya users would be able to use the new component object Dictionary in an FCMP function, and might be able to have a Dictionary of Dictionaries to track the distinct flag values in each group.
Thanks everyone for your valuable inputs. Apparently there is no straightforward solution to this scenario in SAS. With the bigger picture of the requirement in mind, I have decided to tackle the problem at the data layer itself or add another intermediate presentation layer.
I'm sure many have pointed out this need to SAS, I have raised this issue with SAS too. Hope they look into it and come up with a similar function as LISTAGG OR GROUP_CONCAT.
Answer from Joe is very good but is missing one critical part.
There should be a
retain flag_out;
line after the 'by' line.

SAS pass-through facility. How to insert a big list from a local table in a query?

I need to query a large table in a server (REMOTE_TBL) using the SAS pass-through facility. In order to make the query shorter, I want to send a list of IDs extracted from a local table (LOCAL_TBL).
My first step is to get the IDs into a variable called id_list using an INTO statement:
select distinct ID into: id_list separated by ',' from WORK.LOCAL_TBL
Then I pass these IDs to the pass-through query:
PROC SQL;
CONNECT TO sybaseiq AS dbcon
(host="name.cl" server=alias db=iws user=sas_user password=XXXXXX);
create table WANT as
select * from connection to dbcon(
select *
from dbo.REMOTE_TBL
where ID in (&id_list)
);
QUIT;
The code runs fine except that I get the following message:
The length of the value of the macro variable exceeds the maximum length
Is there an easier way to send the selected ID's to the pass-through query?
Is there a way to store the selected ID's in two or more variables?
Store the values into multiple macro variables and then store the names of the macro variables into another macro variable.
So this code will make a series of macro variables named M1, M2, .... and then set ID_LIST to &M1,&M2....
data _null_;
length list $20200 mlist $20000;
do until(eof or length(list)>20000);
set LOCAL_TBL end=eof;
list=catx(',',list,id);
end;
call symputx(cats('m',_n_),list);
mlist=catx(',',mlist,cats('&m',_n_));
if eof then call symputx('id_list',mlist);
run;
Then when you expand ID_LIST the macro processor will expand all of the individual Mx macro variables. This little data step will create a couple of example macro variables to demonstrate the idea.
data _null_;
call symputx('m1','a,b,c');
call symputx('m2','d,e,f');
call symputx('id_list','&m1,&m2');
run;
Results:
70 %put ID_LIST=%superq(id_list);
ID_LIST=&m1,&m2
71 %put ID_LIST=&id_list;
ID_LIST=a,b,c,d,e,f
You are passing many data values that appear in your IN (…) clause. The number of values allowed varies by data base; some may limit to 250 values per clause and the length of a statement might have limitations. If the macro variable creates a list of values 20,000 characters long, the remote side might not like that.
When dealing with a lookup of perhaps > 100 values, take some time first to communicate your need to the DB admin for creating temporary tables. When you have such rights, your queries will be more efficient remote side.
… upload id values to #myidlist …
create table WANT as
select * from connection to dbcon(
select *
from dbo.REMOTE_TBL
where ID in (select id from #myidlist)
);
QUIT;
If you can't get the proper permissions, you would have to chop up the id list into pieces and have a macro create a series of ORed IN searches.
1=0
OR ID IN ( … list-values-1 … )
…
OR ID IN ( … list-values-N … )
For example:
data have;
do id = 1 to 44;
output;
end;
run;
%let IDS_PER_MACVAR = 10; * <---------- make as large as you want until error happens again;
* populated the macro vars holding the chopped up ID list;
data _null_;
length macvar $20; retain macvar;
length macval $32000; retain macval;
set have end=end;
if mod(_n_-1, &IDS_PER_MACVAR) = 0 then do;
if not missing(macval) then call symput(macvar, trim(macval));
call symputx ('VARCOUNT', group);
group + 1;
macvar = cats('idlist',group);
macval = '';
end;
macval = catx(',',macval,id);
if end then do;
if not missing(macval) then call symput(macvar, trim(macval));
call symputx ('MVARCOUNT', group);
end;
run;
* macro that assembles the chopped up bits as a series of ORd INs;
%macro id_in_ors (N=,NAME=);
%local i;
1 = 0
%do i = 1 %to &N;
OR ID IN (&&&NAME.&i)
%end;
%mend;
* use %put to get a sneak peek at what will be passed through;
%put %id_in_ors(N=&MVARCOUNT,NAME=IDLIST);
* actual sql with pass through;
...
create table WANT as
select * from connection to dbcon(
select *
from dbo.REMOTE_TBL
where ( %ID_IN_ORS(N=&MVARCOUNT,NAME=IDLIST) ) %* <--- idlist piecewise ors ;
);
...
I suggest that you first save all the distinct values into a table, and then (again using proc sql + into) load the values into a few stand-alone macrovariables, reading the table several times in a few sets; indeed they have to be mutually exclusive yet jointly exhaustive.
Do you have access to and CREATE privileges in the DB where your dbo.REMOTE_TBL resides? If so you might also think about copying your WORK.LOCAL_TBL into a temporary table in the DB and run an inner join right there.
Another option - write out the query to a temporary file and then %include it. No macro logic needed!
proc sort
data = WORK.LOCAL_TBL(keep = ID)
out = distinct_ids
nodupkey;
run;
data _null_;
set distinct_ids end = eof;
file "%sysfunc(pathname(work))/temp.sas";
if _n_ = 1 then put "PROC SQL;
CONNECT TO sybaseiq AS dbcon
(host=""name.cl"" server=alias db=iws user=sas_user password=XXXXXX);
create table WANT as
select * from connection to dbcon(
select *
from dbo.REMOTE_TBL
where ID in (" #;
put ID #;
if not(eof) then put "," #;
if eof then put ");QUIT;" #;
put;
run;
/*Use nosource2 to avoid cluttering the log*/
%include "%sysfunc(pathname(work))/temp.sas" /nosource2;

Delete all observations, which are doubled on some variable

Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.
If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.
You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.