Sas Macro to semi-efficiently manipulate data - sas

Objective: Go from Have table + Help table to Want table. The current implementation (below) is slow. I believe this is a good example of how not to use SAS Macros, but I'm curious as to whether...
1. the macro approach could be salvaged / made fast enough to be viable
(e.g. proc append is supposed to speed up the action of stacking datasets, but I was unable to see any performance gains.)
2. what all the alternatives would look like.
I have written a non-macro solution that I will post below for comparison sake.
Data:
data have ;
input name $ term $;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
proc print ; run;
data help ;
input terms $ ;
cards;
2000
2001
2002
2003
2004
2005
2006
2007
2008
; run;
proc print ; run;
data want ;
input name $ term $ status $;
cards;
Joe 2000 here
Joe 2001 gone
Joe 2002 here
Joe 2003 gone
Joe 2004 gone
Joe 2005 gone
Joe 2006 gone
Joe 2007 gone
Joe 2008 here
Sally 2001 here
Sally 2002 gone
Sally 2003 here
; run;
proc print data=have ; run;
I can write a little macro to get me there for each individual:
%MACRO RET(NAME);
proc sql ;
create table studtermlist as
select distinct term
from have
where NAME = "&NAME"
;
SELECT Max(TERM) INTO :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
SELECT MIN(TERM) INTO :MINTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE TERMLIST AS
SELECT TERMS
FROM HELP
WHERE TERMS BETWEEN "&MINTERM." and "&MAXTERM."
ORDER BY TERMS
;
CREATE TABLE HEREGONE_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
;
quit;
%MEND RET ;
%RET(Joe);
%RET(Sally);
proc print data=HEREGONE_Joe; run;
proc print data=HEREGONE_Sally; run;
But it's incomplete. If I loop through for (presumably quite a few names)...
*******need procedure for all names - grab info on have ;
proc sql noprint;
select distinct name into :namelist separated by ' '
from have
; quit;
%let n=&sqlobs ;
%MACRO RETYA ;
OPTIONS NONOTEs ;
%do i = 1 %to &n ;
%let currentvalue = %scan(&namelist,&i);
%put &currentvalue ;
%put &i ;
%RET(&currentvalue);
%IF &i = 1 %then %do ;
data base; set HEREGONE_&currentvalue; run;
%end;
%IF &i gt 1 %then %do ;
proc sql ; create table base as
select * from base
union
select * from HEREGONE_&currentvalue
;
drop table HEREGONE_&currentvalue;
quit;
%end;
%end ;
OPTIONS NOTES;
%MEND;
%RETYA ;
proc sort data=base ; by name terms; run;
proc print data=base; run;
So now I have want, but with 6,000 names, it takes over 20 minutes.

Let's try the alternative solution. For each name find the min/max term via a proc SQL data step. Then use a data step to create the time period table and merge that with your original table.
*Sample data;
data have ;
input name $ term ;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
*find min/max of each name;
proc sql;
create table terms as
select name, min(term) as term_min, max(term) as term_max
from have
group by name
order by name;
quit;
*Create table with the time periods for each name;
data empty;
set terms;
do term=term_min to term_max;
output;
end;
drop term_min term_max;
run;
*Create final table by merging the original table with table previously generated;
proc sql;
create table want as
select a.name, a.term, case when missing(b.term) then 'Gone'
else 'Here' end as status
from empty a
left join have b
on a.name=b.name
and a.term=b.term
order by a.name, a.term;
quit;
EDIT: Now looking at your macro solution, part of the problem is that you're scanning your table too many times.
The first table, studenttermlist is not required, the last join can
be filtered instead.
The two macro variables, min/max term can be
calculated in a single pass
Avoid the smaller interim term list and use a where clause to filter your results
Use Call Execute to call your macro rather than another macro loop
Rather than loop through to append the
data, take advantage of a naming convention and use a single data
step to append all outputs.
%MACRO RET(NAME);
proc sql noprint;
SELECT MIN(TERM), Max(TERM) INTO :MINTERM, :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE _HG_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from help a
left join have b
on a.terms eq b.term
and b.name="&name"
where a.terms between "&minterm" and "&maxterm";
;
quit;
%MEND RET ;
*call macro;
proc sort data=have;
by name term;
run;
data _null_;
set have;
by name;
if first.name then do;
str=catt('%ret(', name, ');');
call execute(str);
end;
run;
*append results;
data all;
set _hg:;
run;

You can actually do this in a single nested SQL query. It would be messy and hard to read.
I'm going to break it out into the three components.
First, get the distinct names;
proc sql noprint;
create table names as
select distinct name from have;
quit;
Second, Cartesian product names and terms to get all the combos.
proc sql noprint;
create table temp as
select a.name, b.terms as term
from names as a,
help as b;
quit;
Third, left join to find the matches
proc sql noprint;
create table want as
select a.name,
a.term,
case
when missing(b.term) then "gone"
else "here"
end as Status
from temp as a
left join
have as b
on a.name=b.name
and a.term=b.term;
quit;
Last, delete the temp table to save space;
proc datasets lib=work nolist;
delete temp;
run;
quit;
As Reeza shows, there are other ways to do this. As I said above, you can merge all this into a single SQL join and get the results you want. Depending on computer memory and data size, it should be OK (and might be faster as everything is in memory).

proc sql;
create table want as
select c.name, c.terms, a.term,
( case when missing(a.term) then "Gone"
else "Here" end ) as status
from (select distinct a.name, b.terms
from have a, help b) c
left join have a
on c.terms = a.term and c.name = a.name
order by c.name, c.terms, a.term
;

I'm going to throw in my similar answer so I can compare them all later.
proc sql ;
create table studtermlist as
select distinct term,name
from have
;
create table MAXMINTERM as
SELECT Max(TERM) as MAXTERM, Min(TERM) as MINTERM, name
FROM HAVE
GROUP BY name
;
CREATE TABLE TERMLIST AS
SELECT TERMS,name
FROM HELP a,MAXMINTERM b
WHERE TERMS BETWEEN MINTERM and MAXTERM
ORDER BY name,TERMS
;
CREATE TABLE HEREGONE AS
SELECT
a.terms ,
a.Name ,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
and a.name eq b.name
order by name, terms
;
quit;

Related

Using PRXMATCH to match strings from another sas dataset

Need your assistance and guidance. Please see below
*rsubmit;proc sql;
connect to teradata(user=&user_id. password=&user_pwd.);
create table mylib.DWH_table as select * from connection to teradata(
select distinct nm from DWH_table
);
quit;*endrsubmit;
*rsubmit;
DATA mylib.out_sas1;
set mylib.DWH_table;
if prxmatch ("m/studio/i",nm) > 0;
run;*endrsubmit;
So the above code checks for the word "studio" in the column nm and returns the results. However, this is a manual process that needs to be automated. I have another dataset that contains just one column named "KEYWORDS". Some of the sample data I have given below
KEYWORDS:
apple
mango
banana
grapes
The goal is that SAS should take the word in the column and compare it to the value in the database and create a separate output table.
So for example:
*rsubmit;
DATA mylib.out_sas2;
set mylib.DWH_table;
if prxmatch ("m/apple/i",nm) > 0;
run;*endrsubmit;
*rsubmit;
DATA mylib.out_sas3;
set mylib.DWH_table;
if prxmatch ("m/mango/i",nm) > 0;
run;*endrsubmit;
Can this be done in SAS?
Put your keywords in macro vairables
proc sql;
select count(distinct KEYWORDS)
into :no_keys
from mylib.MY_KEYWORDS;
select distinct KEYWORDS
into :key_1-key_&no_keys
from mylib.MY_KEYWORDS;
quit;
Now use those macro variables
%macro find_keywords;
data
%do key_nr = 1 %to &no_keys;
mylib.out_sas&key_nr (drop = UP_nm)
%end;
;
set mylib.DWH_table;
UP_nm : upcase(nm);
%do key_nr = 1 %to &no_keys;
keyword = "&key.";
if prxmatch ("m/&&key_&key_nr/i",UP_nm) > 0 then output out_sas&key_nr;
%end;
run;
%mend;
%find_keywords;
You need to embed this in a macro, because you cannot use %do ... %end; in "open" code. && resolves to &, which makes it a delayed &, that is resolved after resolving &key_nr.
Disclaimer: this code is not tested. If you have trouble getting it running, please respond.
Consider a macro call via a data step using CALL EXECUTE:
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key., %str(%")));
data mylib.out_&name_unquoted.;
set mylib.DWH_table;
if prxmatch ("m/"||trim(&key.)||"/i",nm) > 0;
run;
%mend;
data _null_;
set mydata;
call execute('%nrstr(%subset_data("'||KEYWORDS||'"))');
run;
Alternatively, instead of call execute, create a SAS script file of macro calls, then run with %include:
data _null_;
set mydata;
file "Temp.sas" ;
put '%subset_data("' KEYWORDS '") ;' ;
run;
%include "Temp.sas";
But if keywords are many (i.e., tens to hundreds to thousands), consider #Richard's comment above to develop an indicator column in a concatenated dataset via a helper, temp dataset:
%macro subset_data(key);
*** BUILD temp WITH INDICATOR;
data temp;
set mylib.DWH_table;
if prxmatch ("m/"||trim(&key.)||"/i",nm) > 0;
keyword = &key.;
run;
*** CONCATENATE temp;
data mylib.subset_data;
set mylib.subset_data
temp;
run;
%mend;
Reproducible Example (using sashelp.class dataset)
proc contents data = sashelp.class; run;
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key.,%str(%")));
data &name_unquoted.;
set sashelp.class;
if prxmatch("m/"||trim(&key.)||"/i", Name) > 0;
run;
%mend;
data keywords;
input id keyword $;
datalines;
1 w
2 u
3 y
;
data _null_;
set keywords;
call execute('%nrstr(%subset_data("'||keyword||'"))');
run;
proc sql version
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key., %str(%")));
proc sql;
create table &name_unquoted. as
select * from mylib.DWH_table
where nm like "%" || trim(&key.) || "%";
-- where nm index(nm, trim(&key.)) > 0;
quit;
%mend;
proc sql (with SAS## datasets)
data keywords;
set keywords;
dname = cat("", "sas", _n_);
run;
%macro subset_data(key, dname);
%let name_unquoted = %qsysfunc(compress(&dname.,%str(%")));
proc sql;
create table &name_unquoted. as
select * from mylib.DWH_table
where nm like "%" || trim(&key.) || "%";
-- where nm index(nm, trim(&key.)) > 0;
quit;
%mend;
data _null_;
set keywords;
call execute('%nrstr(%subset_data("'||keyword||'", "'||dname||'"))');
run;
One idea is to perform an cross join on an is match criteria. The result is one table with one row per name noun match.
Sample data and code:
data names;
length name $80;
infile cards length=L;
input name $varying. L;
datalines;
Bob
Bob's Burgers
Angel
Angle iron city
Chad
Chadwicks town council
Dutch
Edward
run;
data nouns;
length noun $10;
infile cards length=L;
input noun $varying. L;
datalines;
chad
own
ward
burger
run;
/*
* might want to pre lowercase the data being matched up
data lower_names;
set names;
lower_name = lower(name);
data lower_nouns;
lower_noun = lower(noun);
run;
*/
proc sql;
create table want as
select name, noun
from names as NAME
cross join nouns as NOUN
where index(lowcase(NAME),lowcase(trim(NOUN))) >= 1 /* SAS INDEX() result: 1 or higher means noun is present */
;
quit;
Regardless of your approach there will be a lot of activity. Suppose there are 100 nouns to be checked against all names, that's 26M names x 100 nouns = 2.6B is match evaluations. The system that is the most powerful and resource available will usually get you the fastest answer.
Case 1: SAS installation better
Download names to SAS
cross join names to nouns in SAS
Case 2: Teradata installation is better
Upload nouns to Teradata
cross join names to nouns in Teradata (via passthrough SQL)
Case 1 code:
Proc SQL;
connect to (user=&user_id. password=&user_pwd.);
* download names;
create table mylib.DWH_names as
select * from connection to Teradata (
select distinct nm from DWH_table
);
create table work.NameNounMatches as
select
nm,
noun
from
mylib.dwh_names as NAMES
cross join
mylib.nouns as NOUNS
where
INDEX(lowcase(NAMES.nm),lowcase(trim(NOUNS.noun))) >= 1
;
Case 2 code:
Teradata temp table -- Upload (connection=global) from Tom on https://communities.sas.com/t5/SAS-Enterprise-Guide/SAS-Access-to-Teradata-How-to-create-Temporary-tables-in/td-p/228852
libname tdwork teradata username=&username password=&password server=&server
connection=global dbmstemp=yes
;
data tdwork.NOUNS_UPLOADED;
set mylib.nouns;
run;
* cross join in Teradata via passthrough;
proc sql;
connect using tdwork;
create table work.NameNounMatches as
select * from connection to tdwork
( select Cust.UNIQUE_ID,IP.IP_NAME
from TABLE_DWH as NAMES_LIST
cross join NOUNS_UPLOADED as NOUNS_LIST
where POSITION(NAMES_LIST.nm,NOUNS_LIST.noun) >= 1
);
quit;

SAS: Create Variable from Proc SQL to use in Macro

I want to count the number of unique items in a variable (call it "categories") then use that count to set the number of iterations in a SAS macro (i.e., I'd rather not hard code the number of iterations).
I can get a count like this:
proc sql;
select count(*)
from (select DISTINCT categories from myData);
quit;
I can run a macro like this:
%macro superFreq;
%do i=1 %to &iterationVariable;
Proc freq data=myData;
table var&i / out=var&i||freq;
run;
%mend superFreq;
%superFreq
I want to know how to get the count into the iteration variable so that the macro iterates as many times as there are unique values in the variable "categories".
Sorry if this is confusing. Happy to clarify if need be. Thanks in advance.
You can achieve this by using the into clause in proc sql:
proc sql noprint;
select max(age),
max(height),
max(weight)
into :max_age,
:max_height,
:max_weight
from sashelp.class;
quit;
%put &=max_age &=max_height &=max_weight;
Result:
MAX_AGE= 16 MAX_HEIGHT= 72 MAX_WEIGHT= 150
You can also select a list of results into a macro variable by combining the into clause with the separated by clause:
proc sql noprint;
select name into :list_of_names separated by ' ' from sashelp.class;
quit;
%put &=list_of_names;
Result:
LIST_OF_NAMES=Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas
William

SAS: How to Automate the Creation of Many Datasets using Another Data set

I am looking to create multiple datasets from city_variables dataset. There are a total of 58 observations that I summed up into macrovariable (&count) to stop the do loop.
The city_variables dataset looks like (vertically ofcourse):
CITY_NAME
City1
City2
City3
City4
City5
City6
City7
City8
City9
City10
..........
City58
I created macrovariable &name from a data null statement in order to input the cityname into the dataset name.
Any help would be great on how to automate the creation of the 48 files by name (not number). Thanks again.
/Create macro with number of observations in concordinate file/
proc sql;
select count(area_name);
into :count
from main.state_all;
quit;
%macro repeat;
data _null_;
set city_variables;
%do i= 1 %UNTIL (i = &count);
call symput('name',CITY_NAME);
run;
data &name;
set dataset;
where city_name = &name;
run;
%end;
%mend repeat;
%repeat
Well, if you're going to do
proc sql;
select count(area_name);
into :count
from main.state_all;
quit;
Then why not go all the way? Make a macro that does one dataset output, given the criteria as parameters, then make one call for each separate whatever-name. This might be close to what you're looking at.
%macro make_data(data_name=, set_name=, where=);
data &data_name.;
set &set_name.;
where &where.;
run;
%mend make_data;
proc sql;
select
cats('%make_data(data_name=',city_name,
', set_name=dataset, where=city_name="',
city_name,
'" )')
into :make_datalist
separated by ' '
from main.state_all;
quit;
&make_datalist.;
Some other options that I'll just link to:
Chris Hemedinger # SAS Dummy blog How to Split One Data Set Into Many shows a similar concept except he doesn't put the macro wrapper where I do.
Paul Dorfman, Data Step Hash Objects as Programming Tools is the seminal paper on using a hash table to do this. This is the "fastest" way to do this, likely, if you understand hash tables and have the memory available.
You don't need to use a macro to automate splitting up your data in this way. Since your example is really simple, I would consider using call execute in a null data step:
data test;
infile datalines ;
input city_name $20.;
datalines;
City1
City2
City2
City3
City3
City3
;
run;
data _null_;
set test;
call execute("data "||strip(city_name)||";"||"
set test;
where city_name = '"||strip(city_name)||"';"||"
run;");
run;

SAS loop through datasets

I have multiple tables in a library call snap1:
cust1, cust2, cust3, etc
I want to generate a loop that gets the records' count of the same column in each of these tables and then insert the results into a different table.
My desired output is:
Table Count
cust1 5,000
cust2 5,555
cust3 6,000
I'm trying this but its not working:
%macro sqlloop(data, byvar);
proc sql noprint;
select &byvar.into:_values SEPARATED by '_'
from %data.;
quit;
data_&values.;
set &data;
select (%byvar);
%do i=1 %to %sysfunc(count(_&_values.,_));
%let var = %sysfunc(scan(_&_values.,&i.));
output &var.;
%end;
end;
run;
%mend;
%sqlloop(data=libsnap, byvar=membername);
First off, if you just want the number of observations, you can get that trivially from dictionary.tables or sashelp.vtable without any loops.
proc sql;
select memname, nlobs
from dictionary.tables
where libname='SNAP1';
quit;
This is fine to retrieve number of rows if you haven't done anything that would cause the number of logical observations to differ - usually a delete in proc sql.
Second, if you're interested in the number of valid responses, there are easier non-loopy ways too.
For example, given whatever query that you can write determining your table names, we can just put them all in a set statement and count in a simple data step.
%let varname=mycol; *the column you are counting;
%let libname=snap1;
proc sql;
select cats("&libname..",memname)
into :tables separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
data counts;
set &tables. indsname=ds_name end=eof; *9.3 or later;
retain count dataset_name;
if _n_=1 then count=0;
if ds_name ne lag(ds_name) and _n_ ne 1 then do;
output;
count=0;
end;
dataset_name=ds_name;
count = count + ifn(&varname.,1,1,0); *true, false, missing; *false is 0 only;
if eof then output;
keep count dataset_name;
run;
Macros are rarely needed for this sort of thing, and macro loops like you're writing even less so.
If you did want to write a macro, the easier way to do it is:
Write code to do it once, for one dataset
Wrap that in a macro that takes a parameter (dataset name)
Create macro calls for that macro as needed
That way you don't have to deal with %scan and troubleshooting macro code that's hard to debug. You write something that works once, then just call it several times.
proc sql;
select cats('%mymacro(name=',"&libname..",memname,')')
into :macrocalls separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
&macrocalls.;
Assuming you have a macro, %mymacro, which does whatever counting you want for one dataset.
* Updated *
In the future, please post the log so we can see what is specifically not working. I can see some issues in your code, particularly where your macro variables are being declared, and a select statement that is not doing anything. Here is an alternative process to achieve your goal:
Step 1: Read all of the customer datasets in the snap1 library into a macro variable:
proc sql noprint;
select memname
into :total_cust separated by ' '
from sashelp.vmember
where upcase(memname) LIKE 'CUST%'
AND upcase(libname) = 'SNAP1';
quit;
Step 2: Count the total number of obs in each data set, output to permanent table:
%macro count_obs;
%do i = 1 %to %sysfunc(countw(&total_cust) );
%let dsname = %scan(&total_cust, &i);
%let dsid=%sysfunc(open(&dsname) );
%let nobs=%sysfunc(attrn(&dsid,nobs) );
%let rc=%sysfunc(close(&dsid) );
data _total_obs;
length Member_Name $15.;
Member_Name = "&dsname";
Total_Obs = &nobs;
format Total_Obs comma8.;
run;
proc append base=Total_Obs
data=_total_obs;
run;
%end;
proc datasets lib=work nolist;
delete _total_obs;
quit;
%mend;
%count_obs;
You will need to delete the permanent table Total_Obs if it already exists, but you can add code to handle that if you wish.
If you want to get the total number of non-missing observations for a particular column, do the same code as above, but delete the 3 %let statements below %let dsname = and replace the data step with:
data _total_obs;
length Member_Name $7.;
set snap1.&dsname end=eof;
retain Member_Name "&dsname";
if(NOT missing(var) ) then Total_Obs+1;
if(eof);
format Total_Obs comma8.;
run;
(Update: Fixed %do loop in step 2)

SAS : How to iterate a dataset elements within the proc sql WHERE statement?

I need to create multiple tables using proc sql
proc sql;
/* first city */
create table London as
select * from connection to myDatabase
(select * from mainTable
where city = 'London');
/* second city */
create table Beijing as
select * from connection to myDatabase
(select * from mainTable
where city = 'Beijing');
/* . . the same thing for other cities */
quit;
The names of those cities are in the sas table myCities
How can I embed the data step into proc sql in order to iterate through all cities ?
proc sql noprint;
select quote(city_varname) into :cities separated by ',' from myCities;
quit;
*This step above creates a list as a macro variable to be used with the in() operator below. EDIT: Per Joe's comment, added quote() function so that each city will go into the macro-var list within quotes, for proper referencing by in() operator below.
create table all_cities as
select * from connection to myDatabase
(select * from mainTable
where city in (&cities));
*this step is just the step you provided in your question, slightly modified to use in() with the macro-variable list defined above.
One relatively simple solution to this is to do this entirely in a data step. Assuming you can connect via libname (which if you can connect via connect to you probably can), let's say the libname is mydb. Using a similar construction to Max Power's for the first portion:
proc sql noprint;
select city_varname
into :citylist separated by ' '
from myCities;
select cats('%when(var=',city_varname,')')
into :whenlist separated by ' '
from myCities;
quit;
%macro when(var=);
when "&var." output &var.;
%mend when;
data &citylist.;
set mydb.mainTable;
select(city);
&whenlist.;
otherwise;
end;
run;
If you're using most of the data in mainTable, this probably wouldn't be much slower than doing it database-side, as you're moving all of the data anyway - and likely it would be faster since you only hit the database once.
Even better would be to pull this to one table (like Max shows), but this is a reasonable method if you do need to create multiple tables.
You need to put your proc sql code into a SAS Macro.
Create a macro-variable for City (in my example I called the macro-variable "City").
Execute the macro from a datastep program. Since the Datastep program processes one for each observation, there is no need to create complex logic to iterate.
data mycities;
infile datalines dsd;
input macrocity $ 32.;
datalines;
London
Beijing
Buenos_Aires
;
run;
%macro createtablecity(city=);
proc sql;
/* all cities */
create table &city. as
select * from connection to myDatabase
(select * from mainTable
where city = "&city.");
quit;
%mend;
data _null_;
set mycities;
city = macrocity;
call execute('%createtablecity('||city||')');
run;
Similar to the other solutions here really, maybe a bit simpler... Pull out a distinct list of cities, place into macros, run SQL query within a do loop.
Proc sql noprint;
Select distinct city, count(city) as c
Into :n1-:n999, :c
From connection to mydb
(Select *
From mainTable)
;
Quit;
%macro createTables;
%do a=1 %to &c;
Proc sql;
Create table &&n&a as
Select *
From connection to myDb
(Select *
From mainTable
Where city="&&n&a")
;
Quit;
%end;
%mend createTables;
%createTables;