Using PRXMATCH to match strings from another sas dataset - sas

Need your assistance and guidance. Please see below
*rsubmit;proc sql;
connect to teradata(user=&user_id. password=&user_pwd.);
create table mylib.DWH_table as select * from connection to teradata(
select distinct nm from DWH_table
);
quit;*endrsubmit;
*rsubmit;
DATA mylib.out_sas1;
set mylib.DWH_table;
if prxmatch ("m/studio/i",nm) > 0;
run;*endrsubmit;
So the above code checks for the word "studio" in the column nm and returns the results. However, this is a manual process that needs to be automated. I have another dataset that contains just one column named "KEYWORDS". Some of the sample data I have given below
KEYWORDS:
apple
mango
banana
grapes
The goal is that SAS should take the word in the column and compare it to the value in the database and create a separate output table.
So for example:
*rsubmit;
DATA mylib.out_sas2;
set mylib.DWH_table;
if prxmatch ("m/apple/i",nm) > 0;
run;*endrsubmit;
*rsubmit;
DATA mylib.out_sas3;
set mylib.DWH_table;
if prxmatch ("m/mango/i",nm) > 0;
run;*endrsubmit;
Can this be done in SAS?

Put your keywords in macro vairables
proc sql;
select count(distinct KEYWORDS)
into :no_keys
from mylib.MY_KEYWORDS;
select distinct KEYWORDS
into :key_1-key_&no_keys
from mylib.MY_KEYWORDS;
quit;
Now use those macro variables
%macro find_keywords;
data
%do key_nr = 1 %to &no_keys;
mylib.out_sas&key_nr (drop = UP_nm)
%end;
;
set mylib.DWH_table;
UP_nm : upcase(nm);
%do key_nr = 1 %to &no_keys;
keyword = "&key.";
if prxmatch ("m/&&key_&key_nr/i",UP_nm) > 0 then output out_sas&key_nr;
%end;
run;
%mend;
%find_keywords;
You need to embed this in a macro, because you cannot use %do ... %end; in "open" code. && resolves to &, which makes it a delayed &, that is resolved after resolving &key_nr.
Disclaimer: this code is not tested. If you have trouble getting it running, please respond.

Consider a macro call via a data step using CALL EXECUTE:
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key., %str(%")));
data mylib.out_&name_unquoted.;
set mylib.DWH_table;
if prxmatch ("m/"||trim(&key.)||"/i",nm) > 0;
run;
%mend;
data _null_;
set mydata;
call execute('%nrstr(%subset_data("'||KEYWORDS||'"))');
run;
Alternatively, instead of call execute, create a SAS script file of macro calls, then run with %include:
data _null_;
set mydata;
file "Temp.sas" ;
put '%subset_data("' KEYWORDS '") ;' ;
run;
%include "Temp.sas";
But if keywords are many (i.e., tens to hundreds to thousands), consider #Richard's comment above to develop an indicator column in a concatenated dataset via a helper, temp dataset:
%macro subset_data(key);
*** BUILD temp WITH INDICATOR;
data temp;
set mylib.DWH_table;
if prxmatch ("m/"||trim(&key.)||"/i",nm) > 0;
keyword = &key.;
run;
*** CONCATENATE temp;
data mylib.subset_data;
set mylib.subset_data
temp;
run;
%mend;
Reproducible Example (using sashelp.class dataset)
proc contents data = sashelp.class; run;
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key.,%str(%")));
data &name_unquoted.;
set sashelp.class;
if prxmatch("m/"||trim(&key.)||"/i", Name) > 0;
run;
%mend;
data keywords;
input id keyword $;
datalines;
1 w
2 u
3 y
;
data _null_;
set keywords;
call execute('%nrstr(%subset_data("'||keyword||'"))');
run;
proc sql version
%macro subset_data(key);
%let name_unquoted = %qsysfunc(compress(&key., %str(%")));
proc sql;
create table &name_unquoted. as
select * from mylib.DWH_table
where nm like "%" || trim(&key.) || "%";
-- where nm index(nm, trim(&key.)) > 0;
quit;
%mend;
proc sql (with SAS## datasets)
data keywords;
set keywords;
dname = cat("", "sas", _n_);
run;
%macro subset_data(key, dname);
%let name_unquoted = %qsysfunc(compress(&dname.,%str(%")));
proc sql;
create table &name_unquoted. as
select * from mylib.DWH_table
where nm like "%" || trim(&key.) || "%";
-- where nm index(nm, trim(&key.)) > 0;
quit;
%mend;
data _null_;
set keywords;
call execute('%nrstr(%subset_data("'||keyword||'", "'||dname||'"))');
run;

One idea is to perform an cross join on an is match criteria. The result is one table with one row per name noun match.
Sample data and code:
data names;
length name $80;
infile cards length=L;
input name $varying. L;
datalines;
Bob
Bob's Burgers
Angel
Angle iron city
Chad
Chadwicks town council
Dutch
Edward
run;
data nouns;
length noun $10;
infile cards length=L;
input noun $varying. L;
datalines;
chad
own
ward
burger
run;
/*
* might want to pre lowercase the data being matched up
data lower_names;
set names;
lower_name = lower(name);
data lower_nouns;
lower_noun = lower(noun);
run;
*/
proc sql;
create table want as
select name, noun
from names as NAME
cross join nouns as NOUN
where index(lowcase(NAME),lowcase(trim(NOUN))) >= 1 /* SAS INDEX() result: 1 or higher means noun is present */
;
quit;
Regardless of your approach there will be a lot of activity. Suppose there are 100 nouns to be checked against all names, that's 26M names x 100 nouns = 2.6B is match evaluations. The system that is the most powerful and resource available will usually get you the fastest answer.
Case 1: SAS installation better
Download names to SAS
cross join names to nouns in SAS
Case 2: Teradata installation is better
Upload nouns to Teradata
cross join names to nouns in Teradata (via passthrough SQL)
Case 1 code:
Proc SQL;
connect to (user=&user_id. password=&user_pwd.);
* download names;
create table mylib.DWH_names as
select * from connection to Teradata (
select distinct nm from DWH_table
);
create table work.NameNounMatches as
select
nm,
noun
from
mylib.dwh_names as NAMES
cross join
mylib.nouns as NOUNS
where
INDEX(lowcase(NAMES.nm),lowcase(trim(NOUNS.noun))) >= 1
;
Case 2 code:
Teradata temp table -- Upload (connection=global) from Tom on https://communities.sas.com/t5/SAS-Enterprise-Guide/SAS-Access-to-Teradata-How-to-create-Temporary-tables-in/td-p/228852
libname tdwork teradata username=&username password=&password server=&server
connection=global dbmstemp=yes
;
data tdwork.NOUNS_UPLOADED;
set mylib.nouns;
run;
* cross join in Teradata via passthrough;
proc sql;
connect using tdwork;
create table work.NameNounMatches as
select * from connection to tdwork
( select Cust.UNIQUE_ID,IP.IP_NAME
from TABLE_DWH as NAMES_LIST
cross join NOUNS_UPLOADED as NOUNS_LIST
where POSITION(NAMES_LIST.nm,NOUNS_LIST.noun) >= 1
);
quit;

Related

SAS - Loop through rows and calculate MD5

I want to sweep each table in a libname and calculate a hash over each row.
For that purpose, i have already a table with libname, memname, concatenated columns with ',' and number of observations
libname
memname
columns
num_obs
lib_A
table_A
col1a,col2a...colna
1
lib_A
table_B
col1b,col2b...colnb
2
lib_B
table_C
col1c,col2c...colnc
1
I first get all data into ranged macro variables (i think its easier to work, but could be wrong, ofc)
proc sql;
select libname, memname, columns, num_obs
into :lib1-, :tab1-, :column1-, :sqlobs1-
from have
where libname="&sel_livraria"; /*macro var = prompt from user*/
quit;
Just for developing guideline i made the code just to check one specific table without getting the row number of it since with a simple counter doesn't work (i get the order of the rows mess up each time i run) and it works for that purpose
%let lib=lib_A;
%let tab=table_B;
%let columns=col1b,col2b,colnb;
data want;
length check $32.;
format check $hex32.;
set &lib..&tab;
libname="&lib";
memname="&tab";
check = md5(cats(&columns));
hash = put(check,$hex32.);
keep libname memname hash;
put hash;
put _all_;
run;
So, what’s the best approach for getting a MD5 from each row (same order as tables) of all tables in a libname? I saw problems i couldn’t overcame using data steps, procs or macros.
The result i wanted if lib_A was selected in prompt were something like:
libname
memname
obs_row
hash
lib_A
table_A
1
64A29CCA15F53C83A9583841294A26AA
lib_A
table_B
1
80DAC7B9854CF71A67F9C00A7EC4D9EF
lib_A
table_B
2
0AC44CD79DAB2E33C93BB2312D3A9A40
Need some help.
Tks in advance.
You're pretty close. This is how I would approach it. We'll create a macro with three parameters: data, lib, and out. data is the dataset you have with the column information. lib is the library you want to pull from your dataset, and out is the output dataset that you want to have.
We'll read each column into an individual macro variable:
memname1
memname2
memname3
libname1
libname2
libname3
etc.
From here, we simply need to loop over all of the macro variables and apply them where appropriate. We can easily count how many there are in a data step. All we need to do is add double-ampersands to resolve them correctly. For more information on why this is, check out this MWSUG paper.
%macro get_md5(data=, lib=, out=);
/* Save all variables into macro variables:
memname1 memname2 ...
columns1 columns2 ...
*/
data _null_;
set &data.;
where upcase(libname)=upcase("&lib.");
call symputx(cats('memname', _N_), memname);
call symputx(cats('columns', _N_), columns);
call symputx(cats('obs', _N_), obs);
call symputx('n_datasets', _N_);
run;
/* Loop through all the datasets and access each macro variable */
%do i = 1 %to &n_datasets.;
/* Double ampersand needed:
First, resolve &i. to get &memname1
Then resolve &mename1 to get the value stored in the macro variable memname1
*/
%let memname = &&memname&i.;
%let columns = &&columns&i.;
%let obs = &&obs&i.;
/* Calculate md5 in a temporary dataset */
data _tmp_;
length lib $8.
memname $32.
obs_row 8.
hash $32.
;
set &lib..&memname.(obs=&obs.);
lib = "&lib.";
memname = "&memname.";
obs_row = _N_;
hash = put(md5(cats(&columns.)), $hex32.);
keep libname memname obs_row hash;
run;
/* Overwrite the dataset so we don't keep appending */
%if(&i. = 1) %then %do;
data &out.;
set _tmp_;
run;
%end;
%else %do;
proc append base=&out. data=_tmp_;
run;
%end;
%end;
/* Remove temporary data */
proc datasets lib=work nolist;
delete _tmp_;
quit;
%mend;
Example:
data have;
length libname memname columns $15.;
input libname$ memname$ columns$ obs;
datalines;
sashelp cars make,model,msrp 1
sashelp class age,height,name 2
sashelp comet dose,length,sample 1
;
run;
%get_md5(data=have, lib=sashelp, out=want);
Output:
libname memname obs_row hash
sashelp cars 1 258DADA4843E7068ABAF95667E881B7F
sashelp class 1 29E8F4F03AD2275C0F191FE3DAA03778
sashelp class 2 DB664382B88BE7E445418B1A1C8CE13B
sashelp comet 1 210394B77E7696506FDEFD78890A8AB9
I would make a macro that takes as input the four values in your metadata dataset. Note that commas are anathema to SAS programs, especially macro code, so make the macro so it can accept space delimited variable lists (like normal SAS program statements do).
To reduce the risk of name conflict I will name the variable using triple underscores and then rename them back to human friendly names when the dataset is written.
%macro next_ds(libname,memname,num_obs,varlist);
data next_ds;
length ___1 $8 ___2 $32 ___3 8 ___4 $32 ;
___1 = "&libname";
___2 = "&memname";
___3 + 1;
set &libname..&memname(obs=&num_obs keep=&varlist);
___4 = put(md5(cats(of &varlist)),$hex32.);
keep ___1-___4 ;
rename ___1=libname ___2=memname ___3=obs_row ___4=hash;
run;
%mend next_ds;
Let's make some test metadata that reference datasets everyone should have.
data have;
infile cards truncover ;
input libname :$8. memname :$32. num_obs columns $200.;
cards;
sashelp class 3 name,sex,age
sashelp cars 2 make,model
;
And make sure the target dataset does not already exists.
%if %sysfunc(exist(want)) %then %do;
proc delete data=want; run;
%end;
Now you can call that macro once for each observation in your source metadata dataset. There is no need to generated oodles of macro variables. Instead you can use CALL EXECUTE() to generate the macro calls directly from the dataset.
We can replace the commas in the column lists when making the macro call. You can add in a PROC APPEND step after each macro call to aggregate the results into a single dataset.
data _null_;
set have;
call execute(cats(
'%nrstr(%next_ds)(',libname,',',memname,',',num_obs
,',',translate(columns,' ',','),')'
));
call execute('proc append data=next_ds base=want force; run;');
run;
Notice that wrapping the macro call in %NRSTR() makes the SAS log easier to read.
1 + %next_ds(sashelp,class,3,name sex age)
2 + proc append data=next_ds base=want force; run;
3 + %next_ds(sashelp,cars,2,make model)
4 + proc append data=next_ds base=want force; run;
Results:
Obs libname memname obs_row hash
1 sashelp class 1 5425E9CEDA1DDEB71B2692A3C7050A8A
2 sashelp class 2 C532D227D358A3764C2D225DC8C02D18
3 sashelp class 3 13AD5F1517E0C4494780773B6DC15211
4 sashelp cars 1 777C60693BF5E16F38706C89301CD0A8
5 sashelp cars 2 07080C9321145395D1A2BCC10FBE6B83
Note that CATS() might not be the best method for generating the string to pass to the MD5() function. That can generate the same string for different combinations of the source variables. For example 'AB' || 'CD' is the same as 'A' || 'BCD'. Perhaps just use CAT() instead.
Stu's approach is nice, and will work most of the time but will fall over when you have wiiiide variables, a large number of variables, variables with large precision, and other edge cases.
So for the actual hashing part, you might consider this macro, which is extensively tested within Data Controller for SAS:
https://core.sasjs.io/mp__md5_8sas.html
Usage:
data _null_;
set sashelp.class;
hashvar=%mp_md5(cvars=name sex, nvars=age height weight);
put hashvar=;
run;

SAS-Creating Panel by several datasets

Suppose there are ten datasets with same structure: date and price, particularly they have same time period but different price
date price
20140604 5
20140605 7
20140607 9
I want to combine them and create a panel dataset. Since there is no name in each datasets, I attempt to add a new variable name into each data and then combine them.
The following codes are used to add name variable into each dataset
%macro name(sourcelib=,from=,going=);
proc sql noprint; /*read datasets in a library*/
create table mytables as
select *
from dictionary.tables
where libname = &sourcelib
order by memname ;
select count(memname)
into:obs
from mytables;
%let obs=&obs.;
select memname
into : memname1-:memname&obs.
from mytables;
quit;
%do i=1 %to &obs.;
data
&going.&&memname&i;
set
&from.&&memname&i;
name=&&memname&i;
run;
%end;
%mend;
So, is this strategy correct? Whether are there a different way to creating a panel data?
There are really two ways to setup repeated measures data. You can use the TALL method that your code will create. That is generally the most flexible. The other would be a wide format with each PRICE being stored in a different variable. That is usually less flexible, but can be easier for some analyses.
You probably do not need to use macro code or even code generation to combine 10 datasets. You might find that it is easier to just type the 10 dataset names than to write complex code to pull the names from metadata. So a data step like this will let you list any number of datasets in the SET statement and use the membername as the value for the new PANEL variable that distinguishes the source dataset.
data want ;
length dsn $41 panel $32 ;
set in1.panel1 in1.panela in1.panelb indsname=dsn ;
panel = scan(dsn,-1,'.') ;
run;
And if your dataset names follow a pattern that can be used as a member list in the SET statement then the code is even easier to write. So you could have a list of names that have a numeric suffix.
set in1.panel1-in1.panel10 indsname=dsn ;
or perhaps names that all start with a particular prefix.
set in1.panel: indsname=dsn ;
If the different panels are for the same dates then perhaps the wide format is easier? You could then merge the datasets by DATE and rename the individual PRICE variables. That is generate a data step that looks like this:
data want ;
merge in1.panel1 (rename=(price=price1))
in1.panel2 (rename=(price=price2))
...
;
by date;
run;
Or perhaps it would be easier to add a BY statement to the data set that makes the TALL dataset and then transpose it into the WIDE format.
data tall;
length dsn $41 panel $32 ;
set in1.panel1 in1.panela in1.panelb indsname=dsn ;
by date ;
panel = scan(dsn,-1,'.') ;
run;
proc transpose data=tall out=want ;
by date;
id panel;
var price ;
run;
I can't comment on the SQL code but the strategy is correct. Add a name to each data set and then panel on the name with the PANELBY statement.
That is a valid way to achieve what you are looking for.
You are going to need 2 . in between the macros for library.data syntax. The first . is used to concatenate. The second shows up as a ..
I assume you will want to append all of these data sets together. You can add
data &going..want;
set
%do i=1 %to &obs;
&from..&&memname&i
%end;
;
run;
You can combine your loop that adds the names and that data step like this:
data &going..want;
set
%do i=1 %to &obs;
&from..&&memname&i (in=d&i)
%end;
;
%do i=1 %to &obs;
if d&i then
name = &&memname&i;
%end;
run;

SAS: How to Automate the Creation of Many Datasets using Another Data set

I am looking to create multiple datasets from city_variables dataset. There are a total of 58 observations that I summed up into macrovariable (&count) to stop the do loop.
The city_variables dataset looks like (vertically ofcourse):
CITY_NAME
City1
City2
City3
City4
City5
City6
City7
City8
City9
City10
..........
City58
I created macrovariable &name from a data null statement in order to input the cityname into the dataset name.
Any help would be great on how to automate the creation of the 48 files by name (not number). Thanks again.
/Create macro with number of observations in concordinate file/
proc sql;
select count(area_name);
into :count
from main.state_all;
quit;
%macro repeat;
data _null_;
set city_variables;
%do i= 1 %UNTIL (i = &count);
call symput('name',CITY_NAME);
run;
data &name;
set dataset;
where city_name = &name;
run;
%end;
%mend repeat;
%repeat
Well, if you're going to do
proc sql;
select count(area_name);
into :count
from main.state_all;
quit;
Then why not go all the way? Make a macro that does one dataset output, given the criteria as parameters, then make one call for each separate whatever-name. This might be close to what you're looking at.
%macro make_data(data_name=, set_name=, where=);
data &data_name.;
set &set_name.;
where &where.;
run;
%mend make_data;
proc sql;
select
cats('%make_data(data_name=',city_name,
', set_name=dataset, where=city_name="',
city_name,
'" )')
into :make_datalist
separated by ' '
from main.state_all;
quit;
&make_datalist.;
Some other options that I'll just link to:
Chris Hemedinger # SAS Dummy blog How to Split One Data Set Into Many shows a similar concept except he doesn't put the macro wrapper where I do.
Paul Dorfman, Data Step Hash Objects as Programming Tools is the seminal paper on using a hash table to do this. This is the "fastest" way to do this, likely, if you understand hash tables and have the memory available.
You don't need to use a macro to automate splitting up your data in this way. Since your example is really simple, I would consider using call execute in a null data step:
data test;
infile datalines ;
input city_name $20.;
datalines;
City1
City2
City2
City3
City3
City3
;
run;
data _null_;
set test;
call execute("data "||strip(city_name)||";"||"
set test;
where city_name = '"||strip(city_name)||"';"||"
run;");
run;

SAS loop through datasets

I have multiple tables in a library call snap1:
cust1, cust2, cust3, etc
I want to generate a loop that gets the records' count of the same column in each of these tables and then insert the results into a different table.
My desired output is:
Table Count
cust1 5,000
cust2 5,555
cust3 6,000
I'm trying this but its not working:
%macro sqlloop(data, byvar);
proc sql noprint;
select &byvar.into:_values SEPARATED by '_'
from %data.;
quit;
data_&values.;
set &data;
select (%byvar);
%do i=1 %to %sysfunc(count(_&_values.,_));
%let var = %sysfunc(scan(_&_values.,&i.));
output &var.;
%end;
end;
run;
%mend;
%sqlloop(data=libsnap, byvar=membername);
First off, if you just want the number of observations, you can get that trivially from dictionary.tables or sashelp.vtable without any loops.
proc sql;
select memname, nlobs
from dictionary.tables
where libname='SNAP1';
quit;
This is fine to retrieve number of rows if you haven't done anything that would cause the number of logical observations to differ - usually a delete in proc sql.
Second, if you're interested in the number of valid responses, there are easier non-loopy ways too.
For example, given whatever query that you can write determining your table names, we can just put them all in a set statement and count in a simple data step.
%let varname=mycol; *the column you are counting;
%let libname=snap1;
proc sql;
select cats("&libname..",memname)
into :tables separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
data counts;
set &tables. indsname=ds_name end=eof; *9.3 or later;
retain count dataset_name;
if _n_=1 then count=0;
if ds_name ne lag(ds_name) and _n_ ne 1 then do;
output;
count=0;
end;
dataset_name=ds_name;
count = count + ifn(&varname.,1,1,0); *true, false, missing; *false is 0 only;
if eof then output;
keep count dataset_name;
run;
Macros are rarely needed for this sort of thing, and macro loops like you're writing even less so.
If you did want to write a macro, the easier way to do it is:
Write code to do it once, for one dataset
Wrap that in a macro that takes a parameter (dataset name)
Create macro calls for that macro as needed
That way you don't have to deal with %scan and troubleshooting macro code that's hard to debug. You write something that works once, then just call it several times.
proc sql;
select cats('%mymacro(name=',"&libname..",memname,')')
into :macrocalls separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
&macrocalls.;
Assuming you have a macro, %mymacro, which does whatever counting you want for one dataset.
* Updated *
In the future, please post the log so we can see what is specifically not working. I can see some issues in your code, particularly where your macro variables are being declared, and a select statement that is not doing anything. Here is an alternative process to achieve your goal:
Step 1: Read all of the customer datasets in the snap1 library into a macro variable:
proc sql noprint;
select memname
into :total_cust separated by ' '
from sashelp.vmember
where upcase(memname) LIKE 'CUST%'
AND upcase(libname) = 'SNAP1';
quit;
Step 2: Count the total number of obs in each data set, output to permanent table:
%macro count_obs;
%do i = 1 %to %sysfunc(countw(&total_cust) );
%let dsname = %scan(&total_cust, &i);
%let dsid=%sysfunc(open(&dsname) );
%let nobs=%sysfunc(attrn(&dsid,nobs) );
%let rc=%sysfunc(close(&dsid) );
data _total_obs;
length Member_Name $15.;
Member_Name = "&dsname";
Total_Obs = &nobs;
format Total_Obs comma8.;
run;
proc append base=Total_Obs
data=_total_obs;
run;
%end;
proc datasets lib=work nolist;
delete _total_obs;
quit;
%mend;
%count_obs;
You will need to delete the permanent table Total_Obs if it already exists, but you can add code to handle that if you wish.
If you want to get the total number of non-missing observations for a particular column, do the same code as above, but delete the 3 %let statements below %let dsname = and replace the data step with:
data _total_obs;
length Member_Name $7.;
set snap1.&dsname end=eof;
retain Member_Name "&dsname";
if(NOT missing(var) ) then Total_Obs+1;
if(eof);
format Total_Obs comma8.;
run;
(Update: Fixed %do loop in step 2)

Sas Macro to semi-efficiently manipulate data

Objective: Go from Have table + Help table to Want table. The current implementation (below) is slow. I believe this is a good example of how not to use SAS Macros, but I'm curious as to whether...
1. the macro approach could be salvaged / made fast enough to be viable
(e.g. proc append is supposed to speed up the action of stacking datasets, but I was unable to see any performance gains.)
2. what all the alternatives would look like.
I have written a non-macro solution that I will post below for comparison sake.
Data:
data have ;
input name $ term $;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
proc print ; run;
data help ;
input terms $ ;
cards;
2000
2001
2002
2003
2004
2005
2006
2007
2008
; run;
proc print ; run;
data want ;
input name $ term $ status $;
cards;
Joe 2000 here
Joe 2001 gone
Joe 2002 here
Joe 2003 gone
Joe 2004 gone
Joe 2005 gone
Joe 2006 gone
Joe 2007 gone
Joe 2008 here
Sally 2001 here
Sally 2002 gone
Sally 2003 here
; run;
proc print data=have ; run;
I can write a little macro to get me there for each individual:
%MACRO RET(NAME);
proc sql ;
create table studtermlist as
select distinct term
from have
where NAME = "&NAME"
;
SELECT Max(TERM) INTO :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
SELECT MIN(TERM) INTO :MINTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE TERMLIST AS
SELECT TERMS
FROM HELP
WHERE TERMS BETWEEN "&MINTERM." and "&MAXTERM."
ORDER BY TERMS
;
CREATE TABLE HEREGONE_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
;
quit;
%MEND RET ;
%RET(Joe);
%RET(Sally);
proc print data=HEREGONE_Joe; run;
proc print data=HEREGONE_Sally; run;
But it's incomplete. If I loop through for (presumably quite a few names)...
*******need procedure for all names - grab info on have ;
proc sql noprint;
select distinct name into :namelist separated by ' '
from have
; quit;
%let n=&sqlobs ;
%MACRO RETYA ;
OPTIONS NONOTEs ;
%do i = 1 %to &n ;
%let currentvalue = %scan(&namelist,&i);
%put &currentvalue ;
%put &i ;
%RET(&currentvalue);
%IF &i = 1 %then %do ;
data base; set HEREGONE_&currentvalue; run;
%end;
%IF &i gt 1 %then %do ;
proc sql ; create table base as
select * from base
union
select * from HEREGONE_&currentvalue
;
drop table HEREGONE_&currentvalue;
quit;
%end;
%end ;
OPTIONS NOTES;
%MEND;
%RETYA ;
proc sort data=base ; by name terms; run;
proc print data=base; run;
So now I have want, but with 6,000 names, it takes over 20 minutes.
Let's try the alternative solution. For each name find the min/max term via a proc SQL data step. Then use a data step to create the time period table and merge that with your original table.
*Sample data;
data have ;
input name $ term ;
cards;
Joe 2000
Joe 2002
Joe 2008
Sally 2001
Sally 2003
; run;
*find min/max of each name;
proc sql;
create table terms as
select name, min(term) as term_min, max(term) as term_max
from have
group by name
order by name;
quit;
*Create table with the time periods for each name;
data empty;
set terms;
do term=term_min to term_max;
output;
end;
drop term_min term_max;
run;
*Create final table by merging the original table with table previously generated;
proc sql;
create table want as
select a.name, a.term, case when missing(b.term) then 'Gone'
else 'Here' end as status
from empty a
left join have b
on a.name=b.name
and a.term=b.term
order by a.name, a.term;
quit;
EDIT: Now looking at your macro solution, part of the problem is that you're scanning your table too many times.
The first table, studenttermlist is not required, the last join can
be filtered instead.
The two macro variables, min/max term can be
calculated in a single pass
Avoid the smaller interim term list and use a where clause to filter your results
Use Call Execute to call your macro rather than another macro loop
Rather than loop through to append the
data, take advantage of a naming convention and use a single data
step to append all outputs.
%MACRO RET(NAME);
proc sql noprint;
SELECT MIN(TERM), Max(TERM) INTO :MINTERM, :MAXTERM
FROM HAVE
WHERE NAME = "&NAME"
;
CREATE TABLE _HG_&Name AS
SELECT
A.terms ,
"&Name" as Name,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from help a
left join have b
on a.terms eq b.term
and b.name="&name"
where a.terms between "&minterm" and "&maxterm";
;
quit;
%MEND RET ;
*call macro;
proc sort data=have;
by name term;
run;
data _null_;
set have;
by name;
if first.name then do;
str=catt('%ret(', name, ');');
call execute(str);
end;
run;
*append results;
data all;
set _hg:;
run;
You can actually do this in a single nested SQL query. It would be messy and hard to read.
I'm going to break it out into the three components.
First, get the distinct names;
proc sql noprint;
create table names as
select distinct name from have;
quit;
Second, Cartesian product names and terms to get all the combos.
proc sql noprint;
create table temp as
select a.name, b.terms as term
from names as a,
help as b;
quit;
Third, left join to find the matches
proc sql noprint;
create table want as
select a.name,
a.term,
case
when missing(b.term) then "gone"
else "here"
end as Status
from temp as a
left join
have as b
on a.name=b.name
and a.term=b.term;
quit;
Last, delete the temp table to save space;
proc datasets lib=work nolist;
delete temp;
run;
quit;
As Reeza shows, there are other ways to do this. As I said above, you can merge all this into a single SQL join and get the results you want. Depending on computer memory and data size, it should be OK (and might be faster as everything is in memory).
proc sql;
create table want as
select c.name, c.terms, a.term,
( case when missing(a.term) then "Gone"
else "Here" end ) as status
from (select distinct a.name, b.terms
from have a, help b) c
left join have a
on c.terms = a.term and c.name = a.name
order by c.name, c.terms, a.term
;
I'm going to throw in my similar answer so I can compare them all later.
proc sql ;
create table studtermlist as
select distinct term,name
from have
;
create table MAXMINTERM as
SELECT Max(TERM) as MAXTERM, Min(TERM) as MINTERM, name
FROM HAVE
GROUP BY name
;
CREATE TABLE TERMLIST AS
SELECT TERMS,name
FROM HELP a,MAXMINTERM b
WHERE TERMS BETWEEN MINTERM and MAXTERM
ORDER BY name,TERMS
;
CREATE TABLE HEREGONE AS
SELECT
a.terms ,
a.Name ,
CASE
WHEN TERMS EQ TERM THEN 'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b
on a.terms eq b.term
and a.name eq b.name
order by name, terms
;
quit;