Obtain and minus total number of observation from multiple datasets in SAS - sas

I have to prepare 5 tables from a large dataset given certain conditions.
The total number of obs for 5 tables is 1000.
I have prepared the first four tables.
For the fifth table, I have trouble select obs (1000 minus sum(table 1 to 4)).
I can manually sum odsnumber but it will impact the efficiency given this has to be done routinely.
Can anyone guide me on how to improve these scripts?
Proc sql;
select nobs
into: odsnumber trimmed
from sashelp.vtable
where libname='work' and memname in ('table1' 'table2' 'table3' 'table4')
;quit;
data table5;
set source;
if 1<=_N_<=sum(1000,manual calculation of nobs from 4 tables);
run;

Compute the SUM of the four NOBS in the SQL and you wont have to manually calculate it.
In SQL I prefer to use DICTIONARY tables instead of SASHELP.V* views. Use the views when needing to access metadata from a DATA or PROC step.
libname and memname values are ALWAYS uppercase when queried from the meta-data DICTIONARY tables.
Example:
data table1; do row = 1 to 10; output; end;
data table2; do row = 1 to 100; output; end;
data table3; do row = 1 to 100; output; end;
data table4; do row = 1 to 40 ; output; end;
data source; do row = 1 to 2500; output; end;
proc sql NOPRINT;
select SUM(nobs)
into: NOBS_OF_4_TABLES trimmed
from DICTIONARY.TABLES
where libname='WORK'
and memname in ('TABLE1' 'TABLE2' 'TABLE3' 'TABLE4')
;
%put NOTE: &=NOBS_OF_4_TABLES; %* Check log to see value computed;
data table5;
set source;
if _N_ > 1000 - &NOBS_OF_4_TABLES then stop;
run;

Related

SAS - Loop through rows and calculate MD5

I want to sweep each table in a libname and calculate a hash over each row.
For that purpose, i have already a table with libname, memname, concatenated columns with ',' and number of observations
libname
memname
columns
num_obs
lib_A
table_A
col1a,col2a...colna
1
lib_A
table_B
col1b,col2b...colnb
2
lib_B
table_C
col1c,col2c...colnc
1
I first get all data into ranged macro variables (i think its easier to work, but could be wrong, ofc)
proc sql;
select libname, memname, columns, num_obs
into :lib1-, :tab1-, :column1-, :sqlobs1-
from have
where libname="&sel_livraria"; /*macro var = prompt from user*/
quit;
Just for developing guideline i made the code just to check one specific table without getting the row number of it since with a simple counter doesn't work (i get the order of the rows mess up each time i run) and it works for that purpose
%let lib=lib_A;
%let tab=table_B;
%let columns=col1b,col2b,colnb;
data want;
length check $32.;
format check $hex32.;
set &lib..&tab;
libname="&lib";
memname="&tab";
check = md5(cats(&columns));
hash = put(check,$hex32.);
keep libname memname hash;
put hash;
put _all_;
run;
So, what’s the best approach for getting a MD5 from each row (same order as tables) of all tables in a libname? I saw problems i couldn’t overcame using data steps, procs or macros.
The result i wanted if lib_A was selected in prompt were something like:
libname
memname
obs_row
hash
lib_A
table_A
1
64A29CCA15F53C83A9583841294A26AA
lib_A
table_B
1
80DAC7B9854CF71A67F9C00A7EC4D9EF
lib_A
table_B
2
0AC44CD79DAB2E33C93BB2312D3A9A40
Need some help.
Tks in advance.
You're pretty close. This is how I would approach it. We'll create a macro with three parameters: data, lib, and out. data is the dataset you have with the column information. lib is the library you want to pull from your dataset, and out is the output dataset that you want to have.
We'll read each column into an individual macro variable:
memname1
memname2
memname3
libname1
libname2
libname3
etc.
From here, we simply need to loop over all of the macro variables and apply them where appropriate. We can easily count how many there are in a data step. All we need to do is add double-ampersands to resolve them correctly. For more information on why this is, check out this MWSUG paper.
%macro get_md5(data=, lib=, out=);
/* Save all variables into macro variables:
memname1 memname2 ...
columns1 columns2 ...
*/
data _null_;
set &data.;
where upcase(libname)=upcase("&lib.");
call symputx(cats('memname', _N_), memname);
call symputx(cats('columns', _N_), columns);
call symputx(cats('obs', _N_), obs);
call symputx('n_datasets', _N_);
run;
/* Loop through all the datasets and access each macro variable */
%do i = 1 %to &n_datasets.;
/* Double ampersand needed:
First, resolve &i. to get &memname1
Then resolve &mename1 to get the value stored in the macro variable memname1
*/
%let memname = &&memname&i.;
%let columns = &&columns&i.;
%let obs = &&obs&i.;
/* Calculate md5 in a temporary dataset */
data _tmp_;
length lib $8.
memname $32.
obs_row 8.
hash $32.
;
set &lib..&memname.(obs=&obs.);
lib = "&lib.";
memname = "&memname.";
obs_row = _N_;
hash = put(md5(cats(&columns.)), $hex32.);
keep libname memname obs_row hash;
run;
/* Overwrite the dataset so we don't keep appending */
%if(&i. = 1) %then %do;
data &out.;
set _tmp_;
run;
%end;
%else %do;
proc append base=&out. data=_tmp_;
run;
%end;
%end;
/* Remove temporary data */
proc datasets lib=work nolist;
delete _tmp_;
quit;
%mend;
Example:
data have;
length libname memname columns $15.;
input libname$ memname$ columns$ obs;
datalines;
sashelp cars make,model,msrp 1
sashelp class age,height,name 2
sashelp comet dose,length,sample 1
;
run;
%get_md5(data=have, lib=sashelp, out=want);
Output:
libname memname obs_row hash
sashelp cars 1 258DADA4843E7068ABAF95667E881B7F
sashelp class 1 29E8F4F03AD2275C0F191FE3DAA03778
sashelp class 2 DB664382B88BE7E445418B1A1C8CE13B
sashelp comet 1 210394B77E7696506FDEFD78890A8AB9
I would make a macro that takes as input the four values in your metadata dataset. Note that commas are anathema to SAS programs, especially macro code, so make the macro so it can accept space delimited variable lists (like normal SAS program statements do).
To reduce the risk of name conflict I will name the variable using triple underscores and then rename them back to human friendly names when the dataset is written.
%macro next_ds(libname,memname,num_obs,varlist);
data next_ds;
length ___1 $8 ___2 $32 ___3 8 ___4 $32 ;
___1 = "&libname";
___2 = "&memname";
___3 + 1;
set &libname..&memname(obs=&num_obs keep=&varlist);
___4 = put(md5(cats(of &varlist)),$hex32.);
keep ___1-___4 ;
rename ___1=libname ___2=memname ___3=obs_row ___4=hash;
run;
%mend next_ds;
Let's make some test metadata that reference datasets everyone should have.
data have;
infile cards truncover ;
input libname :$8. memname :$32. num_obs columns $200.;
cards;
sashelp class 3 name,sex,age
sashelp cars 2 make,model
;
And make sure the target dataset does not already exists.
%if %sysfunc(exist(want)) %then %do;
proc delete data=want; run;
%end;
Now you can call that macro once for each observation in your source metadata dataset. There is no need to generated oodles of macro variables. Instead you can use CALL EXECUTE() to generate the macro calls directly from the dataset.
We can replace the commas in the column lists when making the macro call. You can add in a PROC APPEND step after each macro call to aggregate the results into a single dataset.
data _null_;
set have;
call execute(cats(
'%nrstr(%next_ds)(',libname,',',memname,',',num_obs
,',',translate(columns,' ',','),')'
));
call execute('proc append data=next_ds base=want force; run;');
run;
Notice that wrapping the macro call in %NRSTR() makes the SAS log easier to read.
1 + %next_ds(sashelp,class,3,name sex age)
2 + proc append data=next_ds base=want force; run;
3 + %next_ds(sashelp,cars,2,make model)
4 + proc append data=next_ds base=want force; run;
Results:
Obs libname memname obs_row hash
1 sashelp class 1 5425E9CEDA1DDEB71B2692A3C7050A8A
2 sashelp class 2 C532D227D358A3764C2D225DC8C02D18
3 sashelp class 3 13AD5F1517E0C4494780773B6DC15211
4 sashelp cars 1 777C60693BF5E16F38706C89301CD0A8
5 sashelp cars 2 07080C9321145395D1A2BCC10FBE6B83
Note that CATS() might not be the best method for generating the string to pass to the MD5() function. That can generate the same string for different combinations of the source variables. For example 'AB' || 'CD' is the same as 'A' || 'BCD'. Perhaps just use CAT() instead.
Stu's approach is nice, and will work most of the time but will fall over when you have wiiiide variables, a large number of variables, variables with large precision, and other edge cases.
So for the actual hashing part, you might consider this macro, which is extensively tested within Data Controller for SAS:
https://core.sasjs.io/mp__md5_8sas.html
Usage:
data _null_;
set sashelp.class;
hashvar=%mp_md5(cvars=name sex, nvars=age height weight);
put hashvar=;
run;

Extracting unique values from all the datasets in a SAS library

I need to extract all the unique/distinct values of some common variables across all the datasets in a SAS library. I tried following code, but is there a better way of having this on one dataset.
%macro dslist();
proc sql noprint;
select memname into :mylist separated by ' '
from dictionary.tables where libname= "VIEW" and upcase(memname) like "data_%"
;
quit;
%put &mylist;
data _null_;
datanum = countw("&mylist");
call symput('Dataset', put(datanum, 10.));
run;
%put #######&Dataset;
proc sql ;
%do i = 1 %to &Dataset ;
%let dataname=view.%scan(&mylist,&i,%str( ));
create table %scan(&mylist,&i,%str( )) as
select distinct id,visit
from &dataname
order by id,visit
;
%end;
quit;
%mend;
%dslist;
I use proc append after this step to set all the datasets and then remove duplicates.
Also, if someone knows Hash approach for better efficiency!
Thank you!
If the number of datasets is small you might just generate one SQL statement that will select and de-dup. But there is a limit on the number of tables that a single SQL statement can reference. Just like there is a limit on the number of dataset names that can fit into a the single macro variable your current code it generating.
So to make something that is more robust you could use a data step view to combine the data and PROC SORT to de-dup.
First get the list of datasets that have both ID and VISIT variables and meet your other criteria.
proc sql ;
create table dslist as
select catx('.',libname,nliteral(memname)) as dsname
from dictionary.columns
where libname= "VIEW"
and memname like %upcase("data_%")
and upcase(name) in ('ID' 'VISIT')
group by 1
having count(*)=2
;
quit;
Then use that list to define a data step view that combines just the ID and VISIT variables from all of them.
filename code temp;
data _null_;
set dslist end=eof;
file code lrecl=72;
if _n_=1 then put 'data id_visit_v / viwe=id_visit_v;' / ' set ' #;
put dsname '(keep=id visit) ' #;
if eof then put ';' / 'run;' ;
run;
%include code / source2;
Then use PROC SORT to get the set of distinct ID*VISIT combinations.
proc sort data=id_visit_v nodupkey out=id_visit ;
by id visit;
run;
Clean up.
proc delete data=id_visit_v (memtype=view);
run;
I wonder how your code is actually working with the following upcase(memname) like "data_%".
Creating fake data
libname view "/home/kermit/folder";
data view.data_A;
call streaminit(123);
array _{5} $ ('s', 't', 'a', 'c', 'k');
do i=1 to 100000;
id=rand("integer", 1, 1000);
j=rand('integer', 1, dim(_));
visit=_[j];
output;
end;
drop i j _:;
run;
data view.data_B;
call streaminit(123);
array _{5} $ ('s', 't', 'a', 'c', 'k');
do i=1 to 100000;
id=rand("integer", 1, 1000);
j=rand('integer', 1, dim(_));
visit=_[j];
output;
end;
drop i j _:;
run;
data view.data_C;
call streaminit(123);
array _{5} $ ('s', 't', 'a', 'c', 'k');
do i=1 to 100000;
id=rand("integer", 1, 1000);
j=rand('integer', 1, dim(_));
visit=_[j];
output;
end;
drop i j _:;
run;
Consolidate in one single table
proc sql noprint;
select cats(libname,'.',memname,"(keep= id visit)") into :mylist separated by ' '
from dictionary.tables where libname="VIEW" and upcase(memname) like "DATA_%"
;
quit;
data have;
set &mylist.;
run;
Extract all unique values of id and visit
proc sort data=have out=want nodupkey; by id visit; run;
NOTE: There were 300000 observations read from the data set WORK.HAVE.
NOTE: 295000 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 5000 observations and 2 variables.
NOTE: PROCEDURE SORT a utilisé (Durée totale du traitement) :
real time 0.08 seconds
user cpu time 0.14 seconds
system cpu time 0.02 seconds
memory 23404.76k
OS Memory 51740.00k

SAS - Add origin table name as a column in report

I have an output table that contains 300+ variables from 30 different tables that are joined by UNION, which is used for modelling. I have created a macro that creates a report with a number of statistics, such as mean, min/max values etc. using this output table. I am trying to add a column to the report that details which table(s) the variables come from. I say table(s) as some of the variables are shared across different tables. I want to avoid having the same variable in the report multiple times as the statistics are the same irrespective of what table the variable comes from. Is there an efficient way to do this?
Instead of UNION consider using a DATA STEP and then use the INDSNAME option instead.
data want;
set sashelp.class sashelp.cars indsname=source;
source_dataset = source;
run;
If it were me, I would loop over each of the union datasets and just put the table name and variable names into a compiled dataset. You probably have all the table names in either a macro list or typed out, so you can just add a few more lines of code to run proc contents on each of those to compile a full list of table and variable names. Note that like your example, there will be duplicate variable names that you can modify after the table is compiled:
** create different tables **;
data height; set sashelp.class(keep=name height); run;
data weight; set sashelp.class(keep=name weight); run;
data sex; set sashelp.class(keep=name sex); run;
** put your datasets into a list either manually or dynamically **;
/* manually */
%let ds_list=height weight sex;
/* dynamically -- be careful to include only tables in your union */
proc sql noprint;
select MEMNAME
into: ds_list separated by " "
from sashelp.vmember
where libname = "WORK" and memname not in ("SASMACR","FORMATS");
quit;
%put &ds_list.;
** loop over each table to put the table name and variables in a dataset **;
%MACRO get_names(ds_list);
%do i=1 %to %sysfunc(countw(&ds_list.));
%let ds = %scan(&ds_list.,&i.);
proc contents data = &ds. noprint
out=names_&ds.(keep=MEMNAME NAME rename=(MEMNAME=SOURCE_DATASET));
run;
proc append data = names_&ds. base=full force; run;
%end;
%MEND;
%get_names(&ds_list.);
I managed to do this using the following:
Create table with source tables.
PROC SQL;
CREATE TABLE SOURCES AS
SELECT NAME
,MEMNAME
FROM DICTIONARY.COLUMNS
WHERE LIBNAME='LIBNAME'
ORDER BY 1,2;
RUN;
Join to my stats table.
PROC SQL;
CREATE TABLE STATS_NEW AS
SELECT memname AS TABLE_NAME,a.*
FROM STATS a
LEFT JOIN SOURCES b
ON a.name = b.name
GROUP BY a.name
ORDER BY a.name;
QUIT;
Transpose data and add in comma separators.
DATA STATS_TRANSPOSE (drop=TABLE_NAME);
LENGTH INPUT_TABLES $1000;
SET STATS_NEW;
BY name;
RETAIN INPUT_TABLES;
IF FIRST.name THEN DO; INPUT_TABLES=TABLE_NAME; END;
IF NOT FIRST.name
THEN DO;
INPUT_TABLES=CATS(INPUT_TABLES,', ',TABLE_NAME);
END;
IF LAST.name THEN DO;
IF name IN ('FIELD1','FIELD2')
THEN DO; INPUT_TABLES='ALL'; END;
OUTPUT;
END;
RUN;

replacing field name suffixes in bulk

I have a dataset where I have several variables with suffixes that correspond to given dates. I want to replace the suffixes with the dates to make my output tables more user friendly.
Here is a sample of my code
the fields in my sales dataset are
product number_of_sales_1 number_of_sales_2 number_of_sales_3 revenue_1 revenue_2 revenue_3 tax_1 tax_2 tax_3
The suffixes 1,2,3 correspond to dates which are held in a second dataset with the following format
dates
id date
1 01Apr
2 01May
3 01Jun
I want to bulk replace the suffixes with the dates so my fields in sales become
product number_of_sales_01Apr number_of_sales_01May number_of_sales_01Jun revenue_01Apr revenue_01May revenue_01Jun tax_01Apr tax_01May tax_01Jun
Both the number of dates and the numberof metrics in sales are dynamic so I can't just hardcode in the the code.
I assume your datasets look like below:
data sales;
product="abc";number_of_sales_1=1;number_of_sales_2=2;number_of_sales_3=3;
revenue_1=1000;revenue_2=2000;revenue_3=3000;tax_1=100;tax_2=200;tax_3=300;
run;
data dates;
id=1;date="01Apr";output;id=2;date="01May";output;id=3;date="01Jun";output;
run;
1st Step - Finding out the dates variables which needs to be renamed
proc contents data=sales out=sales_temp(keep=name) noprint; run;
data sales_temp1;
length check_date_vars $1. id 8.;
set sales_temp;
check_date_vars=compress(substr(name,length(name)));
temp=notdigit(check_date_vars);
if temp=0 then id=check_date_vars;
run;
2nd step - Merging the above dataset with the datset which contains the formats, to create a mapping between old names and new names and creating macro variables out of it
proc sort data=sales_temp1; by id; run;
proc sort data=dates; by id; run;
data sales_temp_date;
merge sales_temp1(in=a) dates(in=b);
by id;
if a and b;
new_name=substr(name,1,length(name)-1)||date;
run;
proc sql noprint;
select count(*) into :num_vars separated by " " from sales_temp_date;
quit;
proc sql noprint;
select name into:old_name1 - :old_name&num_vars. from sales_temp_date;
select new_name into:new_name1 - :new_name&num_vars. from sales_temp_date;
quit;
3rd Step - Renaming the variables
%macro rename();
proc datasets library=work nolist;
modify sales;
rename
%do i=1 %to &num_vars.;
&&old_name&i.= &&new_name&i.
%end;
;
run;
%mend;
%rename;

SAS loop through datasets

I have multiple tables in a library call snap1:
cust1, cust2, cust3, etc
I want to generate a loop that gets the records' count of the same column in each of these tables and then insert the results into a different table.
My desired output is:
Table Count
cust1 5,000
cust2 5,555
cust3 6,000
I'm trying this but its not working:
%macro sqlloop(data, byvar);
proc sql noprint;
select &byvar.into:_values SEPARATED by '_'
from %data.;
quit;
data_&values.;
set &data;
select (%byvar);
%do i=1 %to %sysfunc(count(_&_values.,_));
%let var = %sysfunc(scan(_&_values.,&i.));
output &var.;
%end;
end;
run;
%mend;
%sqlloop(data=libsnap, byvar=membername);
First off, if you just want the number of observations, you can get that trivially from dictionary.tables or sashelp.vtable without any loops.
proc sql;
select memname, nlobs
from dictionary.tables
where libname='SNAP1';
quit;
This is fine to retrieve number of rows if you haven't done anything that would cause the number of logical observations to differ - usually a delete in proc sql.
Second, if you're interested in the number of valid responses, there are easier non-loopy ways too.
For example, given whatever query that you can write determining your table names, we can just put them all in a set statement and count in a simple data step.
%let varname=mycol; *the column you are counting;
%let libname=snap1;
proc sql;
select cats("&libname..",memname)
into :tables separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
data counts;
set &tables. indsname=ds_name end=eof; *9.3 or later;
retain count dataset_name;
if _n_=1 then count=0;
if ds_name ne lag(ds_name) and _n_ ne 1 then do;
output;
count=0;
end;
dataset_name=ds_name;
count = count + ifn(&varname.,1,1,0); *true, false, missing; *false is 0 only;
if eof then output;
keep count dataset_name;
run;
Macros are rarely needed for this sort of thing, and macro loops like you're writing even less so.
If you did want to write a macro, the easier way to do it is:
Write code to do it once, for one dataset
Wrap that in a macro that takes a parameter (dataset name)
Create macro calls for that macro as needed
That way you don't have to deal with %scan and troubleshooting macro code that's hard to debug. You write something that works once, then just call it several times.
proc sql;
select cats('%mymacro(name=',"&libname..",memname,')')
into :macrocalls separated by ' '
from dictionary.tables
where libname=upcase("&libname.");
quit;
&macrocalls.;
Assuming you have a macro, %mymacro, which does whatever counting you want for one dataset.
* Updated *
In the future, please post the log so we can see what is specifically not working. I can see some issues in your code, particularly where your macro variables are being declared, and a select statement that is not doing anything. Here is an alternative process to achieve your goal:
Step 1: Read all of the customer datasets in the snap1 library into a macro variable:
proc sql noprint;
select memname
into :total_cust separated by ' '
from sashelp.vmember
where upcase(memname) LIKE 'CUST%'
AND upcase(libname) = 'SNAP1';
quit;
Step 2: Count the total number of obs in each data set, output to permanent table:
%macro count_obs;
%do i = 1 %to %sysfunc(countw(&total_cust) );
%let dsname = %scan(&total_cust, &i);
%let dsid=%sysfunc(open(&dsname) );
%let nobs=%sysfunc(attrn(&dsid,nobs) );
%let rc=%sysfunc(close(&dsid) );
data _total_obs;
length Member_Name $15.;
Member_Name = "&dsname";
Total_Obs = &nobs;
format Total_Obs comma8.;
run;
proc append base=Total_Obs
data=_total_obs;
run;
%end;
proc datasets lib=work nolist;
delete _total_obs;
quit;
%mend;
%count_obs;
You will need to delete the permanent table Total_Obs if it already exists, but you can add code to handle that if you wish.
If you want to get the total number of non-missing observations for a particular column, do the same code as above, but delete the 3 %let statements below %let dsname = and replace the data step with:
data _total_obs;
length Member_Name $7.;
set snap1.&dsname end=eof;
retain Member_Name "&dsname";
if(NOT missing(var) ) then Total_Obs+1;
if(eof);
format Total_Obs comma8.;
run;
(Update: Fixed %do loop in step 2)