I cant figure out how to do lookup properly (SAS)

I cant figure out how to do lookup properly (SAS) - sas

In Base SAS, I have a script with hash object to do table lookup. The condition as follow.
Table A is the original primary table which will take Table B to do lookup.
The lookup key is AssetName and Voltage.
Voltage is always either 33 or 11.
So now we can imagine that it is Table A lookup with Table B by using AssetName and Voltage to get some data from Table B.
Lets have a look at the sample code that I have.
data ncpdm.ncp_load_re (drop=excp_code re_state re_supply_zone)
work.excp_ncp_load_re;
length excp_code $50 re_state re_supply_zone $30;
if _n_=1 then do;
declare hash pmu_list(dataset:"ncpdm.ncp_asset_pmu");
pmu_list.definekey('assetname','voltage');
pmu_list.definedata('region','zone','state_code','state',
'business_area_code','business_area',
'supply_zone_code','supply_zone',
'sub_supply_zone_code','sub_supply_zone',
'pmu_name','substation_name_tnbt','functional_location');
pmu_list.definedone();
call missing(region,zone,state_code,state,
business_area_code,business_area,
supply_zone_code,supply_zone,
sub_supply_zone_code,sub_supply_zone,
pmu_name,substation_name_tnbt,functional_location);
end;
set asset_re (RENAME=(pmu=assetname voltage=voltage_));
data_dttm=datetime();
voltage_=strip(voltage_);
voltage=cats('132/',voltage_);
mnemonic_tnbt=strip(mnemonic_tnbt);
assetname=mnemonic_tnbt;
rc=pmu_list.find();
if (rc^=0) then do;
excp_code='Exception: Mnemonic_tnbt and Voltage not mapped to PMU master list';
output work.excp_ncp_load_re;
end;
else do;
output ncpdm.ncp_load_re;
end;
keep mnemonic_tnbt excp_code re_state re_supply_zone
region zone state_code state business_area_code business_area
supply_zone_code supply_zone sub_supply_zone_code sub_supply_zone
pmu_name substation_name_tnbt functional_location voltage
re_state
re_station
re_ca_no
re_customer_name
re_capacity
re_commission_date
re_technology
pmu
ppu
ssu_pe
re_switch_no
voltage
period
data_dttm
active_flag
program
scod_date
kick_off_date
iom_date
geo_longitude
geo_latitude;
run;
From my code above, i set those that cannot be mapped/lookup to output to excp table. I then use the same hash object code but with excp as data source to lookup the same table again, code as below.
(i change Voltage to 33 or 11(the opposite of existing voltage).
/*2nd round lookup for failed record*/
data ncpdm.ncp_load_rev2 (drop=excp_code re_state re_supply_zone)
work.excp_ncp_load_re (drop=excp_code re_state re_supply_zone);
length excp_code $50 re_state re_supply_zone $30;
if _n_=1 then do;
declare hash pmu_list(dataset:"ncpdm.ncp_asset_pmu");
pmu_list.definekey('assetname','voltage');
pmu_list.definedata('region','zone','state_code','state',
'business_area_code','business_area',
'supply_zone_code','supply_zone',
'sub_supply_zone_code','sub_supply_zone',
'pmu_name','substation_name_tnbt','functional_location');
pmu_list.definedone();
call missing(region,zone,state_code,state,
business_area_code,business_area,
supply_zone_code,supply_zone,
sub_supply_zone_code,sub_supply_zone,
pmu_name,substation_name_tnbt,functional_location);
end;
set work.excp_ncp_load_re;
data_dttm=datetime();
if voltage='132/11' then voltage = '132/33';
else if voltage='132/33' then voltage='132/11';
mnemonic_tnbt=strip(mnemonic_tnbt);
assetname=mnemonic_tnbt;
re_state=state;
re_station=station;
re_ca_no=ca_no;
re_customer_name=applicant_name;
re_capacity=capacity;
re_commission_date=commission_date;
re_technology=technology;
geo_latitude=lat;
geo_longitude=lng;
rc=pmu_list.find();
if (rc^=0) then do;
excp_code='Exception: Mnemonic_tnbt and Voltage not mapped to PMU master list';
output work.excp_ncp_load_re;
end;
else do;
output ncpdm.ncp_load_rev2;
end;
keep mnemonic_tnbt excp_code re_state re_supply_zone
region zone state_code state business_area_code business_area
supply_zone_code supply_zone sub_supply_zone_code sub_supply_zone
pmu_name substation_name_tnbt functional_location voltage
re_state
re_station
re_ca_no
re_customer_name
re_capacity
re_commission_date
re_technology
pmu
ppu
ssu_pe
re_switch_no
voltage
period
data_dttm
active_flag
program
scod_date
kick_off_date
iom_date
geo_longitude
geo_latitude;
run;
Problem is, for those that doesn't have matching Voltage in my first hash object, I managed to lookup in the second hash object code BUT I still get un-mapped records. Once I append table generated from 1st hash object and 2nd hash object, i still get lesser records than my desired result.
I couldnt figure out on how to apply a better logic. Somehow i feel that my method of using 2nd hash object to lookup is not necessary but i just dont know what is the better way.
Is there a better way to that?

You can perform both lookups in a single step. After the first find() fails change the key2 value to it's alternate value and perform another find(). Note: For the case of no matches detected, the lookup satellite variables should be explicitly set to missing so that the output data set is accurate -- if not reset the satellite lookup variables will contain values from the most recent prior match.
Sample code:
This sample generates data for a main table and a lookup table. Each table has some satellite variables that are to be carried into the output data sets. A status variable is also created to indicate the kind of match or mismatch that occurred.
data have(keep=key1 key2 have:);
length key1 $6 key2 $2;
do _n_ = 1 to 5000;
key1 = repeat(byte(26*ranuni(123)+rank('A')),5);
key2 = ifc(ranuni(123) < 0.35, '11', '33');
if ranuni(123) < 0.02 then key2 = '22';
array have(3); * satellite (non-key) variables in have;
do i = 1 to 3;
have(i) = _n_ * 10000 + i;
end;
output;
end;
format have: 8.;
run;
data lookup(keep=key1 key2 look:);
length key1 $6 key2 $2;
array look(5); * satellite (non-key) variables in lookup;
do i = 1 to 26;
key1 = repeat(byte(i-1+rank('A')),5);
array key2s(2) $2 _temporary_ ('11','33');
do j = 1 to 2;
key2 = key2s(j);
do k = 1 to 5;
look(k) = i * 1000 + j * 100 + k;
end;
if ranuni(123) < 0.40 then output;
end;
end;
format look: 8.;
run;
data match no_match;
length key1 $6 key2 $2 match_status $30;
if _n_ = 1 then do;
if 0 then set lookup; * prep pdv at compilation time, set is never executed at runtime;
declare hash lookup(dataset:'lookup');
lookup.defineKey('key1', 'key2');
do i = 1 to 5; drop i;
lookup.defineData(cats('look',i));
end;
lookup.defineDone();
end;
set have;
rc = lookup.find();
if rc = 0 then do;
match_status = 'match on actual keys';
output match;
return; * to top of step;
end;
* first lookup failed, try the alternate key2;
if key2 = '11' then
key2='33';
else
if key2 = '33' then
key2 = '11';
else do;
call missing (of look:); * clear lookup values that find() loaded at last prior match;
match_status = 'no match, key2 invalid';
output no_match;
return; * to top of step;
end;
rc = lookup.find();
if rc = 0 then do;
match_status = 'match after key2 swap';
output match;
return; * to top of step;
end;
else do;
call missing (of look:); * clear lookup values that find() loaded at last prior match;
match_status = 'no match with 11 or 33';
output no_match;
return; * to top of step;
end;
run;

Related

Aggregating multiple observations depending on validity ranges

I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
                            
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
                            
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
                            
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;

It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.

Frequency of entire table in SAS

Is it possible to get the frequency of an entire table in SAS? For example I want to count how many yes's or no's are in an entire table? Thanks

A hash component object has keys and can track .FIND references in key summary variable specified with the keysum: tag attribute supplied at instantiation. The keysum variable, when incremented by 1 per suminc: variable will compute a frequency count.
data have;
* Words array from Abstract;
* "How Do I Love Hash Tables? Let Me Count The Ways!";
* by Judy Loren, Health Dialog Analytic Solutions;
* SGF 2008 - Beyond the Basics;
* https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/029-2008.pdf;
array words(17) $10 _temporary_ (
'I' 'love' 'hash' 'tables'
'You' 'will' 'too' 'after' 'you' 'see'
'what' 'they' 'can' 'do' '--' 'Judy' 'Loren'
);
call streaminit(123);
do row = 1 to 127;
attrib RESPONSE1-RESPONSE20 length = $10;
array RESPONSE RESPONSE1-RESPONSE20;
do over RESPONSE;
RESPONSE = words(rand('integer', 1, dim(words)));
end;
output;
end;
run;
data _null_;
set have;
if _n_ = 1 then do;
length term $10;
call missing (term);
retain one 1;
retain count 0;
declare hash bins(suminc:'one', keysum:'count');
bins.defineKey('term');
bins.defineData('term');
bins.defineDone();
end;
set have end=lastrow;
array response response1-response20;
do over response;
if bins.find(key:response) ne 0 then do;
bins.add(key:response, data:response, data:1);
end;
end;
if lastrow;
bins.output(dataset:'all_freq');
run;
Original answer, presumed only Yes and No
Yes. You can array values, compute as 0/1 flag for each No/Yes value and then use SUM to count 0's and 1's. SUM computes FREQ only when dealing with just 0's and 1's.
Example:
data have;
call streaminit(123);
do row = 1 to 100;
attrib ANSWER1-ANSWER20 length = $3;
array ANSWER ANSWER1-ANSWER20;
do over ANSWER; ANSWER = ifc(rand('uniform') > 0.15,'Yes','No'); end;
output;
end;
run;
data want(keep=freq_1 freq_0);
set have end=lastrow;
array ANSWER ANSWER1-ANSWER20;
array X(20) _temporary_;
do over ANSWER; x(_I_) = ANSWER = 'Yes'; end;
freq_1 + sum (of X(*));
freq_0 + dim(X) - sum (of X(*));
if lastrow;
run;

Transpose your main data and then do a proc freq. This is fully dynamic and scales if the number of question scales or the responses scale. You do need to have all variables be of the same type though - character or numeric.
*generate fake data;
data have;
call streaminit(99);
array q(30) q1-q30;
do i=1 to 100;
do j=1 to dim(q);
q(j) = rand('bernoulli', 0.8);
end;
output;
end;
run;
*flip it to a long format;
proc transpose data=have out=long;
by I;
var q1-q30;
run;
*get the summaries needed;
proc freq data=long;
table col1;
run;
You should get output as follows:
The FREQ Procedure
COL1 Frequency Percent Cumulative
Frequency Cumulative
Percent
0 581 19.37 581 19.37
1 2419 80.63 3000 100.00

Designing new RK number for unique record

I am a SAS Developer. I am starting a project that requires me to assign RK number to unique record. Every extraction will get data that is already existed in the target table and some may not.
For example.
Source Data:
Name
A
B
C
D
E
Target Table:
Name RK
A 1
B 2
C 3
When I load, i want it to insert D and E into the target table with RK 4 & 5 respectively. Currently, I can think of doing hash lookup from source with target table. For data that is not mapped using hash object, RK field will be blank. I will put the max RK number from the target table and incremental 1 to it by appending D & E into it.
I am not sure if this is the most efficient way of doing so. Is there another more efficient way?

You could use a hash to determine if some name (I'll call it value) already exists in target table. However, new keys would have to be tracked, output at the end of the step and then PROC APPPEND'd to target table (I'll call it master) .
For the case of just updating the master table with new RK values, a traditional SAS approach is to use a DATA step to MODIFY a unique keyed master table. The coding pattern is:
SET <source>
MODIFY <master> KEY=<value> / UNIQUE;
... _IORC_ logic ...
Example:
%* Create some source data and the master table;
data have1 have2 have3 have4 have5;
call streaminit(123);
value = 2020; output; output; output;
do _n_ = 1 to 2500;
value = ceil(rand('uniform', 5000));
select;
when (rand('uniform') < 0.20) output have1;
when (rand('uniform') < 0.20) output have2;
when (rand('uniform') < 0.20) output have3;
when (rand('uniform') < 0.20) output have4;
otherwise output have5;
end;
end;
run;
data have6;
do _n_ = 1 to 20;
value = 2020;
output;
end;
run;
* Create the unique keyed master table;
* Typically done once and stored in a permanent library.;
proc sql;
create table keys (value integer, RK integer);
create distinct index value on work.keys;
quit;
%* A macro for adding new RK values as needed;
%macro RK_ASSIGN(master, data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &master;
retain newkey %sysevalf(0&last+0); %* trickery for 1st use case when max(RK) is .;
set &data;
modify &master key=value / unique;
if _iorc_ eq %sysrc(_DSENOM);
newkey + 1;
RK = newkey;
output;
_error_ = 0;
run;
%mend;
%* Use the macro to process source data;
%RK_ASSIGN(keys,have1)
%RK_ASSIGN(keys,have2)
%RK_ASSIGN(keys,have3)
%RK_ASSIGN(keys,have4)
%RK_ASSIGN(keys,have5)
%RK_ASSIGN(keys,have6)
You can see the forced repeats of the 2020 value in the source data is only RK'd once in the master table, and there are no errors during processing.
If you want to backfill the source data with the found or assigned RK value there would be additional steps. You could update a custom format, or do a traditional left join. If you want to focus on backfill during a read over source data the HASH step + APPEND new RK's step might be preferable.
Example 2 Master table is named values
HASH version with RK assignment added to source data. New RKs output and appended.
proc sql;
create table values (value integer, RK integer);
create distinct index value on work.values;
%macro RK_HASH_ASSIGN(master,data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &data(drop=next_RK);
set &data end=end;
if _n_ = 1 then do;
declare hash lookup (dataset:"&master");
lookup.defineKey("value");
lookup.defineData("value", "RK");
lookup.defineDone();
declare hash newlookup (dataset:"&master(obs=0)");
newlookup.defineKey("value");
newlookup.defineData("value", "RK");
newlookup.defineDone();
end;
retain next_RK %sysevalf(0&last+0); %* trick;
* either load existing RK from hash, or compute and apply next RK value;
if lookup.find() ne 0 then do;
next_RK + 1;
RK = next_RK;
lookup.add();
newlookup.add();
end;
if end then do;
newlookup.output(dataset:'work.newmasters');
end;
run;
proc append base=&master data=work.newmasters;
proc delete data=work.newmasters;
run;
%mend;
%RK_HASH_ASSIGN(values,have1)
%RK_HASH_ASSIGN(values,have2)
%RK_HASH_ASSIGN(values,have3)
%RK_HASH_ASSIGN(values,have4)
%RK_HASH_ASSIGN(values,have5)
%RK_HASH_ASSIGN(values,have6)
%* Compare the two assignment strategies, no differences!;
proc sort force data=values(index=(value));
by RK;
run;
proc compare noprint base=keys compare=values out=diffs outnoequal;
by RK;
run;
----- LOG -----
2525 proc compare noprint base=keys compare=values out=diffs
outnoequal <------------- do not output when data is identical ;
;
2526 by RK;
2527 run;
NOTE: There were 215971 observations read from the data set WORK.KEYS.
NOTE: There were 215971 observations read from the data set WORK.VALUES.
NOTE: The data set WORK.DIFFS has 0 observations and 4 variables. <--- all the same ---
NOTE: PROCEDURE COMPARE used (Total process time):
real time 0.25 seconds
cpu time 0.26 seconds

Creating a dataset with the unique values of indexed variable

I have a dataset (LRG_DS) with about 74,000,000 observations. The dataset has been indexed by a variable (I_VAR1) that has about 7500 unique values. I've discovered this by running a proc contents on the dataset.
I'd like to create a dataset (TEMP)contains just the 7000 unique values of the index variable.
I've tried the following:
data TEMP;
set LRG_DS (keep = I_VAR1);
by I_VAR1;
if first.I_VAR1;
run;
and
proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1);
by I_VAR1;
run;
The first approach takes about 46 seconds and the second takes about 55 seconds.
I've read that the sas7bndx is file is not intended to be examined in isolation, but rather as a file to speed up the some of the procedures performed using the index variable.
Any help is much appreciated!

YMMV but using populating an empty hash table with the unique key values may perform better than a sort.
Create some example data:
data x;
do cnt=1 to 10*100000;
var=round(rand('uniform'),0.001);
do cnt2=1 to 10;
output;
end;
drop cnt2;
end;
run;
Test speed with a proc sort:
proc sort data=x(keep=var) out=sorted nodupkey;
by var;
run;
Compare with the hash table version:
data _null_;
set x(keep=var) end=eof;
if _n_ eq 1 then do;
declare hash ht ();
rc = ht.DefineKey ('var');
rc = ht.DefineDone ();
end;
if ht.check() ne 0 then do;
rc = ht.add();
end;
if eof then do;
ht.output(dataset:"ids");
end;
run;
From my very brief tests, I found that the hash table version starts to perform worse as the number of unique values grows. It may be possible to offset this by dimensioning the hash appropriately beforehand but I didn't test.

Read previous and next observations

I have a dataset like this(sp is an indicator):
datetime sp
ddmmyy:10:30:00 N
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
ddmmyy:10:34:00 N
And I would like to extract observations with "Y" and also the previous and next one:
ID sp
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
I tired to use "lag" and successfully extract the observations with "Y" and the next one, but still have no idea about how to extract the previous one.
Here is my try:
data surprise_6_step3; set surprise_6_step2;
length lag_sp $1;
lag_sp=lag(sp);
if sp='N' and lag(sp)='N' then delete;
run;
and the result is:
ID sp
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
Any methods to extract the previous observation also?
Thx for any help.

Try using the point option in set statement in data step.
Like this:
data extract;
set surprise_6_step2 nobs=nobs;
if sp = 'Y' then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set x point = prev;
output;
end;
set x point = current;
output;
if next <= nobs then do;
set x point = next;
output;
end;
end;
run;
There is an implicite loop through dataset when you use it in set statement.
_N_ is an automatic variable that contains information about what observation is implicite loop on (starts from 1). When you find your value, you store the value of _N_ into variable current so you know on which row you have found it. nobs is total number of observations in a dataset.
Checking if prev is greater then 0 and if next is less then nobs avoids an error if your row is first in a dataset (then there is no previous row) and if your row is last in a dataset (then there is no next row).

/* generate test data */
data test;
do dt = 1 to 100;
sp = ifc( rand("uniform") > 0.75, "Y", "N" );
output;
end;
run;
proc sql;
create table test2 as
select *,
monotonic() as _n
from test
;
create table test3 ( drop= _n ) as
select a.*
from test2 as a
full join test2 as b
on a._n = b._n + 1
full join test2 as c
on a._n = c._n - 1
where a.sp = "Y"
or b.sp = "Y"
or c.sp = "Y"
;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

I cant figure out how to do lookup properly (SAS) - sas

Related

Aggregating multiple observations depending on validity ranges

Frequency of entire table in SAS

Designing new RK number for unique record

Creating a dataset with the unique values of indexed variable

Read previous and next observations

Categories

Resources