Merge huge data sets in SAS - sas

I need an advice from guru of SAS :).
Suppose I have two big data sets. The first one is a huge data set (about 50-100Gb!), which contains phone numbers. The second one contains prefixes (20-40 thousands observations).
I need to add the most appropriate prefix to the first table for each phone number.
For example, if I have a phone number +71230000 and prefixes
+7
+71230
+7123
The most appropriate prefix is +71230.
My idea. First, sort the prefix table. Then in data step, process the phone numbers table
data OutputTable;
set PhoneNumbersTable end=_last;
if _N_ = 1 then do;
dsid = open('PrefixTable');
end;
/* for each observation in PhoneNumbersTable:
1. Take the first digit of phone number (`+7`).
Look it up in PrefixTable. Store a number of observation of
this prefix (`n_obs`).
2. Take the first TWO digits of the phone number (`+71`).
Look it up in PrefixTable, starting with `n_obs + 1` observation.
Stop when we will find this prefix
(then store a number of observation of this prefix) or
when the first digit will change (then previous one was the
most appropriate prefix).
etc....
*/
if _last then do;
rc = close(dsid);
end;
run;
I hope my idea is clear enough, but if it's not, I'm sorry).
So what do you suggest?
Thank you for your help.
P.S. Of course, phone numbers in the first table are not unique (may be repeated), and my algorithm, unfortunately, doesn't use it.

There are a couple of ways you could do this, you could use a format or a hash-table.
Example using format :
/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
set prefixtable end=eof ;
retain fmtname 'PREFIX' type 'C' maxlen . ;
maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
start = prefix ;
label = 'Y' ;
output ;
if eof then do ;
hlo = 'O' ;
label = 'N' ;
output ;
call symputx('MAXPL',maxlen) ;
end ;
drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ;
/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
set phonenumberstable ;
length prefix $&MAXPL.. ;
prefix = '' ;
pnum = substr(phonenumber,1,&MAXPL) ;
do until (not missing(prefix) or length(pnum) = 1) ;
if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
end ;
drop pnum ;
run ;

Here's another solution which works very well, speed-wise, as long as you can work under one major (maybe okay) restriction: phone numbers can't start with a 0, and have to be either numeric or convertible to numeric (ie, that "+" needs to be unnecessary to look up).
What I'm doing is building an array of 1/null flags, one 1/null flag per possible prefix. Except this doesn't work with a leading 0: since '9512' and '09512' are the same number. This could be worked around - adding a '1' at the start (so if you have possible 6 digits of prefix, then everything is 1000000+prefix) for example would work - but it would require adjusting the below (and might have performance implications, though I think it wouldn't be all that bad). If "+" is also needed, that might need to be converted to a digit; here you could say anything with a "+" gets 2000000 added to the beginning, or something like that.
The nice thing is this only takes 6 queries (or so) of an array at most per row - quite a bit faster than any of the other search options (since temporary arrays are contiguous blocks of memory, it's just "go check 6 memory addresses that are pre-computed"). Hash and format will be a decent chunk slower, since they have to look up each one anew.
One major performance suggestion: Pay attention to which way your prefixes will likely fail to match. Checking 6 then 5 then 4 then ... might be faster, or checking 1 then 2 then 3 then ... might be faster. It all depends on the actual prefixes themselves, and the actual phone numbers. If most of your prefixes are "+11" and things like that, you almost certainly want to start from the left if that mans a number with "94" will be quickly found as not matching.
With that, the solution.
data prefix_match;
if _n_=1 then do;
array prefixes[1000000] _temporary_;
do _i = 1 to nobs_prefix;
set prefixes point=_i nobs=nobs_prefix;
prefixes[prefix]=1;
end;
call missing(prefix);
end;
set phone_numbers;
do _j = 6 to 1 by -1;
prefix = input(substr(phone_no,1,_j),6.);
if prefix ne 0 and prefixes[prefix]=1 then leave;
prefix=.;
end;
drop _:;
run;
Against a test set, which had 40k prefixes and 100m phone numbers (and no other variables), this ran in a bit over 1 minute on my (good) laptop, versus 6 and change with the format solution and 4 and change with the hash solution (modifying it to output all rows, since the other two solutions do). That seems about right to me performance-wise.

Here's a hashtable example.
Generate some dummy data.
data phone_numbers(keep=phone)
prefixes(keep=prefix);
;
length phone $10 prefix $4;
do i=1 to 10000000;
phone = cats(int(ranuni(0) * 9999999999 + 1));
len = int(ranuni(0) * 4 + 1);
prefix = substr(phone,1,len);
if input(phone,best.) ge 1000000000 then do;
output;
end;
end;
run;
Assuming the longest prefix is 4 chars, try finding a match with the longest first, then continue until the shortest prefix has been tried. If a match is found, output the record and move on to the next observation.
data ht;
attrib prefix length=$4;
set phone_numbers;
if _n_ eq 1 then do;
declare hash ht(dataset:"prefixes");
ht.defineKey('prefix');
ht.defineDone();
end;
do len=4 to 1 by -1;
prefix = substr(phone,1,len);
if ht.find() eq 0 then do;
output;
leave;
end;
end;
drop len;
run;
May need to add logic if a match isn't found to output the record and leave the prefix field blank? Not sure how you want to handle that scenario.

Related

Mixing macro-DO-loops with data step DO-loops

Some context:
I have a string of digits (not ordered, but with known range 1 - 78) and I want to extract the digits to create specific variables with it, so I have
"64,2,3" => var_64 = 1; var_02 = 2; var_03 = 1; (the rest, like var_01 are all set to missing)
I basically came up with two solutions, one is using a macro DO loop and the other one a data step DO loop. The non-macro solution was to fist initialize all variables var_01 - var_78 (via a macro), then to put them into an array and then to gradually set the values of this array while looping through the string, word-by-word.
I then realized that it would be way easier to use the loop iterator as a macro variable and I came up with this MWE:
%macro fast(w,l);
do p = 1 to &l.;
%do j = 1 %to 9;
if &j. = scan(&w.,p,",") then var_0&j. = 1 ;
%end;
%do j = 10 %to 78;
if &j. = scan(&w.,p,",") then var_&j. = 1 ;
%end;
end;
%mend;
data want;
string = "2,4,64,54,1,4,7";
l = countw(string,",");
%fast(string,l);
run;
It works (no errors, no warnings, expected result) but I am unsure about mixing macro-DO-loops and non-macro-DO-loops. Could this lead to any inconsistencies or should I just stay with the non-macro solution?
Your current code is comparing numbers like 1 to strings like "1".
&j. = scan(&w.,p,",")
It will work as long as the strings can be converted into numbers, but it is not a good practice. It would be better to explicitly convert the strings into numbers.
input(scan(&w.,p,","),32.)
You can do what you want with an array. Use the number generated from the next item in the list as the index into the array.
data want;
string = "2,4,64,54,1,4,7";
array var_ var_01-var_78 ;
do index=1 to countw(string,",");
var_[input(scan(string,index,","),32.)]=1;
end;
drop index;
run;

How do you compare observations having part of the key in common in SAS? (i.c. same family, different child)

I have a data set like this
Family Student Age Grade
1 Bob 10 4
1 Kris 12 5
1 Tracy 15 9
There are many other families in this data set. I need to find the siblings of the fifth grader, and create a new variable that contains the age difference between the sibling and the fifth grader. This is an activity that involves merging sets.
The set "school" contains all students and "fifthgraders" only has the fifth graders. I know how to merge them, but I'm stuck on finding their siblings and subtracting their ages.
data mergeStudents
set school fifthgraders
by student
run;
Your code jut can't work, because
you process data by student, not by family
you read fifthgraders before their siblings
Second, I wonder why you did the effort to make a separate dataset with only the fifth graders. Apparently you don't know about the where clause in the dataset options.
A better solution would be
data mergeStudents;
set shool (where =(Grade eq 5) in = grade_5)
shool (where =(Grade ne 5) in = sibling);
by Family;
retain age_5; * do not nullify when reading the next observation ;
if first.Family then age_5 = .; * forget the data about the previous family;
if grade_5 then age_5 = Age; * remember the age of the fifth grager in the family;
* For a family with two fifth graders, this will be the last one in the dataset;
* By sorting 'by Family age' first, you can force it to be the oldest one;
if sibling then do;
age_difference = age - age_5;
output; * If you write an output statement in a dataset, ;
* only observations for which the output statement is executed will be in the result;
end;
run;

SAS - Changing Existing Character Variable values to Numeric using Input

Have a variable called var1 that has two kinds of values (both as character strings). One is "ND" the other is a number out of 0-100, as a string. I want to convert "ND" to 0 and the character string to a numeric value, for example 1(character) to 1(numeric).
Here's my code attempt:
data cleaned_up(drop = exam_1);
set dataset.df(rename=(exam1=exam_1));
select (exam1);
when ('ND') do;
exam1 = 0;
end;
when ;
exam1 = input(exam_1,2.);
end;
otherwise;
end;
Clearly not working. What am I doing wrong?
A couple of problems with your code. Putting the rename statement as a dataset option against the input dataset will perform the rename before the data is read in. Therefore exam1 won't exist as it is now called exam_1. This will still be defined as a character column, so the input function won't work.
You need to keep the existing column, create a new numeric column to do the conversion, then drop the old column and rename the new one. This can be done as a dataset option against the output dataset.
The tranwrd function will replace all occurrences of 'ND' to '0', then using input with the best12 informat will read in all the data as numbers. You don't have to specify the length when reading numbers (i.e. 2. for 2 digits, 3. for 3 digits etc).
data cleaned_up (drop=exam1 rename=(exam_1=exam1));
set df;
exam_1 = input(tranwrd(exam1,'ND','0'),best12.);
run;
You are using select(exam1) while it should be select(exam_1). You can use select for this purpose, but I think simple if condition can solve this much easier:
data test;
length source $32;
do source='99', '34.5', '105', 'ND';
output;
end;
run;
data result(drop = convertedValue);
set test;
if (source eq 'ND') then do;
result = 0;
end;
else do;
convertedValue = input(source,??best.);
if not missing(convertedValue) then do;
if (0 <= round(convertedValue, 1E-12) <= 100) then do;
result = convertedValue;
end;
end;
end;
run;
input(source,??best.) tries to convert source to number and if it fails (e.g. values contains some word), it does not print an error and simply continues execution.
round(convertedValue,1E-12) is used to avoid precision error during the comparison. If you want to do it absolutely safely you have to use something like
if (0 < round(convertedValue,1E-12) < 100
or abs(round(convertedValue,1E-12)) < 1E-10
or abs(round(convertedValue-100,1E-12)) < 1E-10
)
Try to use ifc function then convert to numeric variable.
data have;
input x $3.;
_x=input(ifc(x='ND','0',x),best12.);
cards;
3
10
ND
;

Split a row into multiple rows in SAS enterprise guide

I need help to split a row into multiple rows when the value on the row is something like 1-5. The reason is that I need to count 1-5 to become 5, and not 1, as it is when it count on one row.
I've a ID, the value and where it belong.
As exempel:
ID Value Page
1 1-5 2
The output I want is something like this:
ID Value Page
1 1 2
1 2 2
1 3 2
1 4 2
1 5 2
I've tried using a IF-statement
IF bioVerdi='1-5' THEN
DO;
..
END;
So I don't know what I should put between the DO; and END;. Any clues to help me out here?
You need to loop over the values inside your range and OUTPUT the values. The OUTPUT statement causes the Data Step to write a record to the output data set.
data want;
set have;
if bioVerdi = '1-5' then do;
do value=1 to 5;
output;
end;
end;
Here is another solution that is less restricted to the actual value '1-5' given in your example, but would work for any value in the format '1-6', '1-7', '1-100', etc.
*this is the data you gave ;
data have ;
ID = 1 ;
value = '1-5';
page = 2;
run;
data want ;
set have ;
min = scan( value, 1, '-' ) ; * get the 1st word, delimited by a dash ;
max = scan( value, 2, '-' ) ; * get the 2nd word, delimited by a dash ;
/*loop through the values from min to max, and assign each value as the loop iterates to a new column 'NEWVALUE.' Each time the loop iterates through the next value, output a new line */
do newvalue = min to max ;
output ;
end;
/*drop the old variable 'value' so we can rename the newvalue to it in the next step*/
drop value min max;
/*newvalue was a temporary name, so renaming here to keep the original naming structure*/
rename newvalue = value ;
run;

character to numeric in SAS

I'm new to sas and I have the following problem.
I have a variable that stores time but is a character, format $50. It looks like 30 min, 1.5 h, 5 h, 10 h. I need to convert it to numeric and calculate time in hours. I tried substrn function to extract numbers. but substrn(var, 1,2) gives 30, 1(instead of 1.5), 5, 10 and substrn(var, 1,3) gives 30, 1.5, .(instead of 5), 10. How to solve it?
Any help is appreciated.
Conversion from character to numeric is usually done using the input function. The second argument passes the expected informat (a rule telling SAS how to interpret the input).
You can use compress function (with the "k" option to keep rather than discard characters) to get just the numeric part of the character variable. Compress will remove certain characters from a value; the first argument passes the string for it to work on, the second argument lists the characters to remove, the third argument passes additional options (here "d" to add numerals to the list of characters to remove and "k" to invert the process. i.e. keep rather than remove the selected characters).
And, the index function can be used to identify the times when the string contains "m" for minutes. Index will return the position of the first occurrence of the the search string within the input. In the case if the input does not contain "m" it will return 0 and evaluate as FALSE in the if statement.
/* Create some input data */
data temp;
input time : $20.;
datalines;
1.5h
30min
120min
4.25hour
;
run;
data temp2;
set temp;
/* Extract only the numeric part of the string and convert to numeric */
newTime = input(compress(time, ".","dk"), best9.);
/* Check if the string contains the letter "m" and if so divide by 60 */
if index(time, "m") then newTime = newTime / 60;
run;
proc print;
run;
There's probably a way to create a custom informat that would deal with this, which I expect Joe or one of the other regulars here can advise you on. However, failing that, here's a function-based approach:
data have;
input time_raw $1-50;
cards;
30 min
1.5 h
5 h
10 h
;
run;
data want;
set have;
if index(time_raw, 'min') then do;
minutes = input(substr(time_raw,1,length(time_raw) - 4), 8.);
hours = 0;
end;
else do;
hours = input(substr(time_raw, 1, length(time_raw) - 2), 8.);
minutes = 0;
end;
format time time.;
time = hms(hours, minutes, 0);
run;