SAS locate country from the address string? - sas

I have a list of customer addresses in a dataset where I am trying to locate the Country of residence, for example: NEWSOUTHWALESAUSTRALIA could be indexed to report the country as Australia. I am trying to use the do loop approach to scan through the list of 252 Countries to relate the Country of residence from a dataset called address_format
The dataset test has the list of 252 Countries which have upcased & compressed, as has the field concat_address, so should no issues with differences in the text.
%macro counter;
%do ii = 1 %to 252;
data test;
set country_data (obs=&ii.);
call symput('New_upcase_country',trim(New_upcase_country));
country_new = compress(trim(country_two));
call symput('country_new',trim(country_new));
run;
data ADDRESS_FORMAT_NEW;
set ADDRESS_FORMAT;
length success $70.;
format success $70.;
if index(concat_address,"&country_new.") ge 1
then do ;
country="&country_new.";
end;
run;
%end;
%mend;
%counter;
For some odd reason If I manually programme if index(concat_address,'AUSTRALIA'), I get results, but inside the macro the results are blank.
Is there something obvious I am missing that is preventing the success of the country index?

The obs= option can be thought of as lastobs= (there is no lastobs option).
The option for skipping the first n-1 observations is firstobs=
This example will yield 4 rows (8-5+1)
data class;
set sashelp.class (firstobs=5 obs=8);
run;
So you want
firstobs=&ii obs=&ii, or
firstobs=&ii and a STOP;RUN; to prevent going to row &ii+1 and beyond.
Despite the above answer, I would recommend switching to a no macro approach that does all 252 checks in one data step (versus one step per check). There are numerous ways of doing such, here is one way that does not use arrays or hashes
For example:
data have;
input;
text = _infile_;
datalines;
BONGOBONGOAUSTRALIA
SOMEWHERE IN CHINA
CANADA USA
TIBET LANE, NORWAY
run;
data countries;
length name $50;
input;
name = _infile_;
datalines;
AUSTRALIA
GERMANY
UNITED STATES
TIBET
NORWAY
run;
Output only first match. An important code feature is using point= and nobs= options on the set statement inside the do loop.
data want;
set have;
do index = 1 to check_count until (found);
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then matched_country = name;
end;
run;
Output all matches
data want (keep=text matched_country);
set have;
do index = 1 to check_count;
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then do;
found_count = sum(found_count,1);
matched_country = name;
output;
end;
end;
if not found_count > 0 then do;
matched_country = '** NO MATCH **';
output;
end;
run;

Related

Combine one column's values into a single string

This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;

All values for only most recent occurrence

I am trying to extract all the Time occurrences for only the recent visit. Can someone help me with the code please.
Here is my data:
Obs Name Date Time
1 Bob 2017090 1305
2 Bob 2017090 1015
3 Bob 2017081 0810
4 Bob 2017072 0602
5 Tom 2017090 1300
6 Tom 2017090 1010
7 Tom 2017090 0805
8 Tom 2017072 0607
9 Joe 2017085 1309
10 Joe 2017081 0815
I need the output as:
Obs Name Date Time
1 Bob 2017090 1305,1015
2 Tom 2017090 1300,1010,0805
3 Joe 2017085 1309
Right now my code is designed to give me only one recent entry:
DATA OUT2;
SET INP1;
BY DATE;
IF FIRST.DATE THEN OUTPUT OUT2;
RETURN;
I would first sort the data by name and date. Then I would transpose and process the results.
proc sort data=have;
by name date;
run;
proc transpose data=have out=temp1;
by name date;
var value;
run;
data want;
set temp1;
by name date;
if last.name;
format value $2000.;
value = catx(',',of col:);
drop col: _name_;
run;
You may want to further process the new VALUE to remove excess commas (,) and missing value .'s.
Very similar to the question yesterday from another user, you can use quite a few solutions here.
SQL again is the easiest; this is not valid ANSI SQL and pretty much only SAS supports this, but it does work in SAS:
proc sql;
select name, date, time
from have
group by name
having date=max(date);
quit;
Even though date and time are not on the group by it's legal in SAS to put them on the select, and then SAS automatically merges (inner joins) the result of select name, max(date) from have group by name having date=max(date) to the original have dataset, returning multiple rows as needed. Then you'd want to collapse the rows, which I leave as an exercise for the reader.
You could also simply generate a table of maximum dates using any method you choose and then merge yourself. This is probably the easiest in practice to use, in particular including troubleshooting.
The DoW loop also appeals here. This is basically the precise SAS data step implementation of the SQL above. First iterate over that name, figure out the max, then iterate again and output the ones with that max.
proc sort data=have;
by name date;
run;
data want;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then output;
end;
run;
Of course here you more easily collapse the rows, too:
data want;
length timelist $1024;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then timelist=catx(',',timelist,time);
if last.name then output;
end;
run;
If the data is sorted then just retain the first date so you know which records to combine and output.
proc sort data=have ;
by name descending date time;
run;
data want ;
set have ;
by name descending date ;
length timex $200 ;
retain start timex;
if first.name then do;
start=date;
timex=' ';
end;
if date=start then do;
timex=catx(',',timex,time);
if last.date then do;
output;
call missing(start,timex);
end;
end;
drop start time ;
rename timex=time ;
run;

Flagging values based on subsequent occurences using first. retain etc

Thank you who will be able to help me. I've got a dataset as below:
data smp;
infile datalines dlm=',';
informat identifier $7. trx_date $9. transaction_id $13. product_description $50. ;
input identifier $ trx_date transaction_id $ product_description $ ;
datalines;
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT FULL STRAWB/GRAIN
Cust1,11Aug2016,20-0030417313,RACHELS YOG GREEK NAT F/F/ORG
Cust1,03Nov2016,23-0040737060,RACHELS YOG GREEK NAT F/F/ORG
Cust3,13Feb2016,39-0070595440,COLLECT YOG LEMON
Cust3,21Jun2016,34-0050769524,AF YOG FARMHOUSE STRAWB/REDCUR
Cust3,21Jun2016,34-0050769524,Y/VALLEY GREEK HONEY ORGANIC
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK LEMON CURD ORG
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG FRUITY FAVS
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG STRAWB ORG
Cust3,26Jun2016,39-0430106897,TOTAL GREEK YOGURT 0%
Cust3,14Aug2016,54-0040266755,M/BUNCH SQUASHUMS STRAW/RASP
Cust3,14Aug2016,54-0040266755,MULLER CORNER STRAWBERRY
Cust3,14Aug2016,54-0040266755,TOTAL GREEK YOGURT 0%
Cust3,22Aug2016,54-0050447336,M/BUNCH SQUASHUMS STRAW/RASP
;
For each customers (and each of their purchase based on transaction_id), i'm wanting to flag each product that will be repurchased during their next visit (only their next visit) on a rolling basis. So in the above dataset, correct flags would be on rows 4,12 and 13 because these products are bought on the next customer visit (we only look at the next visit).
I'm trying to do it with the following program:
proc sort data = smp out = td;
by descending identifier transaction_id product_description;
run;
DATA TD2(DROP=tmp_product);
SET td;
BY identifier transaction_id product_description;
RETAIN tmp_product;
IF FIRST.product_description and first.transaction_id THEN DO;
tmp_product = product_description;
END;
ATTRIB repeat_flag FORMAT=$1.;
IF NOT FIRST.product_description THEN DO;
IF tmp_product EQ product_description THEN repeat_flag ='Y';
ELSE repeat_flag = 'N';
END;
RUN;
proc sort data = td2;
by descending identifier transaction_id product_description;
run;
But it's not working? if someone could pse help it would be fab.
Best Wishes
Other method is to produce a dummy group in original dataset and temporary dataset. In original dataset, group is sequenced by visit time per customer, in temporary dataset, group is sequenced from beginning of SECOND visit time per customer, group number in temporary dataset is the same as group number of original dataset, but its visit time is next visit of original dataset. With the dummy group, it is easy to find the same product that was repurchased during their next visit by hash table.
proc sort data=smp;
by identifier trx_date;
run;
data have(drop=_group) temp(drop=group rename=(_group=group));
set smp;
by identifier trx_date;
if first.identifier then do;
group=1; _group=0;
end;
if dif(trx_date)>0 then do;
group+1; _group+1;
end;
if _group^=0 then output temp;
output have;
run;
data want;
if 0 then set temp;
if _n_=1 then do;
declare hash h(dataset:'temp');
h.definekey('identifier','group','product_description');
h.definedata('product_description');
h.definedone();
end;
set have;
flag=(h.find()=0);
drop group;
run;
The method below will "look ahead" to the next row (opposite to LAG) after sorting so you can bring comparisons onto the same row for simple logic:
** convert character date to numeric **;
data smp1; set smp;
TRX_DATE_NUM = input(TRX_DATE,ANYDTDTE10.);
format TRX_DATE_NUM mmddyy10.;
run;
** sort **;
proc sort data = smp1;
by IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM;
run;
** look ahead at the next observations and use logic to identify flags **;
data look_ahead;
set smp1;
by IDENTIFIER;
set smp1 (firstobs = 2
keep = IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM
rename = (IDENTIFIER = NEXT_ID PRODUCT_DESCRIPTION = NEXT_PROD TRX_DATE_NUM = NEXT_DT))
smp1 (obs = 1 drop = _ALL_);
if last.IDENTIFIER then do;
NEXT_ID = "";
NEXT_PROD = "";
NEXT_DT = .;
end;
run;
** logic says if the next row is the same customer who bought the same product on a different date then flag **;
data look_ahead_final; set look_ahead;
if IDENTIFIER = NEXT_ID and NEXT_PROD = PRODUCT_DESCRIPTION and TRX_DATE_NUM ne NEXT_DT then FLAG = 1;
else FLAG = 0;
run;
There are a few ways to do this; I think the simplest to understand, while still having a reasonable level of performance, is to sort the data in descending date order and then use an array to store the product_descriptions of the last trx_date.
Here I use a 2 dimensional array where the first dimension is just a 1/2 value; each trx_date simultaneously loads one row of the array and checks against the other row of the array (using _array_switch to determine which is being loaded/checked).
You could do the same thing with a hash table, and it would be appreciably faster along with perhaps a bit less complicated in some ways; if you are familiar with hash tables and want to see that solution comment and I or someone else can provide it.
You also could use SQL to do this, and I suspect that is the most common solution overall, but I couldn't quite get it to work, as it has some complexity with subqueries within subqueries the way I was approaching it, and I'm apparently not good enough with those.
Here's the array solution. Set the second dimension of prods to a reasonable maximum for your data - it could even be thousands, this is a temporary array and does not use much memory so set to 32000 or whatever would not be a big deal.
proc sort data=smp;
by identifier descending trx_date ;
run;
data want;
array prods[2,20] $255. _temporary_;
retain _array_switch 2;
do _n_ = 1 by 1 until (last.trx_date);
set smp;
by identifier descending trx_date;
/* for first row for an identifier, clear out the whole thing */
if first.identifier then do;
call missing(of prods[*]);
end;
/* for first row of a trx_date, clear out the array-row we were looking at last time, and switch _array_switch to the other value */
if first.trx_date then do;
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave;
call missing(prods[_array_switch,_i]);
end;
_array_switch = 3-_array_switch;
end;
*now check the array to see if we should set next_trans_flag;
next_trans_flag='N';
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave; *for speed;
if prods[_array_switch,_i] = product_description then next_trans_flag='Y';
end;
prods[3-_array_switch,_n_] = product_description; *set for next trx_date;
output;
end;
drop _:;
run;
I think to really answer this you need to generate a list of distinct visit*product combinations. And also a list of the distinct products bought on particular visits.
proc sql noprint ;
create table bought as
select distinct identifier, product_description, trx_date, transaction_id
from smp
order by 1,2,3,4
;
create table all_visits as
select a.identifier, product_description, trx_date, transaction_id
from (select distinct identifier,product_description from bought) a
natural join (select distinct identifier,transaction_id,trx_date from bought) b
order by 1,2,3,4
;
quit;
You can then combine them and make a flag for whether the product was bought on that visit.
data check ;
merge all_visits bought(in=in1) ;
by identifier product_description trx_date transaction_id ;
bought=in1;
run;
You can now use a lead technique to figure out if the they also bought the product on the next visit.
data flag ;
set check ;
by identifier product_description trx_date transaction_id ;
set check(firstobs=2 keep=bought rename=(bought=bought_next)) check(drop=_all_ obs=1);
if last.product_description then bought_next=0;
run;
You can then combine back with the actual purchases and eliminate the extra dummy records.
proc sort data=smp;
by identifier product_description trx_date transaction_id ;
run;
data want ;
merge flag smp (in=in1);
by identifier product_description trx_date transaction_id ;
if in1 ;
run;
Let's put the records back into the original order so we can check the results.
proc sort; by row; run;
proc print; run;

SAS retrieving data from monthly datasets

I have 2 variables and 3 records in a sas data set, and based on the date field in that data set, I need to read different monthly data sets.
For example,
I have
item no. Date
1 30Jun2015
2 31Jul2015
3 31Aug2015
When I read the first record, then based on the date field (30jun2015) here, it should merge another dataset suffixed with 30jun2015 with this current dataset.
How can I achieve that?
So as I'll hazard a guess what you're looking for I've left a bit of a gap where you'll have to specifiy the criteria for your own merge.
1) Read in base data
data MAIN_DATA;
infile cards;
input ITEM_NO DATE:date9.;
format DATE date9.;
cards;
1 30JUN2015
2 31JUL2015
3 31AUG2015
;
run;
2) Store all dates: into macro variables date1 to daten. Assuming ddmmyy6. is a good format for your table names
Data _null_;
Set Main_data;
Call symputx('date'||strip(_n_),put(DATE,ddmmyy6.));
Call symputx('daten', _n_);
Run;
3) Read in the variables and read the associated table - you haven't specified how to do the merge so I'll leave that up to you
%macro readin;
%do i = 1 %to &daten;
data NEW_TABLE_&&date&i..;
set TEST_&&date&i..; /*in this step you can merge on the original table however you intend to*/
run;
%end;
%mend readin;
%readin;

SAS: Drop column in a if statement

I have a dataset called have with one entry with multiple variables that look like this:
message reference time qty price
x 101 35000 100 .
the above dataset changes every time in a loop where message can be ="A". If the message="X" then this means to remove 100 qty from the MASTER set where the reference number equals the reference number in the MASTER database. The price=. is because it is already in the MASTER database under reference=101. The MASTER database aggregates all the available orders at some price with quantity available. If in the next loop message="A" then the have dataset would look like this:
message reference time qty price
A 102 35010 150 500
then this mean to add a new reference number to the MASTER database. In other words, to append the line to the MASTER.
I have the following code in my loop to update the quantity in my MASTER database when there is a message X:
data b.master;
modify b.master have(where=(message="X")) updatemode=nomissingcheck;
by order_reference_number;
if _iorc_ = %sysrc(_SOK) then do;
replace;
end;
else if _iorc_ = %sysrc(_DSENMR) then do;
output;
_error_ = 0;
end;
else if _iorc_ = %sysrc(_DSEMTR) then do;
_error_ = 0;
end;
else if _iorc_ = %sysrc(_DSENOM) then do;
_error_ = 0;
end;
run;
I use the replace to update the quantity. But since my entry for price=. when message is X, the above code sets the price='.' where reference=101 in the MASTER via the replace statement...which I don't want. Hence, I prefer to delete the price column is message=X in the have dataset. But I don't want to delete column price when message=A since I use this code
proc append base=MASTER data=have(where=(msg_type="A")) force;
run;
Hence, I have this code price to my Modify statement:
data have(drop=price_alt);
set have; if message="X" then do;
output;end;
else do; /*I WANT TO MAKE NO CHANGE*/
end;run;
but it doesn't do what I want. If the message is not equal X then I don't want to drop the column. If it is equal X, I want to drop the column. How can I adapt the code above to make it work?
Its a bit of a strange request to be honest, such that it raises questions about whether what you're doing is the best way of doing it. However, in the spirit of answering the question...
The answer by DomPazz gives the option of splitting the data into two possible sets, but if you want code down the line to always refer to a specific data set, this creates its own complications.
You also can't, in the one data step, tell SAS to output to the "same" data set where one instance has a column and one instance doesn't. So what you'd like, therefor, is for the code itself to be dynamic, so that the data step that exists is either one that does drop the column, or one that does not drop the column, depending on whether message=x. The answer to this, dynamic code, like many things in SAS, resolves to the creative use of macros. And it looks something like this:
/* Just making your input data set */
data have;
message='x';
time=35000;
qty=1000;
price=10.05;
price_alt=10.6;
run;
/* Writing the macro */
%macro solution;
%local id rc1 rc2;
%let id=%sysfunc(open(work.have));
%syscall set(id);
%let rc1=%sysfunc(fetchobs(&id, 1));
%let rc2=%sysfunc(close(&id));
%IF &message=x %THEN %DO;
data have(drop=price_alt);
set have;
run;
%END;
%ELSE %DO;
data have;
set have;
run;
%END;
%mend solution;
/* Running the macro */
%solution;
Try this:
data outX(drop=price_alt) outNoX;
set have;
if message = "X" then
output outX;
else
output outNoX;
run;
As #sasfrog says in the comments, a table either has a column or it does not. If you want to subset things where MESSAGE="X" then you can use something like this to create 2 data sets.