I have a list of customer addresses in a dataset where I am trying to locate the Country of residence, for example: NEWSOUTHWALESAUSTRALIA could be indexed to report the country as Australia. I am trying to use the do loop approach to scan through the list of 252 Countries to relate the Country of residence from a dataset called address_format
The dataset test has the list of 252 Countries which have upcased & compressed, as has the field concat_address, so should no issues with differences in the text.
%macro counter;
%do ii = 1 %to 252;
data test;
set country_data (obs=&ii.);
call symput('New_upcase_country',trim(New_upcase_country));
country_new = compress(trim(country_two));
call symput('country_new',trim(country_new));
run;
data ADDRESS_FORMAT_NEW;
set ADDRESS_FORMAT;
length success $70.;
format success $70.;
if index(concat_address,"&country_new.") ge 1
then do ;
country="&country_new.";
end;
run;
%end;
%mend;
%counter;
For some odd reason If I manually programme if index(concat_address,'AUSTRALIA'), I get results, but inside the macro the results are blank.
Is there something obvious I am missing that is preventing the success of the country index?
The obs= option can be thought of as lastobs= (there is no lastobs option).
The option for skipping the first n-1 observations is firstobs=
This example will yield 4 rows (8-5+1)
data class;
set sashelp.class (firstobs=5 obs=8);
run;
So you want
firstobs=&ii obs=&ii, or
firstobs=&ii and a STOP;RUN; to prevent going to row &ii+1 and beyond.
Despite the above answer, I would recommend switching to a no macro approach that does all 252 checks in one data step (versus one step per check). There are numerous ways of doing such, here is one way that does not use arrays or hashes
For example:
data have;
input;
text = _infile_;
datalines;
BONGOBONGOAUSTRALIA
SOMEWHERE IN CHINA
CANADA USA
TIBET LANE, NORWAY
run;
data countries;
length name $50;
input;
name = _infile_;
datalines;
AUSTRALIA
GERMANY
UNITED STATES
TIBET
NORWAY
run;
Output only first match. An important code feature is using point= and nobs= options on the set statement inside the do loop.
data want;
set have;
do index = 1 to check_count until (found);
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then matched_country = name;
end;
run;
Output all matches
data want (keep=text matched_country);
set have;
do index = 1 to check_count;
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then do;
found_count = sum(found_count,1);
matched_country = name;
output;
end;
end;
if not found_count > 0 then do;
matched_country = '** NO MATCH **';
output;
end;
run;
Suppose the dataset has three columns
Date Region Price
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
I try to get the lead price by date and region through following code.
data want;
set have;
by _ric date_l_;
do until (eof);
set have(firstobs=2 keep=price rename=(price=lagprice)) end=eof;
end;
if last.date_l_ then call missing(lagprice);
run;
However, the WANT only have one observations. Then I create new_date=date and try another code:
data want;
set have nobs=nobs;
do _i = _n_ to nobs until (new_date ne Date);
if eof1=0 then
set have (firstobs=2 keep=price rename=(price=leadprice)) end=eof1;
else leadprice=.;
end;
run;
With this code, SAS is working slowly. So I think this code is also not appropriate. Could anyone give some suggestions? Thanks
Try sorting by the variables you want lead price for then set together twice:
data test;
length Date Region $12 Price 8 ;
input Date $ Region $ Price ;
datalines;
01-03 A 1
01-03 A 2
01-03 B 3
01-03 B 4
01-03 A 5
01-04 B 4
01-04 B 6
01-04 B 7
;
run;
** sort by vars you want lead price for **;
proc sort data = test;
by DATE REGION;
run;
** set together twice -- once for lead price and once for all variables **;
data lead_price;
set test;
by DATE REGION;
set test (firstobs = 2 keep = PRICE rename = (PRICE = LEAD_PRICE))
test (obs = 1 drop = _ALL_);
if last.DATE or last.REGION then do;
LEAD_PRICE = .;
end;
run;
You can use proc expand to generate leads on numeric variables by group. Try the following method instead:
Step 1: Sort by Region, Date
proc sort data=have;
by Region Date;
run;
Step 2: Create a new ID variable to denote observation numbers
Because you have multiple values per date per region, we need to generate a new ID variable so that proc expand uses lead by observation number rather than by date.
data have2;
set have;
_ID_ = _N_;
run;
Step 3: Run proc expand by region with the lead transformation
lead will do exactly as it sounds. You can lead by as many values as you'd like, as long as the data supports it. In this case, we are leading by one observation.
proc expand data=have2
out=want;
by Region;
id _ID_;
convert Price = Lead_Price / transform=(lead 1) ;
run;
I am trying to extract all the Time occurrences for only the recent visit. Can someone help me with the code please.
Here is my data:
Obs Name Date Time
1 Bob 2017090 1305
2 Bob 2017090 1015
3 Bob 2017081 0810
4 Bob 2017072 0602
5 Tom 2017090 1300
6 Tom 2017090 1010
7 Tom 2017090 0805
8 Tom 2017072 0607
9 Joe 2017085 1309
10 Joe 2017081 0815
I need the output as:
Obs Name Date Time
1 Bob 2017090 1305,1015
2 Tom 2017090 1300,1010,0805
3 Joe 2017085 1309
Right now my code is designed to give me only one recent entry:
DATA OUT2;
SET INP1;
BY DATE;
IF FIRST.DATE THEN OUTPUT OUT2;
RETURN;
I would first sort the data by name and date. Then I would transpose and process the results.
proc sort data=have;
by name date;
run;
proc transpose data=have out=temp1;
by name date;
var value;
run;
data want;
set temp1;
by name date;
if last.name;
format value $2000.;
value = catx(',',of col:);
drop col: _name_;
run;
You may want to further process the new VALUE to remove excess commas (,) and missing value .'s.
Very similar to the question yesterday from another user, you can use quite a few solutions here.
SQL again is the easiest; this is not valid ANSI SQL and pretty much only SAS supports this, but it does work in SAS:
proc sql;
select name, date, time
from have
group by name
having date=max(date);
quit;
Even though date and time are not on the group by it's legal in SAS to put them on the select, and then SAS automatically merges (inner joins) the result of select name, max(date) from have group by name having date=max(date) to the original have dataset, returning multiple rows as needed. Then you'd want to collapse the rows, which I leave as an exercise for the reader.
You could also simply generate a table of maximum dates using any method you choose and then merge yourself. This is probably the easiest in practice to use, in particular including troubleshooting.
The DoW loop also appeals here. This is basically the precise SAS data step implementation of the SQL above. First iterate over that name, figure out the max, then iterate again and output the ones with that max.
proc sort data=have;
by name date;
run;
data want;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then output;
end;
run;
Of course here you more easily collapse the rows, too:
data want;
length timelist $1024;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then timelist=catx(',',timelist,time);
if last.name then output;
end;
run;
If the data is sorted then just retain the first date so you know which records to combine and output.
proc sort data=have ;
by name descending date time;
run;
data want ;
set have ;
by name descending date ;
length timex $200 ;
retain start timex;
if first.name then do;
start=date;
timex=' ';
end;
if date=start then do;
timex=catx(',',timex,time);
if last.date then do;
output;
call missing(start,timex);
end;
end;
drop start time ;
rename timex=time ;
run;
Thank you who will be able to help me. I've got a dataset as below:
data smp;
infile datalines dlm=',';
informat identifier $7. trx_date $9. transaction_id $13. product_description $50. ;
input identifier $ trx_date transaction_id $ product_description $ ;
datalines;
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT FULL STRAWB/GRAIN
Cust1,11Aug2016,20-0030417313,RACHELS YOG GREEK NAT F/F/ORG
Cust1,03Nov2016,23-0040737060,RACHELS YOG GREEK NAT F/F/ORG
Cust3,13Feb2016,39-0070595440,COLLECT YOG LEMON
Cust3,21Jun2016,34-0050769524,AF YOG FARMHOUSE STRAWB/REDCUR
Cust3,21Jun2016,34-0050769524,Y/VALLEY GREEK HONEY ORGANIC
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK LEMON CURD ORG
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG FRUITY FAVS
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG STRAWB ORG
Cust3,26Jun2016,39-0430106897,TOTAL GREEK YOGURT 0%
Cust3,14Aug2016,54-0040266755,M/BUNCH SQUASHUMS STRAW/RASP
Cust3,14Aug2016,54-0040266755,MULLER CORNER STRAWBERRY
Cust3,14Aug2016,54-0040266755,TOTAL GREEK YOGURT 0%
Cust3,22Aug2016,54-0050447336,M/BUNCH SQUASHUMS STRAW/RASP
;
For each customers (and each of their purchase based on transaction_id), i'm wanting to flag each product that will be repurchased during their next visit (only their next visit) on a rolling basis. So in the above dataset, correct flags would be on rows 4,12 and 13 because these products are bought on the next customer visit (we only look at the next visit).
I'm trying to do it with the following program:
proc sort data = smp out = td;
by descending identifier transaction_id product_description;
run;
DATA TD2(DROP=tmp_product);
SET td;
BY identifier transaction_id product_description;
RETAIN tmp_product;
IF FIRST.product_description and first.transaction_id THEN DO;
tmp_product = product_description;
END;
ATTRIB repeat_flag FORMAT=$1.;
IF NOT FIRST.product_description THEN DO;
IF tmp_product EQ product_description THEN repeat_flag ='Y';
ELSE repeat_flag = 'N';
END;
RUN;
proc sort data = td2;
by descending identifier transaction_id product_description;
run;
But it's not working? if someone could pse help it would be fab.
Best Wishes
Other method is to produce a dummy group in original dataset and temporary dataset. In original dataset, group is sequenced by visit time per customer, in temporary dataset, group is sequenced from beginning of SECOND visit time per customer, group number in temporary dataset is the same as group number of original dataset, but its visit time is next visit of original dataset. With the dummy group, it is easy to find the same product that was repurchased during their next visit by hash table.
proc sort data=smp;
by identifier trx_date;
run;
data have(drop=_group) temp(drop=group rename=(_group=group));
set smp;
by identifier trx_date;
if first.identifier then do;
group=1; _group=0;
end;
if dif(trx_date)>0 then do;
group+1; _group+1;
end;
if _group^=0 then output temp;
output have;
run;
data want;
if 0 then set temp;
if _n_=1 then do;
declare hash h(dataset:'temp');
h.definekey('identifier','group','product_description');
h.definedata('product_description');
h.definedone();
end;
set have;
flag=(h.find()=0);
drop group;
run;
The method below will "look ahead" to the next row (opposite to LAG) after sorting so you can bring comparisons onto the same row for simple logic:
** convert character date to numeric **;
data smp1; set smp;
TRX_DATE_NUM = input(TRX_DATE,ANYDTDTE10.);
format TRX_DATE_NUM mmddyy10.;
run;
** sort **;
proc sort data = smp1;
by IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM;
run;
** look ahead at the next observations and use logic to identify flags **;
data look_ahead;
set smp1;
by IDENTIFIER;
set smp1 (firstobs = 2
keep = IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM
rename = (IDENTIFIER = NEXT_ID PRODUCT_DESCRIPTION = NEXT_PROD TRX_DATE_NUM = NEXT_DT))
smp1 (obs = 1 drop = _ALL_);
if last.IDENTIFIER then do;
NEXT_ID = "";
NEXT_PROD = "";
NEXT_DT = .;
end;
run;
** logic says if the next row is the same customer who bought the same product on a different date then flag **;
data look_ahead_final; set look_ahead;
if IDENTIFIER = NEXT_ID and NEXT_PROD = PRODUCT_DESCRIPTION and TRX_DATE_NUM ne NEXT_DT then FLAG = 1;
else FLAG = 0;
run;
There are a few ways to do this; I think the simplest to understand, while still having a reasonable level of performance, is to sort the data in descending date order and then use an array to store the product_descriptions of the last trx_date.
Here I use a 2 dimensional array where the first dimension is just a 1/2 value; each trx_date simultaneously loads one row of the array and checks against the other row of the array (using _array_switch to determine which is being loaded/checked).
You could do the same thing with a hash table, and it would be appreciably faster along with perhaps a bit less complicated in some ways; if you are familiar with hash tables and want to see that solution comment and I or someone else can provide it.
You also could use SQL to do this, and I suspect that is the most common solution overall, but I couldn't quite get it to work, as it has some complexity with subqueries within subqueries the way I was approaching it, and I'm apparently not good enough with those.
Here's the array solution. Set the second dimension of prods to a reasonable maximum for your data - it could even be thousands, this is a temporary array and does not use much memory so set to 32000 or whatever would not be a big deal.
proc sort data=smp;
by identifier descending trx_date ;
run;
data want;
array prods[2,20] $255. _temporary_;
retain _array_switch 2;
do _n_ = 1 by 1 until (last.trx_date);
set smp;
by identifier descending trx_date;
/* for first row for an identifier, clear out the whole thing */
if first.identifier then do;
call missing(of prods[*]);
end;
/* for first row of a trx_date, clear out the array-row we were looking at last time, and switch _array_switch to the other value */
if first.trx_date then do;
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave;
call missing(prods[_array_switch,_i]);
end;
_array_switch = 3-_array_switch;
end;
*now check the array to see if we should set next_trans_flag;
next_trans_flag='N';
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave; *for speed;
if prods[_array_switch,_i] = product_description then next_trans_flag='Y';
end;
prods[3-_array_switch,_n_] = product_description; *set for next trx_date;
output;
end;
drop _:;
run;
I think to really answer this you need to generate a list of distinct visit*product combinations. And also a list of the distinct products bought on particular visits.
proc sql noprint ;
create table bought as
select distinct identifier, product_description, trx_date, transaction_id
from smp
order by 1,2,3,4
;
create table all_visits as
select a.identifier, product_description, trx_date, transaction_id
from (select distinct identifier,product_description from bought) a
natural join (select distinct identifier,transaction_id,trx_date from bought) b
order by 1,2,3,4
;
quit;
You can then combine them and make a flag for whether the product was bought on that visit.
data check ;
merge all_visits bought(in=in1) ;
by identifier product_description trx_date transaction_id ;
bought=in1;
run;
You can now use a lead technique to figure out if the they also bought the product on the next visit.
data flag ;
set check ;
by identifier product_description trx_date transaction_id ;
set check(firstobs=2 keep=bought rename=(bought=bought_next)) check(drop=_all_ obs=1);
if last.product_description then bought_next=0;
run;
You can then combine back with the actual purchases and eliminate the extra dummy records.
proc sort data=smp;
by identifier product_description trx_date transaction_id ;
run;
data want ;
merge flag smp (in=in1);
by identifier product_description trx_date transaction_id ;
if in1 ;
run;
Let's put the records back into the original order so we can check the results.
proc sort; by row; run;
proc print; run;
I have a list of financial advisors and I need to pull 4 samples per advisor but catch is in those 4 samples I need to force 2 mortgages, 1 loan, 1 credit card lets say.
Is there a way in the Survey select statement to set the specific number of samples to pull per stratum? I know you can stratify on 1 category and set it as a equal number. I was hoping I could use a mapping of employee names + the number of samples left to pull for each category and have survey select utilize that to pull in a dynamic way.
I'm using this as an example but this only stratifies on employee first and gives me 4 per employee. I would need to further stratify on Product type and set that to a specific sample size per product.
proc surveyselect data=work.Emp_Table_Final
method=srs n=4 out=work.testsample SELECTALL;
strata Employee_No;
run;
Thanks i know it might sound complicated, but if i know its possible then i can google the rest
Yes, you can have a dataset be the target of the n option. That dataset must:
Contain the strata variables as well as a variable SAMPSIZE or _NSIZE_ with the number to select
Have the same type and length as the strata variables
Be sorted by the strata variables
Have an entry for every strata variable value
See the documentation for more details.
data sample_counts;
length sex $1;
input sex $ _NSIZE_;
datalines;
F 5
M 3
;;;;
run;
proc sort data=sashelp.class out=class;
by sex;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex;
run;
For two variables it's the same, you just need two variables in the sample_counts. Of course it makes it a lot more complicated, and you may want to produce this in an automated fashion.
proc sort data=sashelp.class out=class;
by sex age;
run;
data sample_counts;
length sex $1;
input sex $ age _NSIZE_;
datalines;
F 11 1
F 12 1
F 13 1
F 14 1
F 15 1
M 11 1
M 12 1
M 13 1
M 14 1
M 15 1
M 16 0
;;;;
run;
/* or do it in an automated way*/
data sample_counts;
set class;
by sex age; *your strata;
if first.age then do; *do this once per stratum level;
if age le 15 then _NSIZE_ = 1; *whatever your logic is for defining _NSIZE_;
else _NSIZE_=0;
output;
end;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex age;
run;