I have two SAS datasets with (assume for simplicity) one char variable in each. The first dataset has a variable with company description (sometimes including city, sometimes not; a messy field) and a second dataset has a variable, where all cities are listed. I need to create variable in a first dataset saying, if any of the cities from 2nd dataset was found or not and the outcome should not contain just 0 or 1 answers, but the city itself.
Is there an easy way to do it without looping INDEXW (or similar) functions?
What's wrong with indexw? Using proc sql and indexw allows a pretty straightforward solution.
Sample data:
data have_messy;
length messy $100;
messy = 'this is a city name: brisbane' ; output;
messy = 'this is a city name: sydney' ; output;
messy = 'this is a city name: melbourne'; output;
run;
data have_city;
length city $20;
city = 'sydney' ; output;
city = 'brisbane'; output;
run;
Example query:
proc sql noprint;
create table want as
select a.*,
b.city
from have_messy a
left join have_city b on indexw(a.messy, b.city)
;
quit;
Results:
messy city
=============================== =========
this is a city name: sydney sydney
this is a city name: brisbane brisbane
this is a city name: melbourne
Be careful - the above query can return multiple results per row in table a if multiple city names are found. I suggest you run a follow up step to handle any duplicate rows depending on your requirements.
Related
I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;
I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1
Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.
If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.
You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;
Excuse me for the vague title, but I really don't know else how to word it.
How do I get from this...
NAME--------------ID NO.-------------CITY---------------SCORE
Name 1________222_________New York________27 Name
1________222_________New York________58 Name
1________222_________New York________71 Name
2________333___________LA____________12 Name
2________333___________LA____________92 Name
2________333___________LA____________08
To This?
NAME--------------ID NO.------------CITY--------------WORST SCORE
Name 1________222________New York_________27 Name
2________333___________LA___________08
I'd like to see the solution in both PROC SQL and Data step, thanks.
For a data step you could try something like this
proc sort data=mydata; by name id_no score; run;
data worst;
set mydata;
by name id_no;
if first.id_no;
run;
The sort will put the lowest score to the first row for each name, then in the data step you are only selecting that first record for each name.
I'm not so eloquent with proc sql joins, so a quick and clumsy solution might look like
proc sql;
create table ws as
select name, id_no, city, score as worst_score, min(score) as min
from mydata
group by name;
create table worst as
select name, id_no, city, worst_score
from ws where worst_score=min;
quit;
Or you could use HAVING, it might look something like this
proc sql;
create table worst as
select name, id_no, city, score as worst_score
from mydata
group by name
having score = min(score);
quit;
I'm very new to SAS and I'm having trouble nailing some of its concepts down(I'll be using the native example table BASEBALL for this question). So what I'm doing is making two new columns for the table which are the batavg86 and batavgcr shown below(I believe they work just fine) and then printing specific columns of the table(name, batavg86, team, and salary) if the value batavg86 is greater than or equal to .300. What I have posted below does not work, it just prints the whole table. Can someone explain this to me because I'm pretty lost(My professor started us on this language and then went out of town for two weeks).
data mybaseball;
set sashelp.baseball;
batavg86 = nHits/nAtBat;
batavgcr = crHits/CrAtBat;
proc print data = name,batavg86,team,salary;
where batavg86 => .300;
run;
This should give you the results you're looking for:
data mybaseball;
set sashelp.baseball;
batavg86 = nHits/nAtBat;
batavgcr = crHits/CrAtBat;
run;
proc print data = mybaseball;
var name batavg86 team salary;
where batavg86 >= .300;
run;