I'm very new to SAS and I'm having trouble nailing some of its concepts down(I'll be using the native example table BASEBALL for this question). So what I'm doing is making two new columns for the table which are the batavg86 and batavgcr shown below(I believe they work just fine) and then printing specific columns of the table(name, batavg86, team, and salary) if the value batavg86 is greater than or equal to .300. What I have posted below does not work, it just prints the whole table. Can someone explain this to me because I'm pretty lost(My professor started us on this language and then went out of town for two weeks).
data mybaseball;
set sashelp.baseball;
batavg86 = nHits/nAtBat;
batavgcr = crHits/CrAtBat;
proc print data = name,batavg86,team,salary;
where batavg86 => .300;
run;
This should give you the results you're looking for:
data mybaseball;
set sashelp.baseball;
batavg86 = nHits/nAtBat;
batavgcr = crHits/CrAtBat;
run;
proc print data = mybaseball;
var name batavg86 team salary;
where batavg86 >= .300;
run;
Related
I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1
I have two SAS datasets with (assume for simplicity) one char variable in each. The first dataset has a variable with company description (sometimes including city, sometimes not; a messy field) and a second dataset has a variable, where all cities are listed. I need to create variable in a first dataset saying, if any of the cities from 2nd dataset was found or not and the outcome should not contain just 0 or 1 answers, but the city itself.
Is there an easy way to do it without looping INDEXW (or similar) functions?
What's wrong with indexw? Using proc sql and indexw allows a pretty straightforward solution.
Sample data:
data have_messy;
length messy $100;
messy = 'this is a city name: brisbane' ; output;
messy = 'this is a city name: sydney' ; output;
messy = 'this is a city name: melbourne'; output;
run;
data have_city;
length city $20;
city = 'sydney' ; output;
city = 'brisbane'; output;
run;
Example query:
proc sql noprint;
create table want as
select a.*,
b.city
from have_messy a
left join have_city b on indexw(a.messy, b.city)
;
quit;
Results:
messy city
=============================== =========
this is a city name: sydney sydney
this is a city name: brisbane brisbane
this is a city name: melbourne
Be careful - the above query can return multiple results per row in table a if multiple city names are found. I suggest you run a follow up step to handle any duplicate rows depending on your requirements.
Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.
If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.
You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;
I have a SQL that would create for each customer a short excerpt of his history. Suppose the columns I am interested in are TIMESTAMP and PURCHASE VALUE. I'd like to calculate a linear regression for each customer and put this value into a table.
proc sql;
create table CUSTOMERHISTORY as
select
TIME_STAMP
,PURCHASE_VALUE
,CUSTOMER_ID
from <my data source>
;quit;
The table is quite large; it would be best, if the table wouldn't have to loaded into RAM prior to computation.
I tried
proc reg
data = CUSTOMERHISTORY;
model PURCHASE_VALUE=TIME_STAMP;
outest = OUTTABLE;
by CUSTOMER_ID;
but it never wrote anything to the OUTTABLE. (I found parameter outest in http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_reg_sect007.htm )
According to the documentation you link to, outtest is a parameter that you should give as a option to proc reg. So to get that specific output, your code should look as:
proc reg
data = CUSTOMERHISTORY
outest = OUTTABLE;
model PURCHASE_VALUE=TIME_STAMP;
by CUSTOMER_ID;
run;
Note that there is no semicolon between data = ... and outtest = ....
I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;