Sorting Data to get Top Ten Largest Salaries - sas

Hi I am working on a problem where I imported an Excel file to SAS
and I need to Sort Salaries of Baseball players and print the top ten largest salaries.
I tried to sort it but it is only sorting the first ten observations from my dataset from greatest to smallest and it is not getting the top ten greatest salaries from the entire dataset.
This is my code below.
proc sort data=MLB out=salaries_sorted;
format Salary dollar12.3;
by descending Salary descending Year;
proc print data=MLB (obs=10);
run;

In this code:
proc sort data=MLB out=salaries_sorted;
format Salary dollar12.3;
by descending Salary descending Year;
You are taking the input dataset MLB, and creating a new sorted table salaries_sorted. You're adding a format there (I'm not sure that works, but even if it does, it's a very unusual place to put it), and then you're telling it how to sort (by salary and year, both descending order).
Then, in this code:
proc print data=MLB (obs=10);
run;
You're printing the top 10 observations of the MLB dataset - but not the sorted dataset.
In order to print the sorted dataset's top 10 rows, you'd need to do this:
proc print data=salaries_sorted(obs=10);
run;
If you don't want the name to change, you can skip the out=salaries_sorted part (and then the MLB dataset will itself be sorted), but it's normally good practice to sort into a new dataset.

Related

Creating table with cumulative values

I have table like first table on the picture.
It's information about banks deals on the FX market on daily basis (buy minus sell). I would like to calculate cumulative results like on the second table. The number of banks and their names, also as date are not fixed. I'm new in SAS and tried to find solutions, but didn't find anything useful. I will be glad for any help.
When data such as this is in a wide format, it can be more difficult to process in SAS compared to a long format. Long data formats have numerous benefits in the form of by-group processing, indexing, filtering, etc. Many SAS procedures are designed around this concept.
For more information on the examples below, check out SAS's example on the Program Data Vector and by-group processing. Mastering these concepts will help you with data step programming.
Here are two ways you can solve it:
1. Use a sum statement and by-group processing.
In this example, we will:
Convert the data from wide to long in order to convert the bank name to a character variable
Perform a cumulative sum on each bank
Convert back to long again
By converting the bank name into a character variable, we can use by-group processing on it.
/* Convert from wide to long */
proc transpose data=raw
out=raw_transposed
name=bank
;
by date;
run;
proc sort data=raw_transposed;
by bank date;
run;
/* Use by-group processing to get cumulative values by month for each bank */
data cumulative_long;
set raw_transposed;
by bank date;
/* Reset the cumulative sum for each bank */
if(first.bank) then call missing(cumulative);
cumulative+COL1;
run;
proc sort data=raw_transposed;
by date bank;
run;
/* Convert from long to wide */
proc transpose data=raw_transposed
out=want(drop=_NAME_)
;
by date;
id bank;
var COL1;
run;
The sum statement can be used as a shortcut of the following code:
data cumulative_long;
set raw_transposed;
by bank date;
retain cumulative;
if(first.bank) then cumulative = 0;
cumulative = cumulative + COL1;
run;
cumulative does not exist in the dataset: we are creating it here. This value will become missing whenever SAS moves on to read a new row. We want SAS to carry the last value forward. retain tells SAS to carry its last value forward until we change it.
2. Use macro variables and dictionary tables
A second option would be to read all of the bank names from a dictionary table to prevent transposing. We will:
Read the names of the banks from the special table dictionary.columns into a macro variable using PROC SQL
Use arrays to perform cumulative sums
This assumes the bank naming scheme is always prefixed with "Bank." If does not follow a regular pattern, you can exclude all other variables from the initial SQL query.
proc sql noprint;
select name
, cats(name, '_cume')
into :banks separated by ' '
, :banks_cume separated by ' '
from dictionary.columns
where memname = 'RAW'
AND libname = 'WORK'
AND upcase(name) LIKE 'BANK%'
;
quit;
data want;
set raw;
array banks[*] &banks.;
array banks_cume[*] &banks_cume.;
do i = 1 to dim(banks);
banks_cume[i]+banks[i];
end;
drop i;
run;

How can I pull certain records and update data in SAS?

I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1

randomly select two observation and calculate the distance

I have a data set have with numerical column x. I want to randomly select any two distinct points and then calculate the distance between them.
If I only do it once, then I just use proc surveyselect to generate another data set with two obs.
proc surveyselect data=have out=want method=srs
sampsize=2;
run;
data out;
set have end=eof;
dist = abs(dif1(x));
if eof;
run;
But how can I do it multiple times, say 1000000? Each time two points are selected with equal prob, then finally I have 1000000 dist records.
How about you reorder your input dataset into a random order and then calculate the distance for every second observation?
proc sql ;
create table random as
select *, ranuni(0) as randorder
from have
order by randorder
;quit ;
data want ;
set random;
dist=abs(dif1(x)) ;
if _n_/2=int(_n_/2) ;
run;
If you need to specify a specific number of samples to calculate then you can add update set random to set random(obs=100000) for example. Although note this would be 'sample pairs' so 100,000 would give your want dataset 50,000 observations

SAS- rank variables conditional on value of variable

I am writing a SAS code, and I have the following issue. I have a table of financial data, and I want to rank the data (stocks) into groups according to a variable, but I want to omit the stocks for which the price (another variable) is less than 5. However, I don't want to just delete all stocks whose values are less than 5 at some date, as then I would need to calculate the returns of the ranked stocks- Hence if a stock is now at 10 but it is 3 in 2 months, I want to have it in the data today, but "not have it in the data" in 2 months.
For the moment my code is:
proc sort data=umd;
by date;
run;
proc rank data=umd out=umd1 group=10;
by date;
var cum_return;
ranks momr;
run;
Could you please help me with that?
I think you can use the " where= " option at dataset out=umdi. This will not delete the records only subset the output dataset.
proc rank data=umd out=umd1(where=(price > 5)) group=10;
by date;
var cum_return;
ranks momr;
run;

Contingency table in SAS

I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;