SAS- rank variables conditional on value of variable - sas

I am writing a SAS code, and I have the following issue. I have a table of financial data, and I want to rank the data (stocks) into groups according to a variable, but I want to omit the stocks for which the price (another variable) is less than 5. However, I don't want to just delete all stocks whose values are less than 5 at some date, as then I would need to calculate the returns of the ranked stocks- Hence if a stock is now at 10 but it is 3 in 2 months, I want to have it in the data today, but "not have it in the data" in 2 months.
For the moment my code is:
proc sort data=umd;
by date;
run;
proc rank data=umd out=umd1 group=10;
by date;
var cum_return;
ranks momr;
run;
Could you please help me with that?

I think you can use the " where= " option at dataset out=umdi. This will not delete the records only subset the output dataset.
proc rank data=umd out=umd1(where=(price > 5)) group=10;
by date;
var cum_return;
ranks momr;
run;

Related

SAS combining range of class values for PROC MEANS

I am taking a scripting class and I have no idea what I'm doing!
For my assignment, I am supposed to print min/max/mean/std for each year. The .csv file I was given to use has a year column with the years as
1949.083
1949.167
1949.25
1949.333
1949.417
1949.5
1949.583
1949.667
1949.75
1949.833
1949.917
1950
1950.083
1950.167
and so on, all the way to 1960.
Assuming I am using PROC MEANS, is there a way to maybe combine the years so I can print a single set of calculations (min/max/mean/std) for each year? As in one set of calculations for the year 1949 (data values from 1949-1949.917), another one for 1950 (data values from 1950-1950.917), etc. Not sure if I'm making sense! I've been looking everywhere for hours and I can't figure it out! :(
If you want PROC MEANS to calculate separate statistics per year you can use a CLASS statement. With a CLASS statement it will define the groups based on the formatted value. So if you just use the format 4. with the variable YEAR then each value will be mapped to a simple 4 digit value.
proc means data=have min max mean std ;
class year;
format year 4.;
var analysis_var ;
run;
But that will round values like 1,949.667 to 1950 and not 1949. If you want to ignore the fractional part of the year you can use the INT() function. So first create a new variable and then use that new variable in the CLASS statement.
data step1;
set have;
yrnum = int(year);
run;
proc means data=step1 min max mean std ;
class yrnum ;
var analysis_var ;
run;

Creating table with cumulative values

I have table like first table on the picture.
It's information about banks deals on the FX market on daily basis (buy minus sell). I would like to calculate cumulative results like on the second table. The number of banks and their names, also as date are not fixed. I'm new in SAS and tried to find solutions, but didn't find anything useful. I will be glad for any help.
When data such as this is in a wide format, it can be more difficult to process in SAS compared to a long format. Long data formats have numerous benefits in the form of by-group processing, indexing, filtering, etc. Many SAS procedures are designed around this concept.
For more information on the examples below, check out SAS's example on the Program Data Vector and by-group processing. Mastering these concepts will help you with data step programming.
Here are two ways you can solve it:
1. Use a sum statement and by-group processing.
In this example, we will:
Convert the data from wide to long in order to convert the bank name to a character variable
Perform a cumulative sum on each bank
Convert back to long again
By converting the bank name into a character variable, we can use by-group processing on it.
/* Convert from wide to long */
proc transpose data=raw
out=raw_transposed
name=bank
;
by date;
run;
proc sort data=raw_transposed;
by bank date;
run;
/* Use by-group processing to get cumulative values by month for each bank */
data cumulative_long;
set raw_transposed;
by bank date;
/* Reset the cumulative sum for each bank */
if(first.bank) then call missing(cumulative);
cumulative+COL1;
run;
proc sort data=raw_transposed;
by date bank;
run;
/* Convert from long to wide */
proc transpose data=raw_transposed
out=want(drop=_NAME_)
;
by date;
id bank;
var COL1;
run;
The sum statement can be used as a shortcut of the following code:
data cumulative_long;
set raw_transposed;
by bank date;
retain cumulative;
if(first.bank) then cumulative = 0;
cumulative = cumulative + COL1;
run;
cumulative does not exist in the dataset: we are creating it here. This value will become missing whenever SAS moves on to read a new row. We want SAS to carry the last value forward. retain tells SAS to carry its last value forward until we change it.
2. Use macro variables and dictionary tables
A second option would be to read all of the bank names from a dictionary table to prevent transposing. We will:
Read the names of the banks from the special table dictionary.columns into a macro variable using PROC SQL
Use arrays to perform cumulative sums
This assumes the bank naming scheme is always prefixed with "Bank." If does not follow a regular pattern, you can exclude all other variables from the initial SQL query.
proc sql noprint;
select name
, cats(name, '_cume')
into :banks separated by ' '
, :banks_cume separated by ' '
from dictionary.columns
where memname = 'RAW'
AND libname = 'WORK'
AND upcase(name) LIKE 'BANK%'
;
quit;
data want;
set raw;
array banks[*] &banks.;
array banks_cume[*] &banks_cume.;
do i = 1 to dim(banks);
banks_cume[i]+banks[i];
end;
drop i;
run;

How can I pull certain records and update data in SAS?

I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1

Contingency table in SAS

I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;

SAS and Date operations

I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)
Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;
By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.
Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;