Sorry for possibly asking a very basic question. I am quite new to coding in SAS.
As I can see there is a standard SAS FCMP procedure to calculate the Black Scholes Implied Volatility for individual option data.
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003193738.htm
But my problem is that I have a data table of intraday option trading data that lists all the required fields, i.e. strike price, time to expiry, equity price, interest rate and volatility. Is there any way i can code so that I can calculate the implied volatility for these individual entries.
I understand that possibly i need to use some kind of loop to do so, but i am not able to understand how to pass values within procedures in SAS. Any help will be highly appreciated.
Related
This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)
I am simulating pga tournaments using Stata. My simulation results table consists of:
column 1: the names of the 30 players in the tournament
columns 2 - 30,001: the 4 round results of my monte-carol simulations.
what I am trying to do is create a 30 x 30 matrix with the golfers' names as column 1 and across the column names where each cell represents the percentage of times Golfer A beat Golfer B outright from the 30,000 simulations. Is this possible to do in Stata? Thanks
I tend to say that everything is always possible in all programming languages, but somethings are much more difficult to do in some languages compared to others. I do not think that Stata is great tool for what you intend to do.
You need to provide some code examples for us to be able to help you with your task, but here is one thing I can say. Stata has two programming languages. One is often called Stata (but is called ado on Stata Corps webiste) and the other is Mata. If you for some reason need to use the software Stata, you should do this in the language Mata that has more matrix operators than ado. And in ado you cant store text in a matrix, so if you want to store the name of the golfer you need to use Mata, but you can also use indexes of rows and columns to keep track of the golfers.
With that said, Stata is primarily a tool to make operations and analyze a single dataset loaded into memory (recently support for multiple datasets has been added). So to answer your question, yes, this can be done in Stata, but you are probably much better of doing it in a language with more support for multidimensional arrays/vectors. For example, R or Python.
How do you take an average of the coefficients across all months?
Please refer to this question earlier
How do I perform regression by month on the same SAS data set?
The comments in the linked question provide the code to get the estimates in a data set. Then you would run a PROC MEANS on the saved data set to get the averages. But you could also run the model without which a variable to get the monthly estimates alone. In general, it isn't common to average parameter estimates this way, except in a bootstrapping process.
Let's say you had a table.
OrderNumber, OrderDate, City, & Sales
The Sales field is given to you. No need to calculate it.
When you bring in this data into Power BI, say you want to analyze Sales by City (in a table format).
You can just straight away drag the two fields into the table.
No need to create a measure.
So now, suppose you created a measure, though.
Total Sales = Sum(Sales).
Is there any advantage to it, in this scenario?
Is it more efficient to use: City, Total Sales
than it is to use: City, Sales
Both display the same information.
When you drag the field into the table, what Power BI does is create an implicit measure automatically based on its best guess of what aggregation (e.g. sum, max, count) it thinks you want.
So in this case, using an explicitly defined measure or an implicitly generated measure should perform the same since it is doing the same thing in the background, i.e., SUM(TableName[Sales]).
It's generally considered best practice to use explicit measures.
You may be interested in this video discussing the differences.
I was told that it is good to always create explicit measures, and that measures are more efficient. Weather right or wrong, I don't know, but from perspective of policy, it is a good idea, since measures do protect you from column name changes. In general, I think I can just make a rule of thumb to always define any measures that you want to report on explicitly.... BUT the answer above could also be correct... stack exchange doesn't let you choose multiple answers....
the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.