I am simulating pga tournaments using Stata. My simulation results table consists of:
column 1: the names of the 30 players in the tournament
columns 2 - 30,001: the 4 round results of my monte-carol simulations.
what I am trying to do is create a 30 x 30 matrix with the golfers' names as column 1 and across the column names where each cell represents the percentage of times Golfer A beat Golfer B outright from the 30,000 simulations. Is this possible to do in Stata? Thanks
I tend to say that everything is always possible in all programming languages, but somethings are much more difficult to do in some languages compared to others. I do not think that Stata is great tool for what you intend to do.
You need to provide some code examples for us to be able to help you with your task, but here is one thing I can say. Stata has two programming languages. One is often called Stata (but is called ado on Stata Corps webiste) and the other is Mata. If you for some reason need to use the software Stata, you should do this in the language Mata that has more matrix operators than ado. And in ado you cant store text in a matrix, so if you want to store the name of the golfer you need to use Mata, but you can also use indexes of rows and columns to keep track of the golfers.
With that said, Stata is primarily a tool to make operations and analyze a single dataset loaded into memory (recently support for multiple datasets has been added). So to answer your question, yes, this can be done in Stata, but you are probably much better of doing it in a language with more support for multidimensional arrays/vectors. For example, R or Python.
Related
This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)
I'm playing around with some datasets on Kaggle.com, trying to learn better practices for ETL, as I tend to get stuck with specific things with the transform part. For this question, I am dealing with the survey results from Stack Overflow 2018: https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey - specifically the LanguageWorkedWith column.
Currently I am using a combination of RapidMiner/Excel to attempt to change the data. I am not well versed in R and Python code enough to solve this problem with coding methods.
The problem with the current column, is it lists all the languages that a user has chosen separated by a semi-colon. I can easily split a column on a semi-colon, but what occurs is either 2 things:
I have 31 columns of LanguageWorkedWith1 - LanguageWorkedWith31. This makes gathering a count of languages by salary to not work.
A cartesian effect where each row would be duplicated to accommodate only the choice of language. So you'll have a lot of duplicate rows, which definitely affects the integrity of the data. I have also tried using Power BI ( the Load location) to remove duplicates on the responder ID and language, but that didnt work.
Ideally I'd like to do a language by salary visual in Power BI, similar to how many kernals have it, but cant figure out the process for making this happen outside of code. Not sure how this would look exactly, but if i can split all the languages and count them, I can at least do something like this:
But I'm not sure if i can relate this back to salary with how the data is.
I just want to understand some transforming processes better! Appreciate any help!
The key here is to split into rows instead of columns.
So that you end up with a table like this:
You can keep that row expansion in its own related table in your data model so you aren't creating a giant table.
From there it's pretty easy to make visuals provided you know a little bit of DAX. For example, I created an AvgSalary measure (after converting that column to a numeric type) like this:
AvgSalary =
CALCULATE (
AVERAGE ( survey_results_public[ConvertedSalary] ),
FILTER (
survey_results_public,
survey_results_public[Respondent] IN VALUES ( 'Language'[Respondent] )
)
)
and was then able to create interesting charts like the following:
Sorry for possibly asking a very basic question. I am quite new to coding in SAS.
As I can see there is a standard SAS FCMP procedure to calculate the Black Scholes Implied Volatility for individual option data.
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a003193738.htm
But my problem is that I have a data table of intraday option trading data that lists all the required fields, i.e. strike price, time to expiry, equity price, interest rate and volatility. Is there any way i can code so that I can calculate the implied volatility for these individual entries.
I understand that possibly i need to use some kind of loop to do so, but i am not able to understand how to pass values within procedures in SAS. Any help will be highly appreciated.
is it possible to show the mathemetical formular / concept behind the analysis done with SAS Enterprise?
Assuming SAS would calculate a correlation between a list of numbers -- is it possible to see what exactly SAS did from a mathematical perspective?
It is not possible to ask SAS for the mathematical formula, no. You can check the documentation; for example, this page gives many of the 'elemantary statistics' formulas (like variance, UCLM, etc.)
If you need the formula behind something more complex that you can't find online, contact your SAS Support rep, and they may be able to put you in contact with the developer of that particular proc - like if you need to know some particular to how PROC GLM does something.
You can ask SAS to give you the SAS code that it ran if you executed a task (in most cases it's available by clicking on the task node), in many cases, but that would be something like proc freq; tables a*b; run;, not a mathematical formula per se.
I'm going to study the relationship between the illiquidity and returns in stock markets, using the Amihud model proposed in the paper "Illiquidity and stock returns: cross-section and time-series effects" (2002). I would like to know if it is possible to automate the regression analysis. I've have more than 2000 stocks in the sample and I'd like to avoid to run each regression one-by-one, speeding the process up.
Do you know if it is possible automate this process in Stata? or if is it possible to do that using some other statistical software (R, SAS, Matlab, Gretl,...) ? If it is, how could I do that?
You should look at foreach and forval as ways of looping.
forval i = 1/3 {
regress Ystock`i' Xstock`i'
}
would be an example if and only if there are variables with names like those you indicated. If you have other names, or a different data structure, a loop would still be possible.