I am exporting a lot of strings from Stata to Excel.
For a single column of 3000+ rows with a different string in each, I need to check the length of each string/cell. I could do this in Stata using the length() function, but I need to be able to open the Excel file, edit a given string, and have the length update automatically in Excel.
This seems like it should be simple using the putexcel command or mata's put_formula() function, but the time to run is prohibitive.
At root, my question is about building many relative references (e.g., =LEN(A1)) in mata all at once, as opposed to one at a time.
This may make more sense after seeing the code below:
mata: b = xl()
mata: b.create_book("Formula_Test", "Formula_Test", "xlsx")
mata: b.load_book("Formula_Test")
*Put some strings in column 1
mata: b.put_string(1, 1, "asfas")
mata: b.put_string(2, 1, "sfhds")
mata: b.put_string(3, 1, "qwrq")
mata: b.put_string(4, 1, "dgsdgsdgsdgs")
*Formula - export one-at-a-time
*This works, but is slow
foreach i of numlist 1/4{
mata: b.put_formula(`i', 2, "LEN(A`i')")
}
*Formula - export all at once with relative reference
*This would be faster, but throws error
mata: b.put_formula((1,4), 3, "LEN(INDIRECT("C[-2]",FALSE))")
When I run the last line, I get an error:
invalid expression
r(3000);
Is there an efficient way to write an entire column or row of Excel formulas using mata, with relative references?
The mata function put_formula() only accepts scalars for rows and columns. Note that you also need to use compound double quotes in its string matrix argument.
Looping in mata is always faster than doing so in Stata:
mata:
for (i = 1; i <= 4; i++) {
b.put_formula(i, 2, `"LEN(INDIRECT("C[-2]",FALSE))"')
}
end
Nevertheless, despite the limitation of having to use scalars as arguments for rows and columns in put_formula(), a loop is in fact not necessary. This is because one can specify a string matrix J of constants as the final argument.
Indeed, the following does the same in seconds:
mata:
k = J(3000, 1, `"LEN(INDIRECT("C[-1]",FALSE))"')
b.put_formula(1, 2, k)
end
In this way, the matrix J[3000,1] is written once in cell B1 of the spreadsheet. Because it has 3000 rows, it naturally extends to all cells down to B3000.
This answer is secondary but may be useful to someone.
The inefficiency in the code in the question -- looping through a numlist and writing the formula in mata one cell at a time -- comes partially from the use of a Stata loop (as Pearly Spencer pointed out and corrected). But a bigger issue is the number of times mata has to write individual cells when the example is expanded from 4 cells to several thousand.
If you can avoid looping and writing many cells individually, using -putexcel- or mata's b.put_formula are not dramatically different in speed in most applications. If you are writing cells in a single column, row, or matrix of cells, and can write them all at once, either option will be fast. A -putexcel- example:
*A -putexcel- example
mata: b.create_book("Formula_Test", "Formula_Test", "xlsx")
putexcel set "Formula_Test", sheet("Formula_Test") modify
putexcel B1:B30000 = formula(`" =LEN(INDIRECT("C[-1]",FALSE)) "')
For 30,000 cells in a single column, -putexcel- took 37 seconds.
Using Pearly Spencer's J matrix approach in mata took 36 seconds.
The important point is: if you are writing a formula to many cells, try to consolidate it into blocks that can be written together as matrices, rather than looping over all cells. This will give you the biggest speed gains; using mata instead of -putexcel- will help, but will provide only a second-order improvement. Even in mata it will take a long time to write individually to thousands of cells.
Related
I have a list of circumstances and effects:
I want to generate a matrix with betas containing the values of betas. I am going to run the loop 10 times, because i am in fact going to bootstrap my observations.
So far I have tried:
local circumstances height weight
local effort training diet
foreach i in 1 10 {
reg outcome `circumstances' `effects'
* store in column i the values of betas of circumstances
* store in column i the values of betas of effort
}
Does anyone know what should the code look like in order to store those values?
Thank you
The pseudocode would first store in "column 1" the first lot of betas and then overwrite them (column 1) with the second lot of betas. Then it would do the same again for column 10 with the first lot of betas and the second lot of betas. That is a long way from anything that makes sense. Nothing in your pseudocode takes bootstrap samples from the dataset, although perhaps you are intending to add code for that later.
Stata doesn't really work with any idea of column numbers, although the idea makes sense to Mata.
Unless there are very specific reasons -- which you would need to spell out -- there is no need to write your own code ab initio for bootstrapping, as the whole point of bootstrap is to do that for you.
Here is complete code for a reproducible example of bootstrapping a silly regression:
sysuse auto, clear
bootstrap b_weight=_b[weight] b_price=_b[price] , reps(1000) seed(2803) : regress mpg weight price
See also the help for bootstrap to learn about its other options, including saving().
10 repetitions would be regarded as absurdly small for the number of bootstrap samples.
I am using FORTRAN77 as a third party language on ANSYS computation software. Here we can write the entire row and columns to files during I/O operations. I am not able to however move the cursor to the first row and write column wise thereafter- for every column in the 2D array defined. It writes all the data in the single column unfortunately. I need to know what I can use at the place quoted as XXX
*CFOPEN, ACT_STR, CSV,,APPEND
*DO,INF,1,2*S,1
*VWRITE, S0(1,INF),
(XXX,F10.2,',')
*CFCLOS
You can try transpose of the matrix and then print the matrix row-wise. you can write a small subroutine that can do the transpose for SO.
I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?
This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia
because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.
I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.
I have built a model which basically does the following:
run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:
for i in timeperiod {
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
}
The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.
regress portfolio 1 against portfolio 10
I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.
In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.
EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.
Sounds like a case for either rolling or statsby, depending on what you exactly want to do. These are prefix commands, that you prefix to your regression model. rolling or statsby will take care of both the looping and storing of results for you.
If you want maximum control, you can do the loop yourself with forvalues or foreach and store the results in a separate file using post. In fact, if you look inside rolling and statsby (using viewsource) you will see that this is what these commands do internally.
Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.
Solution 1: create a separate data set using post (see help).
use my_data, clear
postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
using my_derived_data, replace
* 1. topost is a placeholder name
* 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
* 3. This will create the file my_derived_data.dta,
* which of course you can name as you wish
* 4. The triple slash is a continuation comment: the code is coninued on next line
levelsof time_period, local( allyears )
* 5. This will create a local macro allyears
* that contains all the values of time_period
foreach t of local allyears {
regress outcome x1 x2 x3 if time_period == `t', robust
* 6. the opening and closing single quotes are references to Stata local macros
* Here, I am referring to the cycle index t
organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
* this isn't making huge sense for me, so you'll have to put your code here
* don't forget inserting if time_period == `t' as needed
* something like this:
predict yhat`t' if time_period == `t', xb
xtile decile`t' = yhat`t' if time_period == `t', n(10)
calculate_portfolio_returns_for_stocks_based_on_quantile
forvalues q=1/10 {
* do whatever if time_period == `t' & decile`t' == `q'
}
* store quantile 1 portolio and quantile 10 return for the last period
* again I am not sure what you mean and how to do that exactly
* so I'll pretend it is something like
ratio change / price if time_period == `t' , over( decile`t' )
post topost (`t') ("whatever text describes the time `t' portfolio") ///
(_b[_ratio_1:1]) (_b[_ratio_1:10])
* the last two sets of parentheses may contain whatever numeric answer you are producing
}
postclose topost
* 7. close the file you are creating
use my_derived_data, clear
tsset time_period, year
newey return_q10 return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.
use my_data, clear
gen int collapsed_time = .
gen double collapsed_return_q1 = .
gen double collapsed_return_q10 = .
* 1. set up placeholders for your results
levelsof time_period, local( allyears )
* 2. This will create a local macro allyears
* that contains all the values of time_period
local T : word count `allyears'
* 3. I now use the local macro allyears as is
* and count how many distinct values there are of time_period variable
forvalues n=1/`T' {
* 4. my cycle now only runs for the numbers from 1 to `T'
local t : word `n' of `allyears'
* 5. I pull the `n'-th value of time_period
** computations as in the previous solution
replace collapsed_time_period = `t' in `n'
replace collapsed_return_q1 = (compute) in `n'
replace collapsed_return_q10 = (compute) in `n'
* 6. I am filling the pre-arranged variables with the relevant values
}
tsset collapsed_time_period, year
* 7. this will likely complain about missing values, so you may have to fix it
newey collapsed_return_q10 collapsed_return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.
I think you're looking for the estout command to store the results of the regressions.