Recoding 0-3 values - stata

I have speech data set so here is how it is coded now:
Hypernasality (0-3)
Speech understandibility (0-3)
Speech Acceptability (0-3)
Where 0 is good 3 is severe deviation from normal speech.
Hypnasality (0 and 1)
Audible Air Emission (0 and 1)
Where 0 is none and 1 is yes
I recoded my data this way:
foreach j in speechunderstandibility speechacceptability hypernasality {
recode `j' (0 = 3) (3 = 0) (1 = 2) (2 = 1), gen (`j'_1)
}
foreach j in hyponasality audibleemission {
recode `j' (0 = 1) (1 = 0), gen (`j'_1)
}
However, when I run my regression it gives me counter-intuitive results.
My dependent variable is speech outcome and beta of interest is cleft severity.
Results after recoding would say" Cleft severity improves speech but cleft surgery decreases it"
If I leave it the way it is coded then all 5 outcomes mentioned above have different outcomes.
I need them to go in one direction so I can build a summary index.

It may be a raw data issue. I would make sure that all the data points were entered correctly, because at some stage of data entry, the entry of the 0-3 may have gotten mixed up. So it may be a mix up during data-entry.
Secondly, if you're really sure of the data entry (this sounds like a data entry issue to me, or like Nick Cox says, a data interpretation issue), then perhaps try using "gen" and "replace" commands to recode your variables inside or outside of a loop.
When I have trouble with a looped command, I dissect each part and raw code it.

Related

How to hold estimations in memory and write to file in bulk

to speed up the code, I would like to export results to a file for every 10 regressions. Is it possible to do something like the following?
forvalues i = 1(1)20{
ppmlhdfe y ${varlist`i'}, absorb(year) cluster(year)
estimates store result`i'
if mod(`i', 10) == 0 {
outreg2 result* using "$outputdir1\results.csv"
est clear
}
}
In this pseudo-code, I mean that we save results for every 10 regressions, and clear the estimations in memory move on to the next 10 regressions. Is it possible in Stata?
You can certainly do things conditional on the loop index as in
if mod(`i', 10) == 0 {
}
I am not sure that I understand most of the rest of the code. outreg2 is being asked to put stuff in exactly the same file; I don't use outreg2 and have no idea what makes sense there. Similarly, on the face of it, you are storing estimates before you have them.
What is the real saving here? Why execute regressions if you don't wnat to see the results?

How to prepare the multilevel multivalued training dataset in python

I am a beginner in machine learning. My academic project involves detecting human posture from acceleration and gyro data. I am stuck at the beginning itself. My accelerometer data has x,y,z values and gyro also has x,y,z values stored in file acc.csv and gyro.csv. I want to classify the 'standing', 'sitting', 'walking' and 'lying' position. The idea is to train the machine using some ML algorithm (supervised) and then throw a new acc + gyro data set to identify what this new dataset predict (what the subject is doing at present). I am facing the following problems--
Constructing a training dataset -- I think my activities will be dependent variable, and acc & gyro axis readings will be independent. So if I like to combine it in single matrix with each element of the matrix again has it's own set of acc and gyro value [Something like main and sub matrix], how can I do that? or is there any alternative idea to do the same?
How can I take the data of multiple activities with multiple readings in a single training matrix,
I mean 10 walking data each with it's own acc(xyz) and gyro (xyz) + 10 standing data each with it's own acc(xyz) and gyro (xyz) + 10 sitting data each with it's own acc(xyz) and gyro (xyz) and so on.
Each data file has different number of records and time stamp, how to bring them into a common platform.
I know I am asking very basic things but these are the confusion part nobody has clearly explained to me. I am feeling like standing in front of a big closed door, inside very interesting things are happening where I cannot participate at this moment with my limited knowledge. My mathematical background is high school level only. Please help.
I have gone through some projects on activity recognition in Github. But they are way too complicated for a beginner like me.
import pandas as pd
import os
import warnings
from sklearn.utils import shuffle
warnings.filterwarnings('ignore')
os.listdir('../input/testtraindata/')
base_train_dir = '../input/testtraindata/Train_Set/'
#Train Data
train_data = pd.DataFrame(columns = ['activity','ax','ay','az','gx','gy','gz'])
train_folders = os.listdir(base_train_dir)
for tf in train_folders:
files = os.listdir(base_train_dir+tf)
for f in files:
df = pd.read_csv(base_train_dir+tf+'/'+f)
train_data = pd.concat([train_data,df],axis = 0)
train_data = shuffle(train_data)
train_data.reset_index(drop = True,inplace = True)
train_data.head()
The Data Set
Problem in Train_set
Surprisingly if I remove the last 'gz' from
train_data = pd.DataFrame(columns =['activity','ax','ay','az','gx','gy','gz'])
Everything is working fine.
You have the data labeled? --> position of x,y,z... = positure?
I have no clue about the values (as I have not seen the dataset, and have no clue about positions, acc or gyro), but Im guessing you should have a dataset within a matrise with x, y, z as categories and a target category ;"positions".
If you need all 6 (3 from one csv and 3 from the other) to define the positions you can make 6 categories + positions.
Something like : x_1, y_1 z_1 , x_2, y_2, and z_2 + position label ("position" category).
You can also make each position an own category with 0/1 as true/false.
"sitting" , "walking" etc... and have 0 and 1 as the values in the columns.
Is the timestamp of any importance towards the position? If it is not a feature of importance I would just drop it. If it is important in some way, you might want to bin them.
Here is a beginners guide from Medium in which you can see a bit how to preprocess your data. It also shows one hot encoding :)
https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d
Also try googling Preprocessing your data, then you will probably find the right recipe

'Where' clause with graphite queries

I hold a lot of time-series metric data in graphite. Let's assume I have 2 different metrics X and Y that represent the prices of two different products.
I would now like to query the data from an app and do something like this (of course, that's a pseudo-sql):
Select all points of metric X where value is smaller than value of metric Y during a time frame
I couldn't find any reasonable way, without writing my own script or some map-reduce job, to do it.
I could, for example, plot the graphs and try to understand it visually pretty easily. But it won't be usable for an app to use.
I also thought about using some functions like currentBelow and currentAbove but it doesn't look like I can provide two different series to compare but only one specific integer per the entire period of time.
Disclaimer: I hope there is some better solution ;).
The solution is:
Metrics:
product.X.price
product.Y.price
create a filter mask - values 0 is not greater, 1 is greater
diffSeries - to set bound 0 - above points we want to include, below to exclude.
diffSeries(product.X.price, product.Y.price)
removeBelowValue - to remove all excluded - below 0,
removeBelowValue(diffSeries(product.X.price, product.Y.price), 0)
divideSeries - divide above series by itself to create a mask of 0 and 1 - fortunately in graphite's functions 0/0 = 0 (sic!),
divideSeries(removeBelowValue(diffSeries(product.X.price, product.Y.price), 0), removeBelowValue(diffSeries(product.X.price, product.Y.price), 0))
filter out using prepared mask with multiplySeries, since
0 * datapoint = 0
1 * datapoint = datapoint
multiplySeries(product.X.price, divideSeries(removeBelowValue(diffSeries(product.X.price, product.Y.price), 0), removeBelowValue(diffSeries(product.X.price, product.Y.price), 0)))

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

How do I store regression results from loops in Stata?

I have built a model which basically does the following:
run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:
for i in timeperiod {
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
}
The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.
regress portfolio 1 against portfolio 10
I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.
In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.
EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.
Sounds like a case for either rolling or statsby, depending on what you exactly want to do. These are prefix commands, that you prefix to your regression model. rolling or statsby will take care of both the looping and storing of results for you.
If you want maximum control, you can do the loop yourself with forvalues or foreach and store the results in a separate file using post. In fact, if you look inside rolling and statsby (using viewsource) you will see that this is what these commands do internally.
Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.
Solution 1: create a separate data set using post (see help).
use my_data, clear
postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
using my_derived_data, replace
* 1. topost is a placeholder name
* 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
* 3. This will create the file my_derived_data.dta,
* which of course you can name as you wish
* 4. The triple slash is a continuation comment: the code is coninued on next line
levelsof time_period, local( allyears )
* 5. This will create a local macro allyears
* that contains all the values of time_period
foreach t of local allyears {
regress outcome x1 x2 x3 if time_period == `t', robust
* 6. the opening and closing single quotes are references to Stata local macros
* Here, I am referring to the cycle index t
organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
* this isn't making huge sense for me, so you'll have to put your code here
* don't forget inserting if time_period == `t' as needed
* something like this:
predict yhat`t' if time_period == `t', xb
xtile decile`t' = yhat`t' if time_period == `t', n(10)
calculate_portfolio_returns_for_stocks_based_on_quantile
forvalues q=1/10 {
* do whatever if time_period == `t' & decile`t' == `q'
}
* store quantile 1 portolio and quantile 10 return for the last period
* again I am not sure what you mean and how to do that exactly
* so I'll pretend it is something like
ratio change / price if time_period == `t' , over( decile`t' )
post topost (`t') ("whatever text describes the time `t' portfolio") ///
(_b[_ratio_1:1]) (_b[_ratio_1:10])
* the last two sets of parentheses may contain whatever numeric answer you are producing
}
postclose topost
* 7. close the file you are creating
use my_derived_data, clear
tsset time_period, year
newey return_q10 return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.
use my_data, clear
gen int collapsed_time = .
gen double collapsed_return_q1 = .
gen double collapsed_return_q10 = .
* 1. set up placeholders for your results
levelsof time_period, local( allyears )
* 2. This will create a local macro allyears
* that contains all the values of time_period
local T : word count `allyears'
* 3. I now use the local macro allyears as is
* and count how many distinct values there are of time_period variable
forvalues n=1/`T' {
* 4. my cycle now only runs for the numbers from 1 to `T'
local t : word `n' of `allyears'
* 5. I pull the `n'-th value of time_period
** computations as in the previous solution
replace collapsed_time_period = `t' in `n'
replace collapsed_return_q1 = (compute) in `n'
replace collapsed_return_q10 = (compute) in `n'
* 6. I am filling the pre-arranged variables with the relevant values
}
tsset collapsed_time_period, year
* 7. this will likely complain about missing values, so you may have to fix it
newey collapsed_return_q10 collapsed_return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.
I think you're looking for the estout command to store the results of the regressions.