How do I generate a mean by year and industry in Stata

How do I generate a mean by year and industry in Stata - stata

I'm trying to generate in Stata the mean per year (e.g. 2002-2012) for each industry (by 2 digit SIC codes, so c. 50 different industries)
I found how to do it for one year with:
by sic_2digit, sort: egen test = mean(oancf_at_rsd10) if fyear == 2004
Is there a more efficient way to do this instead of repeating the command 10 times by hand and than adding the values together?

You can specify more than one variable with by:.
by sic_2digit fyear, sort: egen test = mean(oancf_at_rsd10)
Check out the help for by:, which gives the syntax and an example, and also that for collapse.

Related

Trying to find Top 10 products within categories through Regex

I have a ton of products, separated into different categories.
I've aggregated each products revenue, within their category and I now need to locate the top 10.
The issue is, that not every product have sold within a given timeframe, or some category doesn't even have 10 products, leaving me with fewer than 10 values.
As an example, these are some of the values:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,3,5,6,20,46,47,53,78,92,94,111,115,139,161,163,208,278,291,412,636,638,729,755,829,2673
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,57,124,158,207,288,547
0,0,90,449,1590,10492
0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,7,12,14,32,32,37,62,64,64,64,94,100,103,109,113,114,114,129,133,148,152,154,160,167,177,188,205,207,207,209,214,214,224,225,238,238,244,247,254,268,268,285,288,298,301,305,327,333,347,348,359,362,368,373,402,410,432,452,462,462,472,482,495,511,512,532,566,597,599,600,609,620,636,639,701,704,707,728,747,768,769,773,805,833,899,937,1003,1049,1150,1160,1218,1230,1262,1327,1377,1396,1474,1532,1547,1565,1760,1768,1836,1962,1963,2137,2293,2423,2448,2451,2484,2529,2609,3138,3172,3195,3424,3700,3824,4310,4345,4415,4819,4943,5083,5123,5158,5334,5734,6673,7160,7913,9298,9349,10148,11047,11078,12929,18535,20756,28850,63447
63,126
How would you get as close as possible to capturing the top 10 within a category, and how would you ensure that it is only products that have sold, that are included as a possibility? And all of this through Regex.
My current setup is only finding top 3 and a very basic setup:
Step 1: ^.*\,(.*\,.*\,.*)$ finding top 3
Step 2: ^(.*)\,.*\,.*$ finding the lowest value of the top 3 products
Step 3: Checking if original revenue value is higher than, or equal to, step 2 value.
Step 4: If yes, then bestseller, otherwise just empty value.
Thanks in advance

You didn't specify a programming language so I'm going with Javascript here but this regex is quite compatible with almost any regex flavor:
(?:[1-9]\d*,){0,9}[1-9]\d*$
(?:[1-9]\d*,){0,9} - between 0 and 9 times, find numbers followed by a comma; ignore zero revenue
[1-9]\d* - guarantee a non-zero revenue one time
$ - end line anchor
https://regex101.com/r/1xBQD3/1
If your data were to have leading zeros like 0,0,00090,00449,01590,10492 for some reason then you would need this regex which is 33% more expensive:
(?:0*[1-9]\d*,){0,9}0*[1-9]\d*$

Type mismatch when replacing missing observations with previous values using time-series operators in Stata

Consider the following example. I begin with an str6 'name' variable, and a year for two entities observed every other year.
clear
input str6 nameStr year
"A" 2002
"A" 2004
"A" 2006
"B" 2002
"B" 2004
"B" 2006
end
Then I use tsfill to balance the panel:
egen id = group(nameStr)
xtset id year
tsfill
The dataset is now:
input str6 nameStr year id
"A" 2002 1
"" 2003 1
"A" 2004 1
"" 2005 1
"A" 2006 1
"B" 2002 2
"" 2003 2
"B" 2004 2
"" 2005 2
"B" 2006 2
end
Now I could use something like xfill to fill in the missing string identifier. Or, based on the related Stata FAQ and the documentation for Time-series varlists (help tsvarlist) I expect that something like the following to fill in the values of nameStr:
sort id year \\ not required because the data are still sorted from xtset and tsfill
replace nameStr = nameStr[_n-1] if mi(nameStr) & id[_n-1] == id
and it does.
However, I also expect the following to produce the same behavior, and it does not.
replace nameStr = l.nameStr if mi(nameStr)
Instead Stata returns:
type mismatch
r(109);
While there are several ways to work around this (I've listed two), I'm interested in understanding why this happens. Most similar discussions address cases where two variables of differing types are involved, obviously this isn't the case here, since only one variable is involved.

Stata does not allow time series operators to be applied to string variables. If you think about it you will see that previous (lagging) and following (leading) string values make sense but differences don't, at least not so much. The only simple interpretation of differences would be binary, namely strings at two times are the same or different.
So, Stata is not implying that you can't work with other string values for any panel; it just doesn't support calculations on strings using time series operators.
In addition to the syntax you mention stripolate from SSC supports string interpolation: see this Statalist thread.

Count if multiple conditions met in varlist

I am trying to count values of many string variables (hh_1_age hh_2_age hh_3_age etc.) based on multiple conditions, and send output to a new variable schoolage.
The closest I think I've gotten is coded below...but I'm still getting error messages. I'm hoping I'm on the right track and it's just a small syntax error:
generate schoolage = .
foreach var of varlist hh_* {
count if `v'=="6 - 10 years of age" | `v'=="11 - 14 years of age"
}

This requires a great deal of guesswork compared with the small amount of information given. Please note the suggestions here for good Stata questions, which include giving a reproducible example based on real(istic) data.
You may have data on individual members of a household and wish to count how many are of school age, that meaning aged between 6 and 14 years.
gen schoolage = 0
foreach v of var hh_*_age {
replace schoolage = schoolage + inlist(`v', "6 - 10 years of age", "11 - 14 years of age")
}
would be one solution to that.
Note that this assumes that the variables in question really are string, as you report. Further, checks for equality are entirely literal: all characters must match in turn.
For most purposes holding data on each individual in a separate observation is a much better practice than your question seems to imply.

Dropping observations in Stata based on length?

I have a string variable in Stata called Cod. I want to drop the observations such that Cod has less than 16 characters. Any suggestion?

You should look at help string functions to learn basic syntax here.
drop if length(Cod) < 16
may be what you seek.

Multiple responses in Stata

Let's say you have a survey dataset, with 12 variables that stem from the same question, and each variable reports a response option for that question (multiple-response options possible for this question). Each variable (i.e. response option) is numeric with yes/no options. I am trying to combine all of these variables into one, so that I can do cross-tabs with other variables such as village name, and draw out the frequencies of each individual response and graphs nicely without extensive formatting. Does anyone have a solution to this: either to combine the variables or to do a multivariable cross-tab that doesn't require a lot of time spent on formatting?
Example data:
A B C D E F
1 0 1 0 1 0
0 0 1 0 1 1
1 1 1 0 0 0

There are many tricks and techniques here.
Tricks include using egen's concat() function as well as the group() function mentioned by #Dimitriy V. Masterov.
Techniques include special tabulation or listing commands, including tabm and groups on SSC and mrtab at the Stata Journal; on the last, see this article.
See also this article in the Stata Journal for a general discussion of handling multiple responses.

Does egen pattern = group(A-F), label do what you desire? If not, perhaps you can clarify what the desired transformation would look like for the 3 respondents you have shown.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I generate a mean by year and industry in Stata - stata

You can specify more than one variable with by:. by sic_2digit fyear, sort: egen test = mean(oancf_at_rsd10) Check out the help for by:, which gives the syntax and an example, and also that for collapse.

Related

Trying to find Top 10 products within categories through Regex

Type mismatch when replacing missing observations with previous values using time-series operators in Stata

Count if multiple conditions met in varlist

Dropping observations in Stata based on length?

Multiple responses in Stata

Categories

Resources