I want to display the total count of the data in the graph note().
I tried the following:
note(count)
However, this just displays the literal word "count".
I also tried to create a local variable but I am having difficulty just initializing it.
While I can do the following:
. local N = 100
. di `N'
100
I can't seem to do:
. local N = count
count not found
The total number of observations is stored in _N.
sysuse auto, clear
display _N
74
So the following works for me:
local N = _N
twoway scatter mpg price, note(Total no of observations: `N')
The total number of observations is kept in _N but it is not necessarily the number of observations used in a graph.
The command count displays a result and leaves a saved result, the number counted, in its wake as r(N). This is documented both in the help for count and in the manual entry.
Hence you can verify that this sequence leaves a note 74 observations in the resulting graph.
. sysuse auto, clear
(1978 Automobile Data)
. count if mpg < .
74
. histogram mpg, note(`r(N)' observations)
(bin=8, start=12, width=3.625)
Note that no r-class command should intervene here between count and your use of its result. r-class saved results, like any other saved results, are overwritten easily. In many circumstances you are well advised, as you did, to store the result in a local macro, say by
. local N = r(N)
immediately after the count command and then refer to that later in the note().
This is a more general method because count by itself returns the number of observations and so can be used when this is directly what you want.
Combining the other answers, I ultimately did:
count
local N = r(N)
count if male
local N_male = r(N)
count if !male
local N_female = r(N)
...
note("N = `N'" " `N_male' (Male)" " `N_female' (Female)")
But still can't get the commas to render at the thousands and millions place.
Related
Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.
I am trying to update some missing values in a dataset with values from another.
Here is an example in Stata 14.2:
sysuse auto, clear
// save in order to merge below
save auto, replace
// create some missing to update
replace length = . if length < 175
// just so the two datasets are not exactly the same, which is my real example
drop if _n == _N
merge 1:1 make using auto, nogen keep(master match_update) update
The code above only keeps the observations updated (26 observations). It is exactly the same result if one uses keep(match_update) instead.
Why is Stata not keeping all observations in the master dataset?
Note that not using match_update is not helpful either, as it removes all observations.
My current workaround is to rename original variables, merge all, and then replace if original was missing. However, this defeats the point of using the update option, and it is cumbersome for updating many variables.
Personally, I always prefer to manually drop / keep observations using the _merge variable as it is more transparent and less error prone.
However, the following does what you want:
merge 1:1 make using auto, nogenerate keep(master match match_update) update
Result # of obs.
-----------------------------------------
not matched 0
matched 73
not updated 47
missing updated 26
nonmissing conflict 0
-----------------------------------------
You can confirm that this is the case as follows:
sysuse auto, clear
save auto, replace
replace length = . if length < 175
drop if _n == _N
merge 1:1 make using auto, update
drop if _merge == 2
drop _merge
save m1
sysuse auto, clear
save auto, replace
replace length = . if length < 175
drop if _n == _N
merge 1:1 make using auto, nogen keep(master match match_update) update
save m2
cf _all using m1
display r(Nsum)
0
This has probably already been answered, but I must just be searching for the wrong terms.
Suppose I am using the built in Stata data set auto:
sysuse auto, clear
and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max...
so I use command,
keep weight mpg
sum weight, detail
return list
local min=r(min)
local lqr=r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr'
Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation?
The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg.
Any thoughts?
I think you want something like:
clear all
set more off
sysuse auto
keep weight mpg
summarize weight, detail
local min = r(min)
local lqr = r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)
* differences between weights and its median
gen diff = abs(weight - `med')
* put the smallest difference in observation 1 (there can be several, watch out!)
isid diff weight mpg, sort
* replace the original median with the weight "closest" to the median
local med = weight[1]
keep if inlist(weight, `min', `lqr', `med', `uqr', `max')
drop diff
* pretty print
order weight mpg
sort weight mpg
list, sep(0)
Notice the median does not appear because we kept its "closest" neighbor instead (weight == 3,180).
Also, percentile 75 has two associated mpg values.
You could probably work something out with collapse and merge (and many more), but I'll leave it at this.
Use help <command> for whatever is not clear.
Thank you to all the suggestions, here is what I came up with. The idea is that I was pulling these 5 numbers so I could send them to mata for a cubic spline that I am attempting to write.
For whatever reason trying to generalize this was giving me a headache.
My final solution:
sysuse auto, clear
preserve
sort weight
count if weight<.
keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4) | _n==_N
gen X = weight
gen Y = mpg
list X Y
/* at this point I will send X and Y to mata for the cubic spline
routine that I am in the process of writing. It was this little step that
was bugging me. */
restore
I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.
Using the auto.dta data, I want to top-code the PRICE variable at the median of top X%. For example, X% could be 3%, 4%, etc. How can I do this in Stata?
In answering your question, I am assuming that you want to replace all the values above, say top 10%, with value say X(top 90% in the following code).
Here is the sample code:
program topcode
sysuse auto, clear
pctile pct = price, nq(10)
dis r(r9)
gen newprice=price
replace newprice=r(r9) if newprice>r(r9)
end