Why merge with update does not work as intended?

Why merge with update does not work as intended? - stata

I am trying to update some missing values in a dataset with values from another.
Here is an example in Stata 14.2:
sysuse auto, clear
// save in order to merge below
save auto, replace
// create some missing to update
replace length = . if length < 175
// just so the two datasets are not exactly the same, which is my real example
drop if _n == _N
merge 1:1 make using auto, nogen keep(master match_update) update
The code above only keeps the observations updated (26 observations). It is exactly the same result if one uses keep(match_update) instead.
Why is Stata not keeping all observations in the master dataset?
Note that not using match_update is not helpful either, as it removes all observations.
My current workaround is to rename original variables, merge all, and then replace if original was missing. However, this defeats the point of using the update option, and it is cumbersome for updating many variables.

Personally, I always prefer to manually drop / keep observations using the _merge variable as it is more transparent and less error prone.
However, the following does what you want:
merge 1:1 make using auto, nogenerate keep(master match match_update) update
Result # of obs.
-----------------------------------------
not matched 0
matched 73
not updated 47
missing updated 26
nonmissing conflict 0
-----------------------------------------
You can confirm that this is the case as follows:
sysuse auto, clear
save auto, replace
replace length = . if length < 175
drop if _n == _N
merge 1:1 make using auto, update
drop if _merge == 2
drop _merge
save m1
sysuse auto, clear
save auto, replace
replace length = . if length < 175
drop if _n == _N
merge 1:1 make using auto, nogen keep(master match match_update) update
save m2
cf _all using m1
display r(Nsum)
0

Related

Is there a way to flip the order of observations in Stata?

Is it possible to create a backwards counting variable in Stata (like the command _n, just numbering observations backwards)? Or a command to flip the data set, so that the observation with the most recent date is the first one? I would like to make a scatter plot with AfD on the y-axis and the date (row_id) on the x-axis. When I make the plot however, the weeks are ordered backwards. How can I change the order?
This is the code:
generate row_id=_n
twoway scatter AfD row_id || lfit AfD row_id
Here are the data set and the plot:

Your date variable is a string variable, which is unlikely to get you the desired result if you sort on that variable.
You can create a Stata internal form date variable from your string variable:
gen date_num = daily(date, "MDY")
format date_num %td
The values of this new variable will represent the number of days since 1 Jan 1960.
If you create a scatter plot with this date variable on the x-axis, by default it will be sorted from min to max. To let it run from max to min you can specify option xscale(reverse).
If you still want to create an id variable by yourself you can choose one of these options (ascending and descending):
sort date_num
gen id = _n
gsort -date_num
gen id = _n

For your problem, plotting in terms of a daily date variable and -- if for some reason that is a good idea -- using xscale(reverse) are likely to be what you need, as well explained by #Wouter.
In general something like
gen long newid = _N - _n + 1
sort newid
will reverse a dataset.

Unable to display statistics in the graph note() parameter

I want to display the total count of the data in the graph note().
I tried the following:
note(count)
However, this just displays the literal word "count".
I also tried to create a local variable but I am having difficulty just initializing it.
While I can do the following:
. local N = 100
. di `N'
100
I can't seem to do:
. local N = count
count not found

The total number of observations is stored in _N.
sysuse auto, clear
display _N
74
So the following works for me:
local N = _N
twoway scatter mpg price, note(Total no of observations: `N')

The total number of observations is kept in _N but it is not necessarily the number of observations used in a graph.
The command count displays a result and leaves a saved result, the number counted, in its wake as r(N). This is documented both in the help for count and in the manual entry.
Hence you can verify that this sequence leaves a note 74 observations in the resulting graph.
. sysuse auto, clear
(1978 Automobile Data)
. count if mpg < .
74
. histogram mpg, note(`r(N)' observations)
(bin=8, start=12, width=3.625)
Note that no r-class command should intervene here between count and your use of its result. r-class saved results, like any other saved results, are overwritten easily. In many circumstances you are well advised, as you did, to store the result in a local macro, say by
. local N = r(N)
immediately after the count command and then refer to that later in the note().
This is a more general method because count by itself returns the number of observations and so can be used when this is directly what you want.

Combining the other answers, I ultimately did:
count
local N = r(N)
count if male
local N_male = r(N)
count if !male
local N_female = r(N)
...
note("N = `N'" " `N_male' (Male)" " `N_female' (Female)")
But still can't get the commas to render at the thousands and millions place.

How to efficiently create lag variable using Stata

I have panel data (time: date, name: ticker). I want to create 10 lags for variables x and y. Now I create each lag variable one by one using the following code:
by ticker: gen lag1 = x[_n-1]
However, this looks messy.
Can anyone tell me how can I create lag variables more efficiently, please?
Shall I use a loop or does Stata have a more efficient way of handling this kind of problem?

#Robert has shown you the streamlined way of doing it. For completion, here is the "traditional", boring way:
clear
set more off
*----- example data -----
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
list, sepby(id)
*----- what you want -----
// "traditional" loop
forvalues i = 1/10 {
gen x_`i' = L`i'.x
gen y_`i' = L`i'.y
}
list, sepby(id)
And a combination:
// a combination
foreach v in x y {
tsrevar L(1/10).`v'
rename (`r(varlist)') `v'_#, addnumber
}
If the purpose is to create lagged variables to use them in some estimation, know you can use time-series operators within many estimation commands, directly; that is, no need to create the lagged variables in the first place. See help tsvarlist.

You can loop to do this but you can also take advantage of tsrevar to generate temporary lagged variables. If you need permanent variables, you can use rename group to rename them.
clear
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
tsrevar L(1/10).y
rename (`r(varlist)') y_#, addnumber
Note that if you are doing this to calculate a statistic on a rolling window, check out tsegen (from SSC)

Is there a way to get past the "too many values" error in Stata when using tabulate?

I am trying to generate frequencies for a variable in Stata conditional on categories of another variable.
This other categorical variable has about 790,000 observations for the category I am interested in.
Stata's 12,000 rows and 1,200 rows limit for one-way and two-way tables respectively makes this impossible.
Every time I run tab x if y==<category of interest> I get the following error:
too many values
r(134);
I installed the bigtab package and though it gives me tables it cannot be used with by or run statistical tests.
Is there a work around for this?
It seems silly that Stata should have this arbitrary limit when SAS and even SPSS can run the exact same operation without trouble.

To some it might seem silly, or at least puzzling, that people want tables with more than 12000 rows, as there must be a better way to display results or answer the question that is in mind.
That said, the limits of tabulate are hard-wired. But you just need to think of reproducing whatever you want to show. So, for one-way frequencies
. bysort rowvar : gen freq = _N
. by rowvar : gen tag = _n == 1
. gsort -freq rowvar
. list rowvar freq if tag, noobs
and for two-way frequencies
. bysort rowvar colvar : gen freq = _N
. by rowvar colvar : gen tag = _n == 1
. gsort -freq rowvar colvar
. list rowvar freq if tag, noobs
A similar approach, with more bells and whistles, is coded within groups (SSC). An even simpler approach in many ways is to collapse or contract the dataset and then list it.
To flag the general strategy here:
Produce what you want as new variables.
Select just one observation from each group if there are multiple observations.
list, not tabulate.
UPDATE
OP asked
. bysort rowvar : gen freq = _N
OP: This generates the freq variable for the last count of every individual value in my rowvar
Me: No. The freq variable is the count of observations for every distinct value of rowvar.
. by rowvar : gen tag = _n == 1
OP: This generates the tag variable for the first count of every unique observation in rowvar.
Me: Correct, provided you say "distinct", not "unique". Unique values occur once only.
. gsort -freq rowvar
OP: This sorts freq and rowvar in descending order
Me: It sorts freq in descending order and rowvar in ascending order within blocks of constant freq.
. list rowvar freq if tag, noobs
OP: What does if do here?
Me: That one is left as an exercise.

Use the command bigtab. (You have to install the package first: run ssc install bigtab.) For help type h bigtab.

Stata ' rolling ' command: saving variables instead of .dta files

I am using the rolling command in a foreach loop:
use "MyFile.dta"
tsset time, monthly
foreach i of varlist var1 var2 {
rolling _b, window(12) saving(beta_`i'): reg `i' DependentVariable
}
Now, this code saves a different file for each rolling regression. What I would really like is to save each vector of betas obtained from the rolling estimation as a variable.
The final result I would like to obtain is a dataset with a time variable and a "beta_var#" variable for each rolling:
time | beta_var1 | beta_var2
_________|___________|__________
1990m1 | ## | ##
1990m2 | ## | ##
... | ## | ##
200m12 | ## | ##
1990m1 | ## | ##
(PS: secondary question: is there a shortcut to indicate a varlist = to all the variables in the dataset?)

I misread your post and my initial answer does not give what you ask for. Here's one way. Not elegant nor very efficient but it works (just change directory names):
clear all
set more off
* Do not mix with previous trials
capture erase "/home/roberto/results.dta"
* Load data
sysuse sp500
tsset date
* Set fixed independent variable
local var open
foreach depvar of varlist high low close volume {
rolling _b, window(30) saving(temp, replace): regress `depvar' `var'
use "/home/roberto/temp.dta", clear
rename (_b_`var' _b_cons) (b_`depvar' b_cons_`depvar')
capture noisily merge 1:1 start end using "/home/roberto/results.dta", assert(match)
capture noisily drop _merge
save "/home/roberto/results.dta", replace
sysuse sp500, clear
tsset date
}
* Delete auxiliary file
capture erase "/home/roberto/temp.dta"
* Check results
use "/home/roberto/results.dta"
browse
Maybe other solutions can be proposed using postfile or concatenating vectors of results and converting to a dataset using svmat. I'm not sure.
Original answer
Use the saving() option with replace and provide only one file name (drop the macro suffix) :
clear all
set more off
webuse lutkepohl2
tsset qtr
rolling _b, window(30) saving(results, every(5) replace): regress dln_inv dln_inc dln_consump

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js