Create a new variable using observations and labels - stata

I have a variable that looks like this:
I want a new variable that multiplies the labels with the frequency, so for example the first row would be 170,105=70,105, and 2 would be 236,377=72754 and so on. I want my new variable to look like this:
​​​​​​​How can I do this?

On the face of it you have at least 119167 observations. The "at least" refers to the possibility of missing values, not tabulated by default.
You don't say whether you want these values in the same observations or in a much reduced new dataset. If the former, then consider this (noting that 3845 * 4 = 15380).
clear
input apple freq
1 70105
2 36377
3 8840
4 3845
end
expand freq
tab apple
bysort apple : gen new = apple * _N
tabdisp apple, c(new)
----------------------
apple | new
----------+-----------
1 | 70105
2 | 72754
3 | 26520
4 | 15380
----------------------
```

Related

Stata: Keep the first observation by group

I have a data set that looks like this:
id firm earnings A
1 A 100 0
1 A 200 0
2 B 50 1
2 B 70 1
3 C 900 0
bys id firm, I want to keep only the first observation if A==0 and want to keep all the observations if A==1.
I've tried the following code:
if A==0{
bys id firm: keep if _n==1
}
However, this code drops all the _n>1 observations no matter what the A value is.
The if (conditional) {do something} syntax is used in control flow rather than in defining variables. As you have it now Stata is only testing if A==1 in the first row. Try adding additional conditions using and (&) or or (|) statements. Try this:
bys id firm: keep if (_n==1 & A==0) | A==1

Split data into categories in the same row in Power BI

I have a table that contains multiple columns with their named having either the suffix _EXPECTED or _ACTUAL. For example, I'm looking at my sold items from my SoldItems Table and I have the following columns: APPLES_EXPECTED, BANANAS_EXPECTED, KIWIS_EXPECTED, APPLES_ACTUAL, BANANAS_ACTUAL, KIWIS_ACTUAL (The Identifier of the table is the date, so we have results per date). I want to show that data in a table form, something like this (for a selected date in filters:
+------------+----------+--------+
| Sold items | Expected | Actual |
+------------+----------+--------+
| Apples | 10 | 15 |
| Bananas | 8 | 5 |
| Kiwis | 2 | 1 |
+------------+----------+--------+
How can I manage something like this in Power BI ? I tried playing with the matrix/table visualization, however, I can't figure out a way to merge all the expected and actual columns together.
It looks like the easiest option for you would be to mould the data a bit differently using Power query. You can UNPIVOT your data so that all the expected and actual values become rows instead of columns. For example take the following sample:
Date Apples_Expected Apples_Actual
1/1/2019 1 2
Once you unpivot this it will become:
Date Fruit Count
1/1/2019 Apples_Expected 1
1/1/2019 Apples_Actual 2
Once you unpivot, it should be fairly straightforward to get the view you are looking for. The following link should walk you through the steps to unpivot:
https://support.office.com/en-us/article/unpivot-columns-power-query-0f7bad4b-9ea1-49c1-9d95-f588221c7098
Hope this helps.

Stata: Gaps between dates

I have a situation where I need to need to order several dates to see if there is a gap in coverage. My data set looks like this, where id is the panel id and start and end are dates.
id start end
a 01.01.15 02.01.15
a 02.01.15 03.01.15
b 05.01.15 06.01.15
b 07.01.15 08.01.15
b 06.01.15 07.01.15
I need to identify any cases where there is a gap in coverage, meaning when the 2nd start date for an id is greater than the first end date for the same id. Also it should be noted that the same id can have undetermined number of observations and they might not be in a particular order. I wrote the code below for a case where there are only two observations per id.
bys id: gen y=1 if end < start[_n+1]
However, this code does not produce the desired results. I'm thinking that there should be another way to approach this problem.
Your approach seems sound in essence, assuming that your date variables are really Stata daily date variables formatted suitably. You don't explain at all what "does not produce the desired results" means to you.
The code below creates a sandbox similar to your example, but with string variables converted to daily dates.
Key details include:
Observations must be sorted by date within panel.
The end date for the observation after the last in each panel would always be returned as missing, and so as greater than any known date. The code here returns the corresponding indicator as missing.
clear
input str1 id str8 (s_start s_end)
a "01.01.15" "02.01.15"
a "02.01.15" "03.01.15"
b "05.01.15" "06.01.15"
b "07.01.15" "08.01.15"
b "06.01.15" "07.01.15"
b "10.01.15" "12.01.15"
end
foreach v in start end {
gen `v' = daily(s_`v', "DMY", 2050)
format `v' %td
}
// the important line here
bysort id (start) : gen first = end < start[_n+1] if _n < _N
list , sepby(id)
+----------------------------------------------------------+
| id s_start s_end start end first |
|----------------------------------------------------------|
1. | a 01.01.15 02.01.15 01jan2015 02jan2015 0 |
2. | a 02.01.15 03.01.15 02jan2015 03jan2015 . |
|----------------------------------------------------------|
3. | b 05.01.15 06.01.15 05jan2015 06jan2015 0 |
4. | b 06.01.15 07.01.15 06jan2015 07jan2015 0 |
5. | b 07.01.15 08.01.15 07jan2015 08jan2015 1 |
6. | b 10.01.15 12.01.15 10jan2015 12jan2015 . |
+----------------------------------------------------------+

Run a regression of countries by quartiles for a specific year

I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.

Stata output files in surveys (proportions)

I need to modify the code below which I'm using on some CPS data to capture insurance coverage. I need to output a file with the percent covered by Census region (there are four). It should look something like this:
region n percent
1 xxx xx
2 xxx xx
3 xxx xx
4 xxx xx
I could live with two rows defining the percentages covered and not covered in each region if necessary, but I really only need the percentage covered.
Here's the code I'm using:
svyset [iw=hinswt], sdrweight(repwt1-repwt160) vce(sdr)
tempname memhold
postfile `memhold' region_rec n prop using Insurance, replace
levelsof region_rec, local(lf)
foreach x of local lf{
svy, subpop(if region_rec==`x' & age>=3 & age<=17): proportion hcovany
scalar forx = `x'
scalar prop = _b[hcovany]
matrix b = e(_N_subp)
matrix c = e(_N)
scalar n = el(c,1,1)
post `memhold' (forx) (n) (prop)
}
postclose `memhold'
use Insurance, clear
list
This is what it produces:
Survey: Proportion estimation Number of obs = 210648
Population size = 291166198
Subpop. no. obs = 10829
Subpop. size = 10965424.5
Replications = 160
_prop_1: hcovany = Not covered
--------------------------------------------------------------
| SDR
| Proportion Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
hcovany |
_prop_1 | .0693129 .0046163 .0602651 .0783607
Covered | .9306871 .0046163 .9216393 .9397349
--------------------------------------------------------------
[hcovany] not found
r(111);
I can't figure out how to get around the error message at the bottom and get it to save the results. I think a SE and CV would be a desirable feature as well, but I'm not sure how to handle that within the matrix framework.
EDIT: Additional output
+----------------------------------+
| region~c n prop se |
|----------------------------------|
| 1 9640 .9360977 2 |
| 2 12515 .9352329 2 |
| 3 14445 .8769684 2 |
| 4 13241 .8846368 2 |
+----------------------------------+
Try changing _b[hcovany] for _b[some-value-label]. To be clear, the following non-sensical example is similar to your code, but instead of using _b[sex], where sex is a variable, it uses _b[Male], where Male is a value label for sex. Subpopulation sizes and standard errors
are also saved.
clear all
set more off
webuse nhanes2f
svyset [pweight=finalwgt]
tempname memhold
tempfile results
postfile `memhold' region nsubpop maleprop stderr using `results', replace
levelsof region, local(lf)
foreach x of local lf{
svy, subpop(if region == `x' & inrange(age, 20, 40)): proportion sex
post `memhold' (`x') (e(N_subpop)) (_b[Male]) (_se[Male])
}
postclose `memhold'
use `results', clear
list
If we were to use _b[sex] instead of _b[Male], we would get the same r(111) error as in your original post.
For this example, lets see what the matrix e(b), containing the estimated proportions, looks like:
. matrix list e(b)
e(b)[1,2]
sex: sex:
Male Female
y1 .48821487 .51178513
Therefore, if we wanted to extract the proportions for females instead
of males, we could use _b[Female].
Yet another option is to save the estimation result in a matrix and use numerical subscripts:
<snip>
matrix b = e(b)
post `memhold' (`x') (b[1,2])
<snip>
There are other slight changes like the use of inrange and direct use of returned estimation results with post.
Also, you may want to take a look at help _variables and its link:
[U] 13.5 Accessing coefficients and standard errors.