drop entire panel id/firm if at least 1 missing value for variable - stata

I have panel data and want to delete an entire panel id/firm ID if it has at least 1 missing total assets (at) in one of the years. Could someone help me?
So to be clear the panel data contains the following variables:
1) year: year
2) gvkey: firm id
3) TotalAssets: amount of Total Assets
So if a firm (id) has in one of the years at least 1 missing value for TotalAssets, then it needs to be completely removed out of the sample.

It wouldn't seem possible to have more than one missing value in any year and firm with annual data, but most of the question seems to imply that the criterion is one or more missing values within each panel. If you sort on the outcome variable within panels, then missing values are sorted to the end. So the last value will be missing if any values are missing and a drop is conditional on that:
bysort gvkey (totalassets) : drop if missing(totalassets[_N])

Related

Panel data - drop observations in stata

I have panel data containing 4 waves. I need help tp keep only individuals who have participated in all waves. I saw this post drop observations, but ending deleting everything.
Can anyone help me?
The details depend on details you haven't given. But for example suppose you have variables id and wave where wave runs 1 2 3 4. Conditions for selecting complete panels only might be
bysort id : keep if _N == 4
or
egen total = total(wave), by(id)
keep if total == 10
These commands won't help if you have a wave variable always present but the problem is missing values on other variables.

how to set current quarter in Superset?

I want to set current quarter dynamically, e.g [2021-01-01 ~ 2021-04-01)
Does superset support it? if so how to config it?
The Last vs Previous and date range control in general has been a source of confusion for my users.
Last Quarter just shows the last 3 months [because it's a quarter of a year?].
It would be great to have options like Week to date, Month/Period to date, Quarter to date, etc...
Another issue is that each company may define their quarters/periods on different starting dates, depending on their fiscal calendar.
As a stop-gap, I've done the following.
enriched the underlying dataset to have additional columns like period_start_date and fiscal_quarter_start_date.
created a fiscal_dates table that contains a list of every day over the years I need to query. The columns correlate with date columns in my other tables, like dob, fiscal_week_start_date, period_start_date, fiscal_quarter_start_date . I created this table in postgres using generate series
created a new virtual dataset that contains the column period_start_date, that shows the last 4 years of period start dates.
use a value native filter to select from the list of dates.
make the values sorted descending, and default value as "first item in list".
This allows the user to select all records that occur in the same quarter/period, with a default of the current quarter.
The tentative apache/superset#17416 pull request should remedy this problem, i.e., for the QTD you would simply specify the START as datetrunc(datetime("now"), quarter) and leave the END undefined.

Stata generate by groups

I have a very weird thing happening with my code. I have panel data set with the panel id being p_id and I am trying to create a another variable by using panel_id. My code is this, where p_id is the panel id, marital_status of person observed in each time period and x is the variable I would want to create.
bys p_id: gen count =_N
bys p_id: gen count1 =_n
bys p_id: gen x= marital_status if count1 ==1
However when I do
tab x
I get different numbers for rows (row total does not change) each time I run this code. The numbers are pretty closely clustered, but I need to understand why this is happening.
Although the lack of a reproducible example is poor practice, it is possible to guess at what is going on. The first line of code is not problematic, but the second two have the same effect as
bys p_id: gen x = marital_status if _n == 1
In words, the new variable contains marital status data from the first observation in each group of observations for distinct p_id. But sorting on p_id says nothing about sort order for the observations with the same p_id and that within-group sort order is not reproducible without some sufficient constraint. So the first observation could easily be different (unless naturally there is only one observation in each group), with the results you report.
Concretely, suppose that there are 3 observations for p_id 42. Then any of 6 possible orders of those observations is consistent with sorting on p_id. And so forth.
Presumably there is something special about one observation in each group. You would need to explain more about your data and what you want to get to allow fuller advice, but this problem is not a puzzle.

Merging data with a dates table gives strange behaviour when some recordfield is added

I'm using Crystel Reports again after not touching it for about 8 years.
I'm having this situation...
I have 1 data table, and 1 table with just day numbers from 1 to 31.
Nothing is really linked between each other.
In my report I let the user select a reference date.
From that date I grab the maximum days of the month.
The report lists a row per day of that month but there are no actual database fields inthere. Just the first 2 letters of the dayname, the day number and another formula based field showing 'yes/no' or '' depending on a main record value.
So far so good.
In the group header I was adding the fields from the main datatable which went all fine until I added fields that in the query on the sql server rely on some cases but CR just read it out as 1 singe record row with everything in it.
For some reason the report generation goes from 1-2 seconds to 30-40 once I add that field that just outputs 'X' or ''. (it represents things assigned to that user)
Other reports where I'm using the same data still generate in 2 seconds.
To get this working right and to eleminate double date records I'm stuck with 3 groups.
I think this ain't optimal and the reason for the slow down although it wasn't there at the start.
So I was wondering:
Should I go for a sub report for the day listing?
Can I feed the subreport with my date parameter?
or is there some kind of scripted way to list a row x-times without all the grouping requirements?
Synchro was right, the problem was in the actual query/view.
For some reason the view takes half a minute longer if you just added an order by to a specific field.
The "where id between 211 and 265 or id=67" has been moved from a joined view to the actual query.
Thanks for the hint, Synchro.

Stata: using egen group() to create unique identifiers

I have a dataset where each row is a firm, year pair with a firmid that is a string.
If I do
duplicates drop firmid year, force
it doesn't delete anything since there are no duplicates (I originally created the dataset after running duplicates drop firmid year, force).
So far so good. I want to create a panel which requires a firmid that is numeric. So I run
egen newid = group(firmid)
xtset newid year
But the 'repeated time values in panel' error pops up. Moreover,
duplicates list newid year
lists a whole bunch of duplicates.
It seems as though egen, group() isn't generating unique groups. My question is: why, and how do I create unique groups in a robust way?
This is an old thread, but I have recently experienced the same symptoms, so I wanted to share my solution. Of course, so long as the questioner does not give further details, we will not know whether the causes are the same for me and him.
The problem turned out to be an issue of precision. As explained here in section 4.4, calculations done on integers stored as floats are precise only in the range up to 16,777,216. So, if you have more than 16,777,216 firms in your sample, rounding error will result in the same ID being assigned to multiple firms. This is straightforwardly dealt with by increasing the precision of the ID variable to long:
egen long newid = group(firmid)