Removing characters before a certain value in variable names in stata - stata

EDIT: the issue with this question was resolved as Stata changed the variable names in Excel to variable "labels" upon importing the data, and generated the variable "names" that I needed automatically. So the question is unnecessary.
I have a dataset in Stata that has a handful of variable names, some of which begin with a number and a period. Like so:
name of car 62. color of car 145. year of sale state of sale
Accord Red 1995 GA
Corvette Pink 2010 FL
...
How can I remove the numbers from the variable names that contain them so that I wind up with:
name of car color of car year of sale state of sale
Accord Red 1995 GA
Corvette Pink 2010 FL
...
I have some familiarity with the substr() function, but I am confused by the fact that the character count that I need to remove from is not consistent. Instead, I need to remove everything from the period following the number, back.

All those "names" are illegal as variable names, because Stata variable names just can't include spaces or periods or start with a number.
So either your Stata is corrupted beyond belief or you're misunderstanding what you have.
My best guess is that you have read in metadata so that text that could and should be variable labels is in fact making up the first observation (row) in your dataset. If so, the best advice is to go back and repeat the import so that metadata is not read into the dataset. The commands concerned have options to choose that.
In any case, it is immensely better to show data examples using dataex: see the tag wiki for Stata.

Related

Attrition in panel data - Stata

I am constructing a panel dataset based on the survey data for the years 2010-2013 (four consecutive years). As is usually the case with household survey data, there is an issue of attrition, i.e. some households drop out from the survey from year to year. I need to figure out whether these households are missing at random.
My idea is to come up with a dummy equal to 1 in 2011 if a household present in 2010 is missing in 2011 (and 0 otherwise), and so on for the years 2012, 2013. Then I want to run the logit/probit regression on this dummy with a set of covariates that I would like to control for in my study. The variable for household id is "hhid" and I have of course the time dimension variable "year".
Does anyone have a precise idea how this should be properly coded in Stata? I know it is not complicated, but I just cannot wrap my head around it and figure this out....
Here is an example on how you create a dummy in a panel data and then collapse those dummy to the parent unit-of-observation making the dummy 1 if the parent unit-of-observation was 1 in any time period. Then merge the parent unit-of-observation level data back to the panel data.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte hhid int year
1 2010
1 2011
1 2012
1 2013
2 2010
2 2011
2 2013
3 2010
3 2011
end
*Create a dummy for each year-hh level observation for each year
local year_dummies ""
forvalues year = 2010/2013 {
gen dummy`year' = (year==`year')
local year_dummies "`year_dummies' dummy`year'"
}
*Collapse the data set to hh level where the dummies is 1 if any year-hh level was 1
preserve
collapse (max) `year_dummies' , by(hhid)
tempfile year_dummy_hhlevel
save `year_dummy_hhlevel'
restore
*Rename to not having to overwrite the first step
rename dummy???? org_dummy????
*Merge the hh level data back to the year-hh level
*data merging the hh dummy to each year-hh observation
merge m:1 hhid using `year_dummy_hhlevel', nogen
Your question is if there is a difference in the households you do not observe in year X compare to those you do observe in year X. There is no perfect way to answer this question as you, by definition, did not observe those households.
You did however observe all households in your study in year 0 (2010 in your case). As you imply yourself, you can use observations in year 0 as a proxy to answer if those households are different in year X. I can help you show how you can code this, but StackOverflow is not the appropriate forum to discuss is this is statistically valid given your data, how it was collected and what analysis you intend to use.
One way to code this is to use iebaltab in the package called ietoolkit available from SSC (disclosure, I wrote that command).
You can create an attrition dummy indicating attrition and use iebaltab like this: iebaltab balancevars, grpvar(attrition) where balancevars is a list of variables for characteristics in the household where you want to make sure they were similar in year 0. You can use the option ftest to include the test across all balance variables they way you are suggesting.
Not that this command generates statistics, but it is up to you to decide if this is valid, and the validity of balance tests are hotly debated. But those debates are not about coding which StackOverflow is about.

longitudinal dataset categorical variables

I have a longitudinal dataset which contains variables on individuals from 2 waves from Feb and June which measure economic activity across these individuals. The variables from Feb and May wave are categorical variables and I am running the proportion command in Stata to get the individual change in economic activity. For example. I am looking for changes in hours worked across 2 waves and I run proportion but am not able to figure out the if condition as I only want individuals who responded in both Feb and June. I want to drop all those who responded in Feb but not in May or likewise.
Let's suppose you have an identifier variable id and a time-like variable, wave that takes values 1 and 2. If so, you are looking for individuals that satisfy
bysort id (wave) : gen wanted = wave[1] == 1 & wave[2] == 2
So wanted is an indicator that is 1 for individuals present for both waves and 0 otherwise and if wanted would be an if condition to select those people wanted.
There are many variations on this, depending on: your variable names; your data layout; how the information on waves is held (could be also, say, a string variable containing values like "Feb", "May" or "June", or a numeric variable holding dates).
You gave a broad-brush description sketching the problem but almost no precise information on the data. The stata tag wiki gives much detailed advice on how to post a question and flags the importance of giving a concrete data example.

Deleting all observations given one observation for each variable's type

I have a table with firm identifiers, fiscal year, quarter and market_capital. I want to delete all firm observations that had a specific market capital at a specific quarter of a specific year. That is, I want to delete all observations for a firm if its market capital for 2006, quarter 2 was below 50.
My table is in the form:
enter image description here
If I understand correctly, you have a Stata dataset containing four variables which I will call firm, year, quarter, and mc (since "Capital Market" shown in the picture of your data is not valid a Stata variable name).
The following code might start you in the right direction, but it is untested since my copy of Stata cannot read the picture of your data, and "I want to retype data from a picture of data" said nobody, ever.
Added in edit: the untested code had an error, so I removed it.
Having a quarterly date variable -- rather than separate year and quarter variables -- will be needed sooner or later.
That could be
gen qdate = yq(year, quarter)
format qdate %tq
Then your code for dropping is
egen todrop = total(capital < 50 & qdate == yq(2006, 2)), by(firm)
drop if todrop
as the variable todrop will be 1 if and only if you want to drop a firm and 0 otherwise.
See this paper for a review of related technique.

Can I impute a variable conditional on another?

I am trying to impute the data about whether someone is born in the UK from wave 1 to wave 2. I suspect the egen function would work but I am not sure what the code would look like?
As you can see, I need to assign the same born in the uk response for person id 1 in wave 1 to wave 2.
I know I could do it by reshaping the dataset to a wide format but do you know whether there is any other way?
This is a Stata FAQ as accessible here.
You can copy downwards in the dataset without creating any new variables.
bysort id (wave) : replace born_in_uk = born_in_uk[_n-1] if missing(born_in_uk)
mipolate (SSC) has a groupwise option that checks for there being more than one non-missing value. Search within www.statalist.org for mentions.
Note that egen is a command, not a function.
I am not sure whether here born in the UK is numeric with labels or string. But, what if you would do something like:
encode born_in_UK, gen(born_num)
bysort person_id: egen born_num2=mean(born_num)
drop born_num
rename born_num2 born_num
The idea is to think of the repeating personal ids as groups and use the mean function to fill the missing values in the group. I think this should work.

How to import dates from Excel into Stata

I'm using Stata 12.0.
I have a CSV file of exposures for days of the year e.g. 01/11/2002 (DMY).
I want these imported into Stata and it to recognise that it is a date variable. I've been using:
insheet using "FILENAME", comma
But by doing this I am only getting the dates as labels rather than names of the variables. I guess this is because Stata doesn't allow variable names to start with numbers. I have tried to reformat the cells as Dates in Excel and import but then Stata thinks the whole column is a Date and changes the exposure data into dates.
Any advice on the best course of action is appreciated...
As commented elsewhere, I too think you probably have a dataset that is best formatted as panel data. However, I address first the specific problem I think you have according to your question. Then I show some code in case you are interested in switching to a panel structure.
Here is an example CSV file open as a spreadsheet:
And here the same file, open in a text editor. Imagine the ; are ,. This is related to my system's language settings.
Running this (substitute delimiter(";") for comma, in your case):
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
results in
which I think is the problem you describe: dates as variable labels. You would like to have the dates as variable names. One solution is to use a loop and strtoname() to rename the variables based on the variable labels. The following goes after importing with insheet:
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
The result is
The function strtoname() will substitute out the ilegal characters for _'s. See help strtoname.
Now, if you want to work with a panel structure, one way would be:
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
* Rename variables
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
* Generate ID
generate id = _n
* Change to long format
reshape long _, i(id) j(dat) string
* Sensible name
rename _ metric
* Generate new date variable
gen dat2 = date(dat,"DMY", 2050)
format dat2 %d
list, sepby(id)
As you can see, there's no need to do anything beforehand in Excel or in an editor. Stata seems to be enough in this case.
Note: I've reused code from http://www.stata.com/statalist/archive/2008-09/msg01316.html.
A further note on performance: A CSV file with 122 variables or days (columns) and 10,000 observations or subjects (rows) + 1 header row, will produce 1,220,000 observations after the reshape. I have tested this on some old machine with a 1.79 GHz AMD processor and 640 MB RAM and the reshape takes approximately 8 minutes. Stata 12 has a hard-limit of 2,147,483,647 observations (although available RAM determines if you can actually achieve it) and Stata SE of 32,767 variables.
There seems to be some confusion here between the names that variables may have, the values that variables may have and the types that they may have.
Thus, the statement "Stata doesn't allow variables to start with numbers" appears to be a reference to Stata's rules for variable names; if it were true, numeric variables would be impossible.
Stata has no variable (i.e. storage) type that is a date. Strictly, it has no concept of a date variable, but dates may be held as strings or numbers. Dates may be held as strings insofar as any text indicating a date is likely to be a string that Stata can hold. This is flexible, but not especially useful. For almost all useful work, dates need to be converted to integers and then assigned a display format that matches their content to be readable by people. Stata has various conventions here, e.g. that daily dates are held as integers with 0 meaning 1 January 1960.
It seems likely in your case that daily dates are being imported as strings: if so, the function date() (also known as daily()) may be used to convert to an integer date. The example here just uses the minimal default display format for daily dates: friendlier formats exist.
. set obs 1
obs was 0, now 1
. gen sdate = "12/03/12"
. gen ndate = daily(sdate, "DMY", 2050)
. format ndate %td
. l
+----------------------+
| sdate ndate |
|----------------------|
1. | 12/03/12 12mar2012 |
+----------------------+
If your variable names are being misread, as guessed by #ChrisP, you may need to tell us more. A short and concrete example is worth more than a longer verbal description.