I have data on online job postings, but with some variables structured as string when I want them to be numerical to create time series graphs as in here.
The three variables I am interested in converting into numeric variables look as follows:
dataex month posted_date revenue
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 month str19 posted_date str32 revenue
"March_2021" "2021-03-08 10:44:15" "Less than $1 million (USD)"
"March_2021" "2021-03-08 10:44:15" "Less than $1 million (USD)"
"Dec_2020" "2020-12-13 08:04:59" "$10+ billion (USD)"
"Nov_2020" "44150.33611" "$10+ billion (USD)"
"Dec_2020" "2021-01-04 04:59:40" "$10+ billion (USD)"
"Nov_2020" "44167.24444" "$10+ billion (USD)"
"Dec_2020" "2020-12-16 10:49:38" "$10+ billion (USD)"
"Nov_2020" "44167.24514" "$10+ billion (USD)"
"Nov_2020" "44172.01319" "$10+ billion (USD)"
"Dec_2020" "2020-12-30 05:52:25" "$10+ billion (USD)"
"April_2021" "2021-04-21 04:16:12" ""
"April_2021" "2021-04-21 04:16:12" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"April_2021" "2021-04-21 05:57:59" ""
"April_2021" "2021-04-21 05:57:59" ""
"Dec_2020" "2020-12-22 08:13:06" "$500 million to $1 billion (USD)"
I would like the new variables to look something as below:
month_n posted_date_n revenue_n
02/21 09/02/21 $500m_1B
03/21 14/03/21 +10B
04/21 11/04/21 +1m
So based on the instructions here, I ran the following code:
// Destring variables string variables with numerical values
gen posted_date_n = real(posted_date)
gen month_n = real(month)
gen revenue_n = real(revenue)
However, I could not really get what I am looking for and instead, the data looks as follows:
dataex revenue_n posted_date_n month_n
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(revenue_n posted_date_n month_n)
. . .
. . .
. . .
. 44150.34 .
. . .
. 44167.25 .
. . .
. 44167.25 .
. 44172.01 .
. . .
. . .
. . .
I was able to run code to get the data into almost the form you wanted, but not for the date values like 44150.33611, etc. These seem to be excel format as noted by #JR96.
I recommend using the split function and a really handy write up by Nick Cox is a useful read (source).
// Month/Year
split month, p("_")
drop month
rename month1 month
gen month_n = date(month,"M")
format month_n %td_Month
rename month2 year
destring year, replace
format year %ty
rename year year_n
// Posted Date
split posted_date, p(" ")
drop posted_date
rename posted_date1 date
rename posted_date2 time
gen posted_date_n = date(date, "YMD")
format %tdNN/DD/CCYY posted_date_n
This doesn't do exactly what you ask for but it's closer than nothing in my opinion. Example output as,
month_n, year_n, posted_date_n
March, 2021, 03/08/2021
March, 2021, 03/08/2021
Where everything is formated as a date that Stata can recognize. Maybe someone else can jump in here on combining the month_n and year_n column?
Related
I have data with dates as:
ID Date1 Date2
1 1.929e+12 1.929e+12
2 1.917e+12 1.901e+12
3 1.922e+12 .
Based on other dates in the dataset, they should be in yyyy-mm-dd hh:mm:ss format.
Open to solutions within Stata or using different software.
Your values are already date-times so all you need to do is assign a date format starting %tc. Here are examples using display on one of your scalar values:
. di %tc 1.929e+12
15feb2021 09:20:00
. di %tcCCyy-NN-dd_HH:MM:SS 1.929e+12
2021-02-15 09:20:00
See
help datetime_display_formats
help format
for how to select a datetime format and assign it to a variable.
For example,
format Date1 %tcCCyy-NN-dd_HH:MM:SS
I have a Pandas data frame which contains a column with dates. The dates are represented in by a string in the format mm/dd/yyyy.
But I have a problem with the format of the day: Dates until the 9th day of a month are in the format mm/d/yyyy. For example the first december 2008 is displayed as 12/1/2008. The 10th day until the end of a month are displayed as mm/dd/yyyy. For example the 17th december 2008 is represented by 12/17/2008.
My target is to transform the all dates into the form mm.dd.yyyy. The could would represent the above expamles as: 12.01.2008 and 12.17.2008
My idea is to just write the day, month and year into seperate columns and then connect the strings in the format mm.dd.yyyy
So far I have tried to withdraw the year and the month just by their position in the string (see code and example below). But this does not work with the days as there are two cases: the day has either one or two digits.
My idea is to just use a regular expression It is basically the case backslash one or two digits and a backslash. But I do not know how I can express this as a regular expression.
Or is there totally different approach which is much simpler?
Thank you for the help in advance! I am sure that there is a way to do that with regular expressions. But I am also grateful for totally different approaches.
import pandas as pd
# example data frame with dates in the format mm/d/yyyy and mm/dd/yyyy
df = pd.DataFrame({'date' : ['12/1/2008','12/5/2008','12/10/2008','12/17/2008']})
# withdraw month
df['month'] = df['date'].str[:2]
# withdraw year
df['year'] = df['date'].str[-4:]
# withdraw day - this is my problem
df[day] = df['day'] = df['date'].str.extract(r'[\]\d*')
# generate string with dates in the format mm/dd/yyyy
df['date_new'] = df['month'] + '.' df['day'] + '.' + df['year']
From the code of df['day'] I get the following error: error: unterminated character set at position 0
I think you are looking for this:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%m.%d.%Y')
Output:
date
0 12.01.2008
1 12.05.2008
2 12.10.2008
3 12.17.2008
Another thing to bring to your attention if you want to extract days, months, years or so, pandas has a special dt functionality for datetime types, hence, you need to convert your column first into that type.
You can access days and months like this:
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['year'] = df['date'].dt.year
You will get something like:
date month day year
0 2008-12-01 12 1 2008
1 2008-12-05 12 5 2008
2 2008-12-10 12 10 2008
3 2008-12-17 12 17 2008
I am trying to make an easy to use do file where the user inserts the names of the towns s/he wants to summarize and then Stata:
summarizes the towns
saves the results in an Excel file
exports the names of the towns summarized
I am using a list saved in a local macro since it works well with the inlist() function:
clear
input Date AskingRent str10 Town
2019 12 Boston
2019 13 Cambridge
2018 14 Boston
2018 15 Cambridge
end
local towns `" "Billerica", "Boston" "'
keep if inlist(City, `towns')
***some analysis
putexcel set "results.xlsx", modify
putexcel A1 = `towns'
I want the Excel file to have "Billerica, Boston" in cell A1.
However, I get an error in the last line of code that says:
nothing found where expression expected
The following works for me:
clear
input foo1 str20 foo2
5 "Billerica"
6 "Boston"
7 "London"
8 "New York"
end
. local towns `" "Billerica", "Boston" "'
. keep if inlist(foo2, `towns')
. putexcel set "results.xlsx", modify
. putexcel A1 = `"`towns'"'
file results.xlsx saved
I have weekly Google Trends Search query data in Stata. Here is a sample of what the data looks like:
I converted the date string into a date object like so:
gen date2 = date(date, "YMD")
gen year= year(date2)
gen w = week(date2)
gen weekly = yw(year,w)
format weekly %tw
I now want to declare "date2" as my time series reference, so I did the following:
tsset date2, weekly
However, upon using tsreport I get the following information
However, I should have no gaps in the data, as it is weekly. For some reason, Stata is still assuming I have daily data.
I cannot take first differences on any of these variables because of this issue. How do I resolve this?
I agree with William Lisowski's general advice but have different specific recommendations.
You have weekly data with daily flags for each week.
Stata weeks are likely to be of little or no use to you for reasons documented in detail in references that
search week, sj
will disclose. Specifically,
SJ-12-4 dm0065_1 . . . . . Stata tip 111: More on working with weeks, erratum
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/12 SJ 12(4):765 (no commands)
lists previously omitted key reference
SJ-12-3 dm0065 . . . . . . . . . . Stata tip 111: More on working with weeks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/12 SJ 12(3):565--569 (no commands)
discusses how to convert data presented in yearly and weekly
form to daily dates and how to aggregate such data to months
or longer intervals
SJ-10-4 dm0052 . . . . . . . . . . . . . . . . Stata tip 68: Week assumptions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/10 SJ 10(4):682--685 (no commands)
tip on Stata's solution for weeks and on how to set up
your own alternatives given different definitions of the
week
Issuing that search command will give you links to .pdf copies of each paper.
I suggest simply
gen date2 = daily(date, "YMD")
format date2 %td
tsset date2, delta(7)
daily() is the same function as date() but I think the name is a better signal to all of precisely what it does. The more important detail is that delta(7) is sufficient to indicate daily data spaced 7 days apart, which is precisely what you have.
To expand on the problem you had: when you converted to daily dates, then you got a numeric variable with values like 18755 in steps of 7 to your last date. You then told Stata through your tsset ..., weekly that these are really weeks. Stata uses an origin for all dates like these of the beginning of 1960. So, Stata is working out what 18755 weeks (etc.) from the beginning of 1960 would be. And your numeric variable is still in steps of 7. So, the reason that Stata is misinterpreting your data is that you gave it incorrect information. tsset will never change a date variable; it just interprets it as you instruct.
Note also that you created a weekly date variable but then did not use it. That wouldn't have been a good solution either, but it would have been closer to what you want. It appears that all your dates are Sundays, so in some years there would be 53 and in other years 52; that's not true of Stata's own weeks.
The problem would be more helpfully stated if it included a listing of the data, rather than a picture, so that others could test and demonstrate correct code.
With that said, you need to carefully review the output help datetime to improve your understanding of how to work with Stata Internal Format (SIF) date and time data, and of the meaning of a "weekly date" in Stata. I believe that something like the following will start you along the correct path.
gen date2 = date(date, "YMD")
gen weekly = wofd(date2)
format weekly %tw
or in a one fewer steps
gen weekly = wofd(date(date, "YMD"))
format weekly %tw
I am trying to calculate the 95% binomial Wilson confidence interval for the proportion of people completing treatment by year (dataset is line-listed for each person).
I want to store the results into a matrix so that I can use the putexcel command to export the results to an existing Excel spreadsheet without changing the formatting of the sheet. I have created a binary variable dscomplete_binary which is 0 for a person if treatment was not completed, and 1 if treatment was completed.
I have tried the following:
bysort year: ci dscomplete_binary, binomial wilson level(95)
This gives output of each year with the 95% confidence intervals. Previously I used statsby to collapse the dataset to store the results in variables but this clears the dataset from the memory and so I have to constantly re-open it.
Is there a way to run the command and store the results in a tabular format so that the data is stored in a similar way to this:
year mean LowerCI UpperCI
r1 2005 .7031588 .69229454 .71379805
r2 2006 .75532377 .74504232 .7653212
r3 2007 .78125924 .77125096 .79094833
r4 2008 .80014324 .79059798 .80935836
r5 2009 .81860977 .80955398 .82732689
r6 2010 .82641232 .81723672 .83522016
r7 2011 .81854123 .80955547 .82719356
r8 2012 .83497983 .82621944 .8433823
r9 2013 .85411799 .84527379 .86253893
r10 2014 .84461939 .83499599 .85377985
I have tried the following commands, which give different estimates to the binomial Wilson option:
svyset id2
bysort year: eststo: ci dscomplete_binary, binomial wilson level(95)
I think the postfile family of commands will help you here. This won't save your data into a matrix, but will save the results of the ci command into a new data set, which you name and whose structure you set. After the analysis is complete, you can load the data saved by postfile and export to Excel in the manner of your choosing.
For postfile, you analyze the data in a loop instead of using by or bysort.
Assuming the years in your data run 2005-2014, here is sample code:
/*make sure no postfile is open, in case a previous run did not close the file*/
cap postclose ci_results
/*create the postfile that will store results*/
postfile ci_results year mean lowerCI upperCI using ci_results.dta, replace
/*loop through years*/
forval y = 2004/2014 {
ci dscomplete_binary if year==`y', binomial wilson level(95)
/*store saved results from ci to postfile. Make sure the post statement contains results in the same order stated in postfile command.*/
post (`y') (r(mean)) (r(lb)) (r(ub))
}
/*close the postfile once you've looped through all the cases of interest*/
postclose ci_results
use ci_results.dta, clear
Once you load the ci_results.dta data into memory, you can apply any Excel exporting command you like.
This is a development of the suggestion already made to use statsby. The objections to it are quite puzzling, as it is easy to get back to the original dataset. There is some machine time in re-loading a dataset, but how much personal time has been spent in pursuit of an alternative?
Absent a dataset which we can use, I've provided a reproducible example.
If you wish to do this repeatedly, you'll write a more elaborate program to do it, which is what this forum is all about.
I leave how to export results to Excel as a matter for those so inclined: no details of what is wanted are provided in any case.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. statsby mean=r(mean) ub=r(ub) lb=r(lb), by(rep78) : ci foreign, binomial wilson level(95)
(running ci on estimation sample)
command: ci foreign, binomial wilson
mean: r(mean)
ub: r(ub)
lb: r(lb)
by: rep78
Statsby groups
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
. list
+----------------------------------------+
| rep78 mean ub lb |
|----------------------------------------|
1. | 1 0 .6576198 0 |
2. | 2 0 .3244076 0 |
3. | 3 .1 .2562108 .0345999 |
4. | 4 .5 .7096898 .2903102 |
5. | 5 .8181818 .9486323 .5230194 |
+----------------------------------------+
. restore
. describe
The describe results will show that we are back where we started.