How to collapse data by week correctly in Stata? - stata

I have a transaction level dataset and I want to collapse and calculate weekly average price. The dataset can be simplified as follows,
clear
input str9 date quantity price id
"01jan2010" 50 70 1
"02jan2010" 60 80 2
"02jan2010" 70 90 3
"04jan2010" 70 95 4
"08jan2010" 60 81 5
"09jan2010" 70 88 6
"12jan2010" 55 87 7
"13jan2010" 52 88 8
end
gen date2=date(date,"DMY")
format date2 %td
drop date
I want to create a variable date3. For every transaction happened in a week, date3 is the Monday of that week.
Here's the code I have:
sort date2
gen date3=date2 if dow(date2)==1
replace date3=date3[_n-1] if missing(date3)
format date3 %td
However, there are Mondays with no transactions, but the rest of the week has transactions. In those cases, date3 is not the Monday date of that week, but Monday date in the weeks before.
My data becomes the following using the above code:
quantity price id date2 date3
50 70 1 01jan2010
60 80 2 02jan2010
70 90 3 02jan2010
70 95 4 04jan2010 04jan2010
60 81 5 08jan2010 04jan2010
70 88 6 09jan2010 04jan2010
55 87 7 12jan2010 04jan2010
52 88 8 13jan2010 04jan2010
To me, it does not matter if id =1,2,3 have no date3. What I am concerned is that id=7 and id=8 should have a date3 of 11jan2010. But because there is no transaction on that day, the date becomes 04jan2010. Is there a way to fix this?
(I was thinking of constructing a new dataset with consecutive dates since 01jan2010 and then merge with the one above, and then drop if missing quantity of price. But I was wondering if there's a more efficient way).
In addition, I have a weekly index data that reports on every Friday since 01jan2010. If I use wofd command, Stata will generate 53 weeks in 2010. (Or more precisely, two 2010w52.) How can I get just 52 weeks in Stata?
(I found this http://www.stata.com/statalist/archive/2012-02/msg01030.html but I still cannot figure out how this can help solve my problem. )

Your weeks start on Mondays. Everything you need follows from using dow() to exploit the fact that in every one of your weeks, the day of week function dow() yields 1, 2, 3, 4, 5, 6, 0 for the days from Monday to Sunday.
The present or previous Monday for daily dates daily is just
gen Monday = cond(dow(daily) == 0, daily - 6, daily - dow(daily) + 1)
The branch is like this. If it's a Sunday, the previous Monday was 6 days ago. Otherwise, the Monday that starts the week was today if it's Monday and dow() yields 1, yesterday if it's Tuesday and 2, and so forth. Here the variable Monday is just the dates of Mondays that define the weeks.
Important detail: There are no assumptions here about dates being complete in the data or even in order.
Small note: Arbitrary names like date2 and date3 mean nothing much. Use evocative names in your questions (and your practice).
There was a sequel to the article mentioned by Robert Ferrer. search week, sj in Stata to get the references.
Do not use Stata's weeks and in particular do not use the wofd() function (not a command), as they can't help you. Stata's weeks will not map on to your weeks. The article mentioned by Robert Ferrer really is worthwhile reading to understand this (even though I wrote it).
(This is all explained in the Statalist threads you link to.)

Related

How do I calculate average from June 1 to June 30 annually and make it a dynamic function using DAX?

I have two tables in PowerBI, one modified date and one fact for customer scores. The relationship will be using the "Month Num" column. Score assessments take place every June, so I would like to be able to have the scores for 12 months (June 1 to June 30) averaged. Then I will just have a card comparing the Previous year score and Current year score. Is there a way to do this dynamically, so I do not have to change the year in the function every new year? I know using the AVERAGE function will be nested into the function somehow, but I am getting confused not using a calendar year and not seasoned enough to use Time Intelligence functions yet.
Customer Score Table
Month
Month Num
Year
Score
Customer #
June
6
2020
94.9
11111
July
7
2020
97
11111
months
continue
2020
100
June
6
2021
89
22222
July
7
2021
91
22222
months
continue
2021
100
June
6
2022
93
33333
July
7
2022
94
33333
Date Table
Month
Month Num
Month Initial
january
1
J
feb
2
F
march
3
M
other
months
continued

Inactivity duration variable in panel data (Stata)

I have a dataset for U.S. manufacturing workers in the past 30 decades, and I am particularly interested in the following variables:
Month and year of 1st manufacturing job, recorded separately and named "start_month_job_1" & "start_yr_job_1."
Month and year of leaving the 1st manufacturing job, recorded separately and named "end_month_job_1" & "end_yr_job_1."
The reason for leaving the job (e.g. retirement, firing, factory shutdown, etc.), named "leaving_reason"
Month and year of 2nd manufacturing job, recorded separately and named "start_month_job_2" & "start_yr_job_2."
Month and year of leaving the 2nd manufacturing job, recorded separately and named "end_month_job_2" & "end_yr_job_2."
I am trying to create a variable that measures the duration of economic inactivity/idleness. I am defining "duration of economic inactivity" this as the time difference between leaving a 1st job and starting another job. I have created a variable that accomplishes that with years as in below:
gen econ_inactivity_duration_1 = start_yr_job_2 - end_yr_job_1
replace econ_inactivity_1 = 2018 - end_yr_job_1 if missing(start_yr_job_2 ) /// In cases where a worker never starts a second job until 2018, which is the latest year measured in the survey.
However, I want to actually create an economic_inactivity_duration variable that takes into account the difference in month and year, for both starting and leaving a job, respectively. For instance, the duration for the worker in row 1 would be 2 months, between May, 1993 and July, 1993, as opposed to zero, which is what my code above computes.
dataex start_month_job_1 byte start_yr_job_1 byte end_month_job_1 byte end_yr_job_1 byte start_month_job_2 byte start_yr_job_2 byte end_month_job_2 byte end_yr_job_2 byte leaving_reason
3 1990 5 1993 7 1993 4 1994 "Firm shutdown"
1 2003 7 2015 . . . . "job automation"
98 1979 98 2004 . . . . "Firm shutdown"
98 1975 98 2010 98 2010 98 2015 "job automation"
1 1983 12 1985 1 1986 . . "Firm shutdown"
98 1996 98 1998 . . . . "Firm shutdown"
There is probably a better way, but here is a crude method.
* Data example
input end_month_job_1 end_yr_job_1 start_month_job_2 start_yr_job_2
5 1993 7 1993
end
* Calculate months since 1960
gen j1_end = (end_yr_job_1 - 1960) * 12 + end_month_job_1
gen j2_start = (start_yr_job_2 - 1960) * 12 + start_month_job_2
* Calculate difference
gen wanted = j2_start - j1_end
* Check difference is positive
assert wanted > 0
list
+------------------------------------------------------------------------+
| end_mo~1 end_yr~1 s~mont~2 s~yr_j~2 j1_end j2_start wanted |
|------------------------------------------------------------------------|
1. | 5 1993 7 1993 401 403 2 |
+------------------------------------------------------------------------+

DAX to show rows with null date values in selected date slicer range

I have a typical scenario as below.
I have a student table and it contains four columns as below :-
1.StudentID
2.StudentName
3.LastAttendanceDate
4.StudentType
Now there are some null values in the date column LastAttendanceDate.Is it possible to use a date slicer to show these values of the students who have LastAttendanceDate column value as null? In simple words: Say you are a student who went to a school on Monday, Tuesday and Friday and you were absent on Wednesday and Thursday so here Wednesday and Thursday are the days where you were absent in the week and we need to display these records in the table visualization.
My excel Input data:-
StudentID StudentName LastAttendanceDate StudentType
100 Mary 02-05-2011 10:45 Fulltime
100 Mary Fulltime
100 Mary 04-05-2011 12:45 Fulltime
100 Mary 06-05-2011 15:45 Fulltime
100 Mary Fulltime
100 Mary 08-05-2011 19:45 Fulltime
100 Mary 09-05-2011 12:45 Fulltime
101 John 02-05-2011 10:45 Part Time
101 John 03-05-2011 11:23 Part Time
101 John 04-05-2011 10:45 Part Time
101 John 06-05-2011 15:49 Part Time
101 John Part Time
101 John 08-05-2011 19:45 Part Time
101 John 09-05-2011 12:45 Part Time
so here I need to dynamically find in the week/month range or any dynamic date range say from date range 02-05-2011 and 08-05-2011 or 02-05-2011 and 09-05-2011 or even 06-05-2011 and 09-05-2011, the students who were absent and show it in my table visualization.
Can anyone provide an approach or any helpful DAX? Appreciate all the help
My present visualization looks like this :
I want to show the students who were absent in the given time range as selected in the date slicer.
so if I slide the date slicer as per minimum and maximum ranges, it should show all the rows of students who were absent or with null values for Last Attendance Date column in those time range.
Kind regards
Sameer

PowerBI and filtered sum calculation

I should be able to make a report concerning a relationship between sick leaves (days) and man-years. Data is on monthly level, consists of four years and looks like this (there is also own columns for year and business unit):
Month Sick leaves (days) Man-years
January 35 1,5
February 0 1,63
March 87 1,63
April 60 2,4
May 44 2,6
June 0 1,8
July 0 1,4
August 51 1,7
September 22 1,6
October 64 1,9
November 70 2,2
December 55 2
It has to be possible for the user to filter year, month, as well as business unit and get information about sick leave days during the filtered time period (and in selected business unit) compared to the total sum of man-years in the same period (and unit). Calculated from the test data above, the desired result should be 488/22.36 = 21.82
However, I have not managed to do what I want. The main problem is, that calculation takes into account only those months with nonzero sick leave days and ignores man-years of those months with zero days of sick leaves (in example data: February, June, July). I have tried several alternative functions (all, allselected, filter…), but results remain poor. So all information about a better solution will be highly appreciated.
It sounds like this has to do with the way DAX handles blanks (https://www.sqlbi.com/articles/blank-handling-in-dax/). Your context is probably filtering out the rows with blank values for "Sick-days". How to resolve this depends on how your data are structured, but you could try using variables to change your filter context or use "IF ( ISBLANK ( ... ) )" to make sure you're counting the blank rows.

How to write the best code for data aggregation?

I have the following dataset (individual level data):
pid year state income
1 2000 il 100
2 2000 ms 200
3 2000 al 30
4 2000 dc 400
5 2000 ri 205
1 2001 il 120
2 2001 ms 230
3 2001 al 50
4 2001 dc 400
5 2001 ri 235
.........etc.......
I need to estimate average income for each state in each year and create a new dataset that would look like this:
state year average_income
ar 2000 150
ar 2001 200
ar 2002 250
il 2000 150
il 2001 160
il 2002 160
...........etc...............
I already have a code that runs perfectly fine (I have two loops). However, I would like to know is there any better way in Stata like sql style query?
This is shorter code than any suggested so far:
collapse average_income=income, by(state year)
This shouldn't need 2 loops, or any for that matter. There are in fact more efficient ways to do this. When you are repeating an operation on many groups, the bysort command is useful:
bysort year state: egen average_income = mean(income)
You also don't have to create a new dataset, you can just prune this one and save it. Start by only keeping the variables you want (state, year and average_income) and get rid of duplicates:
keep state year average_income
duplicates drop
save "mynewdataset.dta"
You have the SQL tag on the question. This is a basic aggregation query in SQL:
select state, year, avg(income) as average_income
from t
group by state, year;
To put this in a table, depends on your database. One of the following typically works:
create table NewTable as
select state, year, avg(income) as average_income
from t
group by state, year;
Or:
select state, year, avg(income) as average_income
into NewTable
from t
group by state, year;