Resolving gaps in the data in Stata with Weekly Time Series Data - stata

I have weekly Google Trends Search query data in Stata. Here is a sample of what the data looks like:
I converted the date string into a date object like so:
gen date2 = date(date, "YMD")
gen year= year(date2)
gen w = week(date2)
gen weekly = yw(year,w)
format weekly %tw
I now want to declare "date2" as my time series reference, so I did the following:
tsset date2, weekly
However, upon using tsreport I get the following information
However, I should have no gaps in the data, as it is weekly. For some reason, Stata is still assuming I have daily data.
I cannot take first differences on any of these variables because of this issue. How do I resolve this?

I agree with William Lisowski's general advice but have different specific recommendations.
You have weekly data with daily flags for each week.
Stata weeks are likely to be of little or no use to you for reasons documented in detail in references that
search week, sj
will disclose. Specifically,
SJ-12-4 dm0065_1 . . . . . Stata tip 111: More on working with weeks, erratum
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/12 SJ 12(4):765 (no commands)
lists previously omitted key reference
SJ-12-3 dm0065 . . . . . . . . . . Stata tip 111: More on working with weeks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/12 SJ 12(3):565--569 (no commands)
discusses how to convert data presented in yearly and weekly
form to daily dates and how to aggregate such data to months
or longer intervals
SJ-10-4 dm0052 . . . . . . . . . . . . . . . . Stata tip 68: Week assumptions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/10 SJ 10(4):682--685 (no commands)
tip on Stata's solution for weeks and on how to set up
your own alternatives given different definitions of the
week
Issuing that search command will give you links to .pdf copies of each paper.
I suggest simply
gen date2 = daily(date, "YMD")
format date2 %td
tsset date2, delta(7)
daily() is the same function as date() but I think the name is a better signal to all of precisely what it does. The more important detail is that delta(7) is sufficient to indicate daily data spaced 7 days apart, which is precisely what you have.
To expand on the problem you had: when you converted to daily dates, then you got a numeric variable with values like 18755 in steps of 7 to your last date. You then told Stata through your tsset ..., weekly that these are really weeks. Stata uses an origin for all dates like these of the beginning of 1960. So, Stata is working out what 18755 weeks (etc.) from the beginning of 1960 would be. And your numeric variable is still in steps of 7. So, the reason that Stata is misinterpreting your data is that you gave it incorrect information. tsset will never change a date variable; it just interprets it as you instruct.
Note also that you created a weekly date variable but then did not use it. That wouldn't have been a good solution either, but it would have been closer to what you want. It appears that all your dates are Sundays, so in some years there would be 53 and in other years 52; that's not true of Stata's own weeks.

The problem would be more helpfully stated if it included a listing of the data, rather than a picture, so that others could test and demonstrate correct code.
With that said, you need to carefully review the output help datetime to improve your understanding of how to work with Stata Internal Format (SIF) date and time data, and of the meaning of a "weekly date" in Stata. I believe that something like the following will start you along the correct path.
gen date2 = date(date, "YMD")
gen weekly = wofd(date2)
format weekly %tw
or in a one fewer steps
gen weekly = wofd(date(date, "YMD"))
format weekly %tw

Related

Converting day and month variables into Numerical values (Stata)

I have data on online job postings, but with some variables structured as string when I want them to be numerical to create time series graphs as in here.
The three variables I am interested in converting into numeric variables look as follows:
dataex month posted_date revenue
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 month str19 posted_date str32 revenue
"March_2021" "2021-03-08 10:44:15" "Less than $1 million (USD)"
"March_2021" "2021-03-08 10:44:15" "Less than $1 million (USD)"
"Dec_2020" "2020-12-13 08:04:59" "$10+ billion (USD)"
"Nov_2020" "44150.33611" "$10+ billion (USD)"
"Dec_2020" "2021-01-04 04:59:40" "$10+ billion (USD)"
"Nov_2020" "44167.24444" "$10+ billion (USD)"
"Dec_2020" "2020-12-16 10:49:38" "$10+ billion (USD)"
"Nov_2020" "44167.24514" "$10+ billion (USD)"
"Nov_2020" "44172.01319" "$10+ billion (USD)"
"Dec_2020" "2020-12-30 05:52:25" "$10+ billion (USD)"
"April_2021" "2021-04-21 04:16:12" ""
"April_2021" "2021-04-21 04:16:12" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"Feb_2021" "2021-03-01 01:03:09" ""
"April_2021" "2021-04-21 05:57:59" ""
"April_2021" "2021-04-21 05:57:59" ""
"Dec_2020" "2020-12-22 08:13:06" "$500 million to $1 billion (USD)"
I would like the new variables to look something as below:
month_n posted_date_n revenue_n
02/21 09/02/21 $500m_1B
03/21 14/03/21 +10B
04/21 11/04/21 +1m
So based on the instructions here, I ran the following code:
// Destring variables string variables with numerical values
gen posted_date_n = real(posted_date)
gen month_n = real(month)
gen revenue_n = real(revenue)
However, I could not really get what I am looking for and instead, the data looks as follows:
dataex revenue_n posted_date_n month_n
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(revenue_n posted_date_n month_n)
. . .
. . .
. . .
. 44150.34 .
. . .
. 44167.25 .
. . .
. 44167.25 .
. 44172.01 .
. . .
. . .
. . .
I was able to run code to get the data into almost the form you wanted, but not for the date values like 44150.33611, etc. These seem to be excel format as noted by #JR96.
I recommend using the split function and a really handy write up by Nick Cox is a useful read (source).
// Month/Year
split month, p("_")
drop month
rename month1 month
gen month_n = date(month,"M")
format month_n %td_Month
rename month2 year
destring year, replace
format year %ty
rename year year_n
// Posted Date
split posted_date, p(" ")
drop posted_date
rename posted_date1 date
rename posted_date2 time
gen posted_date_n = date(date, "YMD")
format %tdNN/DD/CCYY posted_date_n
This doesn't do exactly what you ask for but it's closer than nothing in my opinion. Example output as,
month_n, year_n, posted_date_n
March, 2021, 03/08/2021
March, 2021, 03/08/2021
Where everything is formated as a date that Stata can recognize. Maybe someone else can jump in here on combining the month_n and year_n column?

How to remove missing value in SAS by a sequence of variables

Here is the demonstrate data.
data faminc;
input famid faminc1-faminc12;
cards;
1 3281 3413 3114 2500 2700 . 3114 3319 3514 1282 2434 2818
2 4042 . . . . . 1531 2914 3819 4124 4274 4471
3 6015 . . . . . . . . . . .
;
run;
I would like to create an indicator variable called fam_indicator. If variables faminc2-faminc12 are all missing, then fam_indicator=1. Otherwise fam_indicator=0.
I tried the code below but it didn't work.
data fam;
set faminc;
if missing(faminc2-faminc12) then fam_indicator=1;
else fam_indicator=0;
run;
You can do this a bunch of different ways. If the variables are all numeric, then n will do it for you.
data fam;
set faminc;
if n(of faminc2-faminc12) eq 0 then fam_indicator=1;
else fam_indicator=0;
run;
cmiss and nmiss also could work; cmiss is generic regardless of type, while nmiss is only for numerics. They would count the number of missings, so you'd want if cmiss(of faminc2-faminc12) eq 11 or similar.
The other thing you needed was the of. n(faminc2-faminc12) would just subtract the one from the other. of says "the next thing here is a variable list" and it will then expand the list out.
nmiss function could be used directly, sum function is also another option, sum of all missing values is still missing value.
fam_indicator=ifn(sum(of faminc2-faminc12)=.,1,0);

How to know ids with missing variable in SAS

In my dataset there are several observations (IDs) with all or too many missing variables. I want to know which IDs have no data (all variables are missing). I used proc freq but it gives me only freqency of variables, which do not serve my purpose. Proc mean nmiss also give me just total missing. I want to know exactly which IDs have missing variables. I searched online but couldn't locate solution of my problem. Help would be appreciated. Below is the sample data;
ID a b c d e
1 . 3 1 2 2
2 . . . . .
3 . . . . .
4 3 . 5 . .
I want result in a way that show me data of ID with complete missing information like;
ID a b c d e
2 . . . . .
3 . . . . .
Thanks
Thanks in advance
Use the nmiss function instead, which counts the number of missing values im the row for a specified list of variables. If you're looking at 3 variables for example
If nmiss(var1, var2, var3) =3;
Keep ID;
This will keep only records with all three variables missing.
The n function returns the number of non-missing numeric values in a list. This means you could use a variable list and not worry about counting the variables:
if n(of _numeric_) = 0 then output;
or
if n(of a--e) = 0 then output;
If you're checking character variables, there is no corresponding c function, but you could use the coalescec function to do something similar. The coalesce functions return the first non-missing value from a list of values. To select rows with all character values missing, use something like:
if missing(coalescec(of _character_)) then output;

Shift columns to the right

I have a SAS dataset which looks like this:
Month Col1 Col2 Col3 Col4
200801 11 2 3 20
200802 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 3 34 1 0
I want to create a dataset by shift each row's column Col1-Col4 values, to the right. It will look diagonally shifted.
Month Col1 Col2 Col3 Col4 Col5 Col6 Col7 . . . . . . . Coln
200801 11 2 3 20
200802 . 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 . . . . . . . . . 3 34 1 0
Can someone suggest how I can do it?
Thanks!
First off, if you can avoid doing so, do. This is a pretty sparse way to store data, and will involve large datasets (definitely use OPTIONS COMPRESS at least), and usually can be worked around with good use of CLASS variables.
If you really must do this, PROC TRANSPOSE is your friend. While this is possible in the data step, it's less messy and more flexible in PROC TRANSPOSE.
First, make a totally vertical dataset (month+colname+colvalue):
data pre_t;
set have;
array cols col1-col4;
do _t = 1 to dim(cols);
colname = cats("col",((_N_-1) + _t)); *shifting here, edit this logic as needed;
value = cols[_t];
output;
keep colname value month;
run;
In that datastep, you are creating the eventual column name in colname and setting it up for transpose. If you have data not identical to the above (in particular, if you have data grouped by something else), N may not work and you may need to do some logic (such as figuring out difference from 200801) to calculate the col#.
Then, proc transpose:
proc transpose data=pre_t out=want;
by month;
id colname;
var value;
run;
And voilĂ , you should have what you were looking for. Make sure it's sorted properly in order to get the output in the expected order.

Generating Interdependent Data in SAS

I am trying to compute a column in SAS, that has dependency on itself. For example, I have the following list of initial values
ID Var_X Var_Y Var_Z
1 2 3 .
2 . 2 .
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
I need to fill up the blank spaces. The formulae are as follows:
Var_Z = 0.1 + 4*Var_x + 5*Var_Y
Var_X = lag1(Var_Z)
Var_Y = lag2(Var_Z)
As we see values of Var_X, Var_Y and Var_Z are inter-dependent. So the computaion needs to follow an specific order.
First we compute when ID = 1, Var_Z = 0.1 + 4*2 + 5*3 = 23.1
Next, when ID = 2, Var_X = lag1(Var_Z) = 23.1
Var_Y does not need computation at ID = 2 as we already have the initial value here. So, we have
ID Var_X Var_Y Var_Z
1 2 3 23.1
2 23.1 2 102.5 (= 0.1 + 4*23.1 +5*2)
3 . . .
4 . . .
5 . . .
6 . . .
7 . . .
We keep repeating this procedure until all vaues are calculated.
Is there a way, SAS can handle this? I tried DO loop, but I guess I did not do a good job coding it right. It just stops after ID = 2.
I am new at SAS so not familiar if there is a way SAS can handle this easily. Will wait for your suggestions.
You don't need to use LAG or RETAIN, if you're just doing this in a single data step. DO loop by itself will handle things nicely. RETAIN would only be needed if we were doing something involving a pre-existing data set, but there's really no reason to use one.
I'm using a shortcut here - while you describe VAR_Y in terms of VAR_Z, you really mean that after one iteration, VAR_Z moves to VAR_X and VAR_X moves to VAR_Y, so I do that (in the proper order to not mix things up).
data test_data;
if _n_ = 1 then do;
var_x=2;
var_y=3;
end;
do _iter = 1 to 7;
var_z = 0.1+4*var_x+5*var_y;
output;
var_y=var_x;
var_x=var_z;
end;
run;
proc print data=test_data;
run;
I believe you can do this within a DO loop - the key is making SAS remember the last values of your variables. My suggestion is to poke around a bit for a simple "counter" program that, in pseudo SAS code, is something like:
Do i = 1 to 100;
i = i + 1;
run;
And see what the actual syntax is in SAS. I suspect your problem is you're not using the retain statement within your DO loop. Check the SAS documentation for that and see if it fixes your problem?