Is there a Stata function to change string variable of dates (Month/ Day/ Year/ Time) to month/date/year and numeric? - stata

I have a string variable for time that has the timestamp. For example, one entry would look like: "4/25/2022 17:26". I have over 1,000 observations. I need to categorize the dates (like time period A, time period B...). I want one category per month so I would have 25 categories (because I have data from over 2 years ago). I thought I would first make the string variable a continuous/ numeric variable so that I can do an if...then statement, creating a new, categorical variable, where I can efficiently say if a date is within a certain range it would go to the new categorical variable for the time period.
I also know I might be planning this all wrong, any suggestions?

If your string variable holding the timestamp is called date_string, you can use clock() like this to create a new variable (in this example d)
gen double d = clock(date_string, "MDYhm")
format d %tc
That code would convert this variable:
date_string
1. 4/25/2022 17:26
2. 4/26/2022 19:52
3. 5/17/2023 7:16
into a new variable d:
date_string d
1. 4/25/2022 17:26 25apr2022 17:26:00
2. 4/26/2022 19:52 26apr2022 19:52:00
3. 5/17/2023 7:16 17may2023 07:16:00

Related

Multiple To clauses in Data step

I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;

Datetime object through 'datetime.strptime is not iterable'

i have a csv file containing years of data, and i need to calculate the difference between the max date and the min date, i am facing a real problem in how can i determine the max value of dates.
So, i am doing this to convert my dates into datetime object
Temps = datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
Printing this line, gives me the exact result i want, but when i try to extract the max values of these dates using this line of code :
start = max(Temps)
I got this error : datetime.strptime' object is not iterable
where am i mistaken ?
The expression
datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
produces a single value (a scalar). When you assign it to Temps this variable become a scalar not a list. It contains only one value.
Then when you try to evaluate max(Temps) max is expecting to find something with multiple values as its argument but, unfortunately, it finds what Temps was assigned most recently.
This was a single value, which is not 'iterable'.

Day of the week effect - excluding dummy variables not individually

I want to test the day of the week effect of stock returns. The stata code I have written works, but looks fairly inefficient.
// 1) Monday effect
eststo:reg return day_dummy2 day_dummy3 day_dummy4 day_dummy5
// 2) Tuesday effect
eststo:reg return day_dummy1 day_dummy3 day_dummy4 day_dummy5
// 3) Wednesday effect
eststo:reg return day_dummy1 day_dummy2 day_dummy4 day_dummy5
and so on.
Is there a way to write a code with the same function (excluding one day at a time) with e.g. a foreach loop?
Thank you very much for your help!
A bit clunky, perhaps, but you could use Stata's macro (see help extended_fcn) functions to iteratively exclude one of your listed variables and generate the list of remaining variables.
local vars "day1 day2 day3 day4 day5 day6 day7"
forvalues i = 1/7 {
local varexclude : word `i' of `vars'
local varsout`i' : subinstr local vars "`varexclude'" ""
// insert -estout- command here
}
macro list // to verify the individual `varsout`i'' local macros
You can obtain the initial varlist with ds day*, which stores the variable list in r(varlist).

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

specify moment at which to change value after converting (tri)annual-subject to month-subject observations (Stata)

My goal is to convert a subject-triannual data set to one with subject-month observations, and specify the month at which one string variable (named "strvar" below) should change value, according to the var called "exact_time".
I have a data set with four records per subject (subject-year observations, aka multiple-record-per-subject data set), information was recorded every three years for each subject as follows:
Table with tri-annual-subject obs. & exact_time var
"strvar" changes its value every three years. The variable "exact_time" records the exact (month.day.year) moment at which each the variable "strvar" changes its value. Once "strvar" varies, it keeps the same value for the following months, until the moment indicated by the next value of "exact_time"
I want Stata to change the value of "strvar" according to the variable "exact_time". For instance, subject 1 changed value of "strvar" in April 1, 1992, hence, I want Stata to assign the new value of "strvar" in April 1992. The value of "strvar" for subject 1 should remain the same until "exact_time" changes value (November.30.1995), hence, starting in November 1995, subject one should adopt the new value of "strvar". In 1998, "strvar" of subject one changed value once again, this time at the beginning of next year (January.1.1999), hence, "strvar" will adopt a new value starting in January.1999, until subject one's last observation (December 2002). As follows:
table with monthly-subject obs, example
I believe this can be achieved in in two steps, the second of which I need your support with:
Expand each tri-annual observation 36 times, so as to have monthly-subject observations, i.e., generate var "new_time". I guess this can be achieved through:
expandcl 36, generate(new_time) cluster(subject)
Instruct Stata to change the value "strvar" according to the date specified by "exact_time", which I have no idea how to do, and for which I would appreciate your support.
Thank you in advance!
For future questions, please provide your failed attemps in form of code. They prove that you have done your part trying to solve the problem.
Also, please provide example data that can easily be copied/pasted by other users. Linking images is not the best option, for several reasons.
Find example code below.
clear
set more off
*----- example data -----
input ///
id str1 strvar str22 xtime
1 z "april 1, 1992"
1 u "november 30, 1995"
1 a "january 1, 1999"
2 b "january 15, 1989"
2 z "june 15, 1992"
2 c "august 30, 1995"
end
gen xtime2 = date(xtime, "MDY")
format %td xtime2
list, sepby(id)
*----- what you want -----
xtset id xtime2
tsfill
gen strvar2 = strvar
replace strvar2 = strvar2[_n-1] if missing(strvar2)
browse
tsfill facilitates the job. Se also help xtset, help subscripting and help datetime.
Think about whether you actually need this. You are not adding any new information to the dataset, so what's the point of having a blown-up version of the original?
(The output doesn't exactly match the one in your image; but this really is meant to be an example.)