Stata: date comparison in double - stata

I'm trying to divide the data by a certain datetime.
I've created e_timefrom what was originally a string "2019-10-15 20:33:04" for example.
To obtain all the information from the string containing h:m:s, I uses the following command to create a double
gen double e_time = clock(event_timestamp, "YMDhms")
Now I get the result I want from format e_time %tc (human readable),
I want to generate a new variable anything that is greater than 2019-10-15 as 1 and anything less than that as 0 .
I've tried
// 1
gen new_d = 0 if e_time < "1.887e+12"
replace new_d = 1 if e_time >= "1.887e+12"
// 2
gen new_d = 0 if e_time < "2019-10-15"
replace new_d = 1 if e_time > "2019-10-15"
However, I get an error message type mismatch.
I tried converting a string "2019-10-15" to double \to check if 1.887e+12 really meant 2019-10-15 using display, but I'm not sure how the command really works here.
Anyhow I tried
// 3
di clock("2019-10-15", "YMDhms")
but it didn't work.
Can anyone give advice on comparing dates that are in a double format properly?

Your post is a little hard to follow (a reproducible data example would help a lot) but the error type mismatch is because e_time is numeric, and "2019-10-15" is a string.
I suggest the following:
clear
input str20 datetime
"2019-10-14 20:33:04"
"2019-10-16 20:33:04"
end
* Keep first 10 characters
gen date = substr(datetime,1,10)
* Check that all strings are 10 characters
assert length(date) == 10
* Convert from string to numeric date variable
gen m = substr(date,6,2)
gen d = substr(date,9,2)
gen y = substr(date,1,4)
destring m d y, replace
gen newdate = mdy(m,d,y)
format newdate %d
gen wanted = newdate >= mdy(10,15,2019) & !missing(newdate)
drop date m d y
list
+------------------------------------------+
| datetime newdate wanted |
|------------------------------------------|
1. | 2019-10-14 20:33:04 14oct2019 0 |
2. | 2019-10-16 20:33:04 16oct2019 1 |
+------------------------------------------+

Related

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

Splitting a long string in pandas cell near the n-th character position into multiple cells without splitting words

As MS Excel limits the number of characters in a cell to 32767, I have to split longer strings in a pandas dataframe into several cells.
Is there a way to split the strings of a pandas column "Text" into several columns "Text_1", "Text_2", "Text_3", ... to divide? It is also important that the text block is not separated within a word, so I assume regex is needed.
An example dataframe:
df_test = pd.DataFrame({'Text' : ['This should be the first very long string','This is the second very long string','This is the third very long string','This is the last string which is very long'],
'Date' : [2019, 2018, 2019, 2018],
'Source' : ["FAZ", "SZ" , "HB", "HB"],
'ID' : ["ID_1", "ID_2", "ID_3", "ID_4"]})
df_test
Text Date Source ID
0 This should be the first very long string 2019 FAZ ID_1
1 This is the second very long string 2018 SZ ID_2
2 This is the third very long string 2019 HB ID_3
3 This is the last string which is very long 2018 HB ID_4
Assuming that the cut in this example occurs at n=15 and not at n=32767, I want to split the Text column accordingly to something like this:
Text_1 Text_2 Text_3 Text_4 Date Source ID
0 This should be the first very long string 2019 FAZ ID_1
1 This is the second very long string 2018 SZ ID_2
2 This is the third very long string 2019 HB ID_3
3 This is the last string which is very long 2018 HB ID_4
Ultimately the approach should be scalable to n=32767 and at least ten new columns "Text_1", "Text_2", and so on.
So far I have created a new column "n" indicating the length of the df_text["Text"] strings per row:
df_test['n'] = df_test['Text'].str.split("").str.len()
Here is the general idea.
# find longest long string, then divide the text
# into the number of new cols you want, adding a | at
# the division and then later splitting by that |
longest = ""
for x in df_test['Text']:
if len(x) > len(longest):
longest = x
continue
import math
num_cols = math.floor(len(longest.split(' ')) / 3) # shoot for 3 words per row
for index,row in df_test.iterrows():
word_str = row['Text']
word_char_len = len(word_str)
word_as_list = word_str.split(' ')
num_words = len(word_as_list)
col_index = math.ceil(len(word_as_list) / num_cols)
for _ in range(num_cols - 1):
word_as_list.insert(col_index,'|')
col_index += col_index
new = ' '.join(word_as_list)
df_test.at[index,'Text'] = new
cols = ['Text'+str(i) for i in range(1,num_cols+1)]
df_test[cols] = df_test.Text.str.split('|',expand=True)
del df_test['Text']
print(df_test)
OUTPUT
Date Source ID Text1 Text2 Text3
0 2019 FAZ ID_1 This should be the first very long string
1 2018 SZ ID_2 This is the second very long string
2 2019 HB ID_3 This is the third very long string
3 2018 HB ID_4 This is the last string which is very long
I will upload a full one when I am done. Comment if you don't like this way or have other suggestions.
Yes - a single pandas cell should contain a maximum number of characters of 32767. So the string from df_test[“Text”] should be split accordingly.

Convert date format to character string

I have a column of format DATETIME23. like this:
14.02.2017 13:00:25
I want to conver it to a string, so later, i would be able to modern it, so, for example, the final version would look like:
2017-02-14 13:00:25.000
Problem occures, when i try to convert date to char format: in result i have a string of smth like 1802700293 - which is the number of seconds.
I tried:
format date $23.0
or
date = put(date, $23.0)
P.S This is nother try:
data a;
format d date9.;
d = '12jan2016'd;
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
/* если нужно обязательно двухзначные день и месяц, то такой колхоз: */
if day(d) < 10 then dd=cat('0',put(day(d),$1.));
else ddday=put(day(d),$2.);
if month(d) < 10 then mm=cat('0',put(month(d),$1.));
else mm=put(month(d),$2.);
yyyy=put(year(d),$4.);
/*dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');*/
dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
run;
BUT, abnormally, the dtms2 concat destroys the zero in the month element
If your datetime is stored as a SAS datetime, just use the appropriate format :
data test ;
dt = '09feb2017:13:53:26'dt ; /* specify a datetime constant */
new_dt = put(dt,E8601DT23.3) ; /* ISO datetime format */
run ;
Output
dt new_dt
1802267606 2017-02-09T13:53:26.000
If you need to replace the 'T' with a space, simply add a translate function around the put().
For your dtms solution you can use put and the Z2. format to keep the leading zero when you concatenate:
dtms = cat(day(d),'-', put(month(d),z2.),'-',year(d),' 00:00:00.000');
You should be able to just use put(date, datetime23.) for your problem though instead of $23, which is converting the number of seconds to a string with length 23. However, as a comment has mentioned datetime23. is not the format from your example.

Stata: Subsetting data using criteria stored in other data set

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:
**Big_data** **Criteria_data**
==================== ================================================
lon lat 4_digit_id minlon maxlon minlat maxlat
-76.22 44.27 0765 -78.44 -77.22 34.324 35.011
-67.55 33.19 6161 -66.11 -65.93 40.32 41.88
....... ........
(over 1 million obs) (271 observations)
==================== ================================================
I have to subset the bid data as follows:
use Big_data
preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore
preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore
....
(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?
1) Subsetting data
With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.
The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.
clear all
set more off
*----- create example databases -----
tempfile bigdata reference
input ///
lon lat
-76.22 44.27
-66.0 40.85 // meets conditions
-77.10 34.8 // meets conditions
-66.00 42.0
end
expand 100000
save "`bigdata'"
*list
clear all
input ///
str4 id minlon maxlon minlat maxlat
"0765" -78.44 -75.22 34.324 35.011
"6161" -66.11 -65.93 40.32 41.88
end
drop id
expand 150
gen id = _n
save "`reference'"
*list
*----- reshape original reference file -----
use "`reference'", clear
tempfile reference2
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
gen lat = .
gen lon = .
save "`reference2'"
*----- create working database -----
use "`bigdata'"
timer on 1
quietly {
forvalues num = 1/300 {
gen minlon`num' = .
gen maxlon`num' = .
gen minlat`num' = .
gen maxlat`num' = .
}
}
timer off 1
timer on 2
append using "`reference2'"
drop i
timer off 2
*----- flag observations for which conditions are met -----
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
*keep if flag
*keep lon lat
*list
timer list
The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).
Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.
(joinby was also tested but again RAM was a problem.)
Edit
Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:
set more off
*----- create example big data -----
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
*----- reshape original reference file -----
use "`reference'", clear
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
drop i
tempfile reference2
save "`reference2'"
*----- create file to save results -----
tempfile results
clear all
set obs 0
gen lon = .
gen lat = .
save "`results'"
*----- start computations -----
clear all
* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb
timer clear
timer on 99
forvalues en = `step'(`step')`sizebd' {
* load observations and join with references
timer on 1
local start = `en' - (`step' - 1)
use in `start'/`en' using "`bigdata'", clear
timer off 1
timer on 2
append using "`reference2'"
timer off 2
* flag observations that meet conditions
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
* append to result database
timer on 4
quietly {
keep if flag
keep lon lat
append using "`results'"
save "`results'", replace
}
timer off 4
}
timer off 99
timer list
display "total time is " `r(t99)'/60 " minutes"
use "`results'"
browse
2) Inequalities
You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.
The following result may seem surprising:
. display (66.11 < 100 < 67.93)
1
How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.
The intended expression was (and Stata will now do what you want):
. display (66.11 < 100) & (100 < 67.93)
0
You can also rely on the function inrange().
The following example is consistent with the previous explanation:
. display (66.11 < 100 < 0)
0
Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).
This uses Roberto's data setup:
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
timer on 1
levelsof id, local(id_list)
foreach id of local id_list {
sum minlat if id==`id', meanonly
local minlat = r(min)
sum maxlat if id==`id', meanonly
local maxlat = r(max)
sum minlon if id==`id', meanonly
local minlon = r(min)
sum maxlon if id==`id', meanonly
local maxlon = r(max)
preserve
use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
qui save data_`id', replace
restore
}
timer off 1
I would try to avoid preserveing and restoreing the "big" file, and doing so is possible, but at the expense of losing Stata format.
Using the same set up as Roberto and Dimitriy did,
set more off
use `bigdata', clear
merge 1:1 _n using `reference'
* check for data consistency:
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)
* this will come handy later
gen byte touse = 0
* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
local thisid = id[`n']
outfile lat lon if touse using data_`thisid'.csv, replace comma
}
Time it on your machine. You could avoid touse and thisid and only have the single outfile within the cycle, but it would be less readable.
You can then infile lat lon using data_###.csv, clear later. If you really need the Stata files proper, you can convert that swarm of CSV files with
clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
* change the filename
local dtaname = subinstr(`"`f'"',".csv",".dta",.)
infile lat lon using `"`f'"', clear
if _N>0 save `"`dtaname'"', replace
}
Time it, too. I protected the save as some of the simulated data sets were empty. I think this was faster than 1.5 min on my machine, including the conversion.

Stata - assign different variables depending on the value within a variable

Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}
I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+
Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.
Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}