How to change order of string based on dates - stata

I received data with a string variable that looks something like:
var_name
25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30
01-APR-11: A25, B82, C65
04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82
12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54
27-OCT-15: A22, B95, C08
And so on. My goal is to split these strings up into different variable names. The variable names would be v1_date, v1_A, v1_B, v1_C, v2_date, v2_A, v2_B, v2_C, v3_date, v3_A, v3_B, v3_C.
I can use split var_name, p(";"), rename to be v1, v2, and v3, and then split again to do this. But the problem is that I want v1, v2, and v3 to be in chronological order based on the date and the data is not currently arranged in that fashion. How can I make it so that the date of v1 comes before v2 and the date of v2 comes before the date of v3? For example in the first observation, I want 25-DEC-99: A11, B14, C89 to be associated with v2 and 28-FEB-94: A27, B94, C30 to be associated with v1.

The following gets you close, I believe. It uses both split and reshape.
clear
set more off
input ///
str100 myvar
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split myvar, p(;)
drop myvar
gen obs = _n
reshape long myvar, i(obs)
drop if missing(myvar)
split myvar, p(:)
drop myvar
gen myvar11 = date(myvar1, "DMY", 2020)
format %td myvar11
drop myvar1
rename (myvar11 myvar2) (mydate mycells)
order mydate, before(mycells)
bysort obs (mydate) : gen neworder = _n
drop _j
reshape wide mydate mycells, i(obs) j(neworder)
list
You can loop over the mycells variables if you need to further split them.

In general, please consider using dataex (SSC) to create easy data examples.
You don't give all the (not trivial) code you used to split the variables. As it happens, I don't think your variable names are easy to work with, so I re-created the split in my own fashion. If you reshape long the split data, then sorting by date is easy, but I have pulled up short of the reverse reshape wide, as I suspect the long structure is much easier to work with.
clear
input str80 data
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split data, p(;) gen(x)
local j = 1
gen work = ""
foreach x of var x* {
replace work = substr(`x', 1, strpos(`x', ":") - 1)
gen date`j' = daily(work, "DMY", 2050)
replace work = substr(`x', strpos(`x', ":") + 1, .)
split work, p(,)
rename (work1 work2 work3) (vA`j' vB`j' vC`j')
local ++j
}
drop work
drop x*
drop data
gen id = _n
edit
reshape long date vA vB vC, i(id) j(which)
drop if missing(date)
bysort id (date): replace which = _n
list, sepby(id)
+----------------------------------------+
| id which date vA vB vC |
|----------------------------------------|
1. | 1 1 12477 A27 B94 C30 |
2. | 1 2 14603 A11 B14 C89 |
|----------------------------------------|
3. | 2 1 18718 A25 B82 C65 |
|----------------------------------------|
4. | 3 1 15776 A11 B72 C68 |
5. | 3 2 18082 A21 B55 C26 |
6. | 3 3 18786 A62 B47 C82 |
|----------------------------------------|
7. | 4 1 14773 A77 B19 C73 |
8. | 4 2 19177 A99 B04 C54 |
|----------------------------------------|
9. | 5 1 20388 A22 B95 C08 |
+----------------------------------------+

Related

Stata: date comparison in double

I'm trying to divide the data by a certain datetime.
I've created e_timefrom what was originally a string "2019-10-15 20:33:04" for example.
To obtain all the information from the string containing h:m:s, I uses the following command to create a double
gen double e_time = clock(event_timestamp, "YMDhms")
Now I get the result I want from format e_time %tc (human readable),
I want to generate a new variable anything that is greater than 2019-10-15 as 1 and anything less than that as 0 .
I've tried
// 1
gen new_d = 0 if e_time < "1.887e+12"
replace new_d = 1 if e_time >= "1.887e+12"
// 2
gen new_d = 0 if e_time < "2019-10-15"
replace new_d = 1 if e_time > "2019-10-15"
However, I get an error message type mismatch.
I tried converting a string "2019-10-15" to double \to check if 1.887e+12 really meant 2019-10-15 using display, but I'm not sure how the command really works here.
Anyhow I tried
// 3
di clock("2019-10-15", "YMDhms")
but it didn't work.
Can anyone give advice on comparing dates that are in a double format properly?
Your post is a little hard to follow (a reproducible data example would help a lot) but the error type mismatch is because e_time is numeric, and "2019-10-15" is a string.
I suggest the following:
clear
input str20 datetime
"2019-10-14 20:33:04"
"2019-10-16 20:33:04"
end
* Keep first 10 characters
gen date = substr(datetime,1,10)
* Check that all strings are 10 characters
assert length(date) == 10
* Convert from string to numeric date variable
gen m = substr(date,6,2)
gen d = substr(date,9,2)
gen y = substr(date,1,4)
destring m d y, replace
gen newdate = mdy(m,d,y)
format newdate %d
gen wanted = newdate >= mdy(10,15,2019) & !missing(newdate)
drop date m d y
list
+------------------------------------------+
| datetime newdate wanted |
|------------------------------------------|
1. | 2019-10-14 20:33:04 14oct2019 0 |
2. | 2019-10-16 20:33:04 16oct2019 1 |
+------------------------------------------+

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Move some columns from a row into a new row

I have already read some approaches into this issue but until now could not deliver any results on my own, probably because i‘am new in this tool, so beg for some help :)
I'm using SAS Enterprise Guide 8.1.
Have:
DATE |COD|TOTAL |P1_DX |P1_DY |P1_CD|P2_DX |P2_DY |P2_CD| ...until P8_
01JAN2004|9 |185 |02FEB2005|27SEP2010|36 |10SEP2011|12DEC2020|16 |
31JAN2010|2 |351 |17FEB2015|27DEC2020|2 | | | |
(...)
Want
DATE |COD|TOTAL |DX |DY |CD |
01JAN2004 |9 |185 |02FEB2005|27SEP2010|36 |
01JAN2004 |9 |185 |10SEP2011|12DEC2020|16 |
31JAN2010 |2 |351 |17FEB2015|27DEC2020|2 |
(...)
You are pivoting (also known as transposing) the data in multiple sets of columns.
Coders typically use PROC TRANSPOSE to pivot data, but the specifics of this question can't be handled in a single proc step.
Rather than doing steps TRANSPOSE/DATA/TRANSPOSE or TRANSPOSE/TRANSPOSE/TRANSPOSE/MERGE, a single DATA step with ARRAYs can be coded to perform the pivot.
Example:
NOTE: Your column naming convention P<#>_DX, P<#>_DY, and P<#>_CD means that the elements of the ARRAY must be explicitly listed. If the column names instead were constructed using convention DX_<#> the columns could be specified in numbered suffix name list syntax DX_1-DX_8
data want;
set have;
array DXs(8) P1_DX P2_DX ...you fill in the rest... P8_DX;
array DYs(8) P1_DY P2_DY ...you fill in the rest... P8_DY;
array CDs(8) P1_CD P2_CD ...you fill in the rest... P8_CD;
length DX DY CD 8;
do seq = 1 to dim(DXs);
DX = DXs(seq);
DY = DYs(seq);
CD = CDs(seq);
* only output if there is some data;
if NMISS(DX,DY,CD) < 3 then OUTPUT;
end;
* seq is also kept in case you need to know which <#> a DX DY CD came from;
keep DATE COD TOTAL DX DY CD seq;
run;

How to extract components of a disorganized string variable in Stata?

I have a text variable showing patient prescription that looks quite messy like this:
PatientRx
ACETAZOLAMIDE 250MG TABLET- 100
ADAPALENE + BENZOYL 0.1% + 2.5% GEL-..
ADRENALINE/EPIPEN 300MCG/0.3ML INJ..
ALENDRONATE + COLECA 70MG + 140MCG TA..
ALLOPURINOL 100MG TABLET- 100
ALUM HYDROX + MAG HY 250+120+120MG/5M..
AMILORIDE + HYDROCHL 5MG + 50MG HCL T..
While I haven't looked through all these values, some patterns may arise:
Often times there are more than one drugs and they are separated, for example by space and forward slash.
Drugs are also be separated with plus sign. But plus sign is also used between doses.
The rule related to space is very arbitrary, both at the beginning and in the middle of entry.
How can I extract only the names of the drugs into new variables? New variables should look like this:
Newvar1 Newvar2
ACETAZOLAMIDE
ADAPALENE BENZOYL
ADRENALINE EPIPEN
ALENDRONATE COLECA
and so on.
Some would reach first for regular expressions, which you might indeed need for the full problem. In addition note moss as installed by ssc install moss.
But it seems easiest, given the information in the example here, which is all we have to go on, to look for the position of the first numeric digit 0 to 9 and then parse what goes before. I don't know whether drug names ever contain numeric digits.
clear
input str40 sandbox
" ACETAZOLAMIDE 250MG TABLET- 100"
"ADAPALENE + BENZOYL 0.1% + 2.5% GEL-"
" ADRENALINE/EPIPEN 300MCG/0.3ML INJ"
"ALENDRONATE + COLECA 70MG + 140MCG TA"
" ALLOPURINOL 100MG TABLET- 100"
"ALUM HYDROX + MAG HY 250+120+120MG/5M"
" AMILORIDE + HYDROCHL 5MG + 50MG HCL T"
end
gen wherenum = .
quietly forval j = 0/9 {
replace wherenum = min(wherenum, strpos(sandbox, "`j'")) if strpos(sandbox, "`j'")
}
gen drug = substr(sandbox, 1, wherenum - 1)
split drug, parse(+ /)
l drug?, sep(0)
+---------------------------+
| drug1 drug2 |
|---------------------------|
1. | ACETAZOLAMIDE |
2. | ADAPALENE BENZOYL |
3. | ADRENALINE EPIPEN |
4. | ALENDRONATE COLECA |
5. | ALLOPURINOL |
6. | ALUM HYDROX MAG HY |
7. | AMILORIDE HYDROCHL |
+---------------------------+

Stata: Permutations of string variables

I have three string variables of the length 2 and I need to get (a) all possible permutations of the three variables (keeping the order of strings within each variable fixed), (b) all possible variable pairs. Small number of variables allows me to do it manually, but I was wondering if there is a more elegant and concise way of solving this.
It is currently coded as:
egen perm1 = concat(x1 x5 x9)
egen perm2 = concat(x1 x9 x5)
egen perm3 = concat(x5 x1 x9)
egen perm4 = concat(x5 x9 x1)
egen perm5 = concat(x9 x5 x1)
egen perm6 = concat(x9 x1 x5)
gen tuple1 = substr(perm1,1,4)
gen tuple2 = substr(perm2,3,4)
gen tuple3 = substr(perm3,1,4)
gen tuple4 = substr(perm4,3,4)...
An abstract from a resulting table illustrates the desired outcome:
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| x1 | x5 | x9 | perm1 | perm2 | perm3 | perm4 | perm5 | perm6 | tuple1 | tuple2 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| 01 | 05 | 09 | 010509 | 010905 | 050109 | 050901 | 090501 | 090105 | 0105 | 0509 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
Neat question. I don't know if there's a "built in" way to do permutations, but the following should do it.
You want to loop over all your variables, but make sure that don't get duplicates. As the dimensions increase this gets tricky. What I do it loop over the same list and each time remove the current counter from counter space of the nested loop.
Unfortunately, this still requires you to write each loop structure, but this should be easy enough to cut-paste-find-replace.
clear
set obs 100
generate x1 = "01"
generate x5 = "05"
generate x9 = "09"
local vars x1 x5 x9
local i = 0
foreach a of varlist `vars' {
local bs : list vars - a
foreach b of varlist `bs' {
local cs : list bs - b
foreach c of varlist `cs' {
local ++i
egen perm`i' = concat(`a' `b' `c')
}
}
}
Edit: Re-reading the question, I'm not clear on what you want (since row1_1 isn't one of your concated variables. Note that if you really want the "drop one" permutations, then just remove one variable from the concat call. This is because "n permute n" is the same as "n permute n-1". That is, there are 6 3-item permutations of 3 items. There are also 6 2-item permutations of 3 items. So
egen perm`i' = concat(`a' `b')