How to extract components of a disorganized string variable in Stata? - stata

I have a text variable showing patient prescription that looks quite messy like this:
PatientRx
ACETAZOLAMIDE 250MG TABLET- 100
ADAPALENE + BENZOYL 0.1% + 2.5% GEL-..
ADRENALINE/EPIPEN 300MCG/0.3ML INJ..
ALENDRONATE + COLECA 70MG + 140MCG TA..
ALLOPURINOL 100MG TABLET- 100
ALUM HYDROX + MAG HY 250+120+120MG/5M..
AMILORIDE + HYDROCHL 5MG + 50MG HCL T..
While I haven't looked through all these values, some patterns may arise:
Often times there are more than one drugs and they are separated, for example by space and forward slash.
Drugs are also be separated with plus sign. But plus sign is also used between doses.
The rule related to space is very arbitrary, both at the beginning and in the middle of entry.
How can I extract only the names of the drugs into new variables? New variables should look like this:
Newvar1 Newvar2
ACETAZOLAMIDE
ADAPALENE BENZOYL
ADRENALINE EPIPEN
ALENDRONATE COLECA
and so on.

Some would reach first for regular expressions, which you might indeed need for the full problem. In addition note moss as installed by ssc install moss.
But it seems easiest, given the information in the example here, which is all we have to go on, to look for the position of the first numeric digit 0 to 9 and then parse what goes before. I don't know whether drug names ever contain numeric digits.
clear
input str40 sandbox
" ACETAZOLAMIDE 250MG TABLET- 100"
"ADAPALENE + BENZOYL 0.1% + 2.5% GEL-"
" ADRENALINE/EPIPEN 300MCG/0.3ML INJ"
"ALENDRONATE + COLECA 70MG + 140MCG TA"
" ALLOPURINOL 100MG TABLET- 100"
"ALUM HYDROX + MAG HY 250+120+120MG/5M"
" AMILORIDE + HYDROCHL 5MG + 50MG HCL T"
end
gen wherenum = .
quietly forval j = 0/9 {
replace wherenum = min(wherenum, strpos(sandbox, "`j'")) if strpos(sandbox, "`j'")
}
gen drug = substr(sandbox, 1, wherenum - 1)
split drug, parse(+ /)
l drug?, sep(0)
+---------------------------+
| drug1 drug2 |
|---------------------------|
1. | ACETAZOLAMIDE |
2. | ADAPALENE BENZOYL |
3. | ADRENALINE EPIPEN |
4. | ALENDRONATE COLECA |
5. | ALLOPURINOL |
6. | ALUM HYDROX MAG HY |
7. | AMILORIDE HYDROCHL |
+---------------------------+

Related

Ignore missing values when generating new variable

I want to create a new variable in Stata, that is a function of 3 different variables, X, Y and Z, like:
gen new_var = (((X)*3) + ((Y)*2) + ((Z)*4))/7
All observations have missing values for one or two of the variables.
When I run the aforementioned command, all it generates are missing values, because no observation has values for all 3 of the variables. I would like Stata to complete the function ignoring the missing variables.
I tried the following commands without success:
gen new_var= (cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
gen new_var= (!missing(X*3+Y*2+Z*4)/7)
gen new_var= (max(X , Y, Z)/7) if missing(X , Y, Z)
The egen command does not allow complicated functions; otherwise rowtotal() could work.
EDIT:
To clarify, "ignoring missing variables" means that even if any one of the component variables is not missing, then apply the function to only that variable and produce a value for the new variable. The new variable should have missing values only when all three component variables are missing.
I am going to guess that "ignoring missing values" means "treating them as zeros". If you have some other idea, you should make it explicit.
That could be
gen new_var = (cond(missing(X), 0, 3 * X) ///
+ cond(missing(Y), 0, 2 * Y) ///
+ cond(missing(Z), 0, 4 * Z)) / 7
Let's look at your solutions and explain why they are all wrong either in general or usually.
(cond(missing(X*3),., X) + cond(missing(Y*2),., Y))/7
It is sufficient is note that if it's true that X is missing, then cond() yields missing, as then X * 3 is missing too. The same kind of remark applies to terms involving Y and Z. So you're replacing any missing values by missing values, which is no gain.
!missing(X*3+Y*2+Z*4)/7
Given the information that at least one of X Y Z is always missing, then this always evaluates to 0/7 or 0. Even if X Y Z were all non-missing, then it would evaluate to 1/7. That is a long way from the sum you want. missing() always yields 1 or 0, and its negation thus 0 or 1.
(max(X, Y, Z)/7) if missing(X , Y, Z)
The maximum of X, Y, Z will be the right answer if and only if one of the values is not missing and the other two are missing. max() ignores missings to the extent possible (even though in other contexts missings are treated as if arbitrarily large positive numbers).
If you just want to "ignore missing values" without "treating them as zeros", the following will work:
clear
set obs 10
generate X = rnormal(5, 2)
generate Y = rnormal(10, 5)
generate Z = rnormal(1, 10)
replace X = . in 2
replace Y = . in 5
replace Z = . in 9
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if X != . | Y != . | Z != .
list
+---------------------------------------------+
| X Y Z new_var |
|---------------------------------------------|
1. | 3.651024 3.48609 -24.1695 -11.25039 |
2. | . 14.14995 8.232919 . |
3. | 3.689442 9.812483 1.154064 5.044221 |
4. | 2.500493 13.02909 5.25539 7.797317 |
5. | 4.19431 . 6.584174 . |
6. | 7.221717 13.92533 5.045283 9.956708 |
7. | 5.746871 14.26329 3.828253 8.725744 |
8. | 1.396223 16.2358 19.01479 16.10277 |
9. | 4.633088 13.95751 . . |
10. | 2.521546 4.490258 -3.396854 .422534 |
+---------------------------------------------+
Alternatively, you could also use the inlist() function:
generate new_var = (((X)*3) + ((Y)*2) + ((Z)*4)) / 7 if !inlist(., X, Y, Z)

How to calculate number of non blank rows based on the value using dax

I have a table with numeric values and blank records. I'm trying to calculate a number of rows that are not blank and bigger than 20.
+--------+
| VALUES |
+--------+
| 2 |
| 0 |
| 13 |
| 40 |
| |
| 1 |
| 200 |
| 4 |
| 135 |
| |
| 35 |
+--------+
I've tried different options but constantly get the next error: "Cannot convert value '' of type Text to type Number". I understand that blank cells are treated as text and thus my filter (>20) doesn't work. Converting blanks to "0" is not an option as I need to use the same values later to calculate AVG and Median.
CALCULATE(
COUNTROWS(Table3),
VALUE(Table3[VALUES]) > 20
)
OR getting "10" as a result:
=CALCULATE(
COUNTROWS(ALLNOBLANKROW(Table3[VALUES])),
VALUE(Table3[VALUES]) > 20
)
The final result in the example table should be: 4
Would be grateful for any help!
First, the VALUE function expects a string. It converts strings like "123"into the integer 123, so let's not use that.
The easiest approach is with an iterator function like COUNTX.
CountNonBlank = COUNTX(Table3, IF(Table3[Values] > 20, 1, BLANK()))
Note that we don't need a separate case for BLANK() (null) here since BLANK() > 20 evaluates as False.
There are tons of other ways to do this. Another iterator solution would be:
CountNonBlank = COUNTROWS(FILTER(Table3, Table3[Values] > 20))
You can use the same FILTER inside of a CALCULATE, but that's a bit less elegant.
CountNonBlank = CALCULATE(COUNT(Table3[Values]), FILTER(Table3, Table3[Values] > 20))
Edit
I don't recommend the CALCULATE version. If you have more columns with more conditions, just add them to your FILTER. E.g.
CountNonBlank =
COUNTROWS(
FILTER(Table3,
Table3[Values] > 20
&& Table3[Text] = "xyz"
&& Table3[Number] <> 0
&& Table3[Date] <= DATE(2018, 12, 31)
)
)
You can also do OR logic with || instead of the && for AND.

Coding dichotomous variables in Stata

I have a set of dichotomous variables for firm size:
emp1_2 (i.e. firm with 1 or 2 employed people, including the owner), emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500, plus I do not have information on 27 firms size but I have an educated guess that they are large firms.
I want to create a dichotomous variable for a firm being a "small firm"; therefore, this variable equals 1 when emp1_2==1 | emp3_9==1 | emp10_19==1 equals 1, and 0 otherwise.
To my understanding of Stata, of which I am a bare user, the two following methods to construct dichotomous variables should be equivalent.
Method 1)
gen lar_firm = 0
replace lar_firm = 1 if emp1_2==1 | emp3_9==1 | emp10_19==1
Method 2)
gen lar_firm = (emp1_2 | emp3_9 | emp10_19)
Instead I have found out that with method 2) lar_firm equals 1 for firms for which emp1_2 | emp3_9 | emp10_19 and for firms that do not enter in any of the categories (i.e. emp1_2, emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500) but for which I have an educated guess that they are large firms.
I am wondering whether there is some subtle difference between the two methods. I though they should lead to equal outcomes.
When you do
gen lar_firm = emp1_2 | emp3_9 | emp10_19
you're testing if
(emp1_2 != 0) | (emp3_9 != 0) |(emp10_19 != 0)
In particular, missing values . are different from 0: they are greater in fact.
For more information:
http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?
Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.

posix regexp to split a table

I'm currently working on data migration in PostgreSQL. Since I'm new to posix regular expressions, I'm having some trouble with a simple pattern and would appreciate your help.
I want to have a regular expression split my table on each alphanumeric char in a column, eg. when a column contains a string 'abc' I'd like to split it into 3 rows: ['a', 'b', 'c']. I need a regexp for that
The second case is a little more complicated, I'd like to split an expression '105AB' into ['105A', '105B'], I'd like to copy the numbers at the beginning of the string and split the table on uppercase letters, in the end joining the number with exactly 1 uppercase letter.
the function I'll be using is probably regexp_split_to_table(string, regexp)
I'm intentionally providing very little data not to confuse anyone, since what I posted is the essence of the problem. If you need more information please comment.
The first was already solved by you:
select regexp_split_to_table(s, ''), i
from (values
('abc', 1),
('def', 2)
) s(s, i);
regexp_split_to_table | i
-----------------------+---
a | 1
b | 1
c | 1
d | 2
e | 2
f | 2
In the second case you don't say if the numerics are always the first tree characters:
select
left(s, 3) || regexp_split_to_table(substring(s from 4), ''), i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i);
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2
For a variable number of numerics:
select n || a, i
from (
select
substring(s, '^\d{1,3}') n,
regexp_split_to_table(substring(s, '[A-Z]+'), '') a,
i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i)
) s;
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2