How to add across columns - stata

Say I have this data:
clear
set more off
input ///
str15(s1 s2 s3 s4)
a "b" "b" "c"
b "b" "" "a"
c "c" "" ""
d "f" "" "g"
e "" "" ""
end
I want to find the number of non-missing values across columns by row. I try:
gen sum = s1!="" + s2!="" + s3!="" + s4!=""
But this gives a type mismatch error. What am I doing wrong?
However, I can write it all out and it works:
gen x=s1!=""
gen y=s2!=""
gen z=s3!=""
gen q=s4!=""
gen sum1=x + y + z + q

The problem is operator precedence. The following works:
gen sum = (s1!="") + (s2!="") + (s3!="") + (s4!="")
From the Stata User's Guide
The order of evaluation (from first to last) of all operators is ! (or ~), ^, - (negation), /, *, - (subtraction), +, != (or ~=), >, <, <=, >=, ==, &, and |.
However, I prefer Roberto's and Dimitry's recommendation of rownonmiss.

I believe you want something like this:
gen sum2 = !missing(s1) + !missing(s2) + !missing(s3) + !missing(s4)
or even something more compact:
egen sum3 = rownonmiss(s1 s2 s3 s4), strok
egen stands for extended generate, and is usually a good place to start looking, followed by egenmore if the first fails.
However, I don't have good intuition why your code fails with strings. It seems to work with numeric variables.

Related

Repeating code in an if qualifier in Stata

In Stata I am trying to repeat code inside an if qualifier using perhaps a forvalues loop. My code looks something like this:
gen y=0
replace y=1 if x_1==1 & x_2==1 & x_3==1 & x_4==1
Instead of writing the & x_i==1 statement every time for each variable, I want to do it using a loop, something like this:
gen y=0
replace y=1 if forvalues i=1/4{x_`i'==1 &}
LATER EDIT:
Would it be possible to create a local in the line of this with the elements added together:
forvalues i=1/4{
local text_`i' "x_`i'==1 &"
display "`text_`i''"
}
And then call it at the if qualifier ?
Although you use the term "if statement" all your code is phrased in terms of if qualifiers, which aren't commands or statements. (Your use of the term "statement" is looser than customary, but that doesn't affect an answer directly.)
You can't insert loops in if qualifiers.
See for the differences
help if
help ifcmd
The entire example
gen y = 0
replace y = 1 if x==1 | x==2 | x==3 | x==4
would be better as
gen y = inlist(x, 1, 2, 3, 4)
or (dependent possibly on whatever values are allowed)
gen y = inrange(x, 1, 4)
A loop solution could be
gen y = 0
quietly forval i = 1/4 {
replace y = 1 if x == `i'
}
We can't discuss whether inlist() or inrange() would or would not be a solution for your real problem if you don't show to us.
I usually don't like - in Nick's terms - to write code to write code. I see an immediate, though not elegant nor 'heterodox', solution to your issue. The whole thing amounts to generate an indicator function for all your indicators, and use it with your if qualifier.
Implicit assumptions, which make this a bad, non-generalizable solution, are: 1) all variables are dummies, and you need them to be == 1, and 2) variable names are conveniently ordered 1 to N (although, if that is not the case, you can easily change the forv into a 'foreach var of varlist etc.')
g touse = 1
forv i =1/30{
replace touse = touse * x_'i'
}
<your action> if touse == 1

python tuple comparison, 'less than' revisited (2.7)

I want to understand Python's behavior for tuple comparison operators, specifically < (the less-than operator).
edit: Now I get it, thank you Jaroslaw for a clear answer.
My error was thinking 1) "Lexicographical comparison" was a synonym for "string compare",
and 2) thinking that the "string compare logic" applied to each element of the tuples being compared instead of at the tuple-level. Which yields valid behavior for == but not so much for <.
On the off chance anyone else gets stuck on the distinction.
excerpt from wikipedia (emphasis added)4: ...lexicographical order ... is a generalization
of the way the alphabetical order of words is based on the alphabetical order of their component letters.
This generalization consists primarily in defining a total order over the sequences ... of
elements of a finite totally ordered set, often called alphabet.
original question text follows...
Wait, you say, hasn't this been asked before?
Yes... and pretty well answered for == (equality).
I want to understand why ( 1, 2 ) < ( 1, 3 ) yields True (note the < less than operator, code example below). This is likely just a python-newbie error on my part, but I am having trouble finding it.
I've read some other questions about how tuple comparisons involves "lexicographic comparisons of the respective elements form each tuple."
Question: python-tuple-comparison-odd-behavior: This question is about using the in operator, I'm interested in < (less than), not so much the behavior of in (at least not yet).
Question: python-tuple-comparison: For this one the answer says (emphasis added):
excerpt from Answer: Tuples are compared position by position: the first item of first
tuple is compared to the first item of the second tuple; if they are
not equal, this is the result of the comparison, else the second item
is considered, then the third and so on.
Which I understand for == comparisons.
edit: *thought I understood
To generalize to all comparison operators I would modify the answer to be something like this:
... the first item of first tuple is compared to the
first item of the second tuple; if the comaprison yields False
then the result of the tuple comaprison is also False. Otherwise
the comparison continues with the remaining items....
edit: this was wrong. Subtly wrong though, worked for == but not other relational operators.
I am having trouble seeing how that works for < (less than) comparisons.
The python documentation they link to (modified to point to 2.7) also talks about this in terms of equality, not less than - again, emphasis added:
excerpt from python docs: Sequence types also support comparisons. In particular, tuples and
lists are compared lexicographically by comparing corresponding
elements. This means that to compare equal, every element must compare
equal and the two sequences must be of the same type and have the same
length. (For full details see Comparisons in the language reference.)
edit: at this point when writing up my original question I had tunnel vision on 'equality'.
I found nothing helpful in the The Comparisons language reference; it doesn't touch on why ( 1, 2 ) < ( 1, 3 ) yields True when the comparison operator seems like it should yield False for the first pair of elements.
The following is some example output of a toy test program; most of it works as I would expect. Please note the 2 embedded questions.
Ouput from "tcomp.py" (source below).
$ python tcomp.py
tcomp.py
version= 2.7.12
--- empty tuples, intuitive, no surprises ---
() == () : True
() < () : False
() > () : False
--- single-item tuples, equal values: intuitive, no surprises ---
(1,) == (1,) : True
(1,) < (1,) : False
(1,) > (1,) : False
--- single-item diff: intuitive, no surprises ---
(1,) == (2,) : False
(1,) < (2,) : True
(1,) > (2,) : False
--- double-item same: intuitive, no surprises ---
(1, 2) == (1, 2) : True
(1, 2) < (1, 2) : False
(1, 2) > (1, 2) : False
* Question: do a<b and a>b both yield False
* because Python short circuits on
* a[0] < b[0] (and correspondingly a[0] > b[0] )?
* e.g. Python never bothers comparing second
* elements: a[1] < b[1] (and, correspondinlgy, a[1] > b[1] ).
--- double-item 1st=same 2nd=diff: ??? What is with a<b ???
(1, 2) == (1, 3) : False
(1, 2) < (1, 3) : True
(1, 2) > (1, 3) : False
* Question: Here I get the == comparison, that yields False like I would expect.
* But WHAT is going on with the < comparison?
* Even comapring "lexicographically", how does a[0] < b[0]
* actually yield True ?
* Is Python really comparing a[0] < b[0] ?
* Because in my mental model that is the same as comparing: 1 < 1
* I kind of think 1 < 1 is supposed to be False, even if Python
* is comparing "1" < "1" (e.g. lexicographically).
$
To add to the last "*Question" above, comapring a[0] < b[0] lexicographically would be like comparing '1' < '1' which still should be false, yes?
tcomp.py:
import platform
def tupleInfo( a, b, msg ):
# using labels instead of eval-style stuff to keep things simpler.
print
print '--- ' , msg , ' ---'
print a, ' == ', b, ' : ', a == b
print a, ' < ', b, ' : ', a < b
print a, ' > ', b, ' : ', a > b
print 'tcomp.py'
print 'version=', platform.python_version()
# let's start with some empty tuples.
e1 = tuple()
e2 = tuple()
tupleInfo( tuple( ) , tuple( ) , 'empty tuples,intuitive, no surprises' )
tupleInfo( tuple( [ 1 ] ) , tuple( [ 1 ] ) , 'single-item tuples, equal values: intuitive, no surprises' )
tupleInfo( tuple( [ 1 ] ) , tuple( [ 2 ] ) , 'single-item diff: intuitive, no surprises' )
tupleInfo( tuple( [ 1, 2 ] ), tuple( [ 1, 2 ] ), 'double-item same: intuitive, no surprises' )
print '* Question: do a<b and a>b both yield False '
print '* because Python short circuits on'
print '* a[0] < b[0] (and correspondingly a[0] > b[0] )?'
print '* e.g. Python never bothers comparing second'
print '* elements: a[1] < b[1] (and, correspondinlgy, a[1] > b[1] ).'
tupleInfo( tuple( [ 1, 2 ] ), tuple( [ 1, 3 ] ), 'double-item 1st=same 2nd=diff: ??? What is with a<b ???' )
print '* Question: Here I get the == comparison, that yields False like I would expect.'
print '* But WHAT is going on with the < comparison?'
print '* Even comapring "lexicographically", how does a[0] < b[0]'
print '* actually yield True ?'
print '* Is Python really comparing a[0] < b[0] ?'
print '* Because in my mental model that is the same as comparing: 1 < 1'
print '* I kind of think 1 < 1 is supposed to be False, even if Python'
print '* is comparing "1" < "1" (e.g. lexicographically).'
The answer lies in word 'lexigographicly'. It means, that python compares tuples beginning from the first position. This order is used in vocabularies or lexicons - word a is smaller than word b, if a apperars in vocabulary before b. Then we can compare two words like 'anthrax' and 'antipodes', where three first letters are equal: 'anthrax' appears in vocabulary before 'antipodes', so the statement 'anthrax' < 'antipodes' is True.
This comparision can be represented like this:
def isSmaller(a, b): # returns true if a<b, false if a>=b
for i in xrange(0, a.__len__()): # for every elementof a, starting from left
if b.__len__() <= i: # if a starts with b, but a is longer, eg. b='ab', a='ab...'
return False
if a[i] < b[i]: # if a starts like b, but there is difference on i-th position,
# eg. a='abb...', b='abc...',
return True
if a[i] > b[i]: # eg. a='abc...', b='abb...',
return False
if a[i] == b[i]: # if there is no difference, check the next position
pass
if a.__len__() < b.__len__(): # if b starts with a, but b is longer, eg. a='ac', b='ac...'
return True
else: # else, ie. when a is the same as b, eg. a='acdc', b='acdc'
return False
print (1,2,3)<(1,2)
print (1,2,3)<(1,2,3)
print (1,2,3)<(1,3,2)
print isSmaller((1,2,3),(1,2))
print isSmaller((1,2,3),(1,2,3))
print isSmaller((1,2,3),(1,3,2))
Output:
False
False
True
False
False
True

Simple function in Stata

EDIT: Thank to Joe's advice, I will make my question more specific. Actually I need to code a function in Stata which takes variables A,B,C,D,... as inputs and a variable Y as output which can be evaluated with usual Stata functions/commands like "generate dummy=2*myfun(X) if ..."
The function itself contains numerical calculations. A pseudo Stata code will look like
myfun(X)
gen Y=0.5*X if X==1
replace Y=31-X if X==2
replace Y=X-2 if X==3
.... a long list
return(Y)
Notice that X can be a huge set of different Stata variables and the numerical calculations are rather long inside the function. That's why I would like to use a function. I guess that the native "program" command in Stata is not suitable for this type of problem because it cannot take variables as input/output.
(ANSWER TO ORIGINAL QUESTION)
I have never used SAS, but at a wild guess you want something like
foreach v in A B C D {
gen test`v' = 0.5 * (`v' == 1) + 0.6 * (`v' == 2) + 0.7 * (`v' == 3)
}
or
foreach v in A B C D {
gen test`v' = cond(`v' == 1, 0.5, cond(`v' == 2, 0.6, cond(`v' == 3, 0.7, .)))
}
But hang on; that middle line also looks like
gen test`v' = (4 + `v') / 10
(ANSWER TO COMPLETELY DIFFERENT REVISED QUESTION)
This can be done in various ways. As above you could have a loop
foreach v in A B C D {
gen test`v' = 0.5 * `v' if `v' == 1
replace test`v' = 31 - `v' if `v' == 2
replace test`v' = `v' - 2 if `v' == 3
}
The question says "I guess that the native "program" command in Stata is not suitable for this type of problem because it cannot take variables as input/output." That guess is completely incorrect. You could write a program to do this too. This example is schematic, not definitive. A real program would include more checks and error messages to match any incorrect input. For detailed advice, you really need to read the documentation. One answer on SO can't teach you all you need to know even to write simple Stata programs. In any case, the example is evidently frivolous and/or incomplete, so a complete working example would be pointless or impossible.
program myweirdexample
version 13
syntax varlist(numeric), Generate(namelist)
local nold : word count `varlist'
local nnew : word count `generate'
if `nold' != `nnew' {
di as err "`generate' does not match `varlist'"
exit 198
}
local i = 1
quietly foreach v of local varlist {
local new : word `i' of `generate'
gen `new' = 0.5 * `v' if `v' == 1
replace `new' = 31 - `v' if `v' == 2
replace `new' = `v' - 2 if `v' == 3
local ++i
}
end
Footnote on terminology: The question uses the term function more broadly than it is used in Stata. In Stata, commands and functions are distinct; "function" is not a synonym for command.
Second footnote: Check out recode. It may be what you need, but it is best for mapping integer codes to other integer codes.
Third footnote: An example of a needed check is that the argument of generate() should be variable names that are legal and new.

Regular Expressions with repeated characters

I need to write a regular expression that can detect a string that contains only the characters x,y, and z, but where the characters are different from their neighbors.
Here is an example
xyzxzyz = Pass
xyxyxyx = Pass
xxyzxz = Fail (repeated x)
zzzxxzz = Fail (adjacent characters are repeated)
I thought that this would work ((x|y|z)?)*, but it does not seem to work. Any suggestions?
EDIT
Please note, I am looking for an answer that does not allow for look ahead or look behind operations. The only operations allowed are alternation, concatenation, grouping, and closure
Usually for this type of question, if the regex is not simple enough to be derived directly, you can start from drawing a DFA and derive a regex from there.
You should be able to derive the following DFA. q1, q2, q3, q4 are end states, with q1 also being the start state. q5 is the failed/trap state.
There are several methods to find Regular Expression for a DFA. I am going to use Brzozowski Algebraic Method as explained in section 5 of this paper:
For each state qi, the equation Ri is a union of terms: for a transition a from qi to qj, the term is aRj. Basically, you will look at all the outgoing edges from a state. If Ri is a final state, λ is also one of the terms.
Let me quote the identities from the definition section of the paper, since they will come in handy later (λ is the empty string and ∅ is the empty set):
(ab)c = a(bc) = abc
λx = xλ = x
∅x = x∅ = ∅
∅ + x = x
λ + x* = x*
(λ + x)* = x*
Since q5 is a trap state, the formula will end up an infinite recursion, so you can drop it in the equations. It will end up as empty set and disappear if you include it in the equation anyway (explained in the appendix).
You will come up with:
R1 = xR2 + yR3 + zR4 + λ
R2 = + yR3 + zR4 + λ
R3 = xR2 + + zR4 + λ
R4 = xR2 + yR3 + λ
Solve the equation above with substitution and Arden's theorem, which states:
Given an equation of the form X = AX + B where λ ∉ A, the equation has the solution X = A*B.
You will get to the answer.
I don't have time and confidence to derive the whole thing, but I will show the first few steps of derivation.
Remove R4 by substitution, note that zλ becomes z due to the identity:
R1 = xR2 + yR3 + (zxR2 + zyR3 + z) + λ
R2 = + yR3 + (zxR2 + zyR3 + z) + λ
R3 = xR2 + + (zxR2 + zyR3 + z) + λ
Regroup them:
R1 = (x + zx)R2 + (y + zy)R3 + z + λ
R2 = zxR2 + (y + zy)R3 + z + λ
R3 = (x + zx)R2 + zyR3 + z + λ
Apply Arden's theorem to R3:
R3 = (zy)*((x + zx)R2 + z + λ)
= (zy)*(x + zx)R2 + (zy)*z + (zy)*
You can substitute R3 back to R2 and R1 and remove R3. I leave the rest as exercise. Continue ahead and you should reach the answer.
Appendix
We will explain why trap states can be discarded from the equations, since they will just disappear anyway. Let us use the state q5 in the DFA as an example here.
R5 = (x + y + z)R5
Use identity ∅ + x = x:
R5 = (x + y + z)R5 + ∅
Apply Arden's theorem to R5:
R5 = (x + y + z)*∅
Use identity ∅x = x∅ = ∅:
R5 = ∅
The identity ∅x = x∅ = ∅ will also take effect when R5 is substituted into other equations, causing the term with R5 to disappear.
This should do what you want:
^(?!.*(.)\1)[xyz]*$
(Obviously, only on engines with lookahead)
The content itself is handled by the second part: [xyz]* (any number of x, y, or z characters). The anchors ^...$ are here to say that it has to be the entirety of the string. And the special condition (no adjacent pairs) is handled by a negative lookahead (?!.*(.)\1), which says that there must not be a character followed by the same character anywhere in the string.
I've had an idea while I was walking today and put it on regex and I have yet to find a pattern that it doesn't match correctly. So here is the regex :
^((y|z)|((yz)*y?|(zy)*z?))?(xy|xz|(xyz(yz|yx|yxz)*y?)|(xzy(zy|zx|zxy)*z?))*x?$
Here is a fiddle to go with it!
If you find a pattern mismatch tell me I'll try to modify it! I know it's a bit late but I was really bothered by the fact that I couldn't solve it.
I understand this is quite an old question and has an approved solution as well. But then I am posting 1 more possible and quick solution for the same case, where you want to check your regular expression that contains consecutive characters.
Use below regular expression:
String regex = "\\b\\w*(\\w)\\1\\1\\w*";
Listing possible cases that above expression returning the result.
Case 1: abcdddd or 123444
Result: Matched
Case 2: abcd or 1234
Result: Unmatched
Case 3: &*%$$$ (Special characters)
Result: Unmatched
Hope this will be helpful...
Thanks:)

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).