Stata: Comparing string variables - stata

I have two string variables that differ on one character for each observation. I need to get the position of that different character.
I have tried to use indexnot() function but it yields false results as the characters in both strings are the same.
Here is an illustrative example, and variable position is the one I am trying to get to:
+--------------+--------------+-----------+
| String 1 | String 2 | Position |
+--------------+--------------+-----------+
| 000002002000 | 000000002000 | 6 |
| 000002102000 | 000002002000 | 7 |
| 000002112000 | 000002102000 | 8 |
| 000002112020 | 000002112000 | 11 |
| 000002112120 | 000002112020 | 10 |
+--------------+--------------+-----------+

gen Position = .
quietly forval j = 1/12 {
replace Position = `j' if substr(String1, `j', 1) != substr(String2, `j', 1) & missing(Position)
}
Commentary is perhaps redundant here, but will harm no-one.
In the absence of a built-in function to do this, you need to write some code using existing commands and functions. Initialise a Position to missing (zero would do fine as an alternative). Then loop over the characters, here 1 to 12 because the example shows 12 character strings. We record the position of the first difference in characters. Note how the condition missing(Position) (Position == . if you like) restricts changes to the first difference met.
Stata loops automatically over all the observations here, so the only loop needed is over string positions.

Related

Drop a specific character from string responses

I have a string variable and some of the responses have an extra character at the beginning. The character in question is a constant character in all cases. The variable is ICD-code. For example, instead of G23 I have DG23.
Is there a way in Stata to remove the excess D character?
My data looks like this
ID
diag
1
DZ456
2
DG32
3
DY258
4
DD35
5
DS321
6
DD21
7
DA123
For basic information in this territory, consult help string functions.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte d str5 diag
1 "DZ456"
2 "DG32"
3 "DY258"
4 "DD35"
5 "DS321"
6 "DD21"
7 "DA123"
end
replace diag = substr(diag, 2, .) if substr(diag, 1, 1) == "D"
list
+----------+
| d diag |
|----------|
1. | 1 Z456 |
2. | 2 G32 |
3. | 3 Y258 |
4. | 4 D35 |
5. | 5 S321 |
|----------|
6. | 6 D21 |
7. | 7 A123 |
+----------+
An alternative to string functions is to use regular expressions, see help regex.
replace diag = regexs(1) if regexm(diag, "^D(.*)")

How to index match a condition set in a cell

I am trying to avoid having a multiple if formula by index matching a table instead, however what i need to match is the actual condition and a string.
Lookup table:
+---+------------------+-------------------+-------+
| | A | B | C |
+---+------------------+-------------------+-------+
| 1 | Current to Prior | Portfolio Comment | Error |
| 2 | =0 | "" | 1 |
| 3 | <>0 | "" | -1 |
| 4 | >0 | OK – Losses | 0 |
| 5 | <0 | OK – Losses | 1 |
| 6 | <0 | OK – New Sales | 0 |
| 7 | >0 | OK – New Sales | 1 |
+---+------------------+-------------------+-------+
Column A: Lookup Condition
Column B: Lookup string
Column C: Return value
Data example with correct hard coded output (column C):
+---+------------------+-------------------+-------+
| | A | B | C |
+---+------------------+-------------------+-------+
| 1 | Current to Prior | Portfolio comment | Error |
| 2 | 0 | | 1 |
| 3 | -100 | OK – Losses | 1 |
| 4 | 50 | | -1 |
| 5 | 200 | OK – Losses | 0 |
| 6 | 0 | | 1 |
| 7 | -400 | OK – New Sales | 0 |
| 8 | 0 | | 1 |
+---+------------------+-------------------+-------+
Column A: Data value
Column B: Data string
Column C: Output formula
I need a formula that matches the data value with the lookup condition, the data string with the lookup string and outputs the return value.
I know you weren't necessarily asking for a VBA solution, but myself (and many others) prefer using UDFs as, in my opinion, it makes reading formulas easier and cleaner - plus you can do without the helper cells.
We start off your UDF by creating a Select Case Statement. We could choose to use either the Numerical Value or String for the cases. I decided to go with the string.
Within each case, you will compare the numerical values provided to the lngCondition parameter, which will ultimately return the value to the function.
Since you didn't have any cases for when textual values could have a lngCondition = 0, I made it return a worksheet error code #VALUE, just as you'd expect from any other Excel formula. This is the reason for the UDF having a variant return type.
Public Function ReturnErrorCode(lngCondition As Long, strComment As String) As Variant
Select Case strComment
Case ""
If lngCondition = 0 Then
ReturnErrorCode = 1
Else
ReturnErrorCode = -1
End If
Case "OK - Losses"
If lngCondition > 0 Then
ReturnErrorCode = 0
ElseIf lngCondition < 0 Then
ReturnErrorCode = 1
Else
' Your conditions don't specify that 'OK - Losses'
' can have a 0 value
ReturnErrorCode = CVErr(xlErrValue)
End If
Case "OK - New Sales"
If lngCondition < 0 Then
ReturnErrorCode = 0
ElseIf lngCondition > 0 Then
ReturnErrorCode = 1
Else
' Your conditions don't specify that 'OK - New Sales'
' can have a 0 value
ReturnErrorCode = CVErr(xlErrValue)
End If
Case Else
ReturnErrorCode = CVErr(xlErrValue)
End Select
End Function
You would then use this formula in the worksheet as such:
=ReturnErrorCode(A1, B1)
Great! But I have no knowledge of VBA and don't know how to add a UDF.
First, you need to open the VBA Editor. You can do this by simultaneously pressing Alt + F11.
Next, you need to create a standard code module. In the VBE, click Insert then select Module (NOT Class module!).
Then copy the code above, and paste it into the new code module you just created.
Since you have now added VBA code to your workbook, you now need to save it as a macro-enabled workbook the next time you save.

How to find a cell based on row and col criteria?

I have a table like this:
| a | b | c |
x | 1 | 8 | 6 |
y | 5 | 4 | 2 |
z | 7 | 3 | 5 |
What I want to do is finding a value based on the row and col titles, so for example if I have c&y, then it should return 2. What function(s) should I use to do this in OpenOffice Calc?
later:
I tried =INDEX(B38:K67;MATCH('c';B37:K37;0);MATCH('y';A38:A67;0)), but it writes invalid argument.
It turned out I wrote the arguments of INDEX in the wrong order. The =INDEX(B38:K67;MATCH('y';A38:A67;0);MATCH('c';B37:K37;0)) formula works properly. The second argument is the row number and not the col number.

How to detect specific subwords in text

I have a column as a string with no spaces:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
end
I am using the following command:
gen = regex,(var, "(news)")
This outputs 1 1 1 because it finds that the 3 rows in the column var contain the word news.
I'm trying to alter the regular expression "(news)" to create two columns. One for news and one for newspaper. regexm(var, "(newspaper)") makes sure that the row contains a newspaper, but I need a command to make sure characters after news are not "paper" as I'm trying to quantify the two.
EDIT:
Is there a way to count the third entry as 1, because it has a news occurrence without however being a newspaper?
You can quantify as follows without a regular expression:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
"fdgdnews"
"fgogodigjhoigjnewspaper"
"fgeogeionnewsfgdgfpaper"
"45pap9358newsfjfgni"
end
generate news = strmatch(var, "*news*") & !strmatch(var, "*newspaper*")
list, separator(0)
+----------------------------------------+
| var news |
|----------------------------------------|
1. | ihaveanewspaper 0 |
2. | watchingthenewsonthetv 1 |
3. | watchthenewsandreadthenewspaper 0 |
4. | fdgdnews 1 |
5. | fgogodigjhoigjnewspaper 0 |
6. | fgeogeionnewsfgdgfpaper 1 |
7. | 45pap9358newsfjfgni 1 |
+----------------------------------------+
count if news
4
count if !news
3
EDIT:
One way to do this is to eliminate all instances of the word newspaper and repeat the process:
generate var2 = subinstr(var, "newspaper", "", .)
replace news = 1 if strmatch(var2, "*news*")
list, separator(0)
+------------------------------------------------------------------+
| var news var2 |
|------------------------------------------------------------------|
1. | ihaveanewspaper 0 ihavea |
2. | watchingthenewsonthetv 1 watchingthenewsonthetv |
3. | watchthenewsandreadthenewspaper 1 watchthenewsandreadthe |
4. | fdgdnews 1 fdgdnews |
5. | fgogodigjhoigjnewspaper 0 fgogodigjhoigj |
6. | fgeogeionnewsfgdgfpaper 1 fgeogeionnewsfgdgfpaper |
7. | 45pap9358newsfjfgni 1 45pap9358newsfjfgni |
+------------------------------------------------------------------+
count if news
5
count if !news
2

Stata: Using egen, anycount() when values vary for each observation

Each observation in my data presents a player who follows some random pattern. Variables move1 up represent on which moves each player was active. I need to count the number of times each player was active:
The data look as follows (with _count representing a variable that I would like to generate). The number of moves can also be different depending on simulation.
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| simulation | playerlist | move1 | move2 | move3 | move4 | move5 | move6 | _count |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| 1 | 1 | 1 | 1 | 1 | 2 | . | . | 3 |
| 1 | 2 | 2 | 2 | 4 | 4 | . | . | 2 |
| 2 | 3 | 1 | 2 | 3 | 3 | 3 | 3 | 4 |
| 2 | 4 | 4 | 1 | 2 | 3 | 3 | 3 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
egen combined with anycount() is not applicable in this case because the argument for the value() option is not a constant integer.
I have made an attempt to cycle through each observation and use egen rowwise (see below) but it keeps count as missing (as initialised) and is not very efficient (I have 50,000 observations). Is there a way to do this in Stata?
gen _count =.
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values( `=`playerlist'[`i']')
replace _count = temp
drop temp
}
You can easily cut out the loop over observations. In addition, egen is only to be used for convenience, never speed.
gen _count = 0
quietly forval j = 1/6 {
replace _count = _count + (move`j' == playerlist)
}
or
gen _count = move1 == playerlist
quietly forval j = 2/6 {
replace _count = _count + (move`j' == playerlist)
}
Even if you had been determined to use egen, the loop need only be over the distinct values of playerlist, not all the observations. Say the maximum is 42
gen _count = 0
quietly forval k = 1/42 {
egen temp = anycount(move*), value(`k')
replace _count = _count + temp
drop temp
}
But that's still a lousy method for your problem. (I wrote the original of anycount() so I can say why it was written.)
See also http://www.stata-journal.com/sjpdf.html?articlenum=pr0046 for a review of working rowwise.
P.S. Your code contains bugs.
You replace your count variable in all observations by the last value calculated for the count in the last observation.
Values are compared with a local macro playerlist. You presumably have no local macro of that name, so the macro is evaluated as empty. The result is that you end by comparing each value of your move* variables with the observation numbers. You meant to use the variable name playerlist, but the single quotation marks force the macro interpretation.
For the record, this fixes both bugs:
gen _count = .
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values(`= playerlist[`i']')
replace _count = temp in `i'
drop temp
}