Stata: code binary variable conditional on key words in string variable - stata

Is there any way to code a binary variable dependent on keywords being present in a given string variable? Simple example:
I have a string variable that describes various meals and a dummy variable that denotes if a given meal is breakfast or not. Is there any way to code
breakfast = 1 if meal== [then something saying contains eggs, bacon, etc.]
This is a silly example, but I am more interested in identifying a shortcut to coding binary variables, based on information found in string data.

The inbuilt strpos() will yield a positive value if a string is found inside another. Building on that
gen breakfast = strpos(meal, "bacon") | strpos(meal, "eggs")
and so forth. In practice, working with a string made lower case will often help, or indeed be essential. Also, if you have a long list, you may prefer
gen breakfast = 0
quietly foreach thing in bacon eggs cereal "orange juice" {
replace breakfast = breakfast | strpos(lower(meal), `"`thing'"')
}
The principle here is using | (or) as a logical operator, yielding 1 (true) if any argument is non-zero. Note that lower() is included to compare with a lower case version of the original.
This technique is naturally not robust to spelling mistakes or small variations in wording.

You can use the incss function of the egenmore package for this.
ssc install egenmore
egen bacon = incss(meal), sub(bacon) insensitive
This gives you a dummy equal to one if for a given observation the string variable "meal" contains the word bacon. It is zero otherwise. The option insensitive tells Stata to not consider case sensitivity (otherwise Bacon is different from bacon). As far as I know you can only search for one sub-string at a time but you can easily write a loop for this:
foreach word in bacon eggs cheese {
egen `word' = incss(meal), sub(`word') insensitive
}

Related

How to remove repeated words or phrases within the same string

I am working with a string variable response in Stata. This variable stores complete sentences, and many of these sentences have repeated phrases.
For example:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
I want to clean these strings by removing all repeated phrases.
In other words, I want to transform this sentence:
how do you know how do you know what it is?
to the one below:
how do you know what it is?
So far, I have tried to fix each case individually, but this is incredibly time-consuming as there are thousands of repeated words/phrases.
I would like to run code that can identify when a phrase is repeated within the same observation / string, and then remove one instance of that phrase (or word).
I imagine regular expressions would help, but I cannot figure out much more than this.
The following works for me:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
The above solution uses a regular expression to first identify repeated words / phrases. Then it eliminates this from the string by substituting a space in its place.
Because this particular regular expression does not find all sets in one pass (for example in the last observation there are three sets - well, I would hope and but), the process is repeated using a while loop until no repeated elements remain in the string.
In the final step, all unnecessary spaces are deleted to bring the string back to shape.

How do I count emoji and symbols in a cell?

What formula can I use to get a count of emoji and characters in a single cell?
For example, In cells, A1,A2 and A3:
๐Ÿ™Œ๐Ÿ™Œ๐Ÿ™Œ
๐Ÿคœโœ‹๏ธ๐Ÿ‘ˆ๐Ÿคœ๐Ÿคœ
??๐Ÿ‘Š๐Ÿ‘Š๐Ÿ‘Š
Total Count of characters in each cell(Desired Output):
3
5
5
For the given emojis, This will work well:
=LEN(REGEXREPLACE(A13,".","."))
MID/LEN considers each emoji as 2 separate characters.
REGEX will consider them as one.
But even REGEX will fail with a complex emoji like this:
๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ
This contains a literal man emoji๐Ÿ‘จ, a woman emoji๐Ÿ‘ฉ,a girl emoji๐Ÿ‘ง and a boy emoji๐Ÿ‘ฆ-all joined by a ZeroWidthJoiner. You could even swap the boy for a another girl with this formula:
=SUBSTITUTE("โ€๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ","๐Ÿ‘ฆ","๐Ÿ‘ง")
It'll become like this:
โ€๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง
=COUNTA(FILTER(
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#"),
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#")<>""
))
Based on the answer by #I'-'I
Some emojis contain from multiple emojis joined by char(8205):
๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆโ€๐Ÿ‘†
The result differs and depends on a browser you use.
I wonder, how do we count them?

Regex for words that don't differ by only one letter

I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.
You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm
Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)
A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";
This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)

Google Scripts replace function to prefix a regexp in CAPS

In GAS, using the .replace(), Is it possible to match any term within a long text string that is at least 5 consecutive ALL CAPS characters (may have 1 space in there) and prefix it with a string, such as ][? There may be multiple matches within the text string, so I want to insert markers that begin and end a phrase beginning with an ALLCAPS category.
An example of a similar type of text would be this (structurally similar, but with other sensitive data):
"VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need."
I have learned, through Serge, how to replace globally, which was a big help, but the more I research regexp's, the more confusing it gets. I tried substituting the all caps regexp for a specific term, but failed. I think that I could go through and extract all of the all caps regexp's and use them in a replace with multiple values, but it seems that would be a very long way around.
Is it possible, in a couple of lines to make the above text look like this:
"][VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. ][SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. ][FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need."
My intention is to then split on the ] Which would mean that new cells would start with the all caps term, and end with ]. I have the code to convert the text to an array (there are lots of entries), then use .replace() to find and replace within the array, and to set the values back into the sheet, but I just don't know if there is a way to either prefix (my research says lookback isn't possible in GAS), or to pick up the allcaps value, add the string "][", and put it back.
If this is asking too much, or feels like I haven't included any code, here is the first part that Serge already helped with: Looking for a Google script that will perform CTRL+F replace for a string
Here is the code, as I used it, combining Serge's previous help and the new recommendation. I had to fix some case issues with a term before running the all caps because some people can't follow a template, but it works.
function insertSplitMarkers(){
var sh = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Freq Iso');
var data = sh.getRange(2,1,sh.getLastRow(),sh.getLastColumn()).getValues();// get all data
var regexp = /(([A-Z]\s*){5,})/g;
for(var n=0;n<data.length;n++){
for(var m=0;m<data[0].length;m++){
if(typeof(data[n][m])=='string'){ // if it is a string
data[n][m]=data[n][m].replace(/Interventions/g,'INTERVENTIONS');// use the regex replace with /g parameter meaning "globally"
data[n][m]=data[n][m].replace(regexp, "][$1");
}
}
}
Logger.log(data);
sh.getRange(2,1,data.length,data[0].length).setValues(data);
}
It looks like this will do what you want although as is, it will also pick out aoAOEOUE:
var yourString = "VACATION: Approved by Supervisor - Frequency 1-3 times per year; duration not to exceed 5 days. SICK LEAVE: Approved by Supervisor - Frequency up to 8 per year, no more than 5 days consecutively without MD excuse. FMLA FEDERAL: Approved by HR - Frequency as needed, must be approved at least 14 days in advance, or within 24 hours of employee's identified need.";
var regexp = /(([A-Z]\s*){5,})/g;
var newString = yourString.replace(regexp, "][$1");
Logger.log(newString);
#user3169581 I've adjusted your regex slightly to try to eliminate matching whitespace around the desired phrase and ensure you get the whole desired phrase, it will require a little adjustment in the replace:
var regexp = /\b([A-Z\s]{5,})(:)/g
...
data[n][m] = data[n][m].replace(regexp,"][$2$3")
Link to regex101 with working matching here: http://regex101.com/r/rD5kS9
HTH
EDIT: for some reason the existing answer wasn't showing up for me when I started this response. Forgive the redundancy.

How to increment date using regex

So, I have a spinEdit that should display the year and month in this format yyyyMM. I am using RegEx to mask the value to that format but when I want to increment from say 201212 to 201301, it fails and displays 20121. The RegEx I am using looks like this
([0-9][0-9][0-9][0-9])(0[1-9])|(1[0-2])
The issue is that incrementing the value (add 1 to month) isn't incrementing the year field when the month is at 12. The same happens in reverse where decreasing the value (minus 1 month) isn't decreasing the year, 201301 - 1 takes it to 2013. Is there a way to fix this using just RegEx?
I think it is possible, but not fully regex solution, you need to have linux and bash available (personally I find the date function in bash ve) I had to get the date formats (string) in a filename and compare it to a date in a script. Below is the code snippet:
#!/bin/bash
#yyyymm you got after regex
inputdate = 201307
#value you want to subtract
x = 8
#outputdate should return you 201211
outputdate = $(date -d "$inputdate01 -$x month" +"%Y%m")
I believe there may be a way, however that is far to complicated for its worth in a practical situation. So by keeping things simple, it is not possible.