How to remove everything but certain words in string variable (Stata)? - regex

I have a string variable response, which contains text as well as categories that have already been coded (categories like "CatPlease", "CatThanks", "ExcuseMe", "Apology", "Mit", etc.).
I would like to erase everything in response except for these previously coded categories.
For example, I would like response to change from:
"I Mit understand CatPlease read it again CatThanks"
to:
"Mit CatPlease CatThanks"
This seems like a simple problem, but I can't get my regex code to work perfectly.
The code below attempts to store the categories in a variable cat_only. It only works if the category appears at the beginning of response. The local macro, cats, contains all of the words I would like to preserve in response:
local cats = "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)?"
gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, "`cats'.+?`cats'.+?`cats'")
If I add characters to the beginning of the search pattern in ustrregexm, however, nothing will be stored in cat_only:
gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, ".+?`cats'.+?`cats'.+?`cats'")
Is there a way to fix my code to make it work, or should I approach the problem differently?

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 response
"I Mit understand CatPlease read it again CatThanks"
end
local regex "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b[^\s]+\b"
gen wanted = strtrim(stritrim(ustrregexra(response, "`regex'", "")))
list
. list
+-------------------------------------------------------------------------------+
| response wanted |
|-------------------------------------------------------------------------------|
1. | I Mit understand CatPlease read it again CatThanks Mit CatPlease CatThanks |
+-------------------------------------------------------------------------------+

I don't regard myself as fluent with Stata's regex functions, but this may be helpful:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen test = "I Mit understand CatPlease read it again CatThanks"
. local OK "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)"
. ssc install moss
. moss test, match("`OK'") regex
. egen wanted = concat(_match*), p(" ")
. l wanted
+-------------------------+
| wanted |
|-------------------------|
1. | Mit CatPlease CatThanks |
+-------------------------+

Spaces can be handled using regex:
local words = "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b\S+\b"
gen wanted = ustrregexra(response, "`words' | ?`words'", "")
This uses an alternation (a regex OR which is coded |) to match trailing/leading spaces, with the leading space being optional to handle when the entire input is one of the target words.

Related

Extract term within a string that matches a variable

I have a large dataset with two string variables: people_attending and special_attendee:
*Example generated by -dataex-. To install: ssc install dataex
clear
input str148 people_attending str16 special_attendee
"; steve_jobs-apple_CEO; kevin_james-comedian; michael_crabtree-football_player; sharon_stone-actor; bill_gates-microsoft_CEO; kevin_nunes-politician" "michael_crabtree"
"; rob_lowe-actor; ted_cruz-politician; niki_minaj-music_artist; lindsey_whalen-basketball_coach" "niki_minaj"
end
The first variable varies in length and contains a list of every person who attended an event along with their title. Name and title are separated by a dash, and attendees are separated by a semi-colon and space. The second variable is an exact match of one of the names contained in the first variable.
I want to create a third variable that extracts the title for whichever person is listed in the second variable. In the above example, I would want the new variable to be "football_player" for observation 1 and "music_artist" for observation 2.
Here is a way to do this using a simple regular expression:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(0) if ustrregexm(wanted, ">(.*?);")
replace wanted = substr(wanted, 3, strpos(wanted, ";")-3)
list wanted
+-----------------+
| wanted |
|-----------------|
1. | football_player |
2. | music_artist |
+-----------------+
In the first step you substitute the name with a marker >. Then you extract the relevant substring using the regular expression. In the final step, you clean up.
EDIT:
The third step can be omitted if you slightly modify the code as follows:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(1) if ustrregexm(wanted, ">-(.*?);")

Dialogflow: Regexp entity not matched

I am going crazy with this problem, I am sure I am missing something...
I would like to match words that start with 2 characters or digits, followed by 1 or more character/digit/slash.
Some examples:
AM9
B9C
AS/1
etc...
So I have created an entity, let's say EntityOne as follows according to some RegExp tests (I have also tested the same regexp surrounded by "()", all tested on https://regex-golang.appspot.com/assets/html/index.html that it seems to use re2):
and a test Intent with params defined as follows:
REQUIRED | PARAM NAME | ENTITY | VALUE | IS LIST | PROMPTS
yes | name | #EntityOne | $value | no | test:
And inside this intent I try with words similar to the examples above that should be matched.
But I see the prompt "test:" over and over, the entity is never matched.
Any hints please? Tell me if you want me to share additional info, but I think that there is nothing much to share. Thanks in advance

How to find strings beginning with X?

I am trying to identify strings that begin with X using the function regexm() in Stata.
My code:
for var lookin: count if regexm(X, "X")
I have tried using double quotes, square brackets, adding the options for the other characters in the string X[0-9][0-9] etc. but to no avail.
I expect the resultant number to be about 1000, but it returns 0.
The following works for me:
clear
input str22 foo
"Xhello"
"this is a X sentence"
"X a silly one"
"but serves the purpose"
end
generate tag = strmatch(foo, "X*")
list
+------------------------------+
| foo tag |
|------------------------------|
1. | Xhello 1 |
2. | this is a X sentence 0 |
3. | X a silly one 1 |
4. | but serves the purpose 0 |
+------------------------------+
count if tag
2
This is the regular expression solution based on the above example:
generate tag = regexm(foo, "^X")
for in Stata is ancient and now undocumented syntax, unless you are using a very old version of Stata, in which case you would be better flagging that.
X is the default loop element which is substituted everywhere it is found.
Hence your syntax -- looping over a single variable -- reduces to
count if regexm(lookin, "lookin")
and even without a data example we can believe that the answer is 0.
This would be legal and is closer to what you seek:
for Y in var lookin : count if regexm(Y, "X")
but the regular expression is wrong, as #Pearly Spencer points out.
Incidentally,
count if strpos(lookin, "X") == 1
is a direct alternative to your code.
In any Stata that supports regexm() you should be looping with foreach or forvalues.

A single Regex Match Entire String first to then break up into multiple components

I am trying to come up with a RegEx (POSIX like) in a vendor application that returns data looking like illustrated below and presents a single line of data at a time so I do not need to account for multiple rows and need to match a row indvidually.
It can return one or more values in the string result
The application doesn't just let me use a "\d+\.\d+" to capture the component out of the string and I need to map all components of a row of data to a variable unfortunately even if I am going to discard it or otherwise it returns a negative match result.
My data looks like the following with the weird underscore padding.
USER | ___________ 3.58625 | ___________ 7.02235 |
USER | ___________ 10.02625 | ___________ 15.23625 |
The syntax is supports is
Matches REGEX "(Var1 Regex), (Var2 Regex), (Var3 Regex), (Var 4 regex), (Var 5 regex)" and the entire string must match the aggregation of the RegEx components, a single character off and you get nothing.
The "|" characters are field separators for the data.
So in the above what I need is a RegEx that takes it up to the beginning of the numeric and puts that in Var1, then capture the numeric value with decimal point in var 2, then capture up to the next numeric in Var 3, and then keep the numeric in var 4, then capture the space and end field | character into var 5. Only Var 2 and 4 will be useful but I have to capture the entire string.
I have mainly tried capturing between the bars "|" using ^.*\|(.*).\|*$ from this question.
I have also tried the multiple variable ([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+) mentioned in this question.
I seem to be missing something to get it right when I try using them via RegExr and I feel like I am missing something pretty simple.
In RegExr I never get more than one part of the string I either get just the number, the equivalent of the entire string in a single variable, or just the number which don't work in this context to accomplish the required goal.
The only example the documentation provides is the following from like a SysLog entry of something like in this example I'm consolidating there with "Fault with Resource Name: Disk Specific Problem: Offline"
WHERE value matches regex "(.)Resource Name: (.), Specific Problem: ([^,]),(.)"
SET _Rrsc = var02
SET _Prob = var03
I've spun my wheels on this for several hours so would appreciate any guidance / help to get me over this hump.
Something like this should work:
(\D+)([\d.]+)(\D+)([\d.]+)(.*)
Or in normal words: Capture everything but numbers, capture a decimal number, capture everything but numbers, capture a decimal number, capture everything.
Using USER | ___________ 10.02625 | ___________ 15.23625 |
$1 = USER | ___________  
$2 = 10.02625
$3 =  | ___________  
$4 = 15.23625
$5 =  |

Splitting a comma separated string with regex in sparql

i have to make a question about regex() in SPARQL.
I would like to replace a variable, which sometime contains a phrase with a comma, with another that contains just what is before the comma.
For example if the variable contains "I like it, ok" i want to get a new variable which contains "I like it". I don't know which regular expresions to use.
This is a use case for strbefore, you don't need regex at all. As a general tip, I suggest reading (or skimming) through the table of contents for Section 17 of the SPARQL 1.1 Query Language Recommendation. It lists all the SPARQL functions, and while you don't need to memorize them all, you'll at least have an idea of what's out there. (This is good advice for all programmers and languages: skim the table of contents and the index.) This query1 shows how to use strbefore:
select ?x ?prefix where {
values ?x { "we invited the strippers, jfk and stalin" }
bind( strbefore( ?x, "," ) as ?prefix )
}
---------------------------------------------------------------------------
| x | prefix |
===========================================================================
| "we invited the strippers, jfk and stalin" | "we invited the strippers" |
---------------------------------------------------------------------------
1. See Strippers, JFK, and Stalin Illustrate Why You Should Use the Serial Comma