Extract term within a string that matches a variable - stata

I have a large dataset with two string variables: people_attending and special_attendee:
*Example generated by -dataex-. To install: ssc install dataex
clear
input str148 people_attending str16 special_attendee
"; steve_jobs-apple_CEO; kevin_james-comedian; michael_crabtree-football_player; sharon_stone-actor; bill_gates-microsoft_CEO; kevin_nunes-politician" "michael_crabtree"
"; rob_lowe-actor; ted_cruz-politician; niki_minaj-music_artist; lindsey_whalen-basketball_coach" "niki_minaj"
end
The first variable varies in length and contains a list of every person who attended an event along with their title. Name and title are separated by a dash, and attendees are separated by a semi-colon and space. The second variable is an exact match of one of the names contained in the first variable.
I want to create a third variable that extracts the title for whichever person is listed in the second variable. In the above example, I would want the new variable to be "football_player" for observation 1 and "music_artist" for observation 2.

Here is a way to do this using a simple regular expression:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(0) if ustrregexm(wanted, ">(.*?);")
replace wanted = substr(wanted, 3, strpos(wanted, ";")-3)
list wanted
+-----------------+
| wanted |
|-----------------|
1. | football_player |
2. | music_artist |
+-----------------+
In the first step you substitute the name with a marker >. Then you extract the relevant substring using the regular expression. In the final step, you clean up.
EDIT:
The third step can be omitted if you slightly modify the code as follows:
generate wanted = subinstr(people_attending, special_attendee, ">", .)
replace wanted = ustrregexs(1) if ustrregexm(wanted, ">-(.*?);")

Related

How to extract real name, first name, lastname, and create nickname from name that have acronyms, aristocratic title, academic titles/degrees

I am trying to extract name, firstname, lastname, create nickname/firstname, and nickname/lastname from name list that have acronyms, aristocratic title, academic titles/degrees, nickname inside parentheses or after slash-backslash in libreoffice calc using regex function. This is the expected result
The rules are:
Expected Name: remove nickname inside parentheses or after slash-backslash if exist, job titles, academic titles and degrees except for (R) or (R.), (Ra) or (Ra.), and one alphabet with(out) dot in the name ex: (A.) (A), (M.), and name acronym with(out) dot ex: (Muh.), (Moh), etc
Firstname: (from expected name) extract first name including dot in acronym ex: (R.), (Ra.), (Muh.) and one alphabet with(out) dot in the name ex: (A.) (A), (M.)
Lastname: (from expected name) extract last name including dot
Nickname-fname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If not exist, use next word in name even it's last word. If the name consist only one word, extract it even only one character
Nickname-lname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract last name that is not: one char or acronym with(out) dot. If not exist, extract prev word in name even it's first word. If the name consist only one word, extract it even only one character
I tried the following:
=REGEX($A2,"\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$|,.*","")
to extract real name. It remove nickname inside parentheses or after slash-backslash with(out) space like
( Nita ) in Yunita ( Nita )
( Nita ) in Yunita( Nita )
(Nita) in Yunita (Nita)
(Nita) in Yunita(Nita)
( Nita) in Yunita ( Nita)
( Nita) in Yunita( Nita)
(Nita ) in Yunita (Nita )
(Nita ) in Yunita(Nita ))
\Nita in Yunita\Nita
\ Nita in Yunita\ Nita
\Nita in Yunita \Nita
\ Nita in Yunita \ Nita
...
and so on
it also remove academic degrees like Ph.D. but it failed on Ra. Ayu S. Ph.D. (because there is no comma after S.). It failed on Prof. and Dr.. I want to keep Ra..
=REGEX($A23,"(?:^|(?:[.!?]\s))([\w.]+)")
to extract first name but failed on Moh.Ali (yes, without space after dot).
=REGEX($A23,"\b(?<last>[\w\[.\]\\]+)$")
to extract last name but also failed on Moh.Ali.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<first>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If the first name not meet condition, use next word in name even it's last word. If still not meet condition, use name that consist only one word, extract it even only one character. The regex failed in most case.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<last>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from last name that is not: one char, acronym with(out) dot. If the last name not meet condition, use prev word in name even it's first word. If still not meet condition, use name that consist only one word, extract it even only one character. It failed in most case too.
This is the result of the regex:
But I'm new to regex and stuck. Please help me
The parsing rules are rather elaborate, especially for nicknames, so
I suggest using a multi-step process involving 2 helper columns and
the non-advanced
(POSIX ERE
plus \b) features of the available
ICU
regular expressions combined with a few spreadsheet formulas.
If you copy the formulas listed and commented below to columns B
through H in your spreadsheet you should find that
conditional formatting highlights a total of 5 cells, all nicknames:
1 for Jasmine, 2 for Imah, and 2 for Nur.. For the first 3
the expected result contradicts the nickname rules; removing
the dot in Nur. is left as an exercise.
I used the input as it is but in cases like this pre-processing is
the key to avoid complex or special-case regexes, e.g. insert a blank
in names like M.Ali, or prefix a comma to comma-less Ph.D..
Aside: When you post source data as an image rather than plaintext
what's the likelihood of someone OCR'ing the image and editing the
text?
Step 1
Add auxiliary column AuxSplit separating the input string into its
3 main groups -- name, titles, nickname (last 2 optional) -- and
joining them with #. NB: # is used for this reply but it
must be a non-meta character not appearing in any input string.
REGEX(A22;"(.*?)( *(Ph\.D\.)?|(,[^\\/(]*?))?( *[\\/(].*)? *$";"$1#$2#$5")
where
$ anchors regex to end of string
$5 is the captured optional nickname (incl. any delimiters and blanks)
$2 the optional suffixed titles (incl. any comma and blanks);
note that the comma-less Ph.D is special-cased
$1 the name (incl. any leading blanks)
Sample content: Jasmine Fianna#, B.A., M.A.#/Jasmine
Step 2
Add auxiliary column AuxNick extracting an explicit nickname (if any)
from 3rd main group (after last #) in AuxSplit.
REGEX(G22;".*#[\\/( ]*([^ )]*)[ )]*$";"$1")
Capturing string after \ or / or between (), trimming off blanks.
Sample content Agus.
Step 3
Extract ExpectedName from 1st group (up to first #) in AuxSplit
stripping off any prefixed titles.
REGEX(G22;"^((Prof|Dr)\.[ ]*)?([^#]*).*";"$3")
Note that these titles are special-cased as they have the same form
as an abbreviated name.
Sample content: Q. Ranita El
Step 4
From ExpectedName extract Firstname,
REGEX(B22;"^(Ra?\.?|[^.]+\.|[^ ]+)")
stripping off R/R./Ra/Ra. and blanks from start of string,
and Lastname
REGEX(B22;"[^ .]*[^ ]$")
extracting the last character sequence after blank/dot;
any trailing blanks were removed in step 1.
Note: with M.Ali as M. Ali simplify Lastname formula to
REGEX(B22;"[^ ]+$")
Step 5
Extract Lastnick from AuxNick, LastName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(D22;"\b([^ ]+\.|[^ .])$"));D22;REGEX(B22;".*?([^ ]+)[ ]+([^ ]*\.|[^.])$";"$1")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(D22;"…";D22 : else use Lastname unless abbrev. or 1-char.
;REGEX(B22;"…";"$1"))) : else use previous word in name, exploiting the
fact that an unmatched replacement returns the entire string
and Firstnick from AuxNick, FirstName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(B22;"^([^ ]+\.|[^ ]\b|I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b)"));C22;REGEX(B22;"^(I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b|.*?)[ .]\b([^ ]+).*";"$4")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(B22;"…";C22 : else use FirstName unless abbrev. / 1-char. / special case
;REGEX(B22;"…";"$4"))) : else use next word in name, exploiting the
fact that an unmatched replacement returns the entire string
TODO: remove trailing dot from Nur.
Recall the (LibreOffice 6.2+) REGEX syntax:
*Syntax*: REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] )
*Expression*: A text representing the regular expression, using
[ICU](https://unicode-org.github.io/icu/userguide/strings/regexp.html)
regular expressions. If there is no match and Replacement is not
given, #N/A is returned.
Replacement: Optional. The replacement text and references to capture
groups. If there is no match, Text is returned unmodified.

How to remove everything but certain words in string variable (Stata)?

I have a string variable response, which contains text as well as categories that have already been coded (categories like "CatPlease", "CatThanks", "ExcuseMe", "Apology", "Mit", etc.).
I would like to erase everything in response except for these previously coded categories.
For example, I would like response to change from:
"I Mit understand CatPlease read it again CatThanks"
to:
"Mit CatPlease CatThanks"
This seems like a simple problem, but I can't get my regex code to work perfectly.
The code below attempts to store the categories in a variable cat_only. It only works if the category appears at the beginning of response. The local macro, cats, contains all of the words I would like to preserve in response:
local cats = "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)?"
gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, "`cats'.+?`cats'.+?`cats'")
If I add characters to the beginning of the search pattern in ustrregexm, however, nothing will be stored in cat_only:
gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, ".+?`cats'.+?`cats'.+?`cats'")
Is there a way to fix my code to make it work, or should I approach the problem differently?
* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 response
"I Mit understand CatPlease read it again CatThanks"
end
local regex "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b[^\s]+\b"
gen wanted = strtrim(stritrim(ustrregexra(response, "`regex'", "")))
list
. list
+-------------------------------------------------------------------------------+
| response wanted |
|-------------------------------------------------------------------------------|
1. | I Mit understand CatPlease read it again CatThanks Mit CatPlease CatThanks |
+-------------------------------------------------------------------------------+
I don't regard myself as fluent with Stata's regex functions, but this may be helpful:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen test = "I Mit understand CatPlease read it again CatThanks"
. local OK "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)"
. ssc install moss
. moss test, match("`OK'") regex
. egen wanted = concat(_match*), p(" ")
. l wanted
+-------------------------+
| wanted |
|-------------------------|
1. | Mit CatPlease CatThanks |
+-------------------------+
Spaces can be handled using regex:
local words = "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b\S+\b"
gen wanted = ustrregexra(response, "`words' | ?`words'", "")
This uses an alternation (a regex OR which is coded |) to match trailing/leading spaces, with the leading space being optional to handle when the entire input is one of the target words.

A single Regex Match Entire String first to then break up into multiple components

I am trying to come up with a RegEx (POSIX like) in a vendor application that returns data looking like illustrated below and presents a single line of data at a time so I do not need to account for multiple rows and need to match a row indvidually.
It can return one or more values in the string result
The application doesn't just let me use a "\d+\.\d+" to capture the component out of the string and I need to map all components of a row of data to a variable unfortunately even if I am going to discard it or otherwise it returns a negative match result.
My data looks like the following with the weird underscore padding.
USER | ___________ 3.58625 | ___________ 7.02235 |
USER | ___________ 10.02625 | ___________ 15.23625 |
The syntax is supports is
Matches REGEX "(Var1 Regex), (Var2 Regex), (Var3 Regex), (Var 4 regex), (Var 5 regex)" and the entire string must match the aggregation of the RegEx components, a single character off and you get nothing.
The "|" characters are field separators for the data.
So in the above what I need is a RegEx that takes it up to the beginning of the numeric and puts that in Var1, then capture the numeric value with decimal point in var 2, then capture up to the next numeric in Var 3, and then keep the numeric in var 4, then capture the space and end field | character into var 5. Only Var 2 and 4 will be useful but I have to capture the entire string.
I have mainly tried capturing between the bars "|" using ^.*\|(.*).\|*$ from this question.
I have also tried the multiple variable ([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+) mentioned in this question.
I seem to be missing something to get it right when I try using them via RegExr and I feel like I am missing something pretty simple.
In RegExr I never get more than one part of the string I either get just the number, the equivalent of the entire string in a single variable, or just the number which don't work in this context to accomplish the required goal.
The only example the documentation provides is the following from like a SysLog entry of something like in this example I'm consolidating there with "Fault with Resource Name: Disk Specific Problem: Offline"
WHERE value matches regex "(.)Resource Name: (.), Specific Problem: ([^,]),(.)"
SET _Rrsc = var02
SET _Prob = var03
I've spun my wheels on this for several hours so would appreciate any guidance / help to get me over this hump.
Something like this should work:
(\D+)([\d.]+)(\D+)([\d.]+)(.*)
Or in normal words: Capture everything but numbers, capture a decimal number, capture everything but numbers, capture a decimal number, capture everything.
Using USER | ___________ 10.02625 | ___________ 15.23625 |
$1 = USER | ___________  
$2 = 10.02625
$3 =  | ___________  
$4 = 15.23625
$5 =  |

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Get part of a string based on conditions using regex

For the life of me, I can't figure out the combination of the regular expression characters to use to parse the part of the string I want. The string is part of a for loop giving a line of 400 thousand lines (out of order). The string I have found by matching with the unique number passed by an array for loop.
For every string I'm trying to get a date number (such as 20151212 below).
Given the following examples of the strings (pulled from a CSV file with 400k++ lines of strings):
String1:
314513,,Jr.,John,Doe,652622,U51523144,,20151212,A,,,,,,,
String2:
365422,johnd#blankity.com,John,Doe.,Jr,987235,U23481,z725432,20160221,,,,,,,,
String3:
6231,,,,31248,U51523144,,,CB,,,,,,,
There are several complications here...
Some names have a "," in them, so it makes it more than 15 commas.
We don't know the value of the date, just that it is a date format such as (get-date).tostring("yyyyMMdd")
For those who can think of a better way...
We are given two CSV files to match. Algorithmic steps:
Look in the CSV file 1 for the ID Number (found on the 2nd column)
** No ID Numbers will be blank for CSV file 1
Look in the CSV file 2 and match the ID number from CSV file 1. On this same line, get the date. Once have date, append in 5th column on CSV file 1 with the same row as ID number
** Note: CSV file 2 will have $null for some of the values in the ID
number column
I'm open to suggestions (including using the Import-Csv cmdlet in which I am not to familiar with the flags and syntax of for loops with those values yet).
You could try something like this:
,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),
This will match all dates in the given format from 1900 - 2099. It is also specific enough to rule out most other random numbers, although without a larger sample of data, it's impossible to say.
Then in PowerShell:
gc data.csv | where { $_ -match ",((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }
In the PowerShell match we added capturing parenthesis around what we want, and reference the group via the group number in the $matches index.
If you are only interested in matching one line based on a preceding id you could use a lookbehind. For example,
$id=314513; # Or maybe U23481
gc c:\temp\reg.txt | where { $_ -match "(?<=$id.*),((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }