Get each value after specific words with Regex - regex

I have the below string and I am trying to get every value after ID and Display Name. I have tried to create a lookup but I could not get it to work and it only grabs the first value while I want to grab all of them.
This was my code to grab the value after DisplayName
(?<=\bDisplayName\\\"\=\>\\\")(\w+)
When I tried it, it grabs the first value, but only if it is alphabet while most of my text is a mix of Japanese Kanji, Katakana, Hiragana and special characters such as ・.
"{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"本\", \"Id\"=>\"465392\"}, \"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"ジャンル別\", \"Id\"=>\"465610\"}, \"ContextFreeName\"=>\"ビジネス・経済\", \"DisplayName\"=>\"ビジネス・経済\", \"Id\"=>\"466282\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBA\", \"DisplayName\"=>\"経営学・キャリア・MBA\", \"Id\"=>\"492076\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBAの起業・開業\", \"DisplayName\"=>\"起業・開業\", \"Id\"=>\"492058\", \"IsRoot\"=>false}"
What I want to achieve from the above string is the following:
Grab each string after DisplayName
ex.
"DisplayName"=>"本" grab 本
"DisplayName"=>"経営学・キャリア・MBA" grab 経営学・キャリア・MBA
Grab each integer after Id
ex.
"Id"=>"465392" grab 465392
"Id"=>"4920588" grab 4920588
Is it possible to do this in Regex or should I look for something else than Regex?

You can use capturing groups like in
"DisplayName"=>"([^"]*)"
"Id"=>"(\d+)
Details:
"DisplayName"=>"([^"]*)" - "DisplayName"=>" is matched first, then one or more chars other than " are captured into Group 1.
"Id"=>"(\d+) - "Id"=>" is matched first, then one or more digits are captured into Group 1.
See the Python demo:
import re
s = "{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"Ancestor\"=>{\"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"本\", \"Id\"=>\"465392\"}, \"ContextFreeName\"=>\"本\", \"DisplayName\"=>\"ジャンル別\", \"Id\"=>\"465610\"}, \"ContextFreeName\"=>\"ビジネス・経済\", \"DisplayName\"=>\"ビジネス・経済\", \"Id\"=>\"466282\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBA\", \"DisplayName\"=>\"経営学・キャリア・MBA\", \"Id\"=>\"492076\"}, \"ContextFreeName\"=>\"経営学・キャリア・MBAの起業・開業\", \"DisplayName\"=>\"起業・開業\", \"Id\"=>\"492058\", \"IsRoot\"=>false}"
print(re.findall(r'"DisplayName"=>"([^"]*)"', s))
# => ['本', 'ジャンル別', 'ビジネス・経済', '経営学・キャリア・MBA', '起業・開業']
print(re.findall(r'"Id"=>"(\d+)', s))
# => ['465392', '465610', '466282', '492076', '492058']

Related

Extracting string with regex from a tf.Tensor in Tensorflow 2?

I am saving my tf.keras model with a signature in TF2 to serve it with TFServing. In the signature function I would like to extract some entities with regex expressions.
My input is a Tensor with datatype tf.string. I cannot use numpy() within it, resulting in "Tensor object has no attribute numpy". tf.py_function() is unavailable in TFServing as well.
So I am left with tensorflow operations. How would I extract a substring with a pattern?
#tf.function
def serve_fn(input):
# Returns Today's date is . Tomorrow is another day. But I need 11/2020
output = tf.strings.regex_replace("Today's date is 11/2020. Tomorrow is another day.", pattern=r'[\d]{2}/[\d]{4}', rewrite=" ")
# model inference ...
return {'output': output}
That would return the a tensor with content "Today's date . Tomorrow is another day."
How would a pattern look like, which returns just the date? If I'm not mistaken, tf.strings.regex_replace uses re2 which does not support lookaheads. Are there maybe other solutions?
Thanks in advance
You can use
tf.strings.regex_replace("Today's date is 11/2020. Tomorrow is another day.", pattern=r'.*?(\d{2}/\d{4}).*', rewrite=r'\1')
See the RE2 regex demo. Details:
.*?(\d{2}/\d{4}).* matches 0 or more chars other than line break chars, as few as possible, (\d{2}/\d{4}) captures into Group 1 any two digits,/ and then any four digits and then just matches the rest of the line with .* (greedily, as many as possible)
\1 is the brackreference to the Group 1 value. See regex_replace reference: regex_rewrite "supports backslash-escaped digits (\1 to \9) can be to insert text matching corresponding parenthesized group.".

How to pass multiple regex in single re.complile and create list of matched pattern

I have written multiple regex patterns separately and tried to make the matched patterns in a list like this below:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
For a single pattern I am able to make the match patterns to list like this on a column but not for entire as whole:
pattern_list=list(filter(pattern1.findall, column))
input column like this:
column
OR011-103401461251
Hi the information is 1-234455
How are you?LLCM23466723
output coming:
['OR011-103401461251','Hi the information is 1-234455','How are you?LLCM23466723']
output required:
['OR011-103401461251','1-234455','LLCM23466723']
How can I compile all patterns in a single re.compile() and make a single pattern_list for all the matched patterns?
You can use an alternation to combine your expressions to 1 pattern:
(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}
Explanation
(?: Non capturing group
OR011-|OGEA|LLCM|A|1- Match 1 of the options
) Close non capturing group
\d{2,15} Match 2-15 digits
Regex demo | Python demo
About your approach
The function filter returns the element for which the function returns true. You pass the method findall to filter where, for every item, findall finds a match and returns the element which will result in:
['OR011-103401461251','Hi the information is 1-234455','How are you?LLCM23466723']
What you could do instead of using filter is to use map and pass findall:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
pattern_list=map(pattern.findall, df.column)
print(list(pattern_list))
That will result in:
[['OR011-103401461251'], ['1-234455'], ['LLCM23466723']]
See a Python example
Or you could pass a lambda to map and first check if the search has a result:
pattern=re.compile('(?:OR011-|OGEA|LLCM|A|1-)\d{2,15}')
pattern_list=map(lambda x: pattern.search(x).group() if pattern.search(x) else None, df.column)
print(list(pattern_list))
That will result in:
['OR011-103401461251', '1-234455', 'LLCM23466723']
See a Python example

A single Regex Match Entire String first to then break up into multiple components

I am trying to come up with a RegEx (POSIX like) in a vendor application that returns data looking like illustrated below and presents a single line of data at a time so I do not need to account for multiple rows and need to match a row indvidually.
It can return one or more values in the string result
The application doesn't just let me use a "\d+\.\d+" to capture the component out of the string and I need to map all components of a row of data to a variable unfortunately even if I am going to discard it or otherwise it returns a negative match result.
My data looks like the following with the weird underscore padding.
USER | ___________ 3.58625 | ___________ 7.02235 |
USER | ___________ 10.02625 | ___________ 15.23625 |
The syntax is supports is
Matches REGEX "(Var1 Regex), (Var2 Regex), (Var3 Regex), (Var 4 regex), (Var 5 regex)" and the entire string must match the aggregation of the RegEx components, a single character off and you get nothing.
The "|" characters are field separators for the data.
So in the above what I need is a RegEx that takes it up to the beginning of the numeric and puts that in Var1, then capture the numeric value with decimal point in var 2, then capture up to the next numeric in Var 3, and then keep the numeric in var 4, then capture the space and end field | character into var 5. Only Var 2 and 4 will be useful but I have to capture the entire string.
I have mainly tried capturing between the bars "|" using ^.*\|(.*).\|*$ from this question.
I have also tried the multiple variable ([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+) mentioned in this question.
I seem to be missing something to get it right when I try using them via RegExr and I feel like I am missing something pretty simple.
In RegExr I never get more than one part of the string I either get just the number, the equivalent of the entire string in a single variable, or just the number which don't work in this context to accomplish the required goal.
The only example the documentation provides is the following from like a SysLog entry of something like in this example I'm consolidating there with "Fault with Resource Name: Disk Specific Problem: Offline"
WHERE value matches regex "(.)Resource Name: (.), Specific Problem: ([^,]),(.)"
SET _Rrsc = var02
SET _Prob = var03
I've spun my wheels on this for several hours so would appreciate any guidance / help to get me over this hump.
Something like this should work:
(\D+)([\d.]+)(\D+)([\d.]+)(.*)
Or in normal words: Capture everything but numbers, capture a decimal number, capture everything but numbers, capture a decimal number, capture everything.
Using USER | ___________ 10.02625 | ___________ 15.23625 |
$1 = USER | ___________  
$2 = 10.02625
$3 =  | ___________  
$4 = 15.23625
$5 =  |

Autohotkey extract text using regex

I am just now learning regex using autohotkey but can't figure out how to extract specific string and save to a variable?
Line of text I am searching:
T NW CO NORWALK HUB NW 201-DS3-WLFRCTAICM5-NRWLCT02K16 [DS3 LEC] -1 -1 PSTN
I am trying to save, NW 201-DS3-WLFRCTAICM5-NRWLCT02K16 [DS3 LEC] ONLY.
Here is my regex code:
NW\D\d.DS3.]
But how do I store that as a variable in autohotkey?
I have tried RegexMatch but that only shows the position. I am doing something wrong.
You may provide the third argument that will hold the match array:
RegExMatch(str,"NW\D\d.*DS3.*\]",matches)
Then, matches[0] will contain the match.
If you use capturing groups inside the pattern, you will be able to access their values by using further indices. If you use "NW\D(\d.*DS3.*)\]" against "NW 5xxx DS3 yyy], you will have the whole string inside matches[0] and matches[1] will hold 5xxx DS3 yyy.
See AHK RegExMatch docs:
FoundPos := RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPosition = 1])
UnquotedOutputVar
Mode 1 (default): OutputVar is the unquoted name of a variable in which to store the part of Haystack that matched the entire pattern. If the pattern is not found (that is, if the function returns 0), this variable and all array elements below are made blank.
If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in a pseudo-array whose base name is OutputVar. For example, if the variable's name is Match, the substring that matches the first subpattern would be stored in Match1, the second would be stored in Match2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern "(?P<Year>\d{4})" would be stored in MatchYear. If a particular subpattern does not match anything (or if the function returns zero), the corresponding variable is made blank.
; If you want to delete ALL ....
Only(ByRef C)
{
/*
RegExReplace
https://autohotkey.com/docs/commands/RegExReplace.htm
*/
; NW 201-DS3-WLFRCTAICM5-NRWLCT02K16 [DS3 LEC]
C:=RegExReplace(C, "NW\s[\w-]+\s\[[\w\s]+\]","",ReplacementCount,-1)
if (ReplacementCount = 0)
return C
else
return Only(C)
} ; Only(ByRef C)
string:="Line of text I am searching: T NW CO NORWALK HUB NW 201-DS3-WLFRCTAICM5-NRWLCT02K16 [DS3 LEC] -1 -1 PSTN"
Result:=Only(string)
MsgBox, % Result
MsgBox, % Only(string)

How to stop Regex Search look ahead if keyword group is found (CLOSED)

I have following strings on which I need to run RE Search to extract only account ids and to avoid extracting transaction related ids -
Transaction ID 989898989
Trx no. 989898989
Account ID 1234567890
Account Number 1234567890
Acnt No. 1234567890
Account # 1234567890
ID 1234567890
I have created a regex to extract only account id that appear in the text like this to extract 3rd group in the regex.
import re
txt = <all strings from 1 to 7 one by one>
re1="(No.|#|Number|ID)(/s)(\d{10,12})"
rg = re.compile(re1,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
print m.group(3)
If I run this code then all INT will be extracted. But I want to stop RE search if "transaction" or "trx" word is identified in the string. I tried using negative lookahead but unable to find solution.
Solution I am expecting is all strings should print INT in code above apart from strings that have "transaction" or "trx" word in it.
I want to create a regex that if "transaction" is found then stop searching further for group existence
Something like this -
(?!transaction)(/s)(No.|#|Number|ID)(/s)(\d{10,12})
Please Help!
Solution - Using Conditional statement in regex
(transaction|trx)(?(1)|\d{3,12})
Explanation -
(transaction|trx) => 1st Group
(?(1)|\d{3,12}) => 2nd Group - where ?(1) checks whether first group was found, if not found match whatever is there after '|' pipe - else run whatever is before '|'
After that just run => m.group()
and it will return either number or word.
In business logic, typecast the value and check if it can be type casted to INT then great we figured out correctly if not then whatever we extracted is not INT