Capture groups in 1 line with fixed delimiters - regex

I'm a beginner at regex and still don't understand a lot. I apologize in advance from any wrong notations or missing information :(
I need to extract groups from an e-mail subject where I have to use each value further on in a process to use as a folder or document name.
Example: 123456/TEXT/567890/01Moretext
I need to get the following pieces of text:
123456
TEXT
567890
01Moretext
in seperate regex commands.
So far I have:
^\d{6}, which gives me 123456
(?<=/)[^/]*, which gives me TEXT
I can't figure out how to extract the third group, 567890
[^/]*$, which gives me 01Moretext
Would appreciate any help that can prevent my head from exploding!

You can use
[^/]+(?=/[^/]*$)
See the regex demo. Details:
[^/]+ - one or more chars other than /
(?=/[^/]*$) - a positive lookahead that requires a / and then one or more chars other than / till the end of string.

Related

Regex Capture Middle Value

I would like to ask for your help...
I have this string where I have to get the 4.75. I've tried many regex expression but I could not get it to work and been through browsing lots of examples as well.
Regexr Image
Loan Amount Interest Rate
$336,550 4.75 %
So far, below is my current expression
1. (?<=Interest Rate\s*\n*)([^\s]+).+(?=%)
I'm getting the $336,550 4.75
2. ([^\s]+).(?=%)
Resulted into multiple output. In my entire text, which I can't share, there are also other data that is in %.
I am only after the 4.75. I know I can just select the first match via code (i guess) but for now it is not an option.
Thanks in advance!
I've tried different regex expression
You just need to extract "4.75 %" ?
Try this:
(?<=Interest Rate\n\n\$\d{3},\d{3}\s)(\d{1,5}\.\d{1,5}\s%)
Since your regex with variable length patterns inside lookbehind works, you can use the following .NET compliant regex:
(?<=Interest Rate\s+\S+\s+)(\S+)(?=\s*%)
See the regex demo.
Details:
(?<=Interest Rate\s+\S+\s+) - a positive lookbehind that requires Interest Rate, one or more whitespaces, one or more non-whitespaces and again one or more whitespaces immediately to the left of the current location
(\S+) - Group 1: one or more non-whitespace chars
(?=\s*%) - a positive lookahead that requires zero or more whitespaces and then a % char immediately to the right of the current location.
Hi Please try this.
[0-9]+.[0-9]+

Regular Expression to Capture First Two Lines That Don't Include String

I am struggling to find a method to extract the first two lines of an address using a regular expression, where it doesn't include the word "Account".
If we take this address:
Company Name Some Road Some Town
I can use the regular expression (?:.*\s*){2} to return
Company Name Some Road
Which is great.
However, if there is an extra line at the top, making the address become:
Accounts Payable Company Name Some Road Some Town
Then it no longer picks up those two lines that I want.
I have tried the method here: Regular expression to match a line that doesn't contain a word? without success, and have also tried combinations of using things like (?!Account.*)(?:.*\s*){3}, but am having little success.
The Microsoft website https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference has masses of characters etc to use, but I haven't managed to get a combination working yet.
The closest I've got was using [^Account.*](?:.*\s*){3} which returns
s Payable
Company Name
Some Road
I just can't get it to remove the rest of that line! Any help would be appreciated. Thanks.
You may use a ^ with multiline mode on:
(?m)^(?!Accounts)(?:.*\n?){2}
Or (a bit more efficient and following best practices):
(?m)^(?!Accounts).*(?:\n.*)?
See the regex demo and this regex demo.
When (?m) is added to the pattern, ^ matches start of a line, and the whole pattern matches
^ - start of a line
(?!Accounts) - with no Accounts as the first word
(?:.*\n?){2} - two occurrences of any 0+ chars other than line break chars followed with an optional newline
.*(?:\n.*)? - matches a line and an optional subsequent line.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

How to extract a numeric substring from a string but only if the previous string part matches a target

So I am trying to extract defect numbers from changeset comments in TFS. However, there are several ways people have entered the numbers:
"Defect 1321: blah blah blah"
"Fixes HPQC 1427. Logic modified"
"- Bug 976 - Customer"
I am not great with regexes so any help would be great. I prepare the string ahead of time by tolowering it and stripping out the # and ., so I can be assured I am looking for something that starts with (defect|hpqc|bug) has an optional space (\s) then a number (\d) then ends with a space (\s) but this didn't work:
(defect|hpqc|bug)\s\d\s
I only want to find the first match.
I want to extract the numeric component but only if the previous word is a match.
I am sure this is a result of my trivial knowledge of regex creation.
Case matters (usually) and you want more than one digit \d+ and there is an optional number sign too so something like this should work, depending on your system:
(Defect|HPQC|Bug)\s*#?\s*(\d+)
This allows spaces and # or neither before the digits, and captures the digits. It would help to know if you are using python or something else (tag your question).
I believe this regex should work for you:
(?:defect|hpqc|bug)\s+(\d+)\s+
Defect/Bug # is available in matched group #1
If you are looking only for the number after the keyword here is a regex might should help...
(?<=(Defect|HPQC|Bug)\s*#?\s*)\d+
Good Luck!
I precise Beroe response :
(?:Defect|HPQC|Bug)\s*\#?\s*(\d+)`
(?:Defect|HPQC|Bug) : detect but don't capture
\# : slash for disable the comment
It works for me on Expresso