Get Text Starting From Last Occurrence of Certain Substring Leading to Match - regex

Given a long string that generally follows this syntax:
/C=US/foo=bar/var=1/CN=JONES.FRED.R.0123456789:xxj31ZMTZzkVA
/C=US/foo=pop/var=2/CN=BLAKE.DAPHNE.P.1234567890:xxj31ZMTZzkVA
/C=US/foo=bit/var=8/CN=BINKLEY.VELMA.W.2345678901:xxj31ZMTZzkVA
/C=US/foo=hat/var=17/CN=ROGERS.SHAGGY.N.3456789012:xxj31ZMTZzkVA
/C=US/foo=jam/var=39/CN=DOO.SCOOBY.D.4567890123:xxj31ZMTZzkVA
I want to capture what follows the previous occurrence of "/C=US/" that leads up to the last name + dot + first name that follows "CN=", and finally the text that precedes the colon (:). The last name, dot, and first name are not hard-coded but rather passed in from a variable.
For example, given "DOO.SCOOBY", I want to extract this text:
/C=US/foo=jam/var=39/CN=DOO.SCOOBY.D.4567890123
Here is the Regex I am using:
(?<=\/C=US\/)(.*?)(?=DOO.SCOOBY)+(.*?)+:
The problem is, it extracts ALL of the text preceding the match of "DOO.SCOOBY" to the colon, except for the very first "/C=US/". So, I nearly get the entire string back. It's also important to note there are no linebreaks or spaces in this string; it is all bunched together. How can I get text that only goes back as far as the previous "/C=US/"? I've searched plenty on regexes and specifically this scenario, but can't seem to find anything. It looks like I need to implement the positive lookbehind correctly.

You can use
\/C=US\/(?:(?!\/C=US\/).)*?DOO\.SCOOBY[^:]*
See the regex demo.
Details:
\/C=US\/ - a /C=US/ string
(?:(?!\/C=US\/).)*? - any single char, other than line break chars, zero or more but as few as possible occurrences, that does not start a /C=US/ substring
DOO\.SCOOBY - a DOO.SCOOBY string
[^:]* - zero or more chars other than :.

Related

Regular expression to match a word that contains ONLY one colon

I am new to regex, basically I'd like to check if a word has ONLY one colons or not.
If has two or more colons, it will return nothing.
if has one colon, then return as it is. (colon must be in the middle of string, not end or beginning.
(1)
a:bc:de #return nothing or error.
a:bc #return a:bc
a.b_c-12/:a.b_c-12/ #return a.b_c-12/:a.b_c-12/
(2)
My thinking is, but this is seems too complicated.
^[^:]*(\:[^:]*){1}$
^[-\w.\/]*:[-\w\/.]* #this will not throw error when there are 2 colons.
Any directions would be helpful, thank you!
This will find such "words" within a larger sentence:
(?<= |^)[^ :]+:[^ :]+(?= |$)
See live demo.
If you just want to test the whole input:
^[^ :]+:[^ :]+$
To restrict to only alphanumeric, underscore, dashes, dots, and slashes:
^[\w./-]+:[\w./-]+$
I saw this as a good opportunity to brush up on my regex skills - so might not be optimal but it is shorter than your last solution.
This is the regex pattern: /^[^:]*:[^:]*$/gm and these are the strings I am testing against: 'oneco:on' (match) and 'one:co:on', 'oneco:on:', ':oneco:on' (these should all not match)
To explain what is going on, the ^ matches the beginning of the string, the $ matches the end of the string.
The [^:] bit says that any character that is not a colon will be matched.
In summary, ^[^:] means that the first character of the string can be anything except for a colon, *: means that any number of characters can come after and be followed by a single colon. Lastly, [^:]*$ means that any number (*) of characters can follow the colon as long as they are not a colon.
To elaborate, it is because we specify the pattern to look for at the beginning and end of the string, surrounding the single colon we are looking for that only the first string 'oneco:on' is a match.

Regex to match first part of string (up to the first occurrence of a space character) if it doesn't contain the sequence ;host=

I have this string cpu.usage_system;cpu=cpu-total;host=host1 6.94024205748818 1626401140(graphite metric message with tag support).
I'm trying to match the first part of the string, up to the first occurrence of a space character... but only if that first part of the string doesn't contain ;host=.
I can match all characters up to the first occurrence of a space with ^([\S]+).
I have the feeling I should be using a negative lookahead to check for the absence of ;host= but I can't figure out how to put it all together.
The idea is to match the first part of the metric label (& tags), see if contains a host tag, if it does contain a host tag... leave it alone. If it doesn't contain a host tag, append one.
This may not be the most elegant solution, depending on what else you need the regex to do, but if you just want to exclude matching lines that contain ;host=, this lookahead should work:
^(?!.*;host=)([\S]+)

Regex check for name Initials

I am trying to create a regex that checks if one or more middle-name initials have the following stucture:
INITIAL.[BLANK]INITIAL.[BLANK]INITIAL.
There can be multiple Initials as long as they are followed by a dot (.) - blank spaces are only allowed between two initials (e.g. L. B.)
It should not be possible to have a space after an initial if there's no other initial following.
At the moment, I have the following Regex which doesn't work perfectly as of now:
([A-Z]\. (?=[A-Z]|$))+
Using regex101, this is an example:
As you can see, it still matches the string even though there's a blank space at the end, without having another Initial following.
I am not sure why this is happening. I am just learning regex and would be glad if anyone could provide me with a solution to my problem :)
The error you're seeing is because at the last step, your expression reads in [A-Z]\. looks ahead for $ (and finds it). I would express the pattern this way: (?:[A-Z]\. )*[A-Z]\.$. Treat the last initial specially because it does not have a final space.
The pattern you tried ([A-Z]\. (?=[A-Z]|$))+ uses a repeated capturing group which will give you the value of the last iteration.
In that repetition you match a space <code>[A-Z]\. </code> effectively meaning that it should be present in the match.
You could repeat 0+ times matching a char [A-Z] followed by a space to match multiple occurrences.
Then match a char [A-Z] asserting what is on the right is not a non whitespace char.
\b(?:[A-Z]\. )*[A-Z]\.(?!\S)
Regex demo
If there can be multiple spaces but it should not match a newline:
\b(?:[A-Z]\.[^\S\r\n]*)*[A-Z]\.(?!\S)
Regex demo

Regex match till end of text

I'm using Regex to match whole sentences in a text containing a certain string. This is working fine as long as the sentence ends with any kind of punctuation. It does not work however when the sentence is at the end of the text without any punctuation.
This is my current expression:
[^.?!]*(?<=[.?\s!])string(?=[\s.?!])[^.?!]*[.?!]
Works for:
This is a sentence with string. More text.
Does not work for:
More text. This is a sentence with string
Is there any way to make this word as intended? I can't find any character class for "end of text".
End of text is matched by the anchor $, not a character class.
You have two separate issues you need to address: (1) the sentence ending directly after string, and (2) the sentence ending sometime after string but with no end-of-sentence punctuation.
To do this, you need to make the match after string optional, but anchor that match to the end of the string. This also means that, after you recognize an (optional) end-of-sentence punctuation mark, you need to match everything that follows, so the end-of-string anchor will match.
My changes: Take everything after string in your original regex and surround it in (?:...)? - the (?:...) being a "non-remembered" group, and the ? making the entire group optional. Follow that with $ to anchor the end of the string.
Within that optional group, you also need to make the end-of-sentence itself optional, by replacing the simple [.?!] with (?:[.?!].*)? - again, the (?:...) is to make a "non-remembered" group, the ? makes the group optional - and the .* allows this to match as much as you want after the end-of-sentence has been found.
[^.?!]*(?<=[.?\s!])string(?:(?=[\s.?!])[^.?!]*(?:[.?!].*)?)?$
The symbol for end-of-text is $ (and, the symbol for beginning-of-text, if you ever need it, is ^).
You probably won't get what you're looking for with by just adding the $ to your punctuation list though (e.g., [.?!$]); you'll find it works better as an alternative choice: ([.?!]|$).
Your regex is way too complex for what you want to achieve.
To match only a word just use
"\bstring\b"
It will match start, end and any non-alphanum delimiters.
It works with the following:
string is at the start
this is the end string
this is a string.
stringing won't match (you don't want a match here)
You should add the language in the question for more information about using.
Here is my example using javascript:
var reg = /^([\w\s\.]*)string([\w\s\.]*)$/;
console.log(reg.test('This is a sentence with string. More text.'));
console.log(reg.test('More text. This is a sentence with string'));
console.log(reg.test('string'))
Note:
* : Match zero or more times.
? : Match zero or one time.
+ : Match one or more times.
You can change * with ? or + if you want more definition.

Why there are two matches

I think that there is a match,but there are two.That's strange.I want to know why
Why are you surprised? .* matches any number of characters, including 0.
So you get one match that contains the entire line, and a second match that contains the empty string between the first match and the end of the string.
Regular expressions don't just deal with characters, but also with positions between characters (known as anchors). For example ^ matches the position before the first character, $ matches the position after the last character in a string.
A regex engine "walks through" a string, starting from the position before the first character. It then steps forward one character at a time.
For example, when applying the regex .* to "Hello", the regex engine starts before the H. It then matches Hello - after that .* can't match any more characters, so the regex engine returns "Hello" as the first match. The regex engine is now positioned after the o. If you call it again and ask it to match, it will succeed in returning a match because you're asking it to match any string, even an empty one, from the current position - and that's possible.
Why doesn't the regex engine return an infinite number of empty strings, then? It checks whether the last match was started from the end of the string, and if it was, no further matches will be attempted.
Some languages don't even try a regex match once from the final position in a string (Ruby seems to be one example), but I'd say it's more correct to return two matches.
Since it appears more clarification is necessary: The regex engine steps through the string along the positions visualized by |s below:
"|H|e|l|l|o|"
^ Position before the first character
^ Position after the last character