Is there an easier way to find regex of the string? - regex

I'm just beginning with regex and just got stuck in a difficult situation.
The string i have is:
ITEM DESCRIPTION: KING AUTHUR 2LB FLOUR PACK: 10 SIZE: 0011.00 OZ
I need to get the parts within "< >":
ITEM DESCRIPTION: <KING AUTHUR 2LB FLOUR> PACK: <10> SIZE: <0011.00 OZ>
I've tried
: *([\w\.]+ ?[\w]* [\d\w]* *[\w]*)
which is not 100% accurate and feels repetitive and also becomes tedious when the text gets longer (multiple key:value).
Is there a generalized way of getting all the values from a key:value pair from a text of indefinite length?
And also why something like (ITEM).*: doesn't stop at ITEM DESCRIPTION: but selects all the way upto ITEM DESCRIPTION: ... SIZE: if I just want to get the first key?

Here is one way with a PCRE-compatible regex:
:\s*\K.*?(?=\s*\w+:|$)
See the regex demo.
An ECMAScript 2018+/.NET/Python PyPi regex compliant pattern is
(?<=:\s*\b).*?(?=\s*\w+:|$)
See this regex demo.
For the rest, you may rely on capturing:
:\s*(.*?)(?=\s*\w+:|$)
See the regex demo.
Details:
:\s* - a colon and zero or more whitespaces
\K - a match reset operator that discards the whole text matched so far from the match memory buffer
(?<=:\s*\b) - a positive lookbehind that matches a location that is immediately preceded with :, zero or more whitespaces and a word boundary
.*? - any zero or more chars other than line break chars as few as possible
(?=\s*\w+:|$) - a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces, one or more word chars and then a colon, or end of string.

Related

Pattern to match everything except a string of 5 digits

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:
You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.
The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

Capture number if string contains "X", but limit match (cannot use groups)

I need to extract numbers like 2.268 out of strings that contain the word output:
Approxmiate output size of output: 2.268 kilobytes
But ignore it in strings that don't:
some entirely different string: 2.268 kilobytes
This regex:
(?:output.+?)([\d\.]+)
Gives me a match with 1 group, with the group being 2.268 for the target string. But since I'm not using a programming language but rather CloudWatch Log Insights, I need a way to only match the number itself without using groups.
I could use a positive lookbehind ?<= in order to not consume the string at all, but then I don't know how to throw away size of output: without using .+, which positive lookbehind doesn't allow.
With your shown samples, please try following regex.
output:\D+\K\d(?:\.\d+)?
Online demo for above regex
Explanation: Adding detailed explanation for above.
output:\D+ ##Matching output colon followed by non-digits(1 or more occurrences)
\K ##\K to forget previous matched values to make sure we get only further matched values in this expression.
\d(?:\.\d+)? ##Matching digit followed by optional dot digits.
Since you are using PCRE, you can use
output.*?\K\d[\d.]*
See the regex demo. This matches
output - a fixed string
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that removes all text matched so far from the overall match memory buffer
\d - a digit
[\d.]* - zero or more digits or periods.

Why my Regex is only giving me ONE group back?

Im currenty having issues with a regex that Im creating. The regex has to extract all the groups that says number #### between Hello and Regards. At this moment my regex only extracts one group and I need all the groups inside, at this case I have 2, but there may be more inside.
Regex Image
I'm using the web page https://regex101.com/
Flavor: PCRE (PHP)
Regex: Hello\s.*(number\s*[\d]*)\s.*Regards
Text:
This is my test text number 25120
Hello my name is testing
I'm 20 years old
Please help me with the regex number 1542
I have been trying to create the regex many times this is my number 5152
Regards
I'm still trying my attempt number 5150
Result:
My Result is only the group number 5152 but inside is another group number 1542.
You may use
(?si)(?:\G(?!\A)|\bHello\b)(?:(?!\bHello\b).)*?\K\bnumber\s*\d+(?=.*?\bRegards\b)
See the regex demo.
Details
(?si) - s - DOTALL modifier making . match any chars, and i makes the pattern case insensitive
(?:\G(?!\A)|\bHello\b) - either the end of the previous match (\G(?!\A)) or (|) a whole word Hello (\bHello\b)
(?:(?!\bHello\b).)*? - any char, 0 or more times but as few as possible, that does not start a whole word Hello char sequence
\K - match reset operator that discards all text matched so far
\bnumber - a whole word number
\s* - 0+ whitespaces
\d+ - 1+ digits
(?=.*?\bRegards\b) - there must be a whole word Regards somewhere after any 0+ chars (as few as possible).

Regex for finding words with no or only one word between them

I need to find into multiple strings two words with no words or only one word between them. I created the regex for the case to find if those two words exist in string:
^(?=[\s\S]*\bFirst\b)(?=[\s\S]*\bSecond\b)[\s\S]+
and it works correctly.
Then I tried to insert in this regex additional code:
^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b)[\s\S]+
but it didn't work. It selects text with two or more words between searched words. It is not what I need.
First Second - must be selected
First word1 Second - must be selected
First word1 word2 Second - must be not selected by regex, but my regex select it.
Can I get advise how to solve this problem?
Root cause
You should bear in mind that lookarounds match strings without moving along the string, they "stand their ground". Once you write ^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b), the execution is as follows:
^ - the regex engine checks if the current position is the start of string
(?=[\s\S]*\bFirst\b) - the positive lookahead requires the presence of any 0+ chars followed with a whole word First - note that the regex index is still at the start of the string after the lookahead returns true or false
(\b\w+\b){0,1} - this subpattern is checked only if the above check was true (i.e. there is a whole word First somewhere) and matches (consumes, moves the regex index) 1 or 0 occurrences of a whole word (i.e. there must be 1 or more word chars right at the string start
(?=[\s\S]*\bSecond\b) - another positive lookahead that makes sure there is a whole word Second somewhere after the first whole word consumed with \b\w+\b - if any. Even if the word Second is the first word in the string, this will return true since backtracking will step back the word matched with (\b\w+\b){0,1} (see, it is optional), and the Second will get asserted, and [\s\S]+ will grab the whole string (Group 1 will be empty). See the regex demo with Second word word2 First string.
So, your approach cannot guarantee the order of First and Second in the string, they are just required to be present but not necessarily in the order you expect.
Solution
If you need to check the order of First and Second in the string, you need to combine all the checks into one single lookahead. The approach might turn out very inefficient with longer strings and multiple alternatives in the lookaround, consider either unrolling the patterns, or trying mutliple regex patterns (like this pseudo-code if /\bFirst\b/.finds_match().index < /\bSecond\b/.finds_match().index => Good, go on...).
If you plan to go on with the regex approach, you may match a string that contains First....Second only in this order:
^(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b)[\s\S]+
See the regex demo
Details:
^ - start of string
(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b) - there must be:
[\s\S]* - any zero or more chars up to the last
\bFirst - whole word First
(?:\W+\w+)? - optional sequence (1 or 0 occurrences) of 1+ non-word chars and 1+ word chars
\W+ - 1+ non-word chars
Second\b - Second as a whole word
[\s\S]+ - any 1 or more characters (empty string won't match).

Only match unique string occurrences

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b - a leading word boundary
((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
(?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
[A-Za-z]{2} - 2 ASCII letters
\d{5,6} - 5 to 6 digits
\b - trailing word boundary
(?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}