Pattern to match everything except a string of 5 digits

Pattern to match everything except a string of 5 digits - regex

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:

You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.

The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

Related

Identify and replace non-ASCII characters between brackets

I have tags (only ASCII chars inside brackets) of the following structure: [Root.GetSomething], instead, some contributors ended up submitting contributions with Cyrillic chars that look similar to Latin ones, e.g. [Rооt.GеtSоmеthіng].
I need to locate, and then replace those inconsistencies with the matching ASCII characters inside the brackets.
I tried \[([АаІіВСсЕеРТтОоКкХхМ]+)\]; (\[)([^\x00-\x7F]+)(\]), and some variations of the range but those searches don't see any matches. I seem to be missing something important in the regex execution logic.

You can use a regex matching any "interesting" Cyrillic char in between [ + letters or . + ] and a conditional replacement pattern:
Find What: (?:\G(?!\A)|\[)[a-zA-Z.]*\K(?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М))(?=[[:alpha:].]*])
Replace With: (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M)
Make sure Match Case option is ON. See a regex demo with a string:
Details:
(?:\G(?!\A)|\[) - end of the previous successful match or a [ char
[a-zA-Z.]* - zero or more . or ASCII letters
\K - match reset operator that discards the currently matched text from the overall match memory buffer
(?:(А)|(а)|(І)|(і)|(В)|(С)|(с)|(Е)|(е)|(Р)|(Т)|(т)|(О)|(о)|(К)|(к)|(Х)|(х)|(М)) - a non-capturing group containing 19 alternatives each of which is put into a separate capturing group
(?=[[:alpha:].]*]) - a positive lookahead that requires zero or more letters or . and then a ] char immediately to the right of the current location.
The (?1A:?2a:?3I:?4i:?5B:?6C:?7c:?8E:?9e:?{10}P:?{11}T:?{12}t:?{13}O:?{14}o:?{15}K:?{16}k:?{17}X:?{18}x:?{19}M) replacement pattern replaces А with A (\u0410) if Group 1 matched, а (\u0430) with a if Group 2 matched, etc.

Regex to check number of spaces after full stop - Strictly 2 required

I need to check occurrences where I have put one whitespace after a full-stop, and replace it by 2 spaces. I have the Regex for it, but Atom seems to call in invalid.
(?<=\.|\") {1,}(?=[a-zA-Z])
Conditions:
1 spaces after period.
If period in with a closing double quote, then 1 space after the quote.
The above regex works perfectly for my conditions however Atom is not able to validate it. I need to use it for existing files.

You may use
([."]) ([a-zA-Z])
and replace with $1 $2. See the regex demo and a regex graph:
Details
([."]) - Group 1 (its value is referred to with $1 backreference from the replacement pattern): . or "
- a space (use \s to match any whitespace)
([a-zA-Z]) - Group 2 ($2): an ASCII letter.

Regex matching a text after a specific string until another specific string

If I have the following example:
X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>
How can I select the text
Here is our forecast
after "X-FileName .... \n" until "Message-ID" execluded?
I read about lookahead and behind and tried this but didn't work:
(?<=X-FileName:(\n)+$).+(?=Message-ID:)

This should do it:
(?:X-FileName:[^\n]+)\n+([^\n]+)\n+(?:Message-ID:) (group #1 is the match)
Demo
Explanation:
(?:X-FileName:[^\n]+) matches X-Filename: followed by any number of characters that aren't newlines, without capturing it (?:).
\n+ matches any number of consecutive newlines.
([^\n]+) matches and captures any number of consecutive characters that aren't newlines.
\n+, again, matches any number of consecutive newlines.
(?:Message-ID:) matches Message-ID: without capturing it (?:).
Edit: as #WiktorStribiżew mentioned though, splitting your text into lines may be an easier/cleaner way to retrieve what you want.

There are two approaches here, and they depend on the broader context. If your expected substring is the second paragraph, just split with \n\n (or \r\n\r\n) and get the second item from the resulting list.
If it is a text inside some larger text, use a regex.
See a Python demo:
import re
s='''X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>'''
# Non-regex way for the string in the exact same format
print(s.split('\n\n')[1])
# Regex way to get some substring in a known context
m = re.search(r'X-FileName:.*[\r\n]+(.+)', s)
if m:
print(m.group(1))
The regex means:
X-FileName: - a literal substring
.* - any 0+ chars other than line break chars
[\r\n]+ - 1 or more CR or LF chars
(.+) - Group 1: one or more chars other than line break chars, as many as possible.
See the regex demo.

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.

You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.

This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

Only match unique string occurrences

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855

Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b - a leading word boundary
((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
(?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
[A-Za-z]{2} - 2 ASCII letters
\d{5,6} - 5 to 6 digits
\b - trailing word boundary
(?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.

I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pattern to match everything except a string of 5 digits - regex

Related

Identify and replace non-ASCII characters between brackets

Regex to check number of spaces after full stop - Strictly 2 required

Regex matching a text after a specific string until another specific string

Match 3 and 4 delimiters and between them; not less not more

Only match unique string occurrences

Categories

Resources