How to write that the pattern should be repeated? - regex

I have a line of pattern:
double1, +double2,-double3.
For single double value pattern is :
[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)
How to make it for triple value?
Such as:
1.1, 0, -0
0, -123, 33
Not valid for:
""
1,123
123,123,123,123

You can use a slightly simpler pattern:
^(?:(?:^[+-]?|, ?[+-]?)\d+(?:\.\d+)?){3}$
Matches only triple occurences as you specified in your edit.
You can try it here.
As correctly pointed out by The Fourth Bird in his comments below, if you wish to match entries such as .9, where no digits precede the full stop you can use:
^(?:(?:^[+-]?|, ?[+-]?)(?:\d+(?:\.\d+)?|\.\d+)){3}$
You can check this pattern here.

The double part ([.][0-9]*)? is optional which will match 0 or 1 times.
To match it triple times, you could match a double using [-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+) which will match an optional + or - followed by an alternation that will match either a digit followed by an optional part that matches a dot and one or more digits or a dot followed by one or more digits.
Repeat that pattern 2 times using a quantifier {2} preceded by a comma and zero or more times a whitespace character \s*.
Add anchors to assert the start ^ and the end $ of the string and you could make use of a non capturing group (?: if you only want to check if it is a match and not refer to the groups anymore.
^[-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:,\s*[-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)){2}$

Related

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

Is there a way to use Regex to capture numbers out of a string based on a specific leading letters?

I need to extract any number between 4-10 digits that following directly after 'PO#' OR 'PO# ' (with a whitespace). I do not want to include the PO# with the actual value that is extracted, however I do need it as criteria to target the value within a string. If the digits are less than 4 or greater than 10, I do not wish to capture the value and would like to otherwise ignore it.
A sample string would look like this:
PO#12445 for Vendor Enterprise
or
Invoice# 21412556 for Vendor Enterprise for PO# 12445
My current RegEX expression captures PO# with '#' and I use additional logic after the fact to remove the '#', however my expression is also capturing Invoice# and Inv# which I don't want it to do. I'd like it to only target PO#.
Current Expression: [P][O][#]\s*[0-9]{3,9}\d+\w
Any help would be greatly appreciated!
If you need only the digits, you can use \b(?<=PO#)\s?(\d{4,10})\b, with:
(?<=PO#): positivive lookbehind, be sure that this pattern is present before the needed pattern (PO followed by #)
\s?: 0 or 1 whitespace
(\d{4,10}): between 4 and 10 digits
\b: word boundaries to avoid ie. the 10 first digits of a 11 digits pattern match or 'SPO#' to match
Edit: Alexander Mashin is right about the lookbehind having to be fixed width, so \b(?<=PO#)\s?(\d{4,10})\b is better https://regex101.com/r/1KBQd1/5
Edit: added word boundaries
You can use a capturing group and repeat matching the digits 4-10 times using [0-9]{4,10}.
Note that [P][O][#] is the same as PO#
\bPO#\s*([0-9]{4,10})\b
\bPO#\s* Match PO# preceded by a word boundary and match 0+ whitespace chars
( Capture group 1
[0-9]{4,10} Match 4 - 10 digits
)\b Close group followed by a word boundary to prevent the match being part of a larger word
Regex demo
If PCRE is available, how about:
PO#\s*\K\d{4,10}(?=\D|$)
PO#\s* matches the leading substring "PO#" followed by 0 or more whitespaces.
\K resets the starting position of the match and works as a positive (zero length) lookbehind.
\d{4,10} matches a sequence of digits of 4 <= length <= 10.
(?=\D|$) is the positive lookahead to match a non-digit character or the end of the string.

Find the first set of 5 digits in a text

I need to find the first set of 5 numbers in a text like this :
;SUPER U CHARLY SUR MARNE;;;rte de Pavant CHARLY SUR MARNE Picardie 02310;Charly-sur-Marne;;;02310;;;;;;;;;;;;;;
I need to find the first 02310 only.
My regex but it found all set of 5 numbers :
([^\d]|^)\d{5}([^\d]|$)
To match the first 5-digit number you may use
^.*?\K(?<!\d)\d{5}(?!\d)
See the regex demo. As you want to remove the match, simply keep the Replace With field blank. The ^ matches the start of a line, .*? matches any 0+ chars other than line break chars, as few as possible, and \K operator drops the text matched so far. Then, (?<!\d)\d{5}(?!\d) matches 5 digits not enclosed with other digits.
Another variation includes a capturing group/backreference:
Find What: ^(.*?)(?<!\d)\d{5}(?!\d)
Replace With: $1
See this regex demo.
Here, instead of dropping the found text before the number, (.*?) is captured into Group 1 and $1 in the replacement pattern puts it back.
I would've use
(^(?:(?!\d{5}).)+)(\d{5})(?!\d)
It finds fragment from beginning of the string till end of first 5-digit number, but in case of replacement you can use $1 or $2 to substitute corresponding part. For example replacement $1<$2> will surround number by < and >.
To find the first 5 digits in the text, you could also match not a digit \D* or 1-4 digits followed by matching 5 digits:
^(?=.*\b\d{5}\b)(?:\D*|\d{1,4})*\K\d{5}(?!\d)
^ Start of string
(?=.*\b\d{5}\b) Assert that there are 5 consecutive digits between word boundaries
(?:\D*|\d{1,4})* Repeat matching 0+ times not a digit or 1-4 digits
\K\d{5} Forget what was matched, then match 5 digits
(?!\d) Assert what followed is not a digit
Regex demo

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo