regexp to match only one occurrence followed by two digits - regex

I want to replace the , by a . if both the following cases are true:
, should be present only once in the string
, should be followed by a maximum of two digits
These are OK: 1 000 000,51, 1.000,9
These are not: 9,523,036.11, 1,000
My evolution so far: https://regex101.com/r/njuKtb/1

You may use this regex for search:
^([^,]*),(?=\d{1,2}(?!\d))(?!.*,)
And use this replacement:
$1.
RegEx Demo
RegEx Details:
^([^,]*): Match 0 or more non-comma characters at the start
,: Match literal comma
(?=\d{1,2}(?!\d)): Match 1 or 2 digits not followed by another digit
(?!.*,): Make sure we don't have comma ahead
Alternatively use this for search:
^([^,]*),(?=\d{1,2}(?!\d))([^,\n]*)$
and replace by:
$1.$2

You can do:
/^(?!^[^,\n]*,[^,\n]*,[^,\n]*)(?:[^,\n]*),(?=\d{1,2}\D*$)/m
Demo
Which is:
^ Start of string or line
(?!^[^,\n]*,[^,\n]*,[^,\n]*) Only matches lines with a single ','
(?:[^,\n]*) Suck up the LH before the ,
, The ,
(?=\d{1,2}\D*$) no more than two \d before end of the line

Related

How to match lines in a numbered list with a regex

I want to search for all lines that:
start with a numeric-repeat (one or several times)
this numeric-repeat is not followed by dot and a whitespace character
either a single dot after the numeric-repeat or a letter is okay
Given Lines
1. TEST 1 : DataLogFile
11. TEST 2 : Inter Citro File
111. TEST 3 : Inter Citro File
111.TEST4 : Match this
111TEST4 : Match this
Expected Result
Should only match last 2 lines
111.TEST4 : Match this
111TEST4 : Match this
1. Regex
I try with regex ^[0-9]+(?!. ).* to match only the last row because there is no whitespace character after the dot.
Tested in Regex101
1. Actual Result
Matched 4 last lines
11. TEST 2 : Inter Citro File
111. TEST 3 : Inter Citro File
111.TEST4 : Match this
111TEST4 : Match this
2. Regex like answered
When I try the SaSkY first response ^\d+\.\S.*,
it will only match lines that have digits, then dot, then no blank, then characters. See Demo
But for input without a dot after digits it will not match.
Although expected to match also 111TEST4 : Match this.
Try this:
^\d+(?:\.\S|[A-Za-z]).*
^ start of the line.
\d+ one or more digits.
(?:\.\S|[A-Za-z]) non-capturing group:
\. a literal dot ..
\S any character except a whitespace character.
| OR.
[A-Za-z] a letter.
.* zero or more characters.
See regex demo
You can try:
^(\d)\1*+(?!\.?\s+).*$
Regex demo.
Or if you want just a number at the beginning (not repeating numbers such as 111):
^\d++(?!\.?\s+).*$
You should have stated your expectations clearly before asking.
If you like to
match: any "identifier" or word that is either prefixed with a number (e.g. 1Hello) or is prefixed with an ordinal (e.g. 2.World)
But not: a phrase containing space like in a numbered list entry (e.g. 1. Hello
Simple regex sequentially built
Then ^\d+\.?[a-zA-Z].*
Matches:
111.TEST4 : Match this
111TEST5: Match this
111test6: Match this
But not those numbered-list items having separating spaces inside.
It also does not match anything starting with a letter.
Those do not match:
1. TEST 1 : DataLogFile
11. TEST 2 : Inter Citro File
111. TEST 3 : Inter Citro File
test7: should not match
💡️ So you can apply this regex on lines to filter for poorly formatted numbered-list entries.
See demo
Explained the sequence
^ begin of line
\d+ at least one or more digits (a number)
\.? an optional dot (raw dots need to be escaped by backslash!)
[a-zA-Z] any alphabetic letter from the range (lower or uppercase)
.* anything else (here the unescaped dot has special meaning "any character")

Find the first set of 5 digits in a text

I need to find the first set of 5 numbers in a text like this :
;SUPER U CHARLY SUR MARNE;;;rte de Pavant CHARLY SUR MARNE Picardie 02310;Charly-sur-Marne;;;02310;;;;;;;;;;;;;;
I need to find the first 02310 only.
My regex but it found all set of 5 numbers :
([^\d]|^)\d{5}([^\d]|$)
To match the first 5-digit number you may use
^.*?\K(?<!\d)\d{5}(?!\d)
See the regex demo. As you want to remove the match, simply keep the Replace With field blank. The ^ matches the start of a line, .*? matches any 0+ chars other than line break chars, as few as possible, and \K operator drops the text matched so far. Then, (?<!\d)\d{5}(?!\d) matches 5 digits not enclosed with other digits.
Another variation includes a capturing group/backreference:
Find What: ^(.*?)(?<!\d)\d{5}(?!\d)
Replace With: $1
See this regex demo.
Here, instead of dropping the found text before the number, (.*?) is captured into Group 1 and $1 in the replacement pattern puts it back.
I would've use
(^(?:(?!\d{5}).)+)(\d{5})(?!\d)
It finds fragment from beginning of the string till end of first 5-digit number, but in case of replacement you can use $1 or $2 to substitute corresponding part. For example replacement $1<$2> will surround number by < and >.
To find the first 5 digits in the text, you could also match not a digit \D* or 1-4 digits followed by matching 5 digits:
^(?=.*\b\d{5}\b)(?:\D*|\d{1,4})*\K\d{5}(?!\d)
^ Start of string
(?=.*\b\d{5}\b) Assert that there are 5 consecutive digits between word boundaries
(?:\D*|\d{1,4})* Repeat matching 0+ times not a digit or 1-4 digits
\K\d{5} Forget what was matched, then match 5 digits
(?!\d) Assert what followed is not a digit
Regex demo

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

How to write that the pattern should be repeated?

I have a line of pattern:
double1, +double2,-double3.
For single double value pattern is :
[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)
How to make it for triple value?
Such as:
1.1, 0, -0
0, -123, 33
Not valid for:
""
1,123
123,123,123,123
You can use a slightly simpler pattern:
^(?:(?:^[+-]?|, ?[+-]?)\d+(?:\.\d+)?){3}$
Matches only triple occurences as you specified in your edit.
You can try it here.
As correctly pointed out by The Fourth Bird in his comments below, if you wish to match entries such as .9, where no digits precede the full stop you can use:
^(?:(?:^[+-]?|, ?[+-]?)(?:\d+(?:\.\d+)?|\.\d+)){3}$
You can check this pattern here.
The double part ([.][0-9]*)? is optional which will match 0 or 1 times.
To match it triple times, you could match a double using [-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+) which will match an optional + or - followed by an alternation that will match either a digit followed by an optional part that matches a dot and one or more digits or a dot followed by one or more digits.
Repeat that pattern 2 times using a quantifier {2} preceded by a comma and zero or more times a whitespace character \s*.
Add anchors to assert the start ^ and the end $ of the string and you could make use of a non capturing group (?: if you only want to check if it is a match and not refer to the groups anymore.
^[-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:,\s*[-+]?(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)){2}$

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?
How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.
Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line
Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))