Regex Match string if it exists - regex

Example 1:
THE COMPANIES ACT
(Cap 486)
IT IS notified
Example 2:
THE COMPANIES ACT
(Cap. 486)
Incorporations
IT IS notified
My current regex: THE COMPANIES ACT\n\(((?:Cap.|Cap) .*?)\)(?:\nIncorporations|\nincorporations)\nIT IS notifiedonly matches Example 2.
I would like it to match both examples.

You should make (?:\nIncorporations|\nincorporations) optional by appending ? (0 or 1 match) after it. Otherwise, the first example doesn't match as you have specified that you want to match (?:\nIncorporations|\nincorporations) in any case.
As ncorporations is common in both *ncorporations, you could consider (?:\n[Ii]ncorporations)? instead of (?:\nIncorporations|\nincorporations)? and (?:Cap\.?) instead of (?:Cap.|Cap), to shorten it a bit and also to escape the dot (since . means any character).

Related

Regular expression repetition in Ruby [duplicate]

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you
This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s optional.
?? - Optional (lazy) quantifier
?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.
The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:
cars??
This says, "look for the word car or cars; if you find either, return car and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".
Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."
Some Other Uses of Question marks in regular expressions
Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.
 
 
Negative Lookahead
Negative lookaheads are used if you want to
match something not followed by something else. The negative
lookahead construct is the pair of parentheses, with the opening
parenthesis followed by a question mark and an exclamation point. x(?!x2)
example
Consider a word There
Now, by default, the RegEx e will find the third letter e in word There.
There
^
However if you don't want the e which is immediately followed by r, then you can use RegEx e(?!r). Now the result would be:
There
^
Positive Lookahead
Positive lookahead works just the same. q(?=u) matches a q that
is immediately followed by a u, without making the u part of the
match. The positive lookahead construct is a pair of parentheses,
with the opening parenthesis followed by a question mark and an
equals sign.
example
Consider a word getting
Now, by default, the RegEx t will find the third letter t in word getting.
getting
^
However if you want the t which is immediately followed by i, then you can use RegEx t(?=i). Now the result would be:
getting
^
Non-Capturing Groups
Whenever you place a Regular Expression in parenthesis(), they
create a numbered capturing group. It stores the part of the string
matched by the part of the regular expression inside the
parentheses.
If you do not need the group to capture its match, you can optimize
this regular expression into
(?:Value)
See also this and this.
? simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
?? is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.
Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X?? shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Why the * regular expression indicates what can or cannot be it's previous character

Take this for an example which I found in some blog,
"How about searching for apple word which was spelled wrong in a given file where apple is misspelled as ale, aple, appple, apppple, apppppple etc. To find all patterns
grep 'ap*le' filename
Readers should observe that the above pattern will match even ale word as * indicates 0 or more of previous character occurrence."
Now it's saying that "ale" will be accept when we are having ap*le, isn't the "ap" and "le" fixed?
The * is a quantifier meaning 0 or more times for the previous pattern -- in this case a single literal p. You can also state the same as * with a quantifier:
ap{0,}le
The interesting question sometimes is 'what is the previous pattern?' It is often helpful to put a pattern in a group to aid understand of what the 'previous pattern' is.
Consider wanting to find any of:
ale, aple, appple, apppple, apppppple, able, abbbbbbble
Your first try might be:
/ap|b*le/
^ literal 'p' is the first alternative #WRONG regex will use 'ap'
^ or
^ literal 'b'
Demo
What you want in this case is:
/a(?:p|b)*le/
Demo
If you do not want to match ale and only match aple, appple, apppple, apppppple, use the + instead of the * which means one or more:
/ap+le/
And is equivalent to /ap{1,}le/
Demo
And if you want to only match aple, appple and leave out the variants with more than 3 'p's use the additional max quantifier:
/ap{1,3}le/
All the variants above will match apple correctly spelled. If you what only aple, appple, and not match apple, use alteration:
/a(?:p|p{3})le/
Demo
No its not.
"*" in your case means zero or any occurrence of p. While a and le is fixed. If you need fixed ap and le then this is what you need:
ap+le
"+" means at least once but no limit on number of occurrences.
This means now any number of p after a but before l. So it wont select ale now.

Question marks in regular expressions

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you
This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s optional.
?? - Optional (lazy) quantifier
?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.
The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:
cars??
This says, "look for the word car or cars; if you find either, return car and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".
Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."
Some Other Uses of Question marks in regular expressions
Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.
 
 
Negative Lookahead
Negative lookaheads are used if you want to
match something not followed by something else. The negative
lookahead construct is the pair of parentheses, with the opening
parenthesis followed by a question mark and an exclamation point. x(?!x2)
example
Consider a word There
Now, by default, the RegEx e will find the third letter e in word There.
There
^
However if you don't want the e which is immediately followed by r, then you can use RegEx e(?!r). Now the result would be:
There
^
Positive Lookahead
Positive lookahead works just the same. q(?=u) matches a q that
is immediately followed by a u, without making the u part of the
match. The positive lookahead construct is a pair of parentheses,
with the opening parenthesis followed by a question mark and an
equals sign.
example
Consider a word getting
Now, by default, the RegEx t will find the third letter t in word getting.
getting
^
However if you want the t which is immediately followed by i, then you can use RegEx t(?=i). Now the result would be:
getting
^
Non-Capturing Groups
Whenever you place a Regular Expression in parenthesis(), they
create a numbered capturing group. It stores the part of the string
matched by the part of the regular expression inside the
parentheses.
If you do not need the group to capture its match, you can optimize
this regular expression into
(?:Value)
See also this and this.
? simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
?? is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.
Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X?? shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Regex to parse international floating-point numbers

I need a regex to get numeric values that can be
111.111,11
111,111.11
111,111
And separate the integer and decimal portions so I can store in a DB with the correct syntax
I tried ([0-9]{1,3}[,.]?)+([,.][0-9]{2})? With no success since it doesn't detect the second part :(
The result should look like:
111.111,11 -> $1 = 111111; $2 = 11
First Answer:
This matches #,###,##0.00:
^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$
And this matches #.###.##0,00:
^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$
Joining the two (there are smarter/shorter ways to write it, but it works):
(?:^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$)
|(?:^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$)
You can also, add a capturing group to the last comma (or dot) to check which one was used.
Second Answer:
As pointed by Alan M, my previous solution could fail to reject a value like 11,111111.00 where a comma is missing, but the other isn't. After some tests I reached the following regex that avoids this problem:
^[+-]?[0-9]{1,3}
(?:(?<comma>\,?)[0-9]{3})?
(?:\k<comma>[0-9]{3})*
(?:\.[0-9]{2})?$
This deserves some explanation:
^[+-]?[0-9]{1,3} matches the first (1 to 3) digits;
(?:(?<comma>\,?)[0-9]{3})? matches on optional comma followed by more 3 digits, and captures the comma (or the inexistence of one) in a group called 'comma';
(?:\k<comma>[0-9]{3})* matches zero-to-any repetitions of the comma used before (if any) followed by 3 digits;
(?:\.[0-9]{2})?$ matches optional "cents" at the end of the string.
Of course, that will only cover #,###,##0.00 (not #.###.##0,00), but you can always join the regexes like I did above.
Final Answer:
Now, a complete solution. Indentations and line breaks are there for readability only.
^[+-]?[0-9]{1,3}
(?:
(?:\,[0-9]{3})*
(?:.[0-9]{2})?
|
(?:\.[0-9]{3})*
(?:\,[0-9]{2})?
|
[0-9]*
(?:[\.\,][0-9]{2})?
)$
And this variation captures the separators used:
^[+-]?[0-9]{1,3}
(?:
(?:(?<thousand>\,)[0-9]{3})*
(?:(?<decimal>\.)[0-9]{2})?
|
(?:(?<thousand>\.)[0-9]{3})*
(?:(?<decimal>\,)[0-9]{2})?
|
[0-9]*
(?:(?<decimal>[\.\,])[0-9]{2})?
)$
edit 1: "cents" are now optional;
edit 2: text added;
edit 3: second solution added;
edit 4: complete solution added;
edit 5: headings added;
edit 6: capturing added;
edit 7: last answer broke in two versions;
I would at first use this regex to determine wether a comma or a dot is used as a comma delimiter (It fetches the last of the two):
[0-9,\.]*([,\.])[0-9]*
I would then strip all of the other sign (which the previous didn't match). If there were no matches, you already have an integer and can skip the next steps. The removal of the chosen sign can easily be done with a regex, but there are also many other functions which can do this faster/better.
You are then left with a number in the form of an integer possible followed by a comma or a dot and then the decimals, where the integer- and decimal-part easily can be separated from eachother with the following regex.
([0-9]+)[,\.]?([0-9]*)
Good luck!
Edit:
Here is an example made in python, I assume the code should be self-explaining, if it is not, just ask.
import re
input = str(raw_input())
delimiterRegex = re.compile('[0-9,\.]*([,\.])[0-9]*')
splitRegex = re.compile('([0-9]+)[,\.]?([0-9]*)')
delimiter = re.findall(delimiterRegex, input)
if (delimiter[0] == ','):
input = re.sub('[\.]*','', input)
elif (delimiter[0] == '.'):
input = re.sub('[,]*','', input)
print input
With this code, the following inputs gives this:
111.111,11
111111,11
111,111.11
111111.11
111,111
111,111
After this step, one can now easily modify the string to match your needs.
How about
/(\d{1,3}(?:,\d{3})*)(\.\d{2})?/
if you care about validating that the commas separate every 3 digits exactly,
or
/(\d[\d,]*)(\.\d{2})?/
if you don't.
If I'm interpreting your question correctly so that you are saying the result SHOULD look like what you say is "would" look like, then I think you just need to leave the comma out of the character class, since it is used as a separator and not a part of what is to be matched.
So get rid of the "." first, then match the two parts.
$value = "111,111.11";
$value =~ s/\.//g;
$value =~ m/(\d+)(?:,(\d+))?/;
$1 = leading integers with periods removed
$2 = either undef if it didn't exist, or the post-comma digits if they do exist.
See Perl's Regexp::Common::number.