Regex nested optional groups - regex

I am trying to capture the bold part of strings like this:
'capture a year range at the end of a string 1995-2010'
'if there's no year range just capture the single year 2005'
'capture a year/year range followed by a parenthesis, including the parenthesis 2007-2012 (58 months)'
This regex works for 1 and 2, but I can't get it to work for 3:
/(\d+([-–— ]\d+( \(\d+ months\))?)?$)/
What am I doing wrong?

Try this regex:
/\d{4}(?:[-–— ]\d{4})?(?:\s*\([^)]+\))?$/gm
This one captures everything in the brackets.
If you need a regex specific to the text "(number) months" in the brackets, then you can use this: \d{4}(?:[-–— ]\d{4})?(?:\s+\(\d+\smonths\))?$
Link to test: RegexPal or RegExr
Sample text:
capture a year range at the end of a string 1995-2010
if there's no year range just capture the single year 2005
capture a year/year range followed by a parenthesis, including the
parenthesis 2007-2012 (58 months)
trying out another example 1990 (23 weeks)
trying out another example 1995-2002 (x days)
trying out another example 2050 (blah blah)
trying out another example 2050—3000
trying out another example 2050-3000
trying out another example 2050–3000
And the JavaScript code:
var regex = /\d{4}(?:[-–— ]\d{4})?(?:\s*\([^)]+\))?$/gm; //multiline enabled
var input = "your input string";
if(regex.test(input)) {
var matches = input.match(regex);
for(var match in matches) {
alert(matches[match]);
}
} else {
alert("No matches found!");
}

This Regex works nicely. :)
/(?:(?:\d{4}[-–— ])?\d{4})(?: \(\d+ months\))?$/
The main difference between my Regex and Jonah's is that mine contains ?: which means not to capture the sub-groups. When you group in a Regex it automatically returns what is in that group unless you tell it not to, and I've found that sometimes when those groups get captured when using methods such as replace or split, that it can be a little buggy which may be your problem as well.

The following regex works for me in a sample Perl script. It should be workable in JavaScript:
/(\d{4}([-–— ]\d{4})?( \(\d+ months\))?)$/
We first match a 4-digit year: \d{4}
Then we match an optional separator followed by another 4-digit year: ([-–— ]\d{4})?
Finally, we match the optional months portion: ( \(\d+ months\))?
You may need to insert whitespace matches (\s*) where needed, if your data doesn't always follow this strict template.

It actually works fine here, if I understand your needs correctly: Gskinner RegExr
Just alternate which sentence is the last, as $ will not count for newlines, just the end of the string.

Related

Regex pattern for mm/dd/yyyy and mmddyyyy in Scala

I have date in my .txt file which comes like either of the below:
mmddyyyy
OR
mm/dd/yyyy
Below is the regex which works fine for mm/dd/yyyy.
^02\/(?:[01]\d|2\d)\/(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)\/(?:[0-2]\d|3[01])\/(?:19|20)\d{2}|(?:0[469]|11)\/(?:[0-2]\d|30)\/(?:19|20)\d{2}|02\/(?:[0-1]\d|2[0-8])\/(?:19|20)\d{2}$
However, unable to build the regex for mmddyyyy. I just want to understand is there any generic regex that would work for both cases?
Why use regex for this? Seems like a case of "Now you have two problems"
It would be more effective (and easier to understand) to use a DateTimeFormatter (assuming you are on the JVM and not using scala-js)
The format patterns support using [] to surround optional sections, such as the /, and the formatters inherently perform input validation so if you plug in a month or day that can't exist, it'll throw an exception.
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val mdy = DateTimeFormatter.ofPattern("MM[/]dd[/]yyyy")
def parse(rawDate: String) = LocalDate.parse(rawDate, mdy)
scala> parse("12252022")
res7: java.time.LocalDate = 2022-12-25
scala> parse("12/25/2022")
res8: java.time.LocalDate = 2022-12-25
scala> parse("25/12/2022")
java.time.format.DateTimeParseException: Text '25/12/2022' could not be parsed: Invalid value for MonthOfYear (valid values 1 - 12): 25
scala> parse("abc123")
java.time.format.DateTimeParseException: Text 'abc123' could not be parsed at index 0
If you want to match all those variations with either 2 forward slashes or only digits, you can use a positive lookahead to assert either only digits or 2 forward slashes surrounded by digits.
Then in the pattern itself you can make matching the / optional.
Note that you don't have to escape the \/
^(?=\d+(?:/\d+/\d+)?$)(?:02/?(?:[01]\d|2\d)/?(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)/?(?:[0-2]\d|3[01])/?(?:19|20)\d{2}|(?:0[469]|11)/?(?:[0-2]\d|30)/?(?:19|20)\d{2}|02/?(?:[0-1]\d|2[0-8])\?(?:19|20)\d{2})$
Regex demo
Another option is to write an alternation | matching the same pattern without the / in it.
First of all, there is a tiny shortcoming in your regex: the ^ anchor only applies to the first part of your regex, not to the other alternatives that are separated by |. Similarly the final $ applies only to the final alternative. You should put all alternatives in a non-capturing group, like ^(?: | | | )$
Then for the question itself, you could make the forward slash that follows the month optional and put it in a capture group. Then what comes between the day and the year could be a backreference to that capture group. So (\/?) and \1.
^(?:02(\/?)(?:[01]\d|2\d)\1(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)(\/?)(?:[0-2]\d|3[01])\2(?:19|20)\d{2}|(?:0[469]|11)(\/?)(?:[0-2]\d|30)\3(?:19|20)\d{2}|02(\/?)(?:[0-1]\d|2[0-8])\4(?:19|20)\d{2})$

How to capture text between a specific word and the semicolon immediately preceding it with regex?

I have many rows of people and titles in Excel, and am looking to filter out certain people by title. For example, cells may contain the following:
John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder
These cells are varying lengths and have varying numbers of people and titles. My plan is to add semicolons at the beginning and end to standardize it. This would give me:
;John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;
Currently, I have a code that can iterate through and uses the following regex Founder.*?; which will return each instance of founder based on my code (i.e. Founder;Founder;) but the trouble is that I can't seem to figure out how to also capture the names of the people. I would think I would need to designate the semicolon immediately preceding "Founder" but so far I have not been able to get this. My ultimate goal would be to return something like the following, which I have the code for with the exception of the correct regular expression.
;John Smith, Co-Founder;James Jackson, Co-Founder;
Depending on your version of Excel, you could also do this with a formula:
=FILTERXML("<t><s>" & SUBSTITUTE(A1,";","</s><s>")&"</s></t>","//s[contains(.,'Co-Founder')]")
However, for a regex, you could use
(?:^|;)([^;]*?Co-Founder)
which will return the Co-Founders in capturing group 1.
There is no need for leading/trailing semicolons.
Even though VBA regex does not support look-behind, you can work with that limitation.
the Co-Founders Regex
(?:^|;)([^;]*?Co-Founder)
Options: Case sensitive (or not, as you prefer); ^$ match at line breaks
Match the regular expression below (?:^|;)
Match this alternative ^
Assert position at the beginning of the string ^
Or match this alternative ;
Match the character “;” literally ;
Match the regex below and capture its match into backreference number 1 ([^;]*?Co-Founder)
Match any character that is NOT a “;” [^;]*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the character string “Co-Founder” literally Co-Founder
Created with RegexBuddy
Split the whole string combined with a positive filtering and the getCoFounders() function will return an array of findings:
Sub ExampleCall()
Dim s As String
s = ";John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;"
Debug.Print Join(getCoFounders(s), "|")
End Sub
Function getCoFounders(s As String)
getCoFounders = Filter(Split(s, ";"), "Co-Founder", True, vbTextCompare)
End Function
Results in VB Editor's immediate window
John Smith, Co-Founder|James Jackson, Co-Founder

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

Building a Regex String - Any assistance provided

Im very new to REGEX, I understand its purpose, but Im struggling to yet fully comprehend how to use it. Im trying to build a REGEX string to pull the A8OP2B out from the following (or whatever gets dumped in that 5th group).
{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}}
The other items in above line, will change in character length, so I cannot say the 51st to the 56th character. It will always be the 5th group in quotation marks though that I want to pull out.
Ive tried building various regex strings up, but its still mostly a foreign language to me and I still have much reading to do on it.
Could anyone provide me a working example with the above, so I can reverse engineer and understand better?
Thanks
Demo 1: Reference the JSON to a var, then use either dot or bracket notation.
Demo 2: Using RegEx is not recommended, but here's one in JavaScript:
/\b(\w{6})(?=","RfKey":)/g
First Match
non-consuming match: :"A
meta border: \b: A non-word=:, any char=", and a word=A
consuming match: A8OP2B
begin capture: (, Any word =\w, 6 times={6}
end capture: )
non-consuming match: ","RfKey":
Look ahead: (?= for: ","RfKey": )
Demo 1
var obj = {"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}};
var dataDot = obj.RfReceived.Data;
var dataBracket = obj['RfReceived']['Data'];
console.log(dataDot);
console.log(dataBracket)
Demo 2
Note: This is consuming a string of 3 consecutive patterns. 3 matches are expected.
var rgx = /\b(\w{6})(?=","RfKey":)/g;
var str = `{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}},{"RfReceived":{"Sync":8080,"Low":102,"High":1200,"Data":"PFN07U","RfKey":"None"}},{"RfReceived":{"Sync":7580,"Low":471,"High":360,"Data":"XU89OM","RfKey":"None"}}`;
var res = str.match(rgx);
console.log(res);

1 to 5 of the same groups in REGEX

For a string such as:
abzyxcabkmqfcmkcde
Notice that there are string patterns between ab and c in bold. To capture the first string pattern:
ab([a-z]{3,5})c
Is it possible to match both of the groups from the sample string? Actually, there should be 1 to 5 groups.
Note: python style regex.
You can verify that a given string conforms to the 1-5 repetitions of ab([a-z]{3,5})c using this regex
(?:ab([a-z]{3,5})c){1,5}
or this one if there are characters expected between the groups
(?:ab([a-z]{3,5})c.*?){1,5}
You will only be able to extract the last matching group from that string however, not any of the previous ones. to get a previous one you need to use hsz's approach
Just match all results - i.e. with g flag:
/ab([a-z]{3,5})c/g
or some method like in Python:
re.findall(pattern, string, flags=0)