Nested Groups in Regex - regex

I'm constructing a regex that is looking for dates. I would like to return the date found and the sentence it was found in. In the code below, the strings on either side of date_string should check for the conditions of a sentence. For your sake, I've omitted the regex for date_string - sufficed to say, it works for picking out dates. While the inside of date_string isn't important, it is grouped as one entire regex.
"((?:[^.|?|!]*)"+date_string+"(?:[^.|?|!]*[.|?|!]\s*))"
The problem is that date_string is only matching the last number of any given date, presumably because the regex in front of date_string is matching too far and overrunning the date regex. For example, if I say "Independence Day is July 4.", I will get the sentence and 4, even though it should match 'July 4'. In case you're wondering, my regex inside date_string are ordered in such a way that 'July 4' should match first. Is there any way to do this all in one regex? Or do I need to split it up somehow (i.e. split up all text into sentences, and then check each sentence)?

There are several things wrong with your regex.
There is no alternation in character classes. You want [^.?!], not [^.|?|!].
You don't need the non-capturing groups at all.
You probably don't need any "outer" grouping, since the entire match is what you look for.
Your match part preceding the date is greedy where it should not be (this runs over part of your date).
You make assumptions about what resembles a sentence that do not match reality. Your own example proves that, if you try.
Putting that last point aside for the moment, you end up with this version:
[^.?!]*?(July 4)[^.?!]*[.?!]\s*
Where the literal July 4 stands in for your date regex. This matches in your question text:
' For example, if I say "Independence Day is July 4.'
'", I will get the sentence and 4, even though it should match 'July 4'. '
which pretty much proves my point #5.

You can make the repetition operator non-greedy by adding a question mark. In your case it would be
[^.?!]*?
And yes, splitting the text into sentences (preferably excluding the last character) would make it really easier.
(Seems like I didn't look at what was in the character class. Replaced it with tloflin's.)

Related

Regex Expression to extract county and zipcode

Given the below text, I want to extract county, state and zipcode i.e. BROWNSBURG, IN 46112.
With my current Regex Expression --
text = "BROWNSBURG, IN 46112 10 Other income (loss) 15 Alternative minimum tax (AMT) items"
regex = ([A-z]*[\S][\s]{1}[A-z]{2}[\d\s]+)
output = BROWNSBURG, IN 46112 10
It is extracting BROWNSBURG, IN 46112 10, I don't want this redundant 10. Can anyone please suggest the change in the above regex as it is working fine for most of the documents?
With only one example being provided, I will start out with assuming that the match you're looking for is always at the beginning of the line?
If so, it would be much safer to add the ^ anchor. Otherwise, you should remove it.
^[A-Z\s]+,\s[A-Z]{2}\s\d{5}
When we break down the pattern, you will see why this works:
^ asserts the beginning of the line (remove if necessary)
[A-Z\s]+ will match any letter or space that comes prior to the ,\s. The space is important in the event of counties/cities that contain more than one word.
[A-Z]{2} must match a 2-letter state code
Then finally, \d{5} will match on the 5-digit zip code.
Here is your custom view of your pattern in action.
Placing your pattern in a capturing group is unnecessary. You can simply return the full match, as it will be the same as the submatch. And while this one seems to be pretty simple, please understand that there are different implementations of Regular Expressions in different languages, so specifying the language in your question tags may prove to be useful in the future.

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

Regular expression: matching part of words [duplicate]

I'm trying to make a Regex that matches this string {Date HH:MM:ss}, but here's the trick: HH, MM and ss are optional, but it needs to be "HH", not just "H" (the same thing applies to MM and ss). If a single "H" shows up, the string shouldn't be matched.
I know I can use H{2} to match HH, but I can't seem to use that functionality plus the ? to match zero or one time (zero because it's optional, and one time max).
So far I'm doing this (which is obviously not working):
Regex dateRegex = new Regex(#"\{Date H{2}?:M{2}?:s{2}?\}");
Next question. Now that I have the match on the first string, I want to take only the HH:MM:ss part and put it in another string (that will be the format for a TimeStamp object).
I used the same approach, like this:
Regex dateFormatRegex = new Regex(#"(HH)?:?(MM)?:?(ss)?");
But when I try that on "{Date HH:MM}" I don't get any matches. Why?
If I add a space like this Regex dateFormatRegex = new Regex(#" (HH)?:?(MM)?:?(ss)?");, I have the result, but I don't want the space...
I thought that the first parenthesis needed to be escaped, but \( won't work in this case. I guess because it's not a parenthesis that is part of the string to match, but a key-character.
(H{2})? matches zero or two H characters.
However, in your case, writing it twice would be more readable:
Regex dateRegex = new Regex(#"\{Date (HH)?:(MM)?:(ss)?\}");
Besides that, make sure there are no functions available for whatever you are trying to do. Parsing dates is pretty common and most programming languages have functions in their standard library - I'd almost bet 1k of my reputation that .NET has such functions, too.
In your edit you mention an unwanted leading space in the result… to check a leading or trailing condition together with your regex without including this to the result you can use lookaround feature of regex.
new Regex(#"(?<=Date )(HH)?:?(MM)?:?(ss)?")
(?<=...) is a lookbehind pattern.
Regex test site with this example.
For input Date HH:MM:ss, it will match both regexes (with or without lookbehind).
But input FooBar HH:MM:ss will still match a simple regex, but the lookbehind will fail here. Lookaround doesn't change the content of the result, but it prevents false matches (e.g., this second input that is not a Date).
Find more information on regex and lookaround here.

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regular expression for repeated sequence

i am a learner of regular expressions. I am trying to find the date from the below string. The element <ext:serviceitem> can be repeated upto 20 times in actual xml. I need to take out only the date strings (like any element ending with Date in its name, i need that element's value which is a date). For example and . I want all those dates (only) to be printed out.
<ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-1</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1APRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Monthly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">DV1</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-04 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-12 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem><ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-2</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1BPRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Quarterly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">TS2</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-11 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-28 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem>
I tried with below regex, but its returning rest of the string after the first occurence.
(?<=Date\"\>).*(?=\<\/ext\:serviceItemAttribute\>)
Any help would be highly appreciated.
Your problem is that .* is greedy, meaning that it will grab from the first instance of Date to the last instance of </ext:ser..... Replace the .* with .*? and it will alter the behaviour to what you're after.
#(?<=Date">).*?(?=</ext:serviceItemAttribute>)#i
You should have .*? in a capture group: (.*?).
#(?<=Date">)(.*?)(?=</ext:serviceItemAttribute>)#i
You could also do it - more simply - like:
#Date">(.*?)</ext#i
Update
As has been pointed out in the comment below this (above) solution relies on the use of non-greedy matching.
To get around this you could use the following: ([^<]*) instead of (.*?)
NOTE: This does not impact the alternatives below.
Alternatives
/(\d{4}-\d{2}-\d{2})/
/(\d{4}-\d{2}-\d{2} \d{2}:\d{2})/
The above patterns will match dates in the format YYYY-XX-XX and YYYY-XX-XX HH:MM respectively