Regular expression for repeated sequence - regex

i am a learner of regular expressions. I am trying to find the date from the below string. The element <ext:serviceitem> can be repeated upto 20 times in actual xml. I need to take out only the date strings (like any element ending with Date in its name, i need that element's value which is a date). For example and . I want all those dates (only) to be printed out.
<ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-1</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1APRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Monthly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">DV1</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-04 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-12 10:02</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem><ext:serviceitem><ext:name>EnhancedSupport</ext:name><ext:serviceItemData><ext:serviceItemAttribute name="Name">E69D7F93-81F4-09E2-E043-9D3226AD8E1D-2</ext:serviceItemAttribute><ext:serviceItemAttribute name="ProductionDatabase">P1BPRD</ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportType">Quarterly</ext:serviceItemAttribute><ext:serviceItemAttribute name="Environment">TS2</ext:serviceItemAttribute><ext:serviceItemAttribute name="StartDate">2013-11-11 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="EndDate">2013-11-28 10:03</ext:serviceItemAttribute><ext:serviceItemAttribute name="No_of_WeeksSupported"></ext:serviceItemAttribute><ext:serviceItemAttribute name="Cost"></ext:serviceItemAttribute><ext:serviceItemAttribute name="SupportNotes"></ext:serviceItemAttribute><ext:serviceItemAttribute name="FiscalQuarterNumber"></ext:serviceItemAttribute><ext:subscription><ext:loginID>kbasavar</ext:loginID><ext:ouname>020072748</ext:ouname></ext:subscription></ext:serviceItemData></ext:serviceitem>
I tried with below regex, but its returning rest of the string after the first occurence.
(?<=Date\"\>).*(?=\<\/ext\:serviceItemAttribute\>)
Any help would be highly appreciated.

Your problem is that .* is greedy, meaning that it will grab from the first instance of Date to the last instance of </ext:ser..... Replace the .* with .*? and it will alter the behaviour to what you're after.
#(?<=Date">).*?(?=</ext:serviceItemAttribute>)#i
You should have .*? in a capture group: (.*?).
#(?<=Date">)(.*?)(?=</ext:serviceItemAttribute>)#i
You could also do it - more simply - like:
#Date">(.*?)</ext#i
Update
As has been pointed out in the comment below this (above) solution relies on the use of non-greedy matching.
To get around this you could use the following: ([^<]*) instead of (.*?)
NOTE: This does not impact the alternatives below.
Alternatives
/(\d{4}-\d{2}-\d{2})/
/(\d{4}-\d{2}-\d{2} \d{2}:\d{2})/
The above patterns will match dates in the format YYYY-XX-XX and YYYY-XX-XX HH:MM respectively

Related

Regular expression: matching part of words [duplicate]

I'm trying to make a Regex that matches this string {Date HH:MM:ss}, but here's the trick: HH, MM and ss are optional, but it needs to be "HH", not just "H" (the same thing applies to MM and ss). If a single "H" shows up, the string shouldn't be matched.
I know I can use H{2} to match HH, but I can't seem to use that functionality plus the ? to match zero or one time (zero because it's optional, and one time max).
So far I'm doing this (which is obviously not working):
Regex dateRegex = new Regex(#"\{Date H{2}?:M{2}?:s{2}?\}");
Next question. Now that I have the match on the first string, I want to take only the HH:MM:ss part and put it in another string (that will be the format for a TimeStamp object).
I used the same approach, like this:
Regex dateFormatRegex = new Regex(#"(HH)?:?(MM)?:?(ss)?");
But when I try that on "{Date HH:MM}" I don't get any matches. Why?
If I add a space like this Regex dateFormatRegex = new Regex(#" (HH)?:?(MM)?:?(ss)?");, I have the result, but I don't want the space...
I thought that the first parenthesis needed to be escaped, but \( won't work in this case. I guess because it's not a parenthesis that is part of the string to match, but a key-character.
(H{2})? matches zero or two H characters.
However, in your case, writing it twice would be more readable:
Regex dateRegex = new Regex(#"\{Date (HH)?:(MM)?:(ss)?\}");
Besides that, make sure there are no functions available for whatever you are trying to do. Parsing dates is pretty common and most programming languages have functions in their standard library - I'd almost bet 1k of my reputation that .NET has such functions, too.
In your edit you mention an unwanted leading space in the result… to check a leading or trailing condition together with your regex without including this to the result you can use lookaround feature of regex.
new Regex(#"(?<=Date )(HH)?:?(MM)?:?(ss)?")
(?<=...) is a lookbehind pattern.
Regex test site with this example.
For input Date HH:MM:ss, it will match both regexes (with or without lookbehind).
But input FooBar HH:MM:ss will still match a simple regex, but the lookbehind will fail here. Lookaround doesn't change the content of the result, but it prevents false matches (e.g., this second input that is not a Date).
Find more information on regex and lookaround here.

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.

Regex to extract date with negative lookahead

I am using this pattern to extract confirmation dates from a text file and converting them to a date object (see my post here Extract/convert date from string in MS Access).
The current pattern matches all strings that look like a date, but may not be the confirmation date (which is always preceded by Confirmed by), and moreover, may not have complete date information (e.g. no AM or PM).
Pattern: (\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)
Sample text:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
The above pattern matches the following:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
^^^^^^^^^^^^^^^^^^^^
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
^^^^^^^^^^^^^^^^^^^^
I want the pattern to look for the date in the segment of the text file that begins with Confirmed by and ends with a semi-colon. Also, in order to properly convert the time, the pattern should match only AM or PM at the end. How can I restrict the pattern to this segment and add the additional AM or PM criteria?
Can anyone help?
In order to match the end of the string, use $ at the end of your regex. To match the entire phrase "Confirmed by <someone> on <date>", use plain text (remember that plain text can be used in a regex as well -- if you aren't using special characters, the matcher will match your query verbatim). You need to use a negative look-ahead to exclude entire words.So maybe something like this:
Confirmed by (?!\ on\ )(\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)$
Which will allow you to match a string that starts with "Confirmed by", followed by anything except for " on ", followed by the date that you capture, and the end of the string.
Edit: the negative look-ahead part is tricky, look at the answer below for more reference:
A regular expression to exclude a word/string
I don't see any need for a lookahead here, positive or negative. This works correctly on your sample string:
Confirmed by [^;]*(\d+/\d+/\d+\s+\d+:\d+:\d+(?:\s+(?:AM|PM))?|\d+-\w+-\d+\s+\d+:\d+:\d+);
The [^;]* effectively corrals the match between a Confirmed by sequence and its closing semicolon. (I'm assuming the semicolon will always be present.)
+(?:\s+(?:AM|PM))? makes the AM/PM optional, along with its leading whitespace.
The actual date will be stored in capturing group #1.
Try this:
(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));
The simplest answer is more than often a good enough solution. By turning of the default greedy behavior (using the question mark: .*?) the regular expression will instead try to find the shortest match that matches the pattern. A pattern never matches the same string more than once, this means that each Confirmed by can only be coupled with one date which in this case is the next to follow.
Confirmed by.*?(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));

Nested Groups in Regex

I'm constructing a regex that is looking for dates. I would like to return the date found and the sentence it was found in. In the code below, the strings on either side of date_string should check for the conditions of a sentence. For your sake, I've omitted the regex for date_string - sufficed to say, it works for picking out dates. While the inside of date_string isn't important, it is grouped as one entire regex.
"((?:[^.|?|!]*)"+date_string+"(?:[^.|?|!]*[.|?|!]\s*))"
The problem is that date_string is only matching the last number of any given date, presumably because the regex in front of date_string is matching too far and overrunning the date regex. For example, if I say "Independence Day is July 4.", I will get the sentence and 4, even though it should match 'July 4'. In case you're wondering, my regex inside date_string are ordered in such a way that 'July 4' should match first. Is there any way to do this all in one regex? Or do I need to split it up somehow (i.e. split up all text into sentences, and then check each sentence)?
There are several things wrong with your regex.
There is no alternation in character classes. You want [^.?!], not [^.|?|!].
You don't need the non-capturing groups at all.
You probably don't need any "outer" grouping, since the entire match is what you look for.
Your match part preceding the date is greedy where it should not be (this runs over part of your date).
You make assumptions about what resembles a sentence that do not match reality. Your own example proves that, if you try.
Putting that last point aside for the moment, you end up with this version:
[^.?!]*?(July 4)[^.?!]*[.?!]\s*
Where the literal July 4 stands in for your date regex. This matches in your question text:
' For example, if I say "Independence Day is July 4.'
'", I will get the sentence and 4, even though it should match 'July 4'. '
which pretty much proves my point #5.
You can make the repetition operator non-greedy by adding a question mark. In your case it would be
[^.?!]*?
And yes, splitting the text into sentences (preferably excluding the last character) would make it really easier.
(Seems like I didn't look at what was in the character class. Replaced it with tloflin's.)