Regex - "captured group equals" in a condition - regex

I am writing a regex to capture various date formats. To keep it short and flexible, I wanted to pack all the possible combinations of months, days and years into separate groups. Let`s assume I have two dates like this:
01.01. - 31.12.2013
jan - dec 2013
Now, what I want to achieve is to write a regex that would capture both dates like the above ones. that's easy. but I also want to exclude dates like e.g. those:
01.01. - 31 dec 2013
In other words, whenever the months are mixed, I don't want those dates. Also, if the first date doesn't have a day, I don't want that day to be captured in the second one either.
I wanted to build a conditional which captures only the second date's appropriate fields, based on what is found in the first one (so, e.g. if the first date has an alpha month, look only for an alpha month in the second one, ignore numeric). My regex looks like this:
(?<firstDay>0[1-9]|[12][0-9]|3[01]|[1-9])[-/\s\.](?<firstMonth>0[1-9]|1[012]|[\p{L}]{3,}|[1-9])\s*[-\s/\.]*\s*(?<secondDay>0[1-9]|[12][0-9]|3[01]|[1-9])[-\s/.]*(?<secondMonth>((?<firstMonth>)(?<=0[1-9]|1[012]|[1-9]))(0[1-9]|1[012]|[1-9])|[\p{L}]{3,})[-\s/\.]*(?<year>(19|20)\d\d|[012][0-9]$)
This is all background, but what my question is, is it possible to check what the captured group is equal to and build a capturing condition based on that? I found some similar topic on Stack Overflow (can't find it now to referenec, unfortunately), but when I implement it, it stops capturing some proper dates (e.g. 01.01. - 31.12.2013). This is that part:
(?<secondMonth>((?<firstMonth>)(?<=0[1-9]|1[012]|[1-9]))(0[1-9]|1[012]|[1-9])|[\p{L}]{3,})

Related

grep first n lines only

I'm facing a problem greping the right date within a letter as a document.
Reason is to grep the date of document creation and not any further date within the text.
Usaly the dokument hold information about the company, my address, customer number, bill number....
and the date by when it was created.
Mayby a greeting and/or text maybe within dates again.
Often the date at begin of the document has different look as following.
December 1999 instead of 3.12.1999 as example.
If I grep the date in case of pattern
'(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))'
sometimes get the wrong date as creation date. Reason is the different writing of dates in the documents.
Example 1 is what I usualy get and it works fine as I search for the date (creation date) with correct pattern.
Example 2 is in problem as I get a date, but it's NOT creation date which would be the 1st date. I get instead another date matching the pattern out from the text.
Example 1
Example 2
I could use different pattern '(([0-9][0-9]{,1}\.)([0-9][0-9]{,1}\.)([1-9][0-9][0-9][0-9]{1,}))' grepping the correct date in example 2 but then I would get same issue for example 1.
My idea was to search in first n lines only if pattern match take the date otherwise use different pattern.
I don't get the rule for pdfgrep using the first n lines only what would give me the possibility to use different pattern.
Has anybody an idea how to fix it?
Cheers, bdream
With GNU grep:
-m NUM: Stop reading a file after NUM matching lines.
Alternatively to GNU grep learn to use GNU gawk, specifically designed for such tasks.
Consider also learning python or GNU guile (then read SICP).

Regex to match date of the week from date

Similar to How to use REGEX patterns to return day of the week if a date is entered? except I'm only looking for Regex answers not "any other clever method".
What is is simplest regex to match a particular day of the week, e.g. Thursday from a date string formatted like 2017-05-04T10:14:07Z.
This is what I have as a starting point.
2017-05-(04|11|18|25)T.*$
Is there a way to achieve a solution without too many pipes and covering all possible years or at least the last decade?
First and foremost, Thursday can not be extracted from a date string formatted like 2017-05-04T10:14:07Z, as "Thursday" never appears in the string. The best you can capture is the 2 digit day number (the "04").
You CAN get the day number (\d+-\d{1,2}-(\d{1,2})T.*?Z), However regex can't verify the correctness of the day. (for example, for a random year, can you tell me if Feb 28 is valid without listing every single instance?) So ONLY DO THIS IF YOU ACCEPT DAY MAY NOT BE VALID (or source will always be right)

Regular expression to expand to sentence

I'm trying to extract regions around keywords from longer passages of text. They should include complete sentences, based on the following conditions:
n=250 Charactars before / after keyword should be included if existing (the keyword can be closer then this to the start / end of the text)
from there it should expand further to include the complete sentence (let's assume here we can define sentence borders with ".?! or :" knowing it's not completely accurate)
I already achieved the expanding to the end of the last sentence, but not to start of the first in the following example, where vitamin is the keyword and the italic is captured by the regex. However, it should capture from "An extra 24 hours..."
Apparently I don't get the corresponding group up front, neither using lazy nor using lookbehind.
((.{0,250}(vitamin)\b.{0,250})(.+?(\.|\!|\?|\:))?)/ig
Well, this year you’re getting an extra day to get ahead on your taxes or (finally) clean out the garage. (Hey, we’re not trying to tell you what do but you might as well be productive.) February 29 is back on the calendar this year because it’s a leap year. Whether you love or loathe the extra winter day, you’re probably wondering why it happens in the first place. An extra 24 hours — or day — is built into the calen dar every four years to ensure it aligns with the Earth’s movement around the sun. There’s 365 days in a calendar year, but it actually takes longer for the Earth’s annual journey — about 365.2421 days — around the star that gives us light, life and vitamin D. The difference may seem like no big deal to us, but over time, it adds up. “To ensure consistency with the true astronomical year, it is necessary to periodically add in an extra day to make up the lost time and get the calendar back in sync with the heavens,” according the history. com.
Acknowledgement of the need for a leap year happened around the time of Julius Caesar. In 46 B.C., Caesar enlisted the help of astronomer Sosigenes to update the calendar so that it had 12 months and 365 days, including a leap year every four years.,
You can try something like this:
(([.?!:][^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:])
It works by consuming 250 characters of text before and after the keyword "vitamin". From that point it finds the first punctuation point (.?!:) before/after the 250 characters of text.
Here's a sample of it in action.
You can you use extra parentheses () to strategically group what exact output you want. For example, the above answer includes the ending period from the preceding sentence in the output. So you could use
(([.?!:]([^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:]))
and use group 3 from the result set which doesn't have this ending period.
I do not see how the specification in the question can be matched by a regex. It boils down to the following logic problem:
to match as many characters as possible but no more than 250 before/after the keyword, .{0,250} needs to be greedy and can neither be lazy .{0,250}? nor possessive .{0,250}+
if this part is greedy, you will miss the occurrences of the keyword that start before the .{0,250} part is matched.
The same logic applies to my understanding to the 'match back to the start of the sentnence as well.
I played around with the following more or less meaningful regex:
[.?!:]?([^.?!:]*?(.{0,250}\byear\b.{0,250})[^.?!:]*[.?!:]?) misses first 'year'
[.?!:]?([^.?!:]*?(.{0,250}?\byear\b.{0,250})[^.?!:]*[.?!:]?) gets the first 'year' but fails on others.
I suggest you write your on extraction logic in a function, eihter using regex or not, to achieve the extraction you want.
You could for example find the index of the start of the keyword \bkeyword\b and the full stops (\.[^\d]|[.?!:]$) and then with this information extract the part of the text you want.

Regex to match date of type: ddmmyyyy

I'm trying to match dates of type:ddmmyyyy
, like: 04072001
So far I have this:
^(?:(?:31(?:0?[13578]|1[02]))\1|(?:(?:29|30)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:290?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
which is almost the same as here but without the delimiters( (\/|-|\.) )
You could use something more simple like this:
^(0[1-9]|[1-2][0-9]|31(?!(?:0[2469]|11))|30(?!02))(0[1-9]|1[0-2])([12]\d{3})$
It captures the day, month, year, and validates everything except for whether Feb. 29 is actually a leap year. (To do that, I'd just perform the math on the captured year/date afterwards rather than trying to write it into the expression).
Working example: http://regex101.com/r/dH8mG3
Explained:
- Capture the day: 01-29
- OR 31, if not succeeded by 02, 04, 06, 09, or 11
- OR 30, if not succeeded by 02
- Capture the month: 01-12
- Capture the year: 1000-2999 (you could narrow this down
by using number ranges like
(1[8-9]\d{2}|20\d{2}) == 1800-2099
A regular expression is not the best tool for this job.
If at all possible, just match ^\d{8}$ (or ^\d\d\d\d\d\d\d\d$ if your regexp engine doesn't support the {8} syntax) and then programmatically check that the date is valid.
In a bit more detail:
Match ^(\d\d)(\d\d)(\d\d\d\d)$ (adjust syntax as needed).
Extract the three matching groups and programmatically check that they constitute a valid date.
The latter requires (a) knowing the number of days in each month, and (b) knowing which years are leap years (which depends on which calendar you're using; Gregorian is the obvious choice, but think about years before it was introduced).
The resulting code will be much easier to read and maintain.
(Also, if you have any control over the format, consider using YYYYMMDD rather than DDMMYYYY; it sorts correctly and it's one of the formats specified by the ISO 8601 standard.)
Do not validate dates using only RegEx. Your language probably has a built in date object with its own methods and such which you can use in conjunction with your input whose format you've validated with RegEx.

Regex to match time

I want my users to be able to enter a time form.
If more info necessary, users use this to express how much time is needed to complete a task, and it will be saved in a database if filled.
here is what I have:
/^$|^([0-1]?[0-9]|2[0-4]):([0-5][0-9])(:[0-5][0-9])?$/
It matches an empty form or 01:30 and 01:30:00 formatted times. I really won't need the seconds as every task takes a minute at least, but I tried removing it and it just crashed my code and removed support for empty string.. I really don't understand regex at all.
What I'd like, is for it to also match simple minutes and simple hours, like for instance 3:30, 3:00, 5. Is this possible? It would greatly improve the user experience and limit waste typing. But I'd like to keep the zero optional in case some users find it natural to type it.
I think the following pattern does what you want:
p="((([01]?\d)|(2[0-4])):)?([0-5]\d)?(:[0-5]\d)?"
The first part:
(([01]?\d)|(2[0-3])):)?
is an optional group which deals with hours in format 00-24.
The second part:
([0-5]\d)?
is an optional group which deals with minutes if hours or seconds are present in your expression. The group also deals with expressions containing only minutes or only hours.
The third part:
(:[0-5]\d)?
is an optional group dealing with seconds.
The following samples show the pattern at work:
In [180]: re.match(p,'14:25:30').string
Out[180]: '14:25:30'
In [182]: re.match(p,'2:34:05').string
Out[182]: '2:34:05'
In [184]: re.match(p,'02:34').string
Out[184]: '02:34'
In [186]: re.match(p,'59:59').string
Out[186]: '59:59'
In [188]: re.match(p,'59').string
Out[188]: '59'
In [189]: re.match(p,'').string
Out[189]: ''
As every group is optional the pattern matches also the empty string. I've tested it with Python but I think it will work with other languages too with minimal changes.