Regex for month and day before specific symbols - regex

I'm trying to get the day and month from strings such as:
5月2日 or 4月22日 or 12月2日
However I can't see to figure out the correct regex:
I've tried \d{1,2}[^月] and \d{1,2}[^日] however this only returns something if there is a double digit in the day or month.
Any ideas what I'm missing?
Thanks.

\d{1,2} is matching 1 digit and [^月] is matching another. Your current regex will match two digits and then any character except 月
The correct way to ensure the 月 follows is to use a lookahead \d{1,2}(?=月) as seen in use here

Assuming you have 12 months per year and up to 31 days per month this will get you close, you'll still have to do bounds checking after you determine the syntax is correct; (read; month 19 day 37 will be valid syntax here)
1?\d月[123]?\d日
Edit: Here's a better regex that doesn't need to be bounds checked and doesn't require lookahead;
^(1[012]|[1-9])月(3[01]|[12]\d|[1-9])日$

Related

Matching date, excluding 0 from month

We have files named like the following:
d31_20210315-19232.sql.tar.gz
d31_20201215-19232.sql.tar.gz
Matching the year is easy:
(?<=_)(?P<year>\d{4})
matching the month without the 0 has me coming unstuck:
(0(?P<month>[1-9]))|((?P<month>[1][1-9]))
this isn't allowed but describes what I want to do.
Same with the day I want to ignore the 0 and match either 1-9 or 12-31 inside the group:
(?P<day>
I'm at the stage where I'm going to try and code around the regex capture to get the desired result, but its been bugging me I cannot get this.
You may try this one:
(?<=_)(?P<year>\d{4})0?(?P<month>\d+)(?=\d\d-)0?(?P<day>\d+)
(?<=_)(?P<year>\d{4}) matching the year.
0?(?P<month>\d+)(?=\d\d-) matching the month, ignores the 0 if there's any. Also makes sure after the group month, there are still 2 digits with a dash in front.
0?(?P<day>\d+) matching the day, ignore the 0 if there's any.
Check the test cases

Regex expression for date within dates range

I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.

date regex which validates 3 different formats

I need a regex for date string which validates
YYYY:MM:DD:HH
YYYY:MM:DD:HH:mm
YYYY:MM:DD:HH:mm:ss
means all 3 formats are valid.
Can someone help me with this ?
I have
d\d\d\d:(0\d|1[012]):([012]\d|3[01]):([01]\d|2[0-3])$ YYYY:MM:DD:HH
^\d\d\d\d:(0\d|1[012]):([012]\d|3[01]):([01]\d|2[0-3]):[0-5]\d$ YYYY:MM:DD:HH:MM
^\d\d\d\d:(0\d|1[012]):([012]\d|3[01]):([01]\d|2[0-3]):[0-5]\d:[0-5]\d$ YYYY:MM:DD:HH:MM:SS
These 3 regex and needs to be combine in one
this is your pattern
YYYY:MM:DD:HH(:mm(:ss)?)?
? means 0 or 1 time
you can test it here
I kept your year month day expression d\d\d\d:(0\d|1[012]):([012]\d|3[01]):([01]\d|2[0-3]). Since your hour and minute expressions where the same :[0-5]\d I just required them to appear zero, once or twice with.
The resulting expression is:
^\d\d\d\d:(0\d|1[012]):([012]\d|3[01]):([01]\d|2[0-3])(:[0-5]\d){0,2}$
This expression by francis-gagnon is a slight modification to prevent edge cases where the day or month is expressed as 00.
^\d\d\d\d:(0[1-9]|1[012]):(0[1-9]|[12]\d|3[01]):([01]\d|2[0-3])(:[0-5]\d){0,2}$
If you're looking to also check the date is valid then you could use something like this monster which will test each date position to it's valid and that the time will fit into 24 hour clock:
^(?:(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))(:|\/|-|\.)(?:0?2\1(?:29)))|(?:(?:(?:1[6-9]|[2-9]\d)?\d{2})(:|\/|-|\.)(?:(?:(?:0?[13578]|1[02])\2(?:31))|(?:(?:0?[13-9]|1[0-2])\2(?:29|30))|(?:(?:0?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))))(?::(?:[01]\d|2[0-3]))?(?::[0-5]\d){0,2}$
\d{4}:[0-1][0-9]:[0-3][0-9](?::[0-5][0-9](?::[0-5][0-9])?)?

Gawk regular expression for days in month

Im writing regular expression that accepts days in months ([0-3])([0-9]). How to change it so it will only accept proper amount of days from 1 to 31, but not 37 like mine... i tried alternation |, but i don't know how to include first group into it.
([0-2])([0-9])|(3)([0-1]) does not work
How to change it so i will have still 2 groups and proper dates?
edit: 2 groups, not 4
Try this :
(0)([1-9])|(1|2)([0-9])|(3)(0|1)
DEMO Match numbers between 01 and 31 only
(0[1-9]|[12][0-9]|3[01])
This accepts values between 0-31 in one group, but does not care about about that February has no days as 30,31.
Sorry, misread it.
If you want to get the values in two groups you have to use negative lookahead like so:
([0-2]|3(?![^0-1]))([0-9])
But I think gawk does not support this.

Regular Expression Match to test for a valid year

Given a value I want to validate it to check if it is a valid year. My criteria is simple where the value should be an integer with 4 characters. I know this is not the best solution as it will not allow years before 1000 and will allow years such as 5000. This criteria is adequate for my current scenario.
What I came up with is
\d{4}$
While this works it also allows negative values.
How do I ensure that only positive integers are allowed?
Years from 1000 to 2999
^[12][0-9]{3}$
For 1900-2099
^(19|20)\d{2}$
You need to add a start anchor ^ as:
^\d{4}$
Your regex \d{4}$ will match strings that end with 4 digits. So input like -1234 will be accepted.
By adding the start anchor you match only those strings that begin and end with 4 digits, which effectively means they must contain only 4 digits.
The "accepted" answer to this question is both incorrect and myopic.
It is incorrect in that it will match strings like 0001, which is not a valid year.
It is myopic in that it will not match any values above 9999. Have we already forgotten the lessons of Y2K? Instead, use the regular expression:
^[1-9]\d{3,}$
If you need to match years in the past, in addition to years in the future, you could use this regular expression to match any positive integer:
^[1-9]\d*$
Even if you don't expect dates from the past, you may want to use this regular expression anyway, just in case someone invents a time machine and wants to take your software back with them.
Note: This regular expression will match all years, including those before the year 1, since they are typically represented with a BC designation instead of a negative integer. Of course, this convention could change over the next few millennia, so your best option is to match any integer—positive or negative—with the following regular expression:
^-?[1-9]\d*$
This works for 1900 to 2099:
/(?:(?:19|20)[0-9]{2})/
Building on #r92 answer, for years 1970-2019:
(19[789]\d|20[01]\d)
To test a year in a string which contains other words along with the year you can use the following regex: \b\d{4}\b
In theory the 4 digit option is right. But in practice it might be better to have 1900-2099 range.
Additionally it need to be non-capturing group. Many comments and answers propose capturing grouping which is not proper IMHO. Because for matching it might work, but for extracting matches using regex it will extract 4 digit numbers and two digit (19 and 20) numbers also because of paranthesis.
This will work for exact matching using non-capturing groups:
(?:19|20)\d{2}
Use;
^(19|[2-9][0-9])\d{2}$
for years 1900 - 9999.
No need to worry for 9999 and onwards - A.I. will be doing all programming by then !!! Hehehehe
You can test your regex at https://regex101.com/
Also more info about non-capturing groups ( mentioned in one the comments above ) here http://www.manifold.net/doc/radian/why_do_non-capture_groups_exist_.htm
you can go with sth like [^-]\d{4}$: you prevent the minus sign - to be before your 4 digits.
you can also use ^\d{4}$ with ^ to catch the beginning of the string. It depends on your scenario actually...
/^\d{4}$/
This will check if a string consists of only 4 numbers. In this scenario, to input a year 989, you can give 0989 instead.
You could convert your integer into a string. As the minus sign will not match the digits, you will have no negative years.
I use this regex in Java ^(0[1-9]|1[012])[/](0[1-9]|[12][0-9]|3[01])[/](19|[2-9][0-9])[0-9]{2}$
Works from 1900 to 9999
If you need to match YYYY or YYYYMMDD you can use:
^((?:(?:(?:(?:(?:[1-9]\d)(?:0[48]|[2468][048]|[13579][26])|(?:(?:[2468][048]|[13579][26])00))(?:0?2(?:29)))|(?:(?:[1-9]\d{3})(?:(?:(?:0?[13578]|1[02])(?:31))|(?:(?:0?[13-9]|1[0-2])(?:29|30))|(?:(?:0?[1-9])|(?:1[0-2]))(?:0?[1-9]|1\d|2[0-8])))))|(?:19|20)\d{2})$
You can also use this one.
([0-2][0-9]|3[0-1])\/([0-1][0-2])\/(19[789]\d|20[01]\d)
In my case I wanted to match a string which ends with a year (4 digits) like this for example:
Oct 2020
Nov 2020
Dec 2020
Jan 2021
It'll return true with this one:
var sheetName = 'Jan 2021';
var yearRegex = new RegExp("\b\d{4}$");
var isMonthSheet = yearRegex.test(sheetName);
Logger.log('isMonthSheet = ' + isMonthSheet);
The code above is used in Apps Script.
Here's the link to test the Regex above: https://regex101.com/r/SzYQLN/1
You can try the following to capture valid year from a string:
.*(19\d{2}|20\d{2}).*
Works from 1950 to 2099 and value is an integer with 4 characters
^(?=.*?(19[56789]|20\d{2}).*)\d{4}$