Regular expression to extract the year from a string - regex

This should be simple but i cant seem to get it going. The purpose of it is to extract v3 tags from mp3 file names in Mp3tag.
I have these strings I want to extract the year.
Test String 1 (1994) -> extract 1994
34 Test String 2 (1995)" -> extract 1995
Test (String) 3 (1996)" -> extract 1996
I had ^(.+)\s\(([0-9]*)\)$ but obviously its not giving me the results i was expecting. You can say that im not very good with regular expressions.
Thanks in advance

A suggestion for a more generic solution, not sure if that is what you need. Valid years will always have the form 19xx or 20xx, and the years will be separated with a word-break character (something other than a number or a letter):
\b(19|20)\d{2}\b
This doesn't really care where in the tag the year appears. A simpler version that doesn't assume anything more than 4 digits in the year would be this expression:
\b\d{4}\b
The key here is the \b escape sequence, which matches any non-word character (word charaters are letters, digits and underscores), including parenthesis, of course.
Would also like to recommend this site:
http://www.regular-expressions.info/

You can use something like this \((\d{4})\)$. The first group will have your match.
Explanation
\( # Match the character “(” literally
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
{4} # Exactly 4 times
)
\) # Match the character “)” literally
$ # Assert position at the end of a line (at the end of the string or before a line break character)

You need to escape the parentheses. Also you can restrict that a year has only got 4 numbers:
^(.+)\s\(([0-9]{4})\)$
The year is in matchgroup 2.

I'd go with
^(.*)\s\(([0-9]{4})\)$
(assuming all years have 4 digits, use [0-9]+ if you have an unknown number of digits, but at least one, or [0-9]* if there could be no digits)

You're almost there with your regular expression.
What you really need is:
\s\((\d{4})\)$
Where:
\s is some whitespace
\( is a literal '('
( is the start of the match group
\d is a digit
{4} means four of the previous atom (i.e. four digits)
) is the end of the match group
\) is a literal ')'
$ is the end of the string
For best results, put into a function:
>>> def get_year(name):
... return re.search('\s\((\d{4})\)$', name).groups()[0]
...
>>> for name in "Test String 1 (1994)", "34 Test String 2 (1995)", "Test (String) 3 (1996)":
... print get_year(name)
...
1994
1995
1996

Related

How can I limit the total length of 2 adjacent strings in Regular Expression?

Example word: name.surname#exm.gov.xx.en
I want to limit the name + surname's total length to 12.
Ex: If name's length is 5 then the surname's length cannot bigger than 7.
My regex is here: ([a-z|çöşiğü]{0,12}.[a-z|çöşiğü]{0,12}){0,12}#exm.gov.xx.en
Thx in advance
If there should be a single dot present which should not be at the start or right before the #, you could assert 13 characters followed by an #
^(?=[a-zçöşğü.]{13}#)[a-zçöşğü]+\.[a-zçöşğü]+#exm\.gov\.xx\.en$
In parts
^ Start of string
(?= Positive lookahead, assert what is on the right is
[a-zçöşğü.]{13}# Match 13 times any of the listed followed by an #
) Close lookahead
[a-zçöşğü]+\.[a-zçöşğü]+ Match 2 times any of the listed with a dot inbetween
#exm\.gov\.xx\.en Match #exm.gov.xx.en
$ End of string
Regex demo
Note that I have omitted the pipe | from the character class as it would match it literally instead of meaning OR. If you meant to use it as a char, you could add it back. I also have remove the i as that will be matched by a-z

How would I find values in a file, but only on lines that don't start with #?

I've got a document that looks something like this:
# Document ID 8934
# Last updated 2018-05-06
52 84 12 70 23 2 7 20 1 5
4 2 7 81 32 98 2 0 77 6
(..and so on..)
In other words, it starts off with a few comment lines, then the rest of the document is just a bunch of numbers separated by spaces.
I'm trying to write a regex that gets all digits on all lines that don't start with #, but I can't seem to get it.
I've read over answers such as
Regular Expressions: Is there an AND operator?
Regex: Find a character anywhere in a document but only on lines that begin with a specific word
and pawed through sites such as http://regular-expressions.info, but I still can't get an expression that works (the best I can get is a lengthy version of ^[^#].*
So how can I match digits (or text, or whatever) in a string, but only on lines that don't start with a certain character?
Your regex ^[^#].* uses a negated character class which matches not a # from the start of the string ^ and after that matches any character zero or more times.
This would for example also match t test
What you might do is use an alternation to match a whole line ^#.*$ that starts with a # or capture in a group one or more digits (\d+)
Your digits are captured group 1. You could change the (\d+) to for example a character class ([\w+.]+) to match more than only digits.
(?:^#.*$|(\d+))
Details
(?: Non capturing group
^#.*$ Match from the start of the line ^ a # followed by any character zero or more times .* until the end of the string $
| Or
(\d+) capture one or more digits in a group
) Close non capturing group
I think a way simpler method would be to replace the lines with "" first with this regex:
^#.*
And then you can just match all the numbers with this:
-?\d+ (-? is for negative)

Phone regex validation for Argentina

I figured out a regular expresion for my country's phone but I've something missing.
The rule here is: (Area Code) Prefix - Sufix
Area Code could be 3 to 5 digits
Prefix could be 2 to 4 digits.
Area Code + Prefix is 7 digits long.
Sufix is always 4 digits long
Total digits are 11.
I figured I could have 3 simple regex chained with an OR "|" like this:
/(\(?\d{3}\)?[- .]?\d{4}[- .]?\d\d\d\d)|(\(?\d{4}\)?[- .]?\d{3}[- .]?\d\d\d\d)|(\(?\d{5}\)?[- .]?\d{2}[- .]?\d\d\d\d)/
The thing I'm doing wrong is that \d\d\d\d doesn't match only 4 digits for the sufix, for example: (011) 4740-5000 which is a valid phone number, works ok but if put extra digits it will also return as a valid phone number, ie: (011) 4740-5000000000
You should use ^ and $ to match whole string
For example ^\d{4}$ will match exactly 4 digits not more not less.
Here is the complete regex pattern
^((\(?\d{3}\)? \d{4})|(\(?\d{4}\)? \d{3})|(\(?\d{5}\)? \d{2}))-\d{4}$
Online demo
As per your regex pattern delimiter can be -,. or single space then try
^((\(?\d{3}\)?[-. ]?\d{4})|(\(?\d{4}\)?[-. ]?\d{3})|(\(?\d{5}\)?[-. ]?\d{2}))[-. ]?\d{4}$
This pattern works fine for me:
/^\\(?(\d{3,5})?\\)?\s?(15)?[\s|-]?(4)\d{2,3}[\s|-]?\d{4}$/
I've tested this in regex101:
/^((?:\(?\d{3}\)?[- .]?\d{4}|\(?\d{4}\)?[- .]?\d{3}|\(?\d{5}\)?[- .]?\d{2})[- .]?\d{4})$/
RegEx Demo
^ Matches the beginning of a string
( Beginning of capture group
(?: Beginning of non-capturing group
Your different options for area code & prefix
) End non-capturing group
[- .]?\d{4} The last four digits of the phone number
) End capture group
$ Matches the end of a string
If you're trying to validate such a phone number, then the following one should suit your needs:
^(?=.{15}$)[(]\d{3,5}[)] \d{2,4}-\d{4}$
Debuggex Demo
You need to match the complete expression by indicating the start and end with anchors. You also don't need alternation for the different lengths.
/^(?=(\D*\d){11}$)\(?\d{3,5}\)?[- .]?\d{2,4}[- .]?\d{4}$/
Here's the breakdown:
(?=(\D*\d){11}$) is a non-capturing group ensuring that there are 11 digits total,
with any number of non-digits amongst them
\(?\d{3,5}\)?[- .]? matches 3-5 digits in parens (area code), followed by a separator
\d{2,4}[- .]? matches 2-4 digits (prefix), followed by a separator
\d{4} matches the suffix

RegEx with counting digits and allow special chars

I've done some searching but cant find the right regex.
i would like a regex for a text that only contains digits, whitespaces and plus signs.
like: [0-9 +]
But with a min/max limit for only the digits in that text.
My suggestions ended up with something like this:
^[0-9 \+](?=(.*[0-9]){5,8})$
Should be OK:
"123 456 7"
"12345"
"+ 123 456 78"
Should not be ok:
"123456789"
"+ 124 578a"
"+123456789"
Anyone got a solution that might do the trick?
Edit:
I can see that i was to short on my explanation what i'm aiming for.
My regex conditions should be:
Must include between 5-8 digits
Allow whitespaces and plus signs
I'm guessing from your own regex that between 5 and 8 digits in a row without a whitespace in between are allowed. If that's true, than the following regex might do the trick (example written in Python). It allows single digit groups being between 5 and 8 digits long. If there is more than one group, it allows each group to have exactly 3 digits except for the last group which can be between 1 and 3 digits long. One single plus sign on the left is optional.
Are you parsing phone numbers? :)
In [176]: regex = re.compile(r"""
^ # start of string
(?: \+\s )? # optional plus sign followed by whitespace
(?:
(?: \d{3}\s )+ # one or more groups of three digits followed by whitespace
\d{1,3} # one group of between one and three digits
| # ALTERNATIVE
\d{5,8} # one group of between five and eight digits
)
$ # end of string
""", flags=re.X)
# --- MATCHES ---
In [177]: regex.findall('123 456 7')
Out[177]: ['123 456 7']
In [178]: regex.findall('12345')
Out[178]: ['12345']
In [179]: regex.findall('+ 123 456 78')
Out[179]: ['+ 123 456 78']
In [200]: regex.findall('12345678')
Out[200]: ['12345678']
# --- NON-MATCHES ---
In [180]: regex.findall('123456789')
Out[180]: []
In [181]: regex.findall('+ 124 578a')
Out[181]: []
In [182]: regex.findall('+123456789')
Out[182]: []
In [198]: regex.findall('123')
Out[198]: []
In [24]: regex.findall('1234 556')
Out[24]: []
You can do something like this:
^(?:[ +]*[0-9]){5}(?:(?:[ +]*[0-9])?){3}$
See it here on Regexr
The first group (?:[ +]*[0-9]){5} are the 5 minimum digits, with any amount of spaces and plus before, the second part (?:(?:[ +]*[0-9])?){3} matches the optional digits, with any amount of spaces and plus before.
You were very close - you need to anchor the lookahead to the start of input, and add a second negative lookahead for the upper bound of the quantity of digits:
^(?=(.*\d){5,8})(?!(.*\d){9,})[\d +]+$
Also, fyi you don't need to escape the plus sign within the character class, and [0-9] is \d

Regex matching multiple inputs

I am trying to do a smart input field for UK style weight input, e.g. "6 stone and 3 lb" or "6 st 11 pound", capturing the 2 numbers in groups.
For now I got: ([0-9]{1,2}).*?([0-9]{1,2}).*
Problem is it matches "12 stone" in 2 groups, 1 and 2 instead of just 12. Is it possible to make a regex which captures correctly in both cases?
You need to make the first part possessive so it never gets backtracked into.
([0-9]{1,2}+).*?([0-9]{1,2})
Because . matches everythig including numbers.. try this:
/(\d{1,2})\D+(\d{1,2})?/
Something like this?
\b(\d+)\b.*?\b(\d+)\b
Groups 1 and 2 will have your numbers in either case.
Explanation :
"
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 2
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
"
This works, then look at capture groups 1 and 3:
([0-9]{1,2})[^0-9]+(([0-9]{1,2})?.+)?
The idea is to make a number and text manditory, but make a second number and text optional.
Here is my suggestion for a regex to match both variants you showed:
(?<stone>\d+\s(?:stone|st))(?:\s(and)?\s?)(?<pound>\d+\s(?:pound|lb))
It's a bit vague at the moment, this works:
/([0-9]{1,2})(?:[^0-9]+([0-9]{1,2}).*)?/
for this data:
6 stone and 3 lb
6 st 11 pound
12 stone
12 st and 11lbs
Seeing as everyone is having a go, here's mine:
(\d+)(?:\D+(\d+)?)
It's definitely the concisest so far. This will match one or two groups of digits anywhere:
"12": ("12", null)
"12st": ("12", null)
"12 st": ("12", null)
"12st 34 lb": ("12", "34")
"cabbage 12st 34 lb": ("12", "34")
"12 potato 34 moo": ("12", "34")
The next step would be making it catch the name of the units that were used.
Edit: as pointed out above, we don's know what language you're using, and not all regex functionality is available in all implementations. However as far as I know, \d for digits and \D for non-digits is fairly universal.