Regex to parse docker tag? - regex

Trying to come up with a regex that will match rabbit in the four cases above. Seems easy enough, but my regex-fu is failing me.

The format is a little under-specified, but this seems to work:
From the docs:
An image name is made up of slash-separated name components, optionally prefixed by a registry hostname. The hostname must comply with standard DNS rules, but may not contain underscores. If a hostname is present, it may optionally be followed by a port number in the format :8080. If not present, the command uses Docker’s public registry located at by default. Name components may contain lowercase characters, digits and separators. A separator is defined as a period, one or two underscores, or one or more dashes. A name component may not start or end with a separator.
A tag name may contain lowercase and uppercase characters, digits, underscores, periods and dashes. A tag name may not start with a period or a dash and may contain a maximum of 128 characters.
Tests are here.

You can try:
But you need to read the spec to replace a-z and 0-9 to all possible characters.
Alternatively, this regex will capture the container name without regard for the specification other than / and ::
Sample data and test
def s = [
s.forEach {
def image = (s =~ "(?:.+/)?([^:]+)(?::.+)?")[0][1]
println image
assert image == 'image-name'
Will output:

There is another solution to this problem which contains all varieties and does not use lookaheads.
// image/tag:v1.0.0
// image/tag
// image
// image:v1.1.1-patch
// ubuntu#sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
// etc...
const dockerImageVerify = "^(([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9])\\.)*([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9])(:[0-9]+\\/)?(?:[0-9a-z-]+[/#])(?:([0-9a-z-]+))[/#]?(?:([0-9a-z-]+))?(?::[a-z0-9\\.-]+)?$"
This covers all cases.

To be precise, the question and almost all the answers have nothing to do with the docker image tag, but rather with the entire docker image name.
The regular expression for the image tag is much simpler:
From the docs:
A tag name may contain lowercase and uppercase characters, digits, underscores,
periods and dashes. A tag name may not start with a period or a dash and may contain a
maximum of 128 characters.

The best source for the official regular expression would be the Go implementation for OCI references used by Docker itself.
To get the full pattern, we can build and execute the following:
package main
import (
func main() {
fmt.Printf("%q\n", reference.ReferenceRegexp)
go mod init foo.example
go mod tidy
go run ./main.go
This gives us:
The capture groups are image name, image tag, and digest respectively.

I think this covers all:
const ipRegexStr = '((([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))';
const hostnameRegexStr = '((([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]))';
const tagNameRegexStr = '([a-z0-9](\-*[a-z0-9])*)';
const tagVersionRegexStr = '([a-z0-9\_]([\-\.\_a-z0-9])*)';
const dockerRepoTagRegexStr = `^(${ipRegexStr}|${hostnameRegexStr}/)${tagNameRegexStr}(:${tagVersionRegexStr})?$`;
const dockerTagRegex = new RegExp(dockerRepoTagRegexStr);
console.log(dockerTagRegex.test('')); //true
console.log(dockerTagRegex.test('')); //true
console.log(dockerTagRegex.test('-0-2')); //false
console.log(dockerTagRegex.test('0-2-')); //false
console.log(dockerTagRegex.test('0-2:-23423')); //false

This is yet another iteration inspired by above answers. Ideally there should be a long-format in the docker-compose file so we don't have to break these up. I'm also not sure if I'm naming things the right way (registry/image/tag:label). I can't vouch for how much it follows spec, whether it is too strict or not strict enough, etc. YMMV.
this is Python (not JS) so may not apply to most in the devops tooling space
allows strict uppercase variable names for the image tag labels (${VAR_NAME})
enforces some sane character limits but I did not research what the spec allows
this code block is in VERBOSE mode and needs to be wrapped in re.compile("<block>", re.X) to allow for comments/whitespace to be removed
# FIRST we have the registry/image/tag:label format
# registry
# IPv4
# domain
# hostname
# port
# registry is optional, and not greedy
# label can be a literal or a variable
# next we have the digest format
Here's a code-hint to render a new image string after changing parts of the parsed image string:
def __str__(self):
if self.image:
image_tag = '/'.join(filter(None, (self.image, self.tag)))
imag_tag_label = ':'.join(filter(None, (image_tag, self.label)))
return '/'.join(filter(None, (self.registry, imag_tag_label)))
return f'{self.user}#sha256:{self.hashcode}'


How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below: { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
See the Ruby demo
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Advanced grouping in domain name regex with Python3

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.
Just so you know: I don't really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven't found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.
The parser should consider a few things:
a maximum number of 126 subdomains plus the TLD
each subdomain must not be longer than 64 characters
each subdomain can contain only alphanumeric characters and the - character
each subdomain must not begin or end with the - character
the TLD must not be longer than 64 characters
the TLD must not contain only digits
but I to go a little deeper:
the first string can (optionally) contain a "usage type" like cpanel., mail., webdisk., autodiscover. and so on... (or maybe a symple www.)
the TLD can (optionally) contain a particle like .co, .gov, .edu and so on ( for example)
the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don't think it will be in the future
What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:
The above solution doesn't return the expected results
I report here a couple of examples:
A couple of strings to parse
The groups I expect to find
The groups I find
As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?
This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how "machine learning" would be appropriate, or helpful. Even regex is completely unnecessary.
I've not implemented everything you want to verify, but it's not hard to fill in the missing bits.
import string
double_tld = ['gov', 'edu', 'co', 'add_others_you_need']
# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)
def is_invalid_sd(sd):
return sd.translate(valid_trans) != ''
def check_hostname(hostname):
subdomains = hostname.split('.')
# each subdomain can contain only alphanumeric characters and
# the - character
invalid_parts = list(filter(is_invalid_sd, subdomains))
# TODO react if there are any invalid parts
# "the TLD can (optionally) contain a particle like
# .co, .gov, .edu and so on ( for example)"
if subdomains[-2] in double_tld:
subdomains[-2] += '.' + subdomains[-1]
subdomains = subdomains[:-1]
# "a maximum number of 126 subdomains plus the TLD"
# TODO check list length of subdomains
# "each subdomain must not begin or end with the - character"
# "the TLD must not be longer than 64 characters"
# "the TLD must not contain only digits"
# TODO write loop, check first and last characters, length, isnumeric
# TODO return something
I don't know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).
I found one way to get almost the result you expect using regex module.
match ='^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', '')
match.captures[1] or match.captures('USAGE')
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
Here, to avoid taking . in groups I have added it in non-capturing group like this
Hope it helps.

How to group provided string correctly?

I have the following regex:
I use this regex to match different, yet similar strings:
# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided
# MF001317-077944-01
MF00131707794401 # string provided
These strings need to match/group as it is at the top of the strings, however my problem is that it is not grouping it correctly
The first string: MOR644004007001 is grouped: (MOR644004) (007) (001) which should be (MOR644) (004) (007) (001)
The second string: VUF001010500801 is grouped (VUF001010) (500) (801) which should be (VUF00101) (050) (08) (01)
How can I change ([A-Za-z]{2,3}\d{6}|\d{5}|\d{3})((\d{3})?) so that it would group the provided string correctly?
I am not sure that you can do what you want to.
Let's consider the first two strings:
# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided
Now, both the strings are composed of 3 chars followed by 12 digits. Thus, given a regex R, if R does not depend on particular (sequences of) characters and on particular (sequences of) digits (i.e., it presents [A-Za-z] and \d but does not present, let's say, MO and 0070), then it will match both the string in the same way.
So, if you want to operate a different matching, then you need to look at the particular occurrence of certain characters or digits. We need more data from you in order to give you an aswer.
Finally, I suggest you to take a look at this tool: (demo: It is a research project that automatically generates a regex given (many) examples of extraction. I warmly suggest you to try it, especially if you know that an underlying pattern is present in your case for sure (i.e. strings beginning with VUF must be matched differently from strings beginning with MOR) but you are unable to find it. Again, you will need to provide many examples to the engine. Needles to say, if a generic pattern does not exist, then the tool won't find it ;)
Considering your comment to Serv I'd say the (only?) solution is to have one regex for each possibility, like -
and then use the execution environment (JS/php/python - you haven't provided which one) to piece the parts together.
See example on regex101 here. Note that substitution, only as an example, matches only the second string.
Take a look at this. I have used what's called as a named group. As pointed out earlier by others, it's better to have one regex code for each string. I have shown here for the first string, MOR644004007001. Easily you can expand for other two strings:
import re
# MOR644-004-007-001
MOR = "MOR644004007001" # string provided
# VUF00101-050-08-01
VUF = "VUF001010500801" # string provided
# MF001317-077944-01
MF = "MF00131707794401" # string provided
MORcompile = re.compile(r'(?P<first>\w{,6})(?P<second>\d{,3})(?P<third>\d{,3})(?P<fourth>\d{,3})')
MORsearch =

Regex for IBAN allowing for white spaces AND checking for exact length

I need to check an input field for a German IBAN. The user should be allowed to leave in white spaces and input should be validated to have a starting DE and then exact 20 characters numbers and letters.
Without the white space allowance, I tried
but I cannot find where and how I can add "white spaces anywhere allowed.
This should be simple, but I simply cannot find a solution.
Thanks for help!
Because you should use the right tool for the right task: you should not rely on regexps to validate IBAN numbers, but instead use the IBAN checksum algorithm to check the whole code is actually correct, making any regexp superfluous and redundant. i.e.: remove all spaces, rearrange the code, convert to integers, and compute remainder, here it's best explained.
Though, there am I trying to answer your question, for the fun of it:
what about:
which only difference is allowing a whitespace (or not) after each occurence of a alphanumeric character.
here is the visualization:
edit: for the OP's information, the only difference is that this regexp, from #ulugbex-umirov: (?:\s*[0-9a-zA-Z]\s*) does a lookahead check to see if there's a space between the iso country code and the checksum (which only made of numerical digits), which I do not support on purpose.
And actually to support a correct IBAN syntax, which is formed of groups of 4 characters, as the wikipedia page says:
If your UI is in Javascript, you can use that library for doing IBAN validation:
<script src="iban.js"></script>
// the API is now accessible from the window.IBAN global object
IBAN.isValid('hello world'); // false
IBAN.isValid('BE68539007547034'); // true
so you know this is a valid IBAN, and can validate it before the data is ever even sent to the backend. Simpler, lighter and more elegant… Why do something else?
Here is a list of IBANs from 70 Countries. I generated it with a python script i wrote based on this
Debuggex Demo
Debuggex Demo
This is the correct regex to match DE IBAN account numbers:
DE\d{2}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{2}|DE\d{20}
Pass: DE89 3704 0044 0532 0130 00|||DE89370400440532013000
Fail: DE89-3704-0044-0532-0130-00
Most simple solution I can think of:
In particular, your initial [DE]{2} is wrong, as it allows 'DD', 'EE', 'ED' as well as the intended 'DE'.
To allow any amount of spaces anywhere:
^ *D *E( *[A-Za-z0-9]){20} *$
As you want to allow lower letters, also DE might be lower?
^ *[Dd] *[Ee]( *[A-Za-z0-9]){20} *$
^ matches the start of the string
$ end anchor
in between each characters there are optional spaces *
[character class] defines a set/range of characters
To allow at most one space in between each characters, replace the quantifier * (any amount of) with ? (0 or 1). If supported, \s shorthand can be used to match [ \t\r\n\f] instead of space only.
Test on, also see the SO regex FAQ
Using Google Apps Script, I pasted Laurent's code from github into a script and added the following code to test.
// Use the Apps Script IDE's "Run" menu to execute this code.
// Then look at the View > Logs menu to see execution results.
function myFunction() {
// var IBAN = require('iban');
var t1 = IBAN.isValid('hello world'); // false
var t2 = IBAN.isValid('BE68539007547034'); // true
var t3 = IBAN.isValid('BE68 5390 0754 7034'); // true
Logger.log("Test 1 = %s", t1);
Logger.log("Test 2 = %s", t2);
Logger.log("Test 3 = %s", t3);
The only thing needed to run the example code was commenting out the require('iban') line:
// var IBAN = require('iban');
Finally, instead of using client handlers to attempt a RegEx validation of IBAN input, I use a a server handler to do the validation.

Regular Expression to find string in Expect buffer

I'm trying to find a regex that works to match a string of escape characters (an Expect response, see this question) and a six digit number (with alpha-numeric first character).
Here's the whole string I need to identify:
Ultimately I need to extract the string:
Here's what I have already:
interact {
#this expression does not identify the screen location
#I need to find "\r\n\u001b[1;14H" AND "([a-zA-Z0-9]{1})[0-9]{5}$"
#This regex was what I was using before.
-nobuffer -re {^([a-zA-Z0-9]{1})?[0-9]{5}$} {
set number $interact_out(0,string)
I need to identify the escape characters to to verify that it is a field in that screen region. So I need a regex that includes that first portion, but the backslashes are confusing me...
Also once I have the full string in the $number variable, how do I isolate just the number in another variable in Tcl?
If you just want the number at the end, then this should be enough...
Update with new information
Assuming \n is a newline character, rather than a literal \ followed by a literal n, you can do this...
I found out a few things with some more digging. First of all I wasn't looking at the output of the program but the input of the user. I needed to add the "-o" flag to look at the program output. I also shortened the regex to just the necessary part.
The regex example from #rikh led me to look at why his or my own regex was failing, and that was due to the fact that I wasn't looking at the output but the input. So the original regex that I tried wasn't at fault but the data being looked at (missing the "-o" flag)
Here's the complete answer to my problem.
interact {
-o -nobuffer -re {(\[1;14H[a-zA-Z0-9]{1})[0-9]{5}} {
#get number in place
set numraw $interact_out(0,string)
#get just number out
set num [string range $numraw 6 11]
#switch to lowercase
set num [string tolower $num]
send_user " stored number: $num"
I'm a noob with Expect and Tcl so if any of this doesn't make sense or if you have any more insights into the interact flags, please set me straight.