I'm trying to use a regex to match the following:
I want to capture all characters that are followed by a - and then a numeric character.
So for example, if the string was python-proj-5.0 I would want to get python-proj.
I tried [^-0-9]* but it seems that only matches either a - or numeric characters but not a - preceded by numeric characters.
A pattern like this should work:
(.*)-[\d.]+
This will match any sequence of zero or more characters, captured in group 1, followed by a hyphen, then one or more digits or . characters.
Or using a lookahead:
.*(?=-[\d.]+)
This will match any sequence of zero or more characters which is followed by a hyphen, then one or more digits or . characters. The hyphen and the number which follows will not be included in the match.
Related
I need to match a string with an identifier.
Pattern
Any word will be considered as identifier if
Word doesn't contain any character rather than alpha-numeric characters.
Word doesn't start with number.
Input
The given input string will not contain any preceding or trailing spaces or white-space characters.
Code
I tried using the following regular expressions
\D[a-zA-Z]\w*\D
[ \t\n][a-zA-Z]\w*[ \t\n]
^\D[a-zA-Z]\w*$
None of them works.
How can I achieve this?
Note I want to match a string that contains multiple identifiers (also can be one). For example This is an i0dentifier 1abs, where i0dentifier, This, is, an are expected results.
Note that in your ^\D[a-zA-Z]\w*$ regex, \D can match non-alphanumeric chars since \D matches any non-digit chars, and \w also matches underscores, which is not an alphanumeric char.
I suggest
\b[A-Za-z]+[0-9][A-Za-z0-9]*\b
It matches
\b - word boundary
[A-Za-z]+ - one or more letters (the identifier should start with a letter)
[0-9] - a digit (required)
[A-Za-z0-9]* - zero or more ASCII letters/digits
\b - word boundary.
See the regex demo.
In Python:
identifiers = re.findall(r'\b[A-Za-z]+[0-9][A-Za-z0-9]*\b', text)
A \D matches any non-digit characters including not only alphabets but also punctuation characters, whitespace characters etc. and you definitely do not need them in the beginning.
You can use ^[A-Za-z][A-Za-z0-9]*$ which can be described as
^: Start of string
[A-Za-z]: An alphabet
[A-Za-z0-9]*: An alphanumeric character, zero or more times
$: End of string
Demo
An even simpler pattern for identifier - not using negative lookahead like Wiktor's answer:
^[^0-9][A-Za-z0-9]*$ decomposed and explained:
^[^0-9]: Word starts ^ not [^ with a number 0-9] (more exactly, first char is not a digit, but second character can be a digit!).
[A-Za-z0-9]*: Word doesn't contain any character rather than alpha-numeric characters (not even hyphen or underscore) until the end $.
See demo on regex101.
Positive alternative
As already suggested by Arvind Kumar Avinash:
If (according to both rules) the first char must not be a digit or numeric, but only an alpha, then we could also exchange the first part from above regex from "not-numeric" to "only-alpha".
[A-Za-z][A-Za-z0-9]* explained:
[A-Za-z]: first char must be an alpha
[A-Za-z0-9]*: optional second and following chars can be any alpha-numeric
Same effect, see demo on regex101.
Tests
input
result
reason
aB123
matches identifier
Ab123
matches identifier
XXXX12YZ
matches identifier
a2b3
matches identifier
a
matches identifier
Z
matches identifier
0
no match
starts with a digit
1Ab
no match
starts with a digit
12abc
no match
starts with a digit
abc_123
no match
contains underscore, not alphanum
r2-d2
no match
contains hyphen, not alphanum
I have a string which has the following format:
Foo/FooVersion some info
Foo can contain:
punctuations
special characters
emojis
alpha numeric
Chinese characters
I have this regex to capture the following pattern:
^[\+$-¨™®é!?_ó–:—🔥😘兼职,.&\w\s]+\/\d+[\+\w.-]*
It seems quite exhaustive list of character set and I am not sure if it does cover all the characters. What I am looking for is a simplified regex that takes these characters into account and returns true if there is a match. I am using sql.
FooVersion can consists of:
start with digit followed by word including dot or hyphen
You could use such pattern ([^\/]+)\/\1Version.+
Pattern explanation:
([^\/]+) - [^\/]+ matches on or more characters other than / (this is negated character class), () means capturing group, so matched text is put into first capturing group
\/ - match / literally
\1 - back reference to match the same text as was matched by first capturing group
Version - match Version literally
.+ - match one or more of any characters (to match rest of a string - this is optional and can be removed)
Regex demo
Update
To match updated requirements, you should use ([^\/]+)\/\d[a-zA-Z\d.-]+
What's new is:
[a-zA-Z\d.-]+ - match on or more characters from set a-z (lowercase letters), A-Z (uppercase letters), \d (digits), .- - hyphen or dot
Updated demo
I am working on regex with the following conditions:
Must contain from 1 to 63 alphanumeric characters or hyphens.
First character must be a letter.
Cannot end with a hyphen or contain two consecutive hyphens.
I am able to get the regex like:
^[a-zA-Z0-9](?!.*--)[a-zA-Z0-9-]{0,61}[A-Za-z0-9]$
But it fails on the length constraint as well as allows patterns like "a-". How can I meet the conditions?
I would phrase your requirements as:
^(?=.{1,63}$)(?!.*--)[a-zA-Z]([a-zA-Z0-9\-]*[a-zA-Z0-9])?$
Demo
Here is a brief explanation of what each part of the above regex does:
^ from the start of the match
(?=.{1,63}$) assert that the string is between 1 63 characters
(?!.*--) assert that two hyphens do not appear together anywhere
[a-zA-Z] first character is a letter (mandatory in all matches)
([a-zA-Z0-9\-]*[a-zA-Z0-9])?
The final portion says to match a final character which is alphanumeric, but not dash, possibly preceded by alphanumeric characters or dash.
My take on this would be:
^[A-Za-z](?!.*?--)[A-Za-z0-9\-]{0,62}(?<!-)$
Try it out here
Explanation:
^ - Matches the start of the string.
[A-Za-z] - Matches the first letter.
(?!.*?--) - Ensures that there are no two consecutive hyphens in the rest of the string.
[A-Za-z0-9\-]{0,62} - Matches the remaining alphanumeric and hyphen characters.
(?<!-) - Ensures that the string doesn't end with a hyphen.
$ - Matches the end of the string.
I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.
^(?!-)[a-z\d\-]{1,100}$
Here's an explanation using regex comment mode, so this expanded form can itself be used as a regex:
(?x) # flag to enable comment mode
^ # start of line/string.
(?!-) # negative lookahead for literal hyphen (-) character, so fails if the next position contains one.
[a-z\d\-] # character class matches a single alpha (a-z), digit (\d) or hyphen (\-).
{1,100} # match the above [class] upto 100 times, at least once.
$ # end of line/string.
In short, it's matching upto 100 lowercase alphanumerics or hyphen, but the first character must not be hyphen.
Could be attempting to validate a serial number, or similar, but it's too general to say for sure.
Not all regex engines support negative lookaheads. If you're trying to figure out what it is doing in order to adapt for an engine without negative lookaheads, you can use:
^[a-z\d][a-z\d-]{0,99}$
(?!-) == negative lookahead
start of line not followed by a - that contains at least 1 to 100 characters that can be a-z or 0-9 or a - followed by the end of the line, though the \d in the character class is probably wrong and should be specified by 0-9 otherwise the a-z takes care of a 'd' character, depends on the regex flavor.
A string of letters, digits and dashes. Between 1 and 100 characters. The first character is not a dash.