I am trying to write a regex to match the following rules:
a word is consisted only of letters, digits, apostrophes, hyphens and underscores
start with a letter or apostrophe followed by letter
do not contain sequence of 2 or more apostrophes, underscores or hyphens
end with a letter, digit or apostrophe preceded by the letter s or apostrophe followed by s
So far I have a few regexes built:
For rule 2 I have built
^[']?[a-zA-Z][a-zA-Z0-9]+
For rule 3 I have built
(?!.*[-_'][-_'])(?=[a-z])[a-zA-Z0-9]*
but for a test string abc def''ghi it matches ghi not abc
For rule 4 I have built
.*[a-zA-Z0-9](?:'s)?(?:s')?$
but for a test string test's abc' it does not match anything but it should match test's
I am looking for some advice for rule 3 and 4 on how to improve my regex so they work
(?:^|\s)\K(?!'')['a-z](?:['_-]?[a-z0-9])+['_-]?(?:(?<!')'s|s'|[a-z])(?=\s|$)
Explanation:
(?:^|\s) # non capture group, beginning of line OR space
\K # forget all we've seen until this position
(?!'') # negative lookahead, not two apos.
['a-z] # apos. or letter
(?: # start non capture group
['_-]? # apos, dash or underscore, optional
[a-z0-9] # a letter or digit
)+ # group may appear 1 or more times
['_-]? # apos, dash or underscore, optional
(?: # start non capture group
(?<!') # negative lookbehind, make sure we haven't apos before
's # apos and s
| # OR
s' # s and apos
| # OR
[a-z] # a letter
) # end group
(?=\s|$) # lookahead, make sure we have a space or end of line after
Demo
Related
I want to pull out a base string (Wax) or (noWax) from a longer string, along with potentially any data before and after if the string is Wax. I'm having trouble getting the last item in my list below (noWax) to match.
Can anyone flex their regex muscles? I'm fairly new to regex so advice on optimization is welcome as long as all matches below are found.
What I'm working with in Regex101:
/(?<Wax>Wax(?:Only|-?\d+))/mg
Original string
need to extract in a capturing group
Loc3_341001_WaxOnly_S212
WaxOnly
Loc4_34412-a_Wax4_S231
Wax4
Loc3a_231121-a_Wax-4-S451
Wax-4
Loc3_34112_noWax_S311
noWax
Here is one way to do so, using a conditional:
(?<Wax>(no)?Wax(?(2)|(?:Only|-?\d+)))
See the online demo.
(no)?: Optional capture group.
(? If.
(2): Test if capture group 2 exists ((no)). If it does, do nothing.
|: Or.
(?:Only|-?\d+)
I assume the following match is desired.
the match must include 'Wax'
'Wax' is to be preceded by '_' or by '_no'. If the latter 'no' is included in the match.
'Wax' may be followed by:
'Only' followed by '_', in which case 'Only' is part of the match, or
one or more digits, followed by '_', in which case the digits are part of the match, or
'-' followed by one or more digits, followed by '-', in which case
'-' followed by one or more digits is part of the match.
If these assumptions are correct the string can be matched against the following regular expression:
(?<=_)(?:(?:no)?Wax(?:(?:Only|\d+)?(?=_)|\-\d+(?=-)))
Demo
The regular expression can be broken down as follows.
(?<=_) # positive lookbehind asserts previous character is '_'
(?: # begin non-capture group
(?:no)? # optionally match 'no'
Wax # match literal
(?: # begin non-capture group
(?:Only|\d+)? # optionally match 'Only' or >=1 digits
(?=_) # positive lookahead asserts next character is '_'
| # or
\-\d+ # match '-' followed by >= 1 digits
(?=-) # positive lookahead asserts next character is '-'
) # end non-capture group
) # end non-capture group
Example:
I have the following string
a125A##THISSTRING##.test123
I need to find THISSTRING. There are many strings which are nearly the same so I'd like to check if there is a digit or letter before the ## and also if there is a dot (.) after the ##.
I have tried something like:
([a-zA-Z0-9]+##?)(.+?)(.##)
But I am unable to get it working
You can use look behind and look ahead:
(?<=[a-zA-Z0-9]##).*?(?=##\.)
https://regex101.com/r/i3RzFJ/2
But I am unable to get it working.
Let's deconstruct what your regex ([a-zA-Z0-9]+##?)(.+?)(.##) says.
([a-zA-Z0-9]+##?) match as many [a-zA-Z0-9] followed by a # followed by optional #.
(.+?) any character as much as possible but fewer times.
(.##) any character followed by two #. Now . consumes G and then ##. Hence THISSTRING is not completely captured in group.
Lookaround assertions are great but are little expensive.
You can easily search for such patterns by matching wanted and unwanted and capturing wanted stuff in a capturing group.
Regex: (?:[a-zA-Z0-9]##)([^#]+)(?:##\.)
Explanation:
(?:[a-zA-Z0-9]##) Non-capturing group matching ## preceded by a letter or digit.
([^#]+) Capturing as many characters other than #. Stops before a # is met.
(?:##\.) Non-capturing group matching ##. literally.
Regex101 Demo
Javascript Example
var myString = "a125A##THISSTRING##.test123";
var myRegexp = /(?:[a-zA-Z0-9]##)([^#]+)(?:##\.)/g;
var match = myRegexp.exec(myString);
console.log(match[1]);
You wrote:
check if there is a digit or letter before the ##
I assume you mean a digit / letter before the first ## and
check for a dot after the second ## (as in your example).
You can use the following regex:
[a-z0-9]+ # Chars before "##", except the last
(?: # Last char before "##"
(\d) # either a digit - group 1
| # or
([a-z]) # a letter - group 2
)
##? # 1 or 2 "at" chars
([^#]+) # "Central" part - group 3
##? # 1 or 2 "at" chars
(?: # Check for a dot
(\.) # Captured - group 4
| # or nothing captured
)
[a-z0-9]+ # The last part
# Flags:
# i - case insensitive
# x - ignore blanks and comments
How it works:
Group 1 or 2 captures the last char before the first ##
(either group 1 captures a digit or group 2 captures a letter).
Group 3 catches the "central" part (THISSTRING,
a sequence of chars other than #).
Group 4 catches a dot, if any.
You can test it at https://regex101.com/r/ATjprp/1
Your regex has such an error that a dot matches any char.
If you want to check for a literal dot, you must escape it
with a backslash (compare with group 4 in my solution).
I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1
I want to match all the strings satifying following rules-
should consist of lower-case letters and digits and dashes
should start with a letter or a number
should end with a letter or number
total string length should be atleast 3 and atmost 20 characters
dot . is optional, there shouldn't be two or more consecutive dots .
dash - is optional, there shouldn't be two or more consecutive dashes -
dot . and dash - shouldn't be consecutive // the string aaa.-aaabbb is invalid
underscore not allowed
I have come up with this regex:
^[a-z0-9]([a-z0-9]+\.?\-?[a-z0-9]+){1,18}[a-z0-9]$
[a-z0-9] //should start/end with a letter or a number
([a-z0-9]+\.?\-?[a-z0-9]+){1,18} //other rules
However it is failing in some scenarios like -
abcdefghijklmnopqrstuvwxyz //should fail total number of chars greater than 20
aaa.-aaabbb //should fail as dot '.' and dash '-' are consecutive
Can anyone please help me in correcting this regex?
You can achieve this with a lookahead assertion:
^(?!.*[.-]{2})[a-z0-9][a-z0-9.-]{1,18}[a-z0-9]$
Explanation:
^ # Start of string
(?! # Assert that the following can't be matched:
.* # Any number of characters
[.-]{2} # followed by .. or -- or .- or -.
) # End of lookahead
[a-z0-9] # Match lowercase letter/digit
[a-z0-9.-]{1,18} # Match 1-18 of the allowed characters
[a-z0-9] # Match lowercase letter/digit
$ # End of string
I came up with this which uses a negative lookahead similar to Tim's solution but a different way of appying it. Because it only does the look ahead when it sees a dot or a dash it may not need to do quite so much back tracking which may make it perform very slightly faster.
^[a-z0-9]([a-z0-9]|([-.](?![.-]))){1,18}[a-z0-9]$
Explanation:
^ # Start of string
[a-z0-9] # Must start with a letter or number
( # Begin Group
[a-z0-9] # Match a letter or number
| # OR
([-.](?![.-])) # Match a dot or dash that is not followed by a dot or dash
){1,18} # Match group 1 to 18 times
[a-z0-9] # Must end with a letter or number
$ # End of string
I'm trying to create a regex pattern for my powershell code. I've never worked with regex before, so I'm a total noob.
The regex should check if there are two points in the string.
Examples that SHOULD work:
3.1.1
5.10.12
10.1.15
Examples that SHOULD NOT work:
3
3.1
5.10.12.1
The string must have two points in it, the number of digits doesn't matter.
I've tried something like this, but it doesn't really work and I think its far from the right solution...
([\d]*.[\d]*.[\d])
In your current regex I think you could escape the dot \. or else the dot would match any character.
You could add anchors for the start ^ and the end $ of the string and update your regex to ^\d*\.\d*\.\d*$
That would also match ..4 and ..
Or if you want to match one or more digits, I think you could use ^\d+(?:\.\d+){2}$
That would match
^ # From the beginning of the string
\d+ # Match one or more digits
(?: # Non capturing group
\.\d+ # Match a dot and one or more ditits
){2} # Close non capturing group and repeat 2 times
$ # The end of the string
Use a lookahead:
^\d(?=(?:[^.]*\.[^.]*){2}$)[\d.]*$
Broken down, this says:
^ # start of the line
\d # at least one digit
(?= # start of lookahead
(?:[^.]*\.[^.]*){2} # not a dot, a dot, not a dot - twice
$ # anchor it to the end of the string
)
[\d.]* # only digits and dots, 0+ times
$ # the end of the string
See a demo on regex101.com.