I need to implement a regex to check if a string contains a "if" and "&".
for example:
if a & b // regex should to match here
for a & b // regex not should to match here
I have tried, this regex (if|&), but in this case will be considerate if the string contains if OR ampersand...
Note: This answer assumes that a and b are words, that is, only containing the characters [a-zA-Z0-9_]. If this is not what you want, replace \w+ in the examples below appropriately.
if\s+(\w+)\s*&\s*(\w+)
matches the following:
if a & b
if foo&bar
Broken down:
Match the word if
Look for at least one whitespace character
Match (and capture) a word
Match 0 or more whitespace characters
Match the & character
Match 0 or more whitespace characters
Match (and capture) a word
Related
I need to match a string with an identifier.
Pattern
Any word will be considered as identifier if
Word doesn't contain any character rather than alpha-numeric characters.
Word doesn't start with number.
Input
The given input string will not contain any preceding or trailing spaces or white-space characters.
Code
I tried using the following regular expressions
\D[a-zA-Z]\w*\D
[ \t\n][a-zA-Z]\w*[ \t\n]
^\D[a-zA-Z]\w*$
None of them works.
How can I achieve this?
Note I want to match a string that contains multiple identifiers (also can be one). For example This is an i0dentifier 1abs, where i0dentifier, This, is, an are expected results.
Note that in your ^\D[a-zA-Z]\w*$ regex, \D can match non-alphanumeric chars since \D matches any non-digit chars, and \w also matches underscores, which is not an alphanumeric char.
I suggest
\b[A-Za-z]+[0-9][A-Za-z0-9]*\b
It matches
\b - word boundary
[A-Za-z]+ - one or more letters (the identifier should start with a letter)
[0-9] - a digit (required)
[A-Za-z0-9]* - zero or more ASCII letters/digits
\b - word boundary.
See the regex demo.
In Python:
identifiers = re.findall(r'\b[A-Za-z]+[0-9][A-Za-z0-9]*\b', text)
A \D matches any non-digit characters including not only alphabets but also punctuation characters, whitespace characters etc. and you definitely do not need them in the beginning.
You can use ^[A-Za-z][A-Za-z0-9]*$ which can be described as
^: Start of string
[A-Za-z]: An alphabet
[A-Za-z0-9]*: An alphanumeric character, zero or more times
$: End of string
Demo
An even simpler pattern for identifier - not using negative lookahead like Wiktor's answer:
^[^0-9][A-Za-z0-9]*$ decomposed and explained:
^[^0-9]: Word starts ^ not [^ with a number 0-9] (more exactly, first char is not a digit, but second character can be a digit!).
[A-Za-z0-9]*: Word doesn't contain any character rather than alpha-numeric characters (not even hyphen or underscore) until the end $.
See demo on regex101.
Positive alternative
As already suggested by Arvind Kumar Avinash:
If (according to both rules) the first char must not be a digit or numeric, but only an alpha, then we could also exchange the first part from above regex from "not-numeric" to "only-alpha".
[A-Za-z][A-Za-z0-9]* explained:
[A-Za-z]: first char must be an alpha
[A-Za-z0-9]*: optional second and following chars can be any alpha-numeric
Same effect, see demo on regex101.
Tests
input
result
reason
aB123
matches identifier
Ab123
matches identifier
XXXX12YZ
matches identifier
a2b3
matches identifier
a
matches identifier
Z
matches identifier
0
no match
starts with a digit
1Ab
no match
starts with a digit
12abc
no match
starts with a digit
abc_123
no match
contains underscore, not alphanum
r2-d2
no match
contains hyphen, not alphanum
I want to regex match the last word in a string where the string ends in ... The match should be the word preceding the ...
Example: "Do not match this. This sentence ends in the last word..."
The match would be word. This gets close: \b\s+([^.]*). However, I don't know how to make it work with only matching ... at the end.
This should NOT match: "Do not match this. This sentence ends in the last word."
If you use \s+ it means there must be at least a single whitespace char preceding so in that case it will not match word... only.
If you want to use the negated character class, you could also use
([^\s.]+)\.{3}$
( Capture group 1
[^\s.]+ Match 1+ times any char except a whitespace char or dot
) Close group
\.{3} Match 3 dots
$ End of string
Regex demo
You can anchor your regex to the end with $. To match a literal period you will need to escape it as it otherwise is a meta-character:
(\S+)\.\.\.$
\S matches everything everything but space-like characters, it depends on your regex flavor what it exactly matches, but usually it excludes spaces, tabs, newlines and a set of unicode spaces.
You can play around with it here:
https://regex101.com/r/xKOYa4/1
How do you match any one character with a regular expression?
A number of other questions on Stack Overflow sound like they promise a quick answer, but they are actually asking something more specific:
Regex for a string of repeating characters and another optional one at the end
regex to match a single character that is anything but a space
Replace character in regex match only
Match any single character
Use the dot . character as a wildcard to match any single character.
Example regex: a.c
abc // match
a c // match
azc // match
ac // no match
abbc // no match
Match any specific character in a set
Use square brackets [] to match any characters in a set.
Use \w to match any single alphanumeric character: 0-9, a-z, A-Z, and _ (underscore).
Use \d to match any single digit.
Use \s to match any single whitespace character.
Example 1 regex: a[bcd]c
abc // match
acc // match
adc // match
ac // no match
abbc // no match
Example 2 regex: a[0-7]c
a0c // match
a3c // match
a7c // match
a8c // no match
ac // no match
a55c // no match
Match any character except ...
Use the hat in square brackets [^] to match any single character except for any of the characters that come after the hat ^.
Example regex: a[^abc]c
aac // no match
abc // no match
acc // no match
a c // match
azc // match
ac // no match
azzc // no match
(Don't confuse the ^ here in [^] with its other usage as the start of line character: ^ = line start, $ = line end.)
Match any character optionally
Use the optional character ? after any character to specify zero or one occurrence of that character. Thus, you would use .? to match any single character optionally.
Example regex: a.?c
abc // match
a c // match
azc // match
ac // match
abbc // no match
See also
A quick tutorial to teach you the basics of regex
A practice sandbox to try things out
Simple answer
If you want to match single character, put it inside those brackets [ ]
Examples
match + ...... [+] or +
match a ...... a
match & ...... &
...and so on. You can check your regular expresion online on this site: https://regex101.com/
(updated based on comment)
If you are searching for a single isolated character or a set of isolated characters within any string you can use this
\b[a-zA-Z]\s
this will find all single english characters in the string
similarly use
\b[0-9]\s
to find single digits like it will pick 9 but not 98 and so on
I'm attempting to match the last character in a WORD.
A WORD is a sequence of non-whitespace characters
'[^\n\r\t\f ]', or an empty line matching ^$.
The expression I made to do this is:
"[^ \n\t\r\f]\(?:[ \$\n\t\r\f]\)"
The regex matches a non-whitespace character that follows a whitespace character or the end of the line.
But I don't know how to stop it from excluding the following whitespace character from the result and why it doesn't seem to capture a character preceding the end of the line.
Using the string "Hi World!", I would expect: the "i" and "!" to be captured.
Instead I get: "i ".
What steps can I take to solve this problem?
"Word" that is a sequence of non-whitespace characters scenario
Note that a non-capturing group (?:...) in [^ \n\t\r\f](?:[ \$\n\t\r\f]) still matches (consumes) the whitespace char (thus, it becomes a part of the match) and it does not match at the end of the string as the $ symbol is not a string end anchor inside a character class, it is parsed as a literal $ symbol.
You may use
\S(?!\S)
See the regex demo
The \S matches a non-whitespace char that is not followed with a non-whitespace char (due to the (?!\S) negative lookahead).
General "word" case
If a word consists of just letters, digits and underscores, that is, if it is matched with \w+, you may simply use
\w\b
Here, \w matches a "word" char, and the word boundary asserts there is no word char right after.
See another regex demo.
In Word text, if I want to highlight the last a in para. I search for all the words that have [space][para][space] to make sure I only have the word I want, then when it is found it should be highlighted.
Next, I search for the last [a ] space added, in the selection and I will get only the last [a] and I will highlight it or color it differently.
I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.