regex to match entire words containing only certain characters - regex

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)

This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""

The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+

Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").

Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

Related

Regular Expression to match first word with a character in each line

I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?
Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}
If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo
The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

How to use a regex to match if any pattern appears once out of many times in a given sequence

Hard to word this correctly, but TL;DR.
I want to match, in a given text sentence (let's say "THE TREE IS GREEN") if any space is doubled (or more).
Example:
"In this text,
THE TREE IS GREEN should not match,
THE TREE IS GREEN should
and so should THE TREE IS GREEN
but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern."
My initial approach would be
/THE( {2,})TREE( {2,})IS( {2,})GREEN/
but this only matches if all spaces are double in the sequence, therefore I'd like to make any of the groups trigger a full match. Am I going the wrong way, or is there a way to make this work?
You can use Negative lookahead if there is an option.
First match the sentence that you want to fail, in your case, it is "THE TREE IS GREEN" then give the most generic case that wants to catch your desired result.
(?!THE TREE IS GREEN)(THE[ ]+TREE[ ]+IS[ ]+GREEN)
https://regex101.com/r/EYDU6g/2
You can just search for the spaces that you're looking for:
/ {2,}/ will work to match two or more of the space character. (https://regexr.com/4h4d4)
You can capture the results by surrounding it with parenthesis - /( {2,})/
You may want to broaden it a bit.
/\s{2,}/ will match any doubling of whitespace.
(\s - means any whitespace - space, tab, newline, etc.)
No need to match the whole string, just the piece that's of interest.
If I am not mistaken you want the whole match if there is a part present where there are 2 or more spaces between 2 uppercased parts.
If that is the case, you might use:
^.*[A-Z]+ {2,}[A-Z]+.*$
^ Start of string
.*[A-Z]+ match any char except a newline 0+ time, then match 1+ times [A-Z]
[ ]{2,} Match 2 or more times a space (used square brackets for clarity)
A-Z+ Match 1+ times an uppercase char
.*$ Match any char except a newline 0+ times until the end of the string
Regex demo
You could do this:
import re
pattern = r"THE +TREE +IS +GREEN"
test_str = ("In this text,\n"
"THE TREE IS GREEN should not match,\n"
"THE TREE IS GREEN should\n"
"and so should THE TREE IS GREEN\n"
"but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern.")
matches = re.finditer(pattern, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
if match.group() != 'THE TREE IS GREEN':
print ("{match}".format(match = match.group()))

extract substring with regular expression

I have a string, actually is a directory file name.
str='\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3'
I need to extract the target substring 'UA0001A' with matlab (well I would like think all tools should have same syntax).
It does not necessary to be exact 'UA0001A', it is arbitrary alphabet-number combination.
To make it more general, I would like to think the substring (or the word) shall satisfy
it is a alphabet-number combination word
it cannot be pure alphabet word or pure number word
it cannot include 'midd' or 'midd3' or 'Midd3' or 'MIDD3', etc, so may use case-intensive method to exclude word begin with 'midd'
it cannot include 'y[0-9]{2,4}m[0-9]{1,2}d[0-9]{1,2}\w*'
How to write the regular expression to find the target substring?
Thanks in advance!
You can use
s = '\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3';
res = regexp(s, '(?i)\\(?![^\W_]*(midd|y\d+m\d+))(?=[^\W_]*\d)(?=[^\W_]*[a-zA-Z])([^\W_]+)','tokens');
disp(res{1}{1})
See the regex demo
Pattern explanation:
(?i) - the case-insensitive modifier
\\ - a literal backslash
(?![^\W_]*(midd|y\d+m\d+)) - a negative lookahead that will fail a match if there are midd or y+digits+m+digits after 0+ letters or digits
(?=[^\W_]*\d) - a positive lookahead that requires at least 1 digit after 0+ digits or letters ([^\W_]*)
(?=[^\W_]*[a-zA-Z]) - there must be at least 1 letter after 0+ letters or digits
([^\W_]+) - Group 1 (what will extract) matching 1+ letters or digits (or 1+ characters other than non-word chars and _).
The 'tokens' "mode" will let you extract the captured value rather than the whole match.
See the IDEONE demo
this should get you started:
[\\](?i)(?!.*midd.*)([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*)
[\\] : match a backslash
(?i) : rest of regex is case insensitive
?! following match can not match this
(?!.*midd.*) : following match can not be a word wich has any character, midd, any character
([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*) match at least one number followed by at least one letter OR at least one letter followed by at least one number followed by any amount of letters and numbers (remember, cannot match the ?! group so no word which contains mid )

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks
First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).
Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b
Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).
It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';
If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.