Regular expression - finding full words that contain a specific string - regex

I am trying to match the 'words' that contain a specific string inside a provided string.
This reg_ex works great:
preg_match('/\b(\w*form\w*)\b/', $string, $matches);
So for example if my string contained: "Which person has reformed or performed" it returns reformed and performed.
However, I need to match codes inside codes so my definition of 'word' is based on splitting the string purely by a space.
For example, I have a string like:
Test MFC-123/Ben MFC/7474
And I need to match 'MFC' which should return 'MFC-123/Ben' and 'MFC/7474'.
How can I modify the above reg_ex to match all characters and use space as a boundary.
Thanks

Simply using this will do it for you:
(MFC\S+)
It means any non whitespace character after the MFC
If the MFC comes in between text, or alone, then you can place \S* before and after the MFC`. For example
(\S*MFC\S*)
This matches:
MFC-12312
1231-MFC
MFC

If you want to get the whole block of text which does not contain space and contain your MFC as a match you can use the following regex:
\b(\S*MFC\S+)\b
explanation:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
1st Capturing group (\S*MFC\S+)
\S* match any non-white space character [^\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed.
MFC matches the characters MFC literally (case sensitive)
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed.
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
example where matched blocks are in bold:
Test MFC-123/Ben jbas2/jda lmasdlmasd;mwrsMFCkmasd j2\13 MFC/7474
hope this helps.

Related

Modifying regex to match beginning and end characters

I am new to regex and playing around with writing regex to match markdown syntaxes, particularly italic text like:
this is markdown with some *italic text*
After writing some naive implementations I found this regex which seems to do the job quite nicely (dealing with edge-cases) and matches the entire string:
(?<!\*)\*([^ ][^*\n]*?)\*(?!\*)
However, I don't want to match the entire string - I only want to match the beginning and end * characters (so that I can do some special formatting to those characters). How might I go about doing that?
The tricky thing is that I only want to the match the * characters when the rest of the string matches the correct format of a string in italics (i.e. meets the requirements of that regex above). So a simple regex like (\*|\*) isn't going to cut it.
Except from using a capturing group for the asterix at the start and at the end, you can add an asterix to the first negated character class to prevent matching a double **.
Note that as pointed out by #toto you don't really need the capturing groups around the asterix (\*). You can also match them and add the replacement characters before and after the single capturing group for the content in the middle.
It also means that it should match at least a single character other then an asterix.
You don't have to make the first character class non greedy *? as it can not cross the * boundary that follows.
(?<!\*)(\*)([^*\s][^*\r\n]*)(\*)(?!\*)
Regex demo
If there can also not be a space before the ending asterix, you can repeat matching a space followed by matching any non whitespace char except an asterix (?: [^*\s]+)*
The \r\n in the negated character class is to prevent newline boundaries which are also matched by \s. If that should not be the case, you can replace that by a space or tab and space.
(?<!\*)(\*)([^*\s]+(?: [^*\s]+)*)(\*)(?!\*)
Regex demo
Just change the first and second \* to capturing groups and you can change at will:
(?<!\*)(\*)([^ ][^*\n]*?)(\*)(?!\*)
Demo

Regex: how to match repeating pattern incrementally

Given the following string:
one.two.three.four
How do I match/capture which results in the following in one go:
one
one.two
one.two.three
(if it's possible at all)
You can use this:
(?=(^|(?<=[.]))([\w.]+))
This will perform a non-width look ahead, it means that the string will be iterated on character at the time and matching the pattern; inside it says:
Using a non-width look-behind:
is there the beginning of the string?
do i have a . behind the cursor?
Using a capture group, it will get the rest of the string that was not consumed yet.
(\w+)\.?
(\w+) matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
\.? Quantifier — matches the character . literally (case sensitive)
if your characters are lowercased alphabets. then try this. ([a-z]+)\.?

regex find character parse until space or period

I have a chunk of text which may include a social media account. I want that account without the trailing space or period. This is using google sheets and regextract. So far, I still get the period returned (if it exists). I'm searching for # then want to return all text until space or period.
Here's my formula:
=if(REGEXMATCH(E2,"#"),REGEXEXTRACT(E2,"#.*?\s"),"No social handle")
E2 is the cell that I'm searching. Here's a sample text: Former foo, now blah blah blahr #socialaccount. blah blah blah blah foo.
You can use as this:
=if(REGEXMATCH(E2,"#"),REGEXEXTRACT(E2,"#.+?\b"),"No social handle")
It captures everything non greedy until a word boundary \b is found. I tested it in My own Google Spreadsheets.
Some explanation
The way the formula REGEXEXTRACT works is to extract everything from the start of the regex pattern until the last character to the regex pattern e.g.:
REGEXEXTRACT("bla ble bli", "b?e") this will find anything in the given string that starts with a b and ends with an e, therefore it will return ble
REGEXEXTRACT("bla bleble bli", "b.+e") this will find anything in the given string that starts with a b plus any character (greedy) until it finds an e, therefore it will return bleble
REGEXEXTRACT("bla bleble bli", "b.+?e") this will find anything in the given string that starts with a b plus any character (non greedy) until the first occurrence of an e, therefore it will return ble
That special \b is called a Word Boundary (detailed article on it, enjoy)
And the full explanation for the regex I provided:
# matches the character # literally (case sensitive)
.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few
times as possible, expanding as needed (lazy)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
Explanation from Regex101
You need to replace #.*?\s with
#\S+\b
This will match:
# - a # char
\S+ - one or more non-whitespace chars, as many as possible
\b - a word boundary position.
As \b appears after \S+, it means that all trailing non-word chars other than whitespaces will be cut off the match value.
See regex example.

How do I match name, name here in Regex?

I've tried everything I can think of to get this to work, and after hours of trying, I've got to ask for help.
This is the string I'm scanning.
F:\Downloads\Downloads\500 Comics CCC English\Jack, Byrd - Art #01.cbr
This is the current Regex I'm using to try to match what I want.
(?i)English\\(?<Writer>.*(?= ))(?-i)
It matches English\Jack, Byrd - Art
All I want it to match is Jack, Byrd (with no space after it.)
For some reason, the only space I can get it to match is the space after Art.
No matter what I try, it will only match that space. It's like it doesn't consider the other spaces to be spaces.
The (?i)English\\(?<Writer>.*(?= ))(?-i) pattern contains .* that grabs as many chars as it can and then backtracking results in the English\Jack, Byrd - Art match because the space after Art is the last space (the space is required due to the positive lookahead (?= )).
There are several ways to fix it depending on the contexts you have. If there is always a space-hyphen-space after the necessary value add a space and hyphen to the lookahead
(?i)English\\(?<Writer>.*(?= - ))(?-i)
If the value is always a string of non-whitespace chars, comma, space and again a string of non-whitespace chars use
(?i)English\\(?<Writer>\S+,\s*\S+)(?-i)
where \S+ matches one or more non-whitespace chars.
For those who wonder what a (?<Write>...) is, the construct is called a named capturing group and is referenced to as \k<Writer> when used inside the same regex pattern or as ${Writer} when used inside a string replacement argument.

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)