How to allow spaces in between words? - regex

EDIT: I've been experimenting, and it seems like putting this:
\(\w{1,12}\s*\)$
works, however, it only allows space at the end of the word.
example,
Matches
(stuff )
(stuff )
Does not
(st uff)
Regexp:
\(\w{1,12}\)
This matches the following:
(stuff)
But not:
(stu ff)
I want to be able to match spaces too.
I've tried putting \s but it just broke the whole thing, nothing would match. I saw one post on here that said to enclose the whole thing in a ^[]*$ with space in there. That only made the regex match everything.
This is for Google Forms validation if that helps. I'm completely new to regex, so go easy on me. I looked up my problem but could not find anything that worked with my regex. (Is it because of the parenthesis?)

For matching text like (st uff) or (st uff some more) you will need to write your regex like this,
\(\w{1,12}(?:\s+\w{1,12})*\)
Regex explanation:
\( - Literal start parenthesis
\w{1,12} - Match a word of length 1 to 12 like you wanted
(?:\s+\w{1,12})* - You need this pattern so it can match one or more space followed by a word of length 1 to 12 and whole of this pattern to repeat zero or more times
\) - Literal closing parenthesis
Demo
Now if you want to optionally also allow spaces just after starting parenthesis and ending parenthesis, you can just place \s* in the regex like this,
\(\s*\w{1,12}(?:\s+\w{1,12})*\s*\)
^^^ ^^^
Demo with optional spaces

If you are trying to get 12 characters between parentheses:
\([^\)]{1,12}\)
The [^\)] segment is a character class that represents all characters that aren't closing parentheses (^ inverts the class).
If you want some specific characters, like alphanumeric and spaces, group that into the character class instead:
\([\w ]{1,12}\)
Or
\([\w\s]{1,12}\)
If you want 12 word characters with an arbitrary number of spaces anywhere in between:
\(\s*(?:\w\s*){1,12}\)

Related

Regex Extraction - Match before a space, or NOT before a space

Here are my potential inputs:
brian#muck.co, brian#gmail.com
brian#gmail.com, brian#muck.co
What I want to do is extract the #muck.co email address.
What I have tried is:
\s.*#muck.co
The problem is that this only grabs an email address if it is preceded by a space (so it would only match the second example input above). . . How would I write a Regex expression to match either inputs?
\s matches for a space, so you should wanted to use something like [^\s]*#muck.co - this means any number of not space caracters. [] - for a set of symbols, ^ - for negate effect.
It does not work for me, because \s in my regex flavour seems to not contain regular space, but this works [^[:space:]]\+#muck\.co. Also \+ instead of * for one or more non-space characters instead of any number and escape dot \. which unescaped stands for any single character.
You can use a negated character class to not cross the # and use either a word boundary at the end to prevent a partial word match:
[^\s#]+#muck\.co\b
Regex demo

RegEx in VSCode: capture every character/letter - not just ASCII

I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w .
For example I want to match the dot (and only the dot) in the following line:
<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>
Without the ÿ I could easily work with the mentioned character classes, like:
(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))
But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?
You may match any text between < and > with [^<>]*:
(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))
See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:
(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)
See this regex demo. Details:
(?<=<tag1>[^<>]*) - a location immediately preceded with <tag1 and then any zero or more chars other than < and >
\. - a dot
(?=[^<>]*</tag1>) - a location immediately preceded with any zero or more chars other than < and > and then </tag1>.
use a negated character class that exculdes the dot and the opening angle bracket:
(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.
with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:
(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)

Modifying regex to match beginning and end characters

I am new to regex and playing around with writing regex to match markdown syntaxes, particularly italic text like:
this is markdown with some *italic text*
After writing some naive implementations I found this regex which seems to do the job quite nicely (dealing with edge-cases) and matches the entire string:
(?<!\*)\*([^ ][^*\n]*?)\*(?!\*)
However, I don't want to match the entire string - I only want to match the beginning and end * characters (so that I can do some special formatting to those characters). How might I go about doing that?
The tricky thing is that I only want to the match the * characters when the rest of the string matches the correct format of a string in italics (i.e. meets the requirements of that regex above). So a simple regex like (\*|\*) isn't going to cut it.
Except from using a capturing group for the asterix at the start and at the end, you can add an asterix to the first negated character class to prevent matching a double **.
Note that as pointed out by #toto you don't really need the capturing groups around the asterix (\*). You can also match them and add the replacement characters before and after the single capturing group for the content in the middle.
It also means that it should match at least a single character other then an asterix.
You don't have to make the first character class non greedy *? as it can not cross the * boundary that follows.
(?<!\*)(\*)([^*\s][^*\r\n]*)(\*)(?!\*)
Regex demo
If there can also not be a space before the ending asterix, you can repeat matching a space followed by matching any non whitespace char except an asterix (?: [^*\s]+)*
The \r\n in the negated character class is to prevent newline boundaries which are also matched by \s. If that should not be the case, you can replace that by a space or tab and space.
(?<!\*)(\*)([^*\s]+(?: [^*\s]+)*)(\*)(?!\*)
Regex demo
Just change the first and second \* to capturing groups and you can change at will:
(?<!\*)(\*)([^ ][^*\n]*?)(\*)(?!\*)
Demo

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b

Regular expression to remove parenthesis and space before it

I'm trying to write a regular expression (inside a Google Spreadsheet) to remove parenthesis, the text inside the parenthesis, and space before the parenthesis. Or in other words, I'm trying to extract only the name inside of the text. For example, I'd like the string "A.J. Smith (iOS Developer, San Francisco)" to become "A.J. Smith"
So far I've gotten both =REGEXEXTRACT(D2,"[^()]*") and =REGEXEXTRACT(D2,"^[^(]+") to extract "A.J. Smith " but it leaves that last space at the end. This is probably a really easy problem to solve, I'm just not great with regex.
Just use word boundary.
=REGEXEXTRACT(D2,"^[^(]+\\b")
^[^(]+ greedily matches all the characters upto the first ( symbol including the space which exists before (. Then it backtracks to the last word boundary appears on the matched string because of \b present in the regex.
DEMO
Try this instead:
=REGEXREPLACE(D2,"\s\(.*","")
What I'm doing is replacing everything from a space next to a parenthesis to the end of the string with nothing.
I used https://regoio.herokuapp.com/ to help build a regex to match. This regex would match this example without the space. ^(.+)\s\(
The regex works like this, The ^ matches the beginning of the string, the parenthesis captures whatever expression is inside that you want to use. in this case .+ which matches any character 1 or more times. The \s matchs a whitespace character and \( matches the opening parenthesis.
If you want a regex that removes whitespace at the beginning of the string and any before the parenthesis this should work: ^[\s]*(.+)[\s]+\(
With this regex you can extract all the text you wanted in a single REGEXEXTRACT instead of using multiple ones:
=REGEXEXTRACT(D2,"^[\s]*(.+)[\s]+\(")
I found that =REGEXEXTRACT(D2,"(.*)\s\(") also worked for me.
This should work to remove all parentheses and white space before:
=REGEXTRACT(D2,"\s|\(|\)|\[|]|{|}|")
Feel free to play around with this on rubular.