I have a chunk of text which may include a social media account. I want that account without the trailing space or period. This is using google sheets and regextract. So far, I still get the period returned (if it exists). I'm searching for # then want to return all text until space or period.
Here's my formula:
=if(REGEXMATCH(E2,"#"),REGEXEXTRACT(E2,"#.*?\s"),"No social handle")
E2 is the cell that I'm searching. Here's a sample text: Former foo, now blah blah blahr #socialaccount. blah blah blah blah foo.
You can use as this:
=if(REGEXMATCH(E2,"#"),REGEXEXTRACT(E2,"#.+?\b"),"No social handle")
It captures everything non greedy until a word boundary \b is found. I tested it in My own Google Spreadsheets.
Some explanation
The way the formula REGEXEXTRACT works is to extract everything from the start of the regex pattern until the last character to the regex pattern e.g.:
REGEXEXTRACT("bla ble bli", "b?e") this will find anything in the given string that starts with a b and ends with an e, therefore it will return ble
REGEXEXTRACT("bla bleble bli", "b.+e") this will find anything in the given string that starts with a b plus any character (greedy) until it finds an e, therefore it will return bleble
REGEXEXTRACT("bla bleble bli", "b.+?e") this will find anything in the given string that starts with a b plus any character (non greedy) until the first occurrence of an e, therefore it will return ble
That special \b is called a Word Boundary (detailed article on it, enjoy)
And the full explanation for the regex I provided:
# matches the character # literally (case sensitive)
.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few
times as possible, expanding as needed (lazy)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
Explanation from Regex101
You need to replace #.*?\s with
#\S+\b
This will match:
# - a # char
\S+ - one or more non-whitespace chars, as many as possible
\b - a word boundary position.
As \b appears after \S+, it means that all trailing non-word chars other than whitespaces will be cut off the match value.
See regex example.
Related
Hello good afternoon!!
I'm new to the world of regular expressions and would like some help creating the following expression!
I have a query that returns the following values:
caixa-pod
config-pod
consultas-pod
entregas-pod
monitoramento-pod
vendas-pod
I would like the results to be presented as follows:
caixa
config
consultas
entregas
monitoramento
vendas
In this case, it would exclude the word "-pod" from each value.
I would try (.*)-pod. It is not clear, where do you want to use that regexp (so regexp can be different). I guess it is dashboard variable.
You can try
\b[a-z]*(?=-pod)\b
This regex basically tells the regex engine to match
\b a word boundary
[a-z]* any number of lowercase characters in range a-z (feel free to extend to whatever is needed e.g. [a-zA-Z0-9] matches all alphanumeric characters)
(?=-pod) followed by -pod but exclude that from the result (positive lookahead)
\b another word boundary
\b matches a word boundary position between a word character and non-word character or position (start / end of string).
I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.
I want to find words which contains a given sequence of letters. However the word should be different than a given banned word.
For instance in
"modal dalaman odal Modal ODAL amodal modalex amodale"
If the sequence is "dal" and the banned word is modal, I want to get the dalaman, odal, ODAL, amodal, modalex, amodale.
How can I do that in regex? BTW, there is no specific programming language for this question.
You can use this pattern below to match all words that contain "bar" but are not equal to "modal" as full word.
Pattern:
\w*dal(?<!\bmodal\b)\w*
Explanation:
\w* matches any number of word characters (alphanumeric and underscore "_"), including zero
dal matches the sequence "dal" literally
(?<!\bmodal\b) is a negative lookbehind which assures that the sequence "modal" could not be matched immediately on the left of this token.
The \b matches only at word boundaries, but does not consume any characters.
\w* matches any number of word characters (alphanumeric and underscore "_"), including zero
Check this regex out on regex101.com
This is the old version of my answer that was valid before the question update:
You could use the pattern below together with the i (case insensitivity) flag.
Depending on what programming language or environment you use to process the regex, you might either also have to set the g (global) flag to match all separate occurrences of the pattern, or use a method of your environment that searches all matches, like e.g. in Python re.findall().
Pattern:
\S*(?<!mo)dal\S*
Explanation:
\S* matches any number of non-whitespace characters, including zero
(?<!mo) is a negative lookbehind which assures that the sequence "mo" could not be matched immediately on the left of this token
dal matches the sequence "dal" literally
\S* matches any number of non-whitespace characters, including zero
Check this regex out on regex101.com
More general, you can use this pattern:
\S*(?<!%%FORBIDDEN_LEFT%%)%%REQUIRED%%(?!%%FORBIDDEN_RIGHT%%)\S*
after replacing the placeholders %%REQUIRED%%, %%FORBIDDEN_LEFT%% and %%FORBIDDEN_RIGHT%% with whatever strings you need.
For example, if you want to match "cd" but not "abcdef", you have to use the pattern \S*(?<!ab)cd(?!ef)\S*.
I am trying to match the 'words' that contain a specific string inside a provided string.
This reg_ex works great:
preg_match('/\b(\w*form\w*)\b/', $string, $matches);
So for example if my string contained: "Which person has reformed or performed" it returns reformed and performed.
However, I need to match codes inside codes so my definition of 'word' is based on splitting the string purely by a space.
For example, I have a string like:
Test MFC-123/Ben MFC/7474
And I need to match 'MFC' which should return 'MFC-123/Ben' and 'MFC/7474'.
How can I modify the above reg_ex to match all characters and use space as a boundary.
Thanks
Simply using this will do it for you:
(MFC\S+)
It means any non whitespace character after the MFC
If the MFC comes in between text, or alone, then you can place \S* before and after the MFC`. For example
(\S*MFC\S*)
This matches:
MFC-12312
1231-MFC
MFC
If you want to get the whole block of text which does not contain space and contain your MFC as a match you can use the following regex:
\b(\S*MFC\S+)\b
explanation:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
1st Capturing group (\S*MFC\S+)
\S* match any non-white space character [^\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed.
MFC matches the characters MFC literally (case sensitive)
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed.
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
example where matched blocks are in bold:
Test MFC-123/Ben jbas2/jda lmasdlmasd;mwrsMFCkmasd j2\13 MFC/7474
hope this helps.
I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.