Add exceptions to complex regular expression (lookahead and lookbehind utilized) - regex

I'd like some help with regular expressions because I'm not really familiar with.
So far, I have created the following regex:
/\b(?<![\#\-\/\>])literal(?![\<\'\"])\b/i
As https://regex101.com/ states:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
Negative Lookbehind (?])
Assert that the Regex below does not match
Match a single character present in the list below [#-/>]
# matches the character # literally (case insensitive)
- matches the character - literally (case insensitive)
/ matches the character / literally (case insensitive)
> matches the character > literally (case insensitive)
literal matches the characters literal literally (case insensitive)
Negative Lookahead (?![\<\'\"])
Assert that the Regex below does not match
Match a single character present in the list below [\<\'\"]
\< matches the character < literally (case insensitive)
\' matches the character ' literally (case insensitive)
\" matches the character " literally (case insensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
Global pattern flags
i modifier: insensitive. Case insensitive match (ignores case of
[a-zA-Z])
I want to add two exceptions to this matching rule. 1) if the ">" is preceded by "p", that is for example a <p> starting tag, to match the literal only. 2) Also the literal should only be matched when < is follwed by /p, that is for example a </p> closing tag.
How can achieve this ?
Example: only the bold ones should match.
<p>
**Literal** in computer science is a
<a href='http://www.google.com/something/literal#literal'>literal</a>
for representing a fixed value in source code. Almost all programming
<a href='http://www.google.com/something/else-literal#literal'>languages</a>
have notations for atomic values such as integers, floating-point
numbers, and strings, and usually for booleans and characters; some
also have notations for elements of enumerated types and compound
values such as arrays, records, and objects. An anonymous function
is a **literal** for the function type which is **LITERAL**
</p>
I know I have over-complicated things, but the situation is complicated itself and I think I have no other way.

If the text you're searching is just text mixed with some <a> tags, then you can simplify the < and > parts of the lookarounds, and give a specific string that it shouldn't be followed by: </a>.
/\b(?<![-#\/])literal(?!<\/a>)\b/i
Regex101 Demo

Related

Regex: Match pattern unless preceded by pattern containing element from the matching character class

I am having a hard time coming up with a regex to match a specific case:
This can be matched:
any-dashed-strings
this-can-be-matched-even-though-its-big
This cannot be matched:
strings starting with elem- or asdf- or a single -
elem-this-cannot-be-matched
asdf-this-cannot-be-matched
-
So far what I came up with is:
/\b(?!elem-|asdf-)([\w\-]+)\b/
But I keep matching a single - and the whole -this-cannot-be-matched suffix. I cannot figure it out how to not only ignore a character present inside the matching character class conditionally, and not matching anything else if a suffix is found
I am currently working with the Oniguruma engine (Ruby 1.9+/PHP multi-byte string module).
If possible, please elaborate on the solution. Thanks a lot!
If a lookbehind is supported, you can assert a whitespace boundary to the left, and make the alternation for both words without the hyphen optional.
(?<!\S)(?!(?:elem|asdf)?-)[\w-]+\b
Explanation
(?<!\S) Assert a whitespace boundary to the left
(?! Negative lookahead, assert the directly to the right is not
(?:elem|asdf)?- Optionally match elem or asdf followed by -
) Close the lookahead
[\w-]+ Match 1+ word chars or -
\b A word boundary
See a regex demo.
Or a version with a capture group and without a lookbehind:
(?:\s|^)(?!(?:elem|asdf)?-)([\w-]+)\b
See another regex demo.

Find all words in a string that contain <sub> tags using regex

I have the following string:
CO<sub>2</sub> is one of the most abundant gases there is, while C<sub>2</sub>SO<sub>4</sub> is very corrosive. Drink H<sub>2</sub> to stay hydrated.
I want to extract all the words from this string that contain the sub tags.
I have gotten as far as this for my regular expression, but I can't seem to figure out how to continue.
https://regexr.com/495sp
The following should work:
/\w*<sub>\w*<\/sub>[^ \.]*/g
Demo
Explanation:
\w* - Matches any word characters before the first tag.
<sub> - Matches the first opening tag.
\w* - Matches the text between the first tags.
<\/sub> - Matches the first closing tag.
[^ \.]* - Matches any following characters that aren't spaces or full stops (in case the match occurs at the end of a sentence). Includes matching any further connected sub tags.
g flag - Enables global search, causing all occurrences to be matched.
Updated: to select all words that contain the <sub> tag
(\w+<sub>\w+<\/sub>)+
\w+ Matches any word character
<sub> Matches the characters <sub> literally (case sensitive)
<\/sub> Matches the characters </sub> literally (case sensitive)
+ Matches between one and unlimited times

Regular expression brackets containing any letters

I need to find brackets that contain any letter.
for example:
a17(1d34) xc
the previous brackets contain the letter d.
So I need to find: (1d34)
The following regex can do the job:
\([^a-z]*[a-z]+[^a-z]*\) with flags g and i
You can test it with the live demo at regex101 to check if it works with all the cases you expect.
Also I don't know the language you are using, regex101 let's you generate code for some.
Breakthrough
\( matches the literal opening bracket
[^a-z]* matches any character before the letter that is not a letter (can be nothing)
the ^ character right after an opening range inverts the match
[a-z]+ matches at least one letter
[^a-z]* matches any character after the letter that is not a letter (can be nothing)
\) matches the literal closing bracket
the flag i (case insensitive) extends the range a to z, to uppercase also
the flag g (global match) lets you match multiple times
Hope it helps!
/\((?:\d*[A-Z]+\d*)+\)/gi will match your brackets that contains at least 1 letter.
var rgx = /\((?:\d*[A-Z]+\d*)+\)/gi;
rgx.exec("a17(1d34) xc"); //(1d34)

What's a RegEx for "up to three words but no more than 20 characters"?

I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.

How to replace any_string#a.net to any_string#b.com

How to replace any_string#a.net to any_string#b.com using RegEx?
I want to strip the #a.net and replace it with #b.com
I've tried
(.*#a.net)
but the $1 is showing all the string.
So when i try to replace it, it became
any_string#a.net#b.com
And can someone point me to a nice tutorial regarding RegEx?
The () indicates the capture group. Put the parts of the expression you don't want to capture outside the parens:
(.*)#a.net
A great site to play around with regular expressions is http://refiddle.com/.
I fiddled this problem already.
You can use
\b#[a-zA-Z].net\b
\b to set word boundaries before #
# matches the character # literally
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
. matches any character (except newline)
net matches the characters net literally (case sensitive)
\b word boundary
The above regex will capture the given characters literally which you can replace using #b.com
And of you simply want to capture only #a.net than you can simply use
\b#a.net\b
Regex Demo