Regex with exception of particular words - regex

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)

If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/

I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...

This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)

\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.

Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/

Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}

Related

Building regular expression ending with either one word or other

I would like some help in building a regular expression. The conditions are as follows
The expression must start with #
Then it should contain atleast one or more groups of alphanumeric characters separated by -
Each group contains atleast one alphanumeric character
The expression should end either with -apples or -bananas
Some test cases
#hshg1h2-hd212df-7632jhsd-bananas (Match)
#jhkj31j-jkh213j-jjkhjj324-apples (Match)
hjsdjjhsd-jhsshdjs-jdshdsj-apples (No Match)
#---apples (No Match)
#jhkj31j-jkh213j-jjkhjj324 (No Match)
#jhkj31j-jkh213j-jjkhjj324-apples-bananas (No Match)
I created the following expression
^#([a-zA-Z0-9]{1,}-){1,}(apples|bananas)$. For most of the test cases, it provides the correct result. However it also matches the test case 6 which it should not.
Background
The test cases simulate product-ids for the two products apples and
bananas. Those ids always contains as the last group -bananas or -apples. Thus -apples-bananas or vice versa is suppose be invalid product id.
Could anyone please show me how can I do this?
You can use lookaheads and alteration:
/^#(?!.*apples.*bananas|.*bananas.*apples)(?=[a-zA-Z0-9]+-[a-zA-Z0-9]+).*(?:apples|bananas)$/
Demo
And it is always good to use word boundary assertions:
/^#(?!.*\bapples\b.*\bbananas\b|.*\bbananas\b.*\bapples\b)(?=[a-zA-Z0-9]+-[a-zA-Z0-9]+).*(?:\bapples\b|\bbananas\b)$/
Alternatively, here is a modified version of the regex in comments (that is pretty good!)
^#(?:(?!\b(?:apples|bananas)\b)[a-zA-Z0-9]+-)+\b(?:apples|bananas)$
Demo

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex to match text from multiple links

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text
Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.
If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo
Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.
You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

How do you match a pattern skipping exceptions?

In vim, I'd like to match a regular expression in a search and replace operation, but with exceptions — a list of matches that I want to skip.
For example, suppose I have the text:
-one- lorem ipsum -two- blah blah -three- now is the time -four- the quick brown -five- etc. etc.
(but with lots of other possibilities) and I want to match -\(\w\+\)- and replace it with *\1* but skipping over (not matching) -two- and -four-, so the result would be:
*one* lorem ipsum -two- blah blah *three* now is the time -four- the quick brown *five* etc. etc.
It seems like I should be able to use some kind of assertion (lookbehind, lookahead, something) for this, but I'm coming up blank.
You're looking for a negative lookahead assertion. In Vim, that's done via :help /\#!, like (?!pattern) in Perl.
Basically, you say don't match FOO here, and in general match word characters:
/-\(\%(FOO\)\#!\w\+\)-/
Note how I'm using non-capturing groups (:help /\%(). What's still missing is an assertion on the end, so the above would also exclude -FOOBAR-. As we have a unique end delimiter here, it's easiest to append that:
/-\(\%(FOO-\)\#!\w\+\)-/
Applied to your example, you just need to introduce two branches (for the two exclusions) in place of FOO, and you're done:
/-\(\%(two-\|four-\)\#!\w\+\)-/
Or, by factoring out the duplicated end delimiter:
/-\(\%(\%(two\|four\)-\)\#!\w\+\)-/
This matches any word characters in between dashes, except if those words form either two or four.
The negative lookahead in my other answer is the direct solution, but its syntax is a bit complex (and there can be patterns where the rules for the delimiter are not so simple, and the result then is much less readable).
As you're using substitution, an alternative is to put the selection logic into the replacement part of :substitute. Vim allows a Vimscript expression in there via :help sub-replace-expression.
We have the captured word in submatch(1) (equivalent to \1 in a normal replacement), and now just need to check for the two excluded words; if it's one of those, do a no-op substitution by returning the original full match (submatch(0)), else just return the captured group.
:substitute/-\(\w\+\)-/\=submatch(index(['two', 'four'], submatch(1)) == -1 ? 1 : 0)/g
It's not shorter than the lookahead pattern (well, we could golf the pattern and drop the ternary operator, as a boolean is represented by 0/1, anyway), so here I would still use the pattern. But in general, it's good to know that there's more than one way to do it :-)

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g