Oracle Regex match all words ignoring order - regex

I need find records matching all query phrases but ignoring their occurrence order.
For example, my query string is apple banana kiwi. Following values should be true.
I like apple, banana and kiwi
Banana, kiwi and apple are fruits
Following values should be false
He does not like kiwi
How to implement by SQL in Oracle 11?

In a modern regex engine you would use look-ahead assertions to combine the three conditions into one expression:
(?:.*?\bapple\b)(?:.*?\bbanana\b)(?:.*?\bkiwi\b)
Oracle does not support look-aheads, though, and that means you cannot write an expression that checks all three conditions at the same time(*).
Your options:
Split up the regular expression and combine multiple simple expressions with AND - this is the slowest variant, but it would work.
Dump regular expressions and use multiple LIKE clauses with AND - this will a little be faster than regex but but expression complexity is limited in comparison.
Set up a full text index on that table and use it - this will be the fastest variant, but expression complexity is limited compared to regex. It will be sufficient for a pure natural language keyword search, though, and it would support stemming and alternative word forms.
(*) Technically, academically, you can. You could write an expression that checks all possible permutations of your keywords, like this
A.*?B.*?C|B.*?C.*?A|C.*?A.*?B|...and so on|and so forth
Think whether you would call this an acceptable solution. Oh yeah and it would be slow as hell, too.

Here is an attempt:
with w as -- The words
(
select 'apple banana kiwi' words from dual
),
p as -- the patterns taken from the words
(
select regexp_substr(w.words, '\w+', 1, level) pattern
from w
connect by regexp_substr(w.words, '\w+', 1, level) is not null
),
r as -- the phrases to test
(
select 'I like apple, banana and kiwi' phrase from dual
union all
select 'Banana, kiwi and apple are fruits' phrase from dual
union all
select 'He does not like kiwi' phrase from dual
)
select r.phrase
case sum(case instr(upper(r.phrase), upper(p.pattern))
when 0 then 0
else 1 end)
when regexp_count(w.words, '\w+', 1) then 'true'
else 'false' end all_present
from r, p, w
group by r.phrase, w.words
;
And the result:
He does not like kiwi false
Banana, kiwi and apple are fruits true
I like apple, banana and kiwi true
The principle:
test for every pattern if it's in the phrase (by instr: if 0, it's not present, else it is)
group by phrase to sum this match
if this sum is equal to the number of words tested (here, 3), this is true

Related

Adding three cells if a cell has a value, and if it doesn't, adding two

So basically here's what I want to do:
I need to add cells B12 and C12 normally, however -
If cell C3 has a certain text value (let's say "Apples"), I need to add B12, C12, and K3.
But if C3 -isn't- Apples, it should just add B12 and C12.
Additionally, I have two versions of Apples: "Apples - Red" and "Apples - Green". Maybe an Apples wildcard?
try simple:
=IF(COUNTIF(C3, "Apples*"), SUM(B12:C12, K3), SUM(B12:C12))
or:
=IF(REGEXMATCH(C3, "Apples"), SUM(B12:C12, K3), SUM(B12:C12))
if your "Apples" are numbers you can do:
=IF(COUNTIF(C3&"", "123*"), SUM(B12:C12, K3), SUM(B12:C12))
or:
=IF(REGEXMATCH(C3&"", "123"), SUM(B12:C12, K3), SUM(B12:C12))
=if(REGEXMATCH(C3, "(?i).*apples.*"), SUM(B12,C12,K3), SUM(B12,C12))
(?i).*apples.* is regular expression that matches any string that contains the words apples, ignoring case. So it will match any of the following cell contents: Apples - Red, AWFEFAPPLESWEFWE, apples, apples - purple, red aPPLES, etc. You can narrow the regex a bit if you want to be more strict.
Breaking it down, the regex is built as follows:
(?i) - Ignore case for the entire matching pattern
. - Matches any character.
* - repeats that match 0 or more times.
apples - indicates that we have to match appples
.* - Like above, matches any string, including a zero-length string.
So it translates to "Ignoring case, match any string that has apples in it."
REGEXMATCH() is a google spreadsheet function that lets us compare a cell's contents against a regular expression.
The rest of it is just a standard if.

POSIX ERE Regular expression to find repeated substring

I have a set of strings containing a minimum of 1 and a maximum of 3 values in a format like this:
123;456;789
123;123;456
123;123;123
123;456;456
123;456;123
I'm trying to write a regular expression so I can find values repeated on the same string, so if you have 123;456;789 it would return null but if you had 123;456;456 it would return 456 and for 123;456;123 return 123
I managed to write this expression:
(.*?);?([0-9]+);?(.*?)\2
It works in the sense that it returns null when there are no duplicate values but it doesn't return exactly the value I need, eg: for the string 123;456;456 it returns 123;456;456and for the string 123;123;123 it returns 123;123
What I need is to return only the value for the ([0-9]+) portion of the expression, from what I've read this would normally be done using non-capturing groups. But either I'm doing it wrong or Oracle SQL doesn't support this as if I try using the ?: syntax the result is not what I expect.
Any suggestions on how you would go about this on oracle sql? The purpose of this expression is to use it on a query.
SELECT REGEXP_SUBSTR(column, "expression") FROM DUAL;
EDIT:
Actually according to https://docs.oracle.com/cd/B12037_01/appdev.101/b10795/adfns_re.htm
Oracle Database implements regular expression support compliant with the POSIX Extended Regular Expression (ERE) specification.
Which according to https://www.regular-expressions.info/refcapture.html
Non-capturing group is not supported by POSIX ERE
This answer describes how to select a matching group from a regex. So using that,
SELECT regexp_substr(column, '(\d{3}).*\1', 1, 1, NULL, 1) from dual;
# ^ Select group 1
Working demo of the regex (courtesy: OP).
If you only have three substrings, then you can use a brute force method. It is not particularly pretty, but it should do the job:
select (case when val1 in (val2, val3) then val1
when val2 = val3 then val2
end) as repeated
from (select t.*,
regexp_substr(col, '[^;]+', 1, 1) as val1,
regexp_substr(col, '[^;]+', 1, 2) as val2,
regexp_substr(col, '[^;]+', 1, 3) as val3
from t
) t
where val1 in (val2, val3) or val2 = val3;
Please bear with me and think of this different approach. Look at the problem a little differently and break it down in a way that gives you more flexibility in how you you are able look at the data. It may or may not apply to your situation, but hopefully should be interesting to keep in mind that there are always different ways to approach a problem.
What if you turned the strings into rows so you could do standard SQL against them? That way you could not only count elements that repeat but perhaps apply aggregate functions to look for patterns across sets or something.
Consider this then. The first Common Table Expression (CTE) builds the original data set. The second one, tbl_split, turns that data into a row for each element in the list. Uncomment the select that immediately follows to see. The last query selects from the split data, showing the count of how often the element occurs in the id's data. Uncomment the HAVING line to restrict the output to those elements that appear more than one time for the data you are after.
With the data in rows you can see how other aggregate functions could be applied to slice and dice to reveal patterns, etc.
SQL> with tbl_orig(id, str) as (
select 1, '123;456;789' from dual union all
select 2, '123;123;456' from dual union all
select 3, '123;123;123' from dual union all
select 4, '123;456;456' from dual union all
select 5, '123;456;123' from dual
),
tbl_split(id, element) as (
select id,
regexp_substr(str, '(.*?)(;|$)', 1, level, NULL, 1) element
from tbl_orig
connect by level <= regexp_count(str, ';')+1
and prior id = id
and prior sys_guid() is not null
)
--select * from tbl_split;
select distinct id, element, count(element)
from tbl_split
group by id, element
--having count(element) > 1
order by id;
ID ELEMENT COUNT(ELEMENT)
---------- ----------- --------------
1 123 1
1 456 1
1 789 1
2 123 2
2 456 1
3 123 3
4 123 1
4 456 2
5 123 2
5 456 1
10 rows selected.
SQL>

Regular Expression - Not matching certain characters and position of characters

SOLUTION:
Finally solved it using the regex provided by Gary_W below and a simple PowerShell command that uses the discussed replacement function. So there was no need to use the built in regex activity in the software we use. Here´s the PS:
"100,000.00" -replace "([,.]\d{2}$)|[,.]",""
Regular Expressions are freaking me out. I cannot get used to that logic. However, I think my current RE problem is a quite simple one bur I cannot make it work :(
So here´s what I want to achieve:
I want the RE to match only the digits before the last two decimal places.
Thus, the RE must ignore any "." and "," AND always the last two digits.
> Examples:
> 1.000.000,00 --> 1000000
> 123,456.00 --> 123456
> 100.000,00 --> 100000
> 10.000,00 --> 10000
> 10,000.00 --> 10000
> 1.000,00 --> 1000
> 100,00 --> 100
> 99.88 --> 99
> 99,88 --> 99
> 1,23 --> 1
> ...
Any ideas how to get this working?
Here's how I would do it in Oracle, for what it's worth. Maybe the regex used here will give you an idea. Read the regex as "Look for a match of a comma or a decimal followed by 2 digits at the end of the line, OR a comma or a decimal and replace with nothing.
Note the match for the optional decimal places at the end needs to be first in the regex, otherwise the single characters are matched first, making the 2 decimal places non-existent and thus not matched.
SQL> with tbl(str) as (
select '1.000.000,00' from dual union all
select '123,456.00' from dual union all
select '100.000,00' from dual union all
select '10.000,00' from dual union all
select '10,000.00' from dual union all
select '1.000,00' from dual union all
select '100,00' from dual union all
select '99.88' from dual union all
select '99,88' from dual union all
select '1,23' from dual union all
select '3' from dual
)
select str,
regexp_replace(str, '([,.]\d{2}$)|[,.]') fixed
from tbl;
STR FIXED
------------ ------------
1.000.000,00 1000000
123,456.00 123456
100.000,00 100000
10.000,00 10000
10,000.00 10000
1.000,00 1000
100,00 100
99.88 99
99,88 99
1,23 1
3 3
11 rows selected.
SQL>
Just saw the regexr link, plugging in my regex looks like it works with the global flag. The characters you wish to remove are highlighted.
In which language/with which tool? With sed, you can do:
sed 's/\(.*\)[\.,]../\1/;s/[\.,]//g'
In perl it's similar, just without the initial backslashes:
perl -pe 's/(.*)[\.,]../\1/;s/[\.,]//g'
This is done with two regexes, by the way. The first one reads "save all that you can, up to a dot or a comma followed by two chars, and then replace the whole match with that". The second one reads "replace all dots and commas with nothing", that is, "remove all dots and commas".
In regexr.com you can use "Replace" in Tools to replace the match with the first capture group. Just put (.*)[\.,].. in Expression, and $1 in Replace, to see the first regex working. Then you can do something similar with the second one, as regexr doesn't support chaining of expressions, as far as I can see.

Oracle regexp_replace - Adding space to separate sentences

I am working in Oracle to fix some text. The issue is that sentences in my data have words where sentences aren't separated by spaces. For example:
Sentence without space.Between sentences
Sentence with question mark?Second sentence
I've tested the following replace statement in regex101 and it seems to work out there, but I can't pinpoint why it's not working in Oracle:
regexp_replace(review_text, '([^\s\.])([\.!\?]+)([^\s\.\d])', '\1\2 \3')
This should allow me to look for sentence-separating periods/exclamation points/question marks (single or grouped) and add the necessary space between sentences. I realize that there are other ways that sentences can be separated, but what I have above should cover a large majority of the use cases. The \d in the third capture group is to make sure that I'm not accidentally changing numeric values like "4.5" to "4. 5".
Before test group:
Sentence without space.Between sentences
Sentence with space. Between sentences
Sentence with multiple periods...Between sentences
False positive sentence with 4.5 Liters
Sentence with!Exclamation point
Sentence with!Question mark
After changes should look like this:
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
Regex101 link: https://regex101.com/r/dC9zT8/1
While all changes work as expected from regex101, my issue is that I'm getting in Oracle is that my third and fourth test cases aren't working as intended. Oracle isn't adding a space after the multiple period (ellipses) group, and regexp_replace is adding a space for "4.5". I'm not sure why this is the case, but perhaps there's some peculiarity about Oracle regexp_replace that I'm missing.
Any and all insight is appreciated. Thanks!
This may get you started. This will check for .?! in any combination, followed by zero or more spaces and by an uppercase letter, and it will replace "zero or more spaces" by exactly one space. This will not separate a decimal number; but it will miss sentences that begin with anything other than an uppercase letter. You may start adding conditions - if you run into difficulty please write back and we'll try to help. Referring to other regex dialects may be helpful, but it may not be the fastest way to get your answer.
with
inputs ( str ) as (
select 'Sentence without space.Between sentences' from dual union all
select 'Sentence with space. Between sentences' from dual union all
select 'Sentence with multiple periods...Between sentences' from dual union all
select 'False positive sentence with 4.5 Liters' from dual union all
select 'Sentence with!Exclamation point' from dual union all
select 'Sentence with!Question mark' from dual
)
select regexp_replace(str, '([.!?]+)\s*([A-Z])', '\1 \2') as new_str
from inputs;
NEW_STR
-------------------------------------------------------
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
6 rows selected.

Regular expression that both includes and excludes certain strings in R

I am trying to use R to parse through a number of entries. I have two requirements for the the entries I want back. I want all the entries that contain the word apple but don't contain the word orange.
For example:
I like apples
I really like apples
I like apples and oranges
I want to get entries 1 and 2 back.
How could I go about using R to do this?
Thanks.
Could do
temp <- c("I like apples", "I really like apples", "I like apples and oranges")
temp[grepl("apple", temp) & !grepl("orange", temp)]
## [1] "I like apples" "I really like apples"
Using a regular expression, you could do the following.
x <- c('I like apples', 'I really like apples',
'I like apples and oranges', 'I like oranges and apples',
'I really like oranges and apples but oranges more')
x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)]
# [1] "I like apples" "I really like apples"
The regular expression looks ahead to see if there's no character except a line break and no substring orange and if so, then the dot . will match any character except a line break as it is wrapped in a group, and repeated (0 or more times). Next we look for apple and any character except a line break (0 or more times). Finally, the start and end of line anchors are in place to make sure the input is consumed.
UPDATE: You could use the following if performance is an issue.
x[grepl('^(?!.*orange).*$', x, perl=TRUE)]
This regex is a bit smaller and much faster than the other regex versions (see comparison below). I don't have the tools to compare to David's double grepl so if someone can compare the single grep below vs the double grepl we'll be able to know. The comparison must be done both for a success case and a failure case.
^(?!.*orange).*apple.*$
The negative lookahead ensures we don't have orange
We just match the string, so long as it contains apple. No need for a lookahead there.
Code Sample
grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE);
Speed Comparison
#hwnd has now removed that double lookahead version, but according to RegexBuddy the speed difference remains:
Against I like apples and oranges, the engine takes 22 steps to fail, vs. 143 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 22 steps for ^((?!.*orange).)*apple.*$ (equal there but wait for point 2).
Against I really like apples, the engine takes 64 steps to succeed, vs. 104 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 538 steps for ^((?!.*orange).)*apple.*$.
These numbers were provided by the RegexBuddy debugger.