Oracle regexp_replace - Adding space to separate sentences - regex

I am working in Oracle to fix some text. The issue is that sentences in my data have words where sentences aren't separated by spaces. For example:
Sentence without space.Between sentences
Sentence with question mark?Second sentence
I've tested the following replace statement in regex101 and it seems to work out there, but I can't pinpoint why it's not working in Oracle:
regexp_replace(review_text, '([^\s\.])([\.!\?]+)([^\s\.\d])', '\1\2 \3')
This should allow me to look for sentence-separating periods/exclamation points/question marks (single or grouped) and add the necessary space between sentences. I realize that there are other ways that sentences can be separated, but what I have above should cover a large majority of the use cases. The \d in the third capture group is to make sure that I'm not accidentally changing numeric values like "4.5" to "4. 5".
Before test group:
Sentence without space.Between sentences
Sentence with space. Between sentences
Sentence with multiple periods...Between sentences
False positive sentence with 4.5 Liters
Sentence with!Exclamation point
Sentence with!Question mark
After changes should look like this:
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
Regex101 link: https://regex101.com/r/dC9zT8/1
While all changes work as expected from regex101, my issue is that I'm getting in Oracle is that my third and fourth test cases aren't working as intended. Oracle isn't adding a space after the multiple period (ellipses) group, and regexp_replace is adding a space for "4.5". I'm not sure why this is the case, but perhaps there's some peculiarity about Oracle regexp_replace that I'm missing.
Any and all insight is appreciated. Thanks!

This may get you started. This will check for .?! in any combination, followed by zero or more spaces and by an uppercase letter, and it will replace "zero or more spaces" by exactly one space. This will not separate a decimal number; but it will miss sentences that begin with anything other than an uppercase letter. You may start adding conditions - if you run into difficulty please write back and we'll try to help. Referring to other regex dialects may be helpful, but it may not be the fastest way to get your answer.
with
inputs ( str ) as (
select 'Sentence without space.Between sentences' from dual union all
select 'Sentence with space. Between sentences' from dual union all
select 'Sentence with multiple periods...Between sentences' from dual union all
select 'False positive sentence with 4.5 Liters' from dual union all
select 'Sentence with!Exclamation point' from dual union all
select 'Sentence with!Question mark' from dual
)
select regexp_replace(str, '([.!?]+)\s*([A-Z])', '\1 \2') as new_str
from inputs;
NEW_STR
-------------------------------------------------------
Sentence without space. Between sentences
Sentence with space. Between sentences
Sentence with multiple periods... Between sentences
False positive sentence with 4.5 Liters
Sentence with! Exclamation point
Sentence with! Question mark
6 rows selected.

Related

remove periods, not decimal points

I have a long string variable (many sentences separated by ".") with some important numerical information, generally with a decimal point (e.g., "6.5 lbs").
I would like to regex out the all periods when they appear at the end of a sentence, but leave them when they appear between numbers.
FROM:
First sentence. Second sentence contains a number 1.0 and more words. One more sentence.
TO:
First sentence Second sentence contains a number 1.0 and more words One more sentence
I am doing this in Stata, using Unicode regex functions which follow this standard: http://userguide.icu-project.org/strings/regexp
What I thought I was doing in the following is: `replace the period w/ a space when the previous character is a lowercase letter'.
gen new_variable = ustrregexrf(note_text, "(?<=[a-z])\.", " ")
I find that it will remove one period per line, but will not remove all of them. Maybe what I need to do is tell it: do this for all the periods you find satisfying the condition, but since it's not working the way I think it is already maybe I need an explanation of what it actually is doing.
Bonus points if you can tell me how to remove a period when there is a number followed by a space:
number is 1.0. Next sentence -> number is 1.0 Next sentence
EDIT: there are occasionally strings like end sentence.begin next sentence without spacing so separating on white space won't handle all of my cases.
Method 1
Maybe,
\.(?=\s|$)
might be OK to look into.
Demo 1
Method 2
\d+\.\d+(*SKIP)(*FAIL)|\.
Demo 2
is another option to look at, and it would work by installing the regex module:
$ pip3 install regex
Test
import regex as re
string = '''
First sentence. Second sentence contains a number 1.0 and more words. One more sentence.First sentence. Second sentence contains a number 1.0 and more words. One more sentence.
'''
expression = r'\d+\.\d+(*SKIP)(*FAIL)|\.'
print(re.sub(expression, '', string))
Output
First sentence Second sentence contains a number 1.0 and more words
One more sentenceFirst sentence Second sentence contains a number 1.0
and more words One more sentence

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Regex to match specific-length string with white space in the middle (anywhere)

I need a regex which will match a phrase (with specific length and structure) even if there is additional white space in the middle (anywhere).
Let's say we have some description:
Serial numbers: ABC1234567890 XYZ0987654321
Then we want to find all phrases matching to regex [A-Z]{3}[0-9]{10}, but that description is malformed because of processing by external service. That service splits description to chunks, 12 digits each. So it will be:
Serial numbe
rs: ABC12345
67890 XYZ098
7654321
Important: "Serial numbers:" isn't fixed, it can be everything, so required phrases can be split anywhere (ABC1 234567890, ABC1234567 890 etc.). New line and space have the same meaning from the phrase matching perspective, but in special cases there can be more white chars between parts of phrase (for example, space as last char of chunk + new line, multiple spaces in source description). It just simply should treat whole "white space" between two strings as 1 space (ABC1 234567890 = ABC1234 567890, also with new line break). Those serials can be anywhere in malformed description (as I wrote: "Serial numbers:" part is optional, can be anything), also there can be more serial numbers within description. [A-Z]{3}[0-9]{10} also is only an example, I want to know how to achieve matching with optional white space in the middle, but base regex can be different.
EXPECTED RESULT: collection of matched phrases (serial numbers from the example).
ABC1234567890
XYZ0987654321
Info: result can contain white chars within phrase (from above example it would be: ABC12345 67890 and XYZ098 7654321). Most important thing is to match the base phrase (serial number).
Is it possible to make regex which will match it? I think it would be rather simple algorithm to match it without regex, but maybe it can be done with regular expression and make it "oneliner".
this will handle multiple spaces multiple times
(([A-Z]\s*){3}([0-9]\s*){10})
will match AB C A A A A AD E12 34567890
since AD E12 34567890 fits the pattern
https://regex101.com/r/bK3sF8/1
Edit:
Just considering one(you can adjust for multiples) \n (break lines) in and outside the word here:([\w\n?]*)
You should try grouping the result
in this case:
/(([\w\n?]*)\s([\w\n?]*):\s([\w\n?]*)\n?\n?\s([\w\n]*))/ig
you can get the serial number by groups $3 and $4
http://regexr.com/3d67n

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx

Excel Sort by 2nd character in alphanumeric string

I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...