Using REGULAR EXPRESSION to replace string between special characters in oracle - c++

select a[b]c[d][e]f[g] from dual;
I need an output:
acf
i.e. Removed with all [] as well as the text between them .
Solution can be in Oracle or C++ function.
Tried erase function in C++ , something like :
int main ()
{
std::string str ("a[b]c[d]e[f]");
std::cout << str << '\n';
while(1)
{
std::size_t foundStart = str.find("[");
//if (foundStart != std::string::npos)
std::cout << "'[' found at: " << foundStart << '\n';
str.begin();
std::size_t foundClose = str.find("]");
//if (foundClose != std::string::npos)
std::cout << "']' found at: " << foundClose << '\n';
str.begin();
str.erase (foundStart,foundClose);
std::cout << str << '\n';
}
return 0;
}
which returns an output as :
a[b]c[d]e[f]
'[' found at: 1
']' found at: 3
ac[d]e[f]
'[' found at: 2
']' found at: 4
ac[f]
'[' found at: 2
']' found at: 4
ac
'[' found at: 18446744073709551615
']' found at: 18446744073709551615
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::erase
Thanks in Advance.

I don't know enough C++ or Oracle to implement it, but the Regular Expression would look something like this, I suppose:
(?<=[\s\]])[a-z](?=(\[[a-z]\])+[\sa-z])
This will match a, c and f.
You will need to iterate over the matches and print them accordingly.
The regular expression is decoupled from whatever text is around the target,
hello there a[b]c[d][e]f[g] way to go! will have the same matches,
just be sure to have spaces around the target string a[b]c[d][e]f[g]
I hope I helped you!
Good luck

You can use regexp_replace(<your_string>,'\[.*?\]')
Breaking down,
\[ --matches single square bracket '['. Should be escaped with a backslash '\', as '[' is regex operator
.*? --non greedy expression to match minimum text possible
\] --matches single square bracket ']'. Should be escaped with a backslash '\', as ']' is regex operator
Example:
SQL> with x(y) as (
select 'a[b]c[d][][e]f[g][he]ty'
from dual
)
select y, regexp_replace(y,'\[.*?\]') regex_str
from x;
Y REGEX_STR
----------------------- -----------
a[b]c[d][][e]f[g][he]ty acfty

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.
I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.
You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Split string with specific constraint on delimiter

Suppose we have a string: "((0.2,0), (1.5,0)) A1 ABC p". I want to split it into logical units like this:
((0.2,0), (1.5,0))
A1
ABC
p
I.e. split string by whitespaces with requirement that previous character isn't a comma.
Is it possible to use regex as solution?
Update: I've tried in this way:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*\\(, *[^, ]*\\)*"); // as suggested in the updated answers
std::sregex_token_iterator
p(s.begin(), s.end(), re, -1);
std::sregex_token_iterator end;
while (p != end)
std::cout << *p++ << std::endl;
}
The result was: ((0.2,0), (1.5,0)) A1 ABC p
Solution:
#include <iostream>
#include <string>
#include <regex>
int main() {
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*(, *[^, ]*)*");
std::regex_token_iterator<std::string::iterator> p(s.begin(), s.end(), re);
std::regex_token_iterator<std::string::iterator> end;
while (p != end)
std::cout << *p++ << std::endl;
}
Output:
((0.2,0), (1.5,0))
A1
ABC
p
you can do it like this:
[^, ]*(, *[^, ]*)*
what does this do?
first lets go over basics of regular expressions:
the [] defines a group of characters that you want to match for example [ab] will match an 'a' or 'b'.
If you use [^] syntax that describes all the characters you do NOT want to match so [^ab] will match anything that is NOT and 'a' or a 'b'.
the * symbol tell the regular expression that the previous match can appear zero or more times. so a* will match the empty string '' or 'a' or 'aaa' or 'aaaaaaaaaaaaa'
When you put () around a part of an expression that creates a group that you can then so interesting things with in our case we used it so that we could define a part of the pattern that we wanted to be optional by putting * next to it so that it could appear zero or more times.
Ok putting all together:
The fist part [^ ,]* says: Match zero or more character that are NOT ' ' or ',' this wil match string like 'A1' or '((0.2"
The second part in ()* is used to continue matching string that have ',' and space in them but that you do not want to split, this part is optional so that it correctly matches 'A1' or 'ABC' or 'p'.
So (, *[^, ]*)* will match zero or more strings that start with ',' and any number of ' ' followed by a string that does not have ',' or ' ' in it. So in your example it would match ",0)" which is the continuation of "((0.2" and also match ", (1.5" and again ",0))" which will all get added together to make "((0.2,0), (1.5,0))"
NOTE: You may need to escape some characters in your expression based on the regular expression library you are using. The solution will work in this online tester http://www.regexpal.com/
but some libraries and tools need you to escape things like the (
so the expression would look like:
[^, ]*\(, *[^, ]*\)*
Also I removed the ( |$) part is it is only required if you want the ending space to be part of the match.

Bug in std::regex?

Here is code :
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("[^c]ei");
pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
std::cout << results.str() << std::endl;
return 0;
}
Output :
cei
The compiler used is gcc 4.9.1.
I'm a newbie learning regular expression.I expected nothing should be output,since "cei" doesn't match the pattern here. Am I doing it right? What's the problem?
Update:
This one has been reported and confirmed as a bug, for detail please visit here :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497
It's a bug in the implementation. Not only do a couple other tools I tried agree that your pattern does not match your input, but I tried this:
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
{
std::cout << results.str() << std::endl;
for (size_t i = 0; i < results.size(); ++i) {
std::ssub_match sub_match = results[i];
std::string sub_match_str = sub_match.str();
std::cout << i << ": " << sub_match_str << '\n';
}
}
}
This is basically similar to what you had, but I replaced [:alpha:] with [a-z] for simplicity, and I also temporarily replaced [^c] with [a-z] because that seems to make it work correctly. Here's what it prints (GCC 4.9.0 on Linux x86-64):
cei
0: cei
1:
2: c
3: e
4: i
5:
If I replace [a-z] where you had [^c] and just put f there instead, it correctly says the pattern doesn't match. But if I use [^c] like you did:
std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");
Then I get this output:
cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted (core dumped)
So it claims to match successfully, and results[0] is "cei" which is expected. Then, results[1] is "cei" also, which I guess might be OK. But then results[2] crashes, because it tries to construct a std::string of length 18446744073709551614 with begin=nullptr. And that giant number is exactly 2^64 - 2, aka std::string::npos - 1 (on my system).
So I think there is an off-by-one error somewhere, and the impact can be much more than just a spurious regex match--it can crash at runtime.
The regex is correct and should not match the string "cei".
The regex can be tested and explained best in Perl:
my $regex = qr{ # start regular expression
[[:alpha:]]* # 0 or any number of alpha chars
[^c] # followed by NOT-c character
ei # followed by e and i characters
[[:alpha:]]* # followed by 0 or any number of alpha chars
}x; # end + declare 'x' mode (ignore whitespace)
print "xei" =~ /$regex/ ? "match\n" : "no match\n";
print "cei" =~ /$regex/ ? "match\n" : "no match\n";
The regex will first consume all chars to the end of the string ([[:alpha:]]*), then backtrack to find the NON-c char [^c] and proceed with the e and i matches (by backtracking another time).
Result:
"xei" --> match
"cei" --> no match
for obvious reasons. Any discrepancies to this in various C++ libraries and testing tools are the problem of the implementation there, imho.

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

boost regex whitespace followed by one or more asterisk

why does the following boost regex not return the results I am looking for (starts with 0 ore more whitespace followed by one or more asterisk)?
boost::regex tmpCommentRegex("(^\\s*)\\*+");
for (std::vector<std::string>::iterator vect_it =
tmpInputStringLines.begin(); vect_it != tmpInputStringLines.end();
++vect_it) {
boost::match_results<std::string::const_iterator> tmpMatch;
if (boost::regex_match((*vect_it), tmpMatch, tmpCommentRegex,
boost::match_default) == 0) {
std::cout << "Found comment " << (*vect_it) << std::endl;
} else {
std::cout << "No comment" << std::endl;
}
}
On the following input:
* Script 7
[P]%OMO * change
[P]%QMS * change
[T]%OMO * change
[T]%QMM * change
[S]%G1 * Resume
[]
This should read
Found comment * Script 7
No comment
No comment
No comment
No comment
No comment
No comment
Quoting from the documentation for regex_match:
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
None of your input lines are matched by your regular expression as a whole, so the program works as expected. You should use regex_search to get the desired behavior.
Besides, regex_match and regex_search both return bool and not int, so testing for == 0 is wrong.