Can't fit regexp for a substring - regex

I have to find and remove a substring from the text using regexp in PostgreSQL. The substring corresponds to the condition: <any text between double-quotes containing for|while inside>
Example
Text:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", " Script for data loading: ", false, v_sql, 0);
So, my purpose is to find and remove the substring "Script for data loading: ".
When I tried to use the script below:
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);'
, '(\")(.*(for|while)(\s).*)(\")'
, '');
I have all the texts inside double-quotes replaced. The result looks like:
PERFORM xxkkcsort.f_write_log(||cc.source_table_name , false, null::text, 0);
What's a proper regular expression to solve the issue?

You canuse
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);',
'"[^"]*(for|while)\s[^"]*"',
'') AS Result;
Output:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", ||cc.source_table_name , false, null::text, 0);
See the regex demo and the DB fiddle. Details:
" - a " char
[^"]* - zero or more chars other than "
(for|while) - for or while
\s - a whitespace
[^"]*" - zero or more chars other than " and then a " char.

any text between double-quotes containing for|while inside
SELECT regexp_replace(string, '"[^"]*\m(?:for|while)\M[^"]*"', '');
" ... literal " (no special meaning here, so no need to escape it)
[^"]* ... character class including all characters except ", 0-n times
\m ... beginning of a word
(?:for|while) ... two branches in non-capturing parentheses
(regexp_replace() works with simple capturing parentheses, too, but it's cheaper this way since you don't use the captured substring. But try either with the replacement '\1', where it makes a difference ...)
\M ... end of a word
[^"]* ... like above
" ... like above
I dropped \s from your expression, as the task description does not strictly require a white-space character (end of string or punctuation delimiting the word ...).
Related:
Escape function for regular expression or LIKE patterns

Related

Regex to remove unescaped quotes from a CSV

I need to feed a CSV file into a database. For that I have to remove "wild" un-escaped quotes.
Following input structure is possible:
"aa";"bb";"cc";"dd";"ee"
"aa";"bb";"c "cc" c";"dd";"ee"
"aa";;"cc";"dd";"ee"
"aa";55;"cc";"dd";"ee"
The expression:
(?<!^|\"\;)\"(?!\;|$)
does work for #1 and #2 of the input examples but fails when there is an empty element (#3) or an unquoted numeric field (#4). Also see this Rubular example
Any pointer how to get these cases covered would be highly appreciated.
Edit:
Following #Wiktor Stribiżew advice, I'm now using
(^"|"$|";+"|";\d+;"|";|;")|"
this also covers some additional edge cases, I have identified in the input data, as shown here
The following solution only meets your current requirements and is not a universal solution to fix quotes in CSV:
(^"|"$|";+"|";\d+;")|"
Replace with $1 (or \1, depending on where you use this regex).
See the regex demo.
Details
(^"|"$|";+"|";\d+;") - Group 1:
^"| - " at the start of the string, or
"$| - " at the end of the string, or
";+"| - ", 1+ ; chars, and then ", or
";\d+;" - ";, 1+ digits, then ;"
| - or
" - a " char.

How do I use a regex pattern in VBS to match commas not preceded or followed by a line feed or carriage return?

As I understand regular expressions, I think this pattern should work in VBS to pick up commas in a string that are preceded or followed by a line feed or carriage return as submatch 0 or submatch 1 (one of the first two pattern groups):
oRe.Pattern = "(,[\n\r])|([\n\r],)|(.{2},.{2})"
However, in the string excerpt below, submatch 2 (third pattern group) is picking up the commas , each of which is preceded by a carriage return:
I want these commas ignored
Here's the code from the picture:
SELECT
di.QuestionSetID AS SectionID
,di.ScoreNBR AS SectionLowestTopBoxNBR
,di.AveragePercentileNBR AS SectionTopBoxPercentileRankNBR
,qdate.QuarterStartDTS AS SectionStartDTS
FROM NRCPicker.PatientSatisfaction.DimensionPercentile AS di
INNER JOIN (
Can anyone see why these commas are being picked up as submatch 2?
I based my pattern on this article: http://www.rexegg.com/regex-best-trick.html. I also used regex101.com in developing and testing this pattern.
I am using VBS to parse fields from a SQL script by creating an array using split(string, ","). In some cases, there are composite fields that include commas within them. I don't want to split on those commas, so I am replacing those commas with a space before I perform the split operation. The result of my regex pattern then would be to pick up only those commas not preceded or followed by a carriage return/line feed and replace them with a space.
Hopefully this is a better illustration of what I'm trying to do:
Here's a sample of my VBscript:
SQLScript = "SELECT
di.QuestionSetID AS SectionID
,di.ScoreNBR AS Section,LowestTopBoxNBR
,di.AveragePercentileNBR AS SectionTopBoxPercentileRankNBR
,qdate.Quarter,StartDTS AS Section,StartDTS
FROM NRCPicker.PatientSatisfaction.DimensionPercentile AS di
INNER JOIN ("
oRe.Pattern = "(,[\n\r])|([\n\r],)|(.{2},.{2})"
oLoadFields = oRe.Replace(SQLScript, "$1$2$3")
Expected output (commas replaced with a space only when not at the beginning or end of a line):
oLoadFields = "SELECT
di.QuestionSetID AS SectionID
,di.ScoreNBR AS Section LowestTopBoxNBR
,di.AveragePercentileNBR AS SectionTopBoxPercentileRankNBR
,qdate.Quarter StartDTS AS Section StartDTS
FROM NRCPicker.PatientSatisfaction.DimensionPercentile AS di
INNER JOIN ("
You are matching the first occurrence only - that is,
SELECT
di.QuestionSetID AS SectionID
,<- here
You're not seeing any effect, though, because you are replacing it with the same text you captured, in doing "$1$2$3".
What you want to do, if you don't want to match the commas around line-spaces, and only replace those in the middle of a line, is not anchor the commas to [\r\n]. You can invert it with the caret: [^\r\n] so that it matches anything that is NOT \r or \n. Then, you need to re-structure the pattern accordingly.
([^\r\n]),([^\n\r]) will match anything that is not \r or \n either side of a comma, and capture those characters in $1 and $2. To replace the comma with a space, your replacement string should therefore be: "$1 $2".
SQLScript = "SELECT
di.QuestionSetID AS SectionID
,di.ScoreNBR AS Section,LowestTopBoxNBR
,di.AveragePercentileNBR AS SectionTopBoxPercentileRankNBR
,qdate.Quarter,StartDTS AS Section,StartDTS
FROM NRCPicker.PatientSatisfaction.DimensionPercentile AS di
INNER JOIN ("
oRe.Pattern = "([^\r\n]),([^\n\r])"
oLoadFields = oRe.Replace(SQLScript, "$1 $2")
Try it like this:
(\S+?),(?=\S+)
We exploit the fact that the , in question are always surrounded by non-whitespace \S. Since there is no (positive) lookbehind in VBScript's RegExp I simply capture the leading part and put it back while the comma itself is replaced by a space: "$1 ".
This also works if there is extra whitespace at the end or beginning of the line.
Demo
Code Sample:
Set re = New RegExp
re.Pattern = "(\S+?),(?=\S+)"
re.Global = True
Dim Input
Input = "SELECT " & vbCRLF & _
" di.QuestionSetID AS SectionID, " & vbCRLF & _
" di.QuestionSetID AS SectionID2 " & vbCRLF & _
",di.ScoreNBR AS Section,LowestTopBoxNBR" & vbCRLF & _
",di.AveragePercentileNBR AS SectionTopBoxPercentileRankNBR " & vbCRLF & _
",qdate.Quarter,StartDTS AS Section,StartDTS "& vbCRLF & _
"FROM NRCPicker.PatientSatisfaction.DimensionPercentile AS di" & vbCRLF & _
"INNER JOIN ("
msgbox re.Replace(Input, "$1 ")
If VBS uses roughly the same engine that JS uses, then you can
leverage the look ahead assertion and
BOL/EOL anchors.
In multi-line mode :
Find (?!^),(?!$)
Replace with a space
https://regex101.com/r/LRXNvz/1
update note:
Note that you can't just capture whats to the left and right of the comma,
then write that back
since there could be adjacent sequential commas.
So anything like (.),(.) won't work.
Example 1: It matches 'hello,,,,,world' which advances the current position
past the next comma and will never match the second one.
Example 2: It matches 'hello,,,,,world' which writes back a comma.
You can see this dysfunction here https://regex101.com/r/u5CPgb/1

Regular Expression starting and ending with special characters

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.
The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).
Pattern details:
\[" - [" literally
[^"]* - zero or more characters other than "
(?:"(?!])[^"]*)* - zero or more sequences of
"(?!]) - " not followed with ]
[^"]* - zero or more characters other than "
"] - literal character sequence "]

Regex: match whole line except first string and #comment lines

I tried (\s|\t).*[\b\w*\s\b], this one is almost ok but I want also except lines with #.
#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0
As #anubhava said in his answer, it looks you just need to check for # at the beginning of the line. The regex for that is simple, but the mechanics of applying the regex varies wildly, so it would help if we knew which regex flavor/tool you're using (e.g. PHP, .NET, Notepad++, EditPad Pro, etc.). Here's a JavaScript version:
/^[^#].*$/mg
Notice the modifiers: m ("multiline") allows ^ and $ to match at line boundaries, and g ("global") allows you to find all the matches, not just the first one.
Now let's look at your regex. [\b\w*\s\b] is a character class that matches a word character (\w), a whitespace character (\s), an asterisk (*), or a backspace (\b). In other words, both * and \b lose their special meanings when the appear in a character class.
\s matches any whitespace character including \t, so (\s|\t) is needlessly redundant, and may not be needed at all. What it's actually doing in your case is matching the newline before each matched line. There's no need for that when you can use ^ in multiline mode. If you want to allow for horizontal whitespace (i.e., spaces and tabs) before the #, you can do this:
/^(?![ \t]*#).*$/mg
(?![ \t]*#) is a negative lookahead; it means "from this position, it is impossible to match zero or more tabs or spaces followed by #". Coming right after the ^ line anchor as it does, "this position" means the beginning of a line.
Try this:
^[A-z0-9_-]+\s+(.+)$
Assuming your first string will consist of only letters, numbers, underscores or hyphens, the first part will match that. Then we match whitespace, and then capture the rest. However, this is all dependent on the regular expression engine being used. Is this using language support for regexes, a specific editor, or a certain library? Which one? There isn't a standard: each regex engine works slightly differently.
Try this:
^[^#].*?(\s|\t)(?<Group>.*)$
After a match is found, the Group group will contain your string.
I would use this regex. In English, this says "First character is not a pound sign (#), then non-white space to match the first 'word', then white space, then match the whole line.
^[^#]\S*\s+(.+)$
Can I suggest another approach though? It looks like there are tabs between each field in the text, so why not just read the text line-by-line and split by tab into an array?
Here is an example in C# (untested):
using(StreamReader sr = new StreamReader("C:\\Path\\to\\file.txt"))
{
string line = sr.ReadLine();
while(!sr.EndOfStream)
{
//skip the comment lines
if(line.StartsWith("#"))
continue;
string[] fields = line.Split(new string[] {"\t"}, StringSplitOptions.RemoveEmptyEntries);
//now fields[0] contains the Name field
//fields[1] contains the Type field
//fields[2] contains the Allowable Values field
line = sr.ReadLine();
}
}
Try this code in php:
<?php
$s="#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 ";
$a = explode("\n", $s);
foreach($a as $str) {
preg_match('~^[^#].*$~', $str, $m);
var_dump($m);
}
?>
OUTPUT
array(0) {
}
array(0) {
}
array(1) {
[0]=>
string(79) "_absolute-path-base-uri String - "
}
array(1) {
[0]=>
string(77) "add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 "
}
Code is pretty simple, it just ignores matching # at the start of a line thus ingoring those lines completely.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"