Regex to remove unescaped quotes from a CSV

Regex to remove unescaped quotes from a CSV - regex

I need to feed a CSV file into a database. For that I have to remove "wild" un-escaped quotes.
Following input structure is possible:
"aa";"bb";"cc";"dd";"ee"
"aa";"bb";"c "cc" c";"dd";"ee"
"aa";;"cc";"dd";"ee"
"aa";55;"cc";"dd";"ee"
The expression:
(?<!^|\"\;)\"(?!\;|$)
does work for #1 and #2 of the input examples but fails when there is an empty element (#3) or an unquoted numeric field (#4). Also see this Rubular example
Any pointer how to get these cases covered would be highly appreciated.
Edit:
Following #Wiktor Stribiżew advice, I'm now using
(^"|"$|";+"|";\d+;"|";|;")|"
this also covers some additional edge cases, I have identified in the input data, as shown here

The following solution only meets your current requirements and is not a universal solution to fix quotes in CSV:
(^"|"$|";+"|";\d+;")|"
Replace with $1 (or \1, depending on where you use this regex).
See the regex demo.
Details
(^"|"$|";+"|";\d+;") - Group 1:
^"| - " at the start of the string, or
"$| - " at the end of the string, or
";+"| - ", 1+ ; chars, and then ", or
";\d+;" - ";, 1+ digits, then ;"
| - or
" - a " char.

Related

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)

As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.

You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Replace double quoted strings by single quoted except for GStrings

My OCD has gotten the better of me and I'm going through my groovy codebase replacing simple strings with double quotes around them into single quoted strings.
However, I want to avoid GStrings that actually contain dollar symbols and variables.
I'm using IntelliJ to do the substitution, and the following almost works:
From: "([^$\"\n\r]+)"
To: '$1'
It captures strings without any dollars in, but only partially skips any strings that contain them.
For example it matches the quotes between two double quoted strings in this case:
foo("${var}": "bar")
^^^^
Is it possible to create a regex that would skip a whole string that contained dollars, so in the above case it skips "${var}" and selects "bar", instead of erroneously selecting ": "?
EDIT: Here's a section of code to try against
table.columns.elements.each{ columnName, column ->
def columnText = "${columnName} : ${column.dataType}"
cols += "${columnText}\n"
if (columnText.length() > width) {
width = columnText.length()
}
height++
}
builder."node"("id": table.elementName) {
builder."data"("key": "d0") {
builder."y:ShapeNode"()
}
}
def foo() {
def string = """
a multiline quote using triple quotes with ${var} gstring vars in.
"""
}

Do single and triple quote replacements separately.
Single quotes:
Use a look ahead for an even number of quotes after your hit. A negative look behind stops it matching the inner quotes of triple quoted strings.
Find: (?<!")"([^"$]*)"(?=(?:(?:[^"\r\n]*"){2})*[^"]*$)
Replace: '$1'
See live demo.
Triple quotes:
Use a simpler match for triple quoted strings, since they are on their own lines.
Find: """([^"$]*?)"""
Replace: '''$1'''
See live demo, which includes a triple-quoted string that contains a variable.

You need to make sure the first quote comes after even number of quotes:
^[^\n\r"]*(?:(?:"[^"\n\r]*){2})*"([^$\"\n\r]+)"
Here you can play with it.
Explanation:
^[^"\n\r]* - some non-quotes at the beginning
"[^"\n\r]* - a quote, then some more non-quotes
(?:"[^"\n\r]*){2} - let's have two of this
(?:(?:...)) - actually, let's have 0, 2, 4, 6, ... whatever amount of this
Then your regex comes to match the right string: "([^$\"\n\r]+)"
If intellij supports that, then you can make it faster by replacing the non-capturing groups (?:...) with atomic groups (?>...).
This regex finds the last string in the line so you'll have to run the replace several times.
Update
Updated the negated character classes with the newline characters. Now it works well for multi-line texts too. Still, you'll have to run it several times because it finds only one string per line.

Regular Expression starting and ending with special characters

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.

The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).
Pattern details:
\[" - [" literally
[^"]* - zero or more characters other than "
(?:"(?!])[^"]*)* - zero or more sequences of
"(?!]) - " not followed with ]
[^"]* - zero or more characters other than "
"] - literal character sequence "]

Strip comments from text except for comment char between quotes

I'm trying to build a regexp for removing comments from a configuration file. Comments are marked with the ; character. For example:
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
The difficulty I have is ignoring the comment character when it's placed between quotes.
Any ideas?

You could try matching a semicolon only if it's followed by an even number of quotes:
;(?=(?:[^"]*"[^"]*")*[^"]*$).*
Be sure to use this regex with the Singleline option turned off and the Multiline option turned on.
In Python:
>>> import re
>>> t = """; This is a comment line
... keyword1 keyword2 ; comment
... keyword3 "key ; word 4" ; comment"""
>>> regex = re.compile(';(?=(?:[^"]*"[^"]*")*[^"]*$).*', re.MULTILINE)
>>> regex.sub("", t)
'\nkeyword1 keyword2 \nkeyword3 "key ; word 4" '

No regex :)
$ grep -E -v '^;' input.txt
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

You may use regexp to get all strings out first, replace them with some place-holder, and then simply cut off all \$.*, and replace back the strings at last :)

Something like this:
("[^"]*")*.*(;.*)
First, match any number of text between quotes, then match a ;. If the ; is between quotes it will be matches by the first group, not by the second group.

I (somewhat accidentally) came up with a working regex:
replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm, '$1')
I wanted:
remove single line comments at start of line or end of line,
to use single and double quotes,
the ability to have just one quote in a comment: that's useful (but accept " as well)
(so matching on a balanced set (even number) of quotes after a comment-delimiter as in Tim Pietzcker's answer was not suitable),
leave comment-delimiter ; alone in correctly (closed) quoted 'strings'
mix quoting style
multiple quoted strings (and comments in/after comments)
nest single/double quotes in resp. double/single quoted 'strings'
data to work on is like valid ini-files (or assembly), as long as it doesn't contain escaped quotes or regex-literals etc.
Lacking look-back on javascript I thought it might be an idea to not match comments (and replace them with ''), but match on data preceding the comment and then replace the full match data with the sub-match data.
One could envision this concept on a line by line basis (so replace the full line with the match, thereby 'loosing' the comment), BUT the multiline parameter doesn't seem to work exactly that way (at least in the browser).
[^'";]* starts eating any characters from the 'start' that are not '";.
(Completely counter-intuitive (to me), [^'";\r\n]* will not work.)
(?:'[^']*'|"[^"]*")? is a non-capturing group matching zero or one set of quote any chars quote (and (?:(['"])[^\2]*\2)? in /^((?:[^'";]*(?:(['"])[^\2]*\2)?)*)[ \t]*;.*$/gm or
(?:(['"])[^\2\r\n]*\2)? in /^((?:[^'";]*(?:(['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm (although mysteriously better) do not work (broke on db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as), but not adding another capturing group for re-use in the match is a good thing as they come with penalties anyway).
The above combo is placed in a non-capturing group which may repeat zero or more times and it's result is placed in a capturing group 1 to pass along.
That leaves us with [ \t]*;.* which 'simply' matches zero or more spaces and tabs followed by a semicolon, followed by zero or more chars that are not a new line. Note how ; is NOT optional !!!
To get a better idea of how this (multi-line parameter) works, hit the exp button in the demo below.
function demo(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '$1'
)
.replace( /[ \t]*$/gm, ''); //optional trim
}
function demo_exp(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '**S**$1**E**' //to see start and end of match.
);
}
<textarea style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
"Text; in" and between "quotes; plus" semicolons; this is the comment
; This is a comment line
keyword1 keyword2 ; comment
keyword3 'key ; word 4' ; comment and one quote ' ;see it?
_b64decode:
db 0x83,0xc6,0x3A ; add si, b64decode_end - _b64decode ;39
push 'a'
pop di
cmp byte [si], 0x2B ; '+'
b64decode_end:
;append base64 data here
;terminate with printable character less than '+'
db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>
Hope this helps.
PS: Please comment valid examples if and where this might break! Since I generally agree (from extensive personal experience) that it is impossible to reliably remove comments using regex (especially higher level programming languages), my gut is still saying this can't be fool-proof. However I've been throwing existing data and crafted 'what-ifs' at it for over 2 hours and couldn't get it to break (, which I'm usually very good at).

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';

/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);

This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"

As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/

Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.

/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string

"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes

/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.

This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!

An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js

here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html

One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).

A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)

If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"

I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.

If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.

(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "

Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js