Notepad++ regex replace - replace all commas with \, within quotations - regex

I am trying to import a csv file into mysql, and I need to convert it into a proper format before importing.
If there's a comma in a column, the csv encloses it within double quotations, here's an example of a row without a comma, and a row with a comma:
1,Superman
2,"Batman,Flash"
What I need to do is to convert all columns which have commas to escape the comma and remove the quotations... such as "Batman,Flash" to Batman\,Flash
Here's what I have so far
Find: "(.*),(.*)"
Replace: \1\\,\2
However, there are two cases in which this does not work:
It will only replace one comma if there's more than one comma withing a quoted column. So something like "Batman,Flash,Robin" will be converted to Batman,Flash\,Robin
This doesn't work if the first column has a comma as well. For example, on a row such as "1,2,3","Batman,Robin"
How can I change the regexes to accommodate the two cases that don't yet work?

I'm sorry, but regex is not the tool for this. You must parse it.
Why?
Do you want to convert this?
"test\, w00t!"
Or what about this?
"test\\\\\, w00t!"
Heck, even this?
"tes\\","\"ing\,\\,"

Related

Replace commas between at's on notepad++

I have a CSV with data to import, the separator character is the comma here; but when the row or line has two e-mails, a comma separates them so the import fails at that point.
So I thought removing the commas between two at's when they're on the same line, but I don't know how.
If you have an alternative solucion, it'll be welcome too!!
Thanks.
Example:
ENTERPRISE1 S.L.,,ENTERPRISE1,999461678,,,,,,ent1#mail.com, ent1alternate#mail2.com,Spain,,,
ENTERPRISE2 S.A.,,ENTERPRISE2.,999859177,,,,,,ent2#mail.com,Italy,,,
Given your data doesn't use any escaping and the #-char will only be present in the mail column, you could use ((?:#|\G(?!^))[^,]+),([^,#]+#) as a search pattern and $1$2 for replace. This will also handle more than two mails in the column correctly. of course you can place a separator of choice between $1 and $2, like $1;$2
You can see it in action here.
You can do it with notepad:
search field:
([^#]+#[^,]+)\s*,\s*([^#]+#[^,]+)
Replace field:
\1|\2
Check regular expression checkbox
So
ent1#mail.com, ent1alternate#mail2.com
will be:
ent1#mail.com|ent1alternate#mail2.com
This will let you keep your column organization, and allow to process data and avoid any lost
Another option is to use to use the correct CSV formatting: double quotes around any field that contains the delimiter.
([^,]+#[^,]+,[^,]+#[^,]+)
Replace:
"\1"
(Regex adapted from destrif's answer).
Looks like in your example you always have a empty space after the comma separating multiple email addresses.If that's a generic rule, you should replace the ", " (comma + empty space) string by another separator like semicolon, using the ctrl+h to call the replace function.

Regex to remove commas between quotes with comma right before end quote Notepad++

In Notepad++, I am using Regex to replace commas between quotes in CSV file.
Using similar example from here.This is what I am trying to read.
1070,17,2,GN3-670,"COLLAR B, M STAY,","2,606.45"
except in my text there is an extra comma right before the closing quotes.
The regex ("[^",]+),([^"]+") does not seem to pick up the last comma and result is
1070,17,2,GN3-670,"COLLAR B M STAY,","2606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY","2606.45"
Is there a simple Regex or will I have to use csv reader C#?
Edit: Some of the Regex is giving false matches so I would like to add another scenario. If I have
1070,17,2,GN3-670,"COLLAR B, M STAY,",55, FREE,"2,606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY",55, FREE,"2606.45"
I think this is what you're looking for:
,(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)
This matches any comma that's followed by an odd number of quotes. It consumes only the comma, so you replace it with nothing.
The thing about your original solution is that it would only match one comma per quoted field. It never even tried to match the second comma in "COLLAR B, M STAY,", so its position didn't really matter. This solution removes any number of commas, regardless of their position within the field.
UPDATE: This regex assumes you're processing one line at a time. If you're using it on a whole document containing many lines, the regex is probably timing out. You can work around that by excluding line terminators (carriage returns and linefeeds), like this:
,(?=[^"\r\n]*"(?:[^"\r\n]*"[^"\r\n]*")*[^"\r\n]*$)
Note that the CSV spec (such as it is) says you can have line terminators in quoted fields, so this regex is technically incorrect. If you do need to support multiline fields, you might as well switch to the CSV library. Regexes are not quite capable of handling CSV fully, but in most cases they're good enough.
You can use the following to match:
((["])(?:(?=(\\?))\3.)*?),\2
And replace with the following:
\1"
See DEMO
This should work
Find What ("[^"]*),"
Replace With \1"

Changing order of csv-file entries with regex replacement in Notepad++

I am trying to change the order of the entries of an *.csv-file with Notepad++ built-in find/replace function. This is how the file looks like now:
ABC;DEF;Here comes some long text with ,.- in it;true;false;
QWE;RTY;Here comes some long text with ,.- in it;true;false;
And this is how it should look like after find/replace:
DEF;Here comes some long text with ,.- in it;ABC;true;false;;
RTY;Here comes some long text with ,.- in it;QWE;true;false;;
So column #1 should be at the position of #3, column number #2 and #3 should shift one to the left.
What I tried so far:
I tried to get the first three columns with an regular expression in the find field, put some brackets around them and reorder them with the $ sign in the replace field. But my regex matches for nearly the whole line, not only the first three columns- what am I doing wrong? Here is my regex:
([A-Z]{3})\;([A-Z]{3})\;(.*[^\;])\;
The first two columns and the following ; are select properly, the problem must be in the third round bracket. But I have no clue what the problem is. The third expression should match to everything except ; and is ended by an ;.
The content of the replacement field should be $2;$3;$1;, I guess that's right.
The main problem is that you're escaping the semi-colons unnecessarily. Use this expression ^(?s)([A-Z]{3};)([A-Z]{3};)([^\n\r;]*;) and replace it with this expression $2$3$1
Have included line delimiters \r or \n too in case of a line with fewer columns. Also you should use start of string anchor ^ to be safe if you have more columns.

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Replace a comma in text values in CSV using regex in Notepad++

I searched a lot but couldn't find any exact soluion.
I have a CSV which contains some values that contains a comma in between the values.
Following is a sample row
"BEIAAGJIPAMBPJIF",2757,08042010,"13:53.59",09042010,"01:55.39","SIHAM","BEIAIGHEIPLGPJIF",20,"A",20,"S",0.00,0.00,0.00,"OLY
SPECIAL ORDER","IN STOCK , DESIGNER",0.00000,0,"","N","N",
Now it you look at the value "IN STOCK , DESIGNER", it containts a comma in between. due to which while reading the csv in my .net application and in MS Dynamics CRM import file wizard, it breaks it into two seprate values instead of one single value.
I need a regex that can match such strings and replace the comma with a hyphen "-" that I can use in Notepad ++.
Kindly help.
Thanks.
This solution worked for me, although it is a bit indirect:
by searching, detect character which is unused in the file, e.g. #
use the following regex replace to replace all delimiters: find: (".*?"|.*?), replace: \1# (note the character from step 1)
now, all leftover commas are only those which are inside the quotes. Mass replace them for -
replace back all #'s for commas