PCRE regex replace a text pattern within double quotes - regex

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.

It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.

If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.

Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Related

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Replace repeated special characters with a single special character

I am attempting to use REGEXREPLACE in Google Sheets to remove the repeating special character \n.
I can't get it to replace all repeating instances of the characters with a single instance.
Here is my code:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+","\\n")
I want the results to be:
Hi Gene\nHope
But it always maintains the new lines.
Hi Gene\n\n\n\n\nHope
It has to be an issue with replacing the special characters because this:
REGEXREPLACE("Hi Gennnne\nHope","n+","n")
Produces:
Hi Gene\nHope
How do I remove repeating instances of special characters with a single instance of the special character in Google Sheets?
Edit
Just found easier way:
=REGEXREPLACE("Hi Gene\n\n\n\n\nHope","(\\n)+","\\n")
Original solution
Thy this formula:
=REGEXREPLACE(A1,REPT(F2,(len(A1)-len(REGEXREPLACE(A1,"\\n","")))/2),"\\n")
Put your text in A1.
How it works
It's workaround, we want to use final formula like this:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+\\n+\\n+\\n+\\n+","\\n")
First target is to find, how many times to repeat \\n+:
=(len(F1)-len(REGEXREPLACE(F1,F2,F3)))/2
Then just combine RegEx.
https://support.google.com/docs/answer/3098245?hl=en
REGEXREPLACE(text, regular_expression, replacement)
The problem seems to be how it interprets the "text". If I put this in a cell REGEXREPLACE("Hi Gene\n\n\n\n\nHope","","")
the output is Hi Gene\n\n\n\n\nHope as well.
If I place the text in a cell by itself with proper newlines and have this REGEXREPLACE(A1, "(\n)\n*", "$1") it works.
Note I could not just do s/\n+/\n/ as it still does not interpret the newline notation as anything special. It would just output \n instead of a newline.
I believe that you don't need to double escape the newlines, e.g. just search for \n:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope", "\n+", "\n")
When you replace \\n you are searching for the literal text \n, rather than newline.

REGEXREPLACE in Google Spreadsheet

I am trying to use REGEX in Google Sheets to clean up form data arriving as comma delimited data with arbitrary leading commas and single spaces.
sample data from form:
,,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
I want to use
REGEXREPLACE(text, regular_expression, replacement)
to remove multiple commas and single spaces that may occur between commas, replacing with a single comma so the line reads
Refrigerator,Slide,Dual Slide,Microwave Oven, . . . etc
The match string (^,+|(,+ ,)|,+) works properly in the Rubular.com simulator, but when used in the Google Spreadsheet as in example with raw data above pasted in at cell M12 as source text:
REGEXREPLACE("M12","(^,+|(,+ ,)|,+)",",")
it fails by not removing one of the leading commas.
,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
The Googlesheet REGEX help points to https://github.com/google/re2/blob/master/doc/syntax.txt which seems to describe the operations the same as the simulator.
From what you're describing, Google is working as expected and the other site linked isn't. Your regex is matching ^,+, amongst other things, (ie one or more commas at the start), and replacing them with a single comma. If the input string has commas at the start, I would expect the output to have one too.
You could build on what you've done with another regular expression replace, and strip any leading commas:
REGEXREPLACE(REGEXREPLACE(M12,"((,+ ,)|,+)",","), "^,+", "")
This uses your original one, minus the leading commas part, to do the original replace, then wraps it in a second call looking for just leading commas, and replacing those with nothing.
Having said that, your original regex is also not quite working as expected either and isn't stripping all the commas and spaces down to a single comma in all circumstances. Instead, you can use this one:
REGEXREPLACE(REGEXREPLACE(M12,"( ?(, *)+)",","), "^,+", "")
This looks for an optional space, followed by one or more commas, each with zero or more commas after them, replacing the whole lot with a single comma, then keeping the new "remove all commas at the start" replace also.
One more good way to do this:
=TEXTJOIN(", ",1,SPLIT(A1,", "))

Replace a comma in text values in CSV using regex in Notepad++

I searched a lot but couldn't find any exact soluion.
I have a CSV which contains some values that contains a comma in between the values.
Following is a sample row
"BEIAAGJIPAMBPJIF",2757,08042010,"13:53.59",09042010,"01:55.39","SIHAM","BEIAIGHEIPLGPJIF",20,"A",20,"S",0.00,0.00,0.00,"OLY
SPECIAL ORDER","IN STOCK , DESIGNER",0.00000,0,"","N","N",
Now it you look at the value "IN STOCK , DESIGNER", it containts a comma in between. due to which while reading the csv in my .net application and in MS Dynamics CRM import file wizard, it breaks it into two seprate values instead of one single value.
I need a regex that can match such strings and replace the comma with a hyphen "-" that I can use in Notepad ++.
Kindly help.
Thanks.
This solution worked for me, although it is a bit indirect:
by searching, detect character which is unused in the file, e.g. #
use the following regex replace to replace all delimiters: find: (".*?"|.*?), replace: \1# (note the character from step 1)
now, all leftover commas are only those which are inside the quotes. Mass replace them for -
replace back all #'s for commas

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###