REGEXREPLACE in Google Spreadsheet - regex

I am trying to use REGEX in Google Sheets to clean up form data arriving as comma delimited data with arbitrary leading commas and single spaces.
sample data from form:
,,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
I want to use
REGEXREPLACE(text, regular_expression, replacement)
to remove multiple commas and single spaces that may occur between commas, replacing with a single comma so the line reads
Refrigerator,Slide,Dual Slide,Microwave Oven, . . . etc
The match string (^,+|(,+ ,)|,+) works properly in the Rubular.com simulator, but when used in the Google Spreadsheet as in example with raw data above pasted in at cell M12 as source text:
REGEXREPLACE("M12","(^,+|(,+ ,)|,+)",",")
it fails by not removing one of the leading commas.
,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
The Googlesheet REGEX help points to https://github.com/google/re2/blob/master/doc/syntax.txt which seems to describe the operations the same as the simulator.

From what you're describing, Google is working as expected and the other site linked isn't. Your regex is matching ^,+, amongst other things, (ie one or more commas at the start), and replacing them with a single comma. If the input string has commas at the start, I would expect the output to have one too.
You could build on what you've done with another regular expression replace, and strip any leading commas:
REGEXREPLACE(REGEXREPLACE(M12,"((,+ ,)|,+)",","), "^,+", "")
This uses your original one, minus the leading commas part, to do the original replace, then wraps it in a second call looking for just leading commas, and replacing those with nothing.
Having said that, your original regex is also not quite working as expected either and isn't stripping all the commas and spaces down to a single comma in all circumstances. Instead, you can use this one:
REGEXREPLACE(REGEXREPLACE(M12,"( ?(, *)+)",","), "^,+", "")
This looks for an optional space, followed by one or more commas, each with zero or more commas after them, replacing the whole lot with a single comma, then keeping the new "remove all commas at the start" replace also.

One more good way to do this:
=TEXTJOIN(", ",1,SPLIT(A1,", "))

Related

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Replace repeated special characters with a single special character

I am attempting to use REGEXREPLACE in Google Sheets to remove the repeating special character \n.
I can't get it to replace all repeating instances of the characters with a single instance.
Here is my code:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+","\\n")
I want the results to be:
Hi Gene\nHope
But it always maintains the new lines.
Hi Gene\n\n\n\n\nHope
It has to be an issue with replacing the special characters because this:
REGEXREPLACE("Hi Gennnne\nHope","n+","n")
Produces:
Hi Gene\nHope
How do I remove repeating instances of special characters with a single instance of the special character in Google Sheets?
Edit
Just found easier way:
=REGEXREPLACE("Hi Gene\n\n\n\n\nHope","(\\n)+","\\n")
Original solution
Thy this formula:
=REGEXREPLACE(A1,REPT(F2,(len(A1)-len(REGEXREPLACE(A1,"\\n","")))/2),"\\n")
Put your text in A1.
How it works
It's workaround, we want to use final formula like this:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+\\n+\\n+\\n+\\n+","\\n")
First target is to find, how many times to repeat \\n+:
=(len(F1)-len(REGEXREPLACE(F1,F2,F3)))/2
Then just combine RegEx.
https://support.google.com/docs/answer/3098245?hl=en
REGEXREPLACE(text, regular_expression, replacement)
The problem seems to be how it interprets the "text". If I put this in a cell REGEXREPLACE("Hi Gene\n\n\n\n\nHope","","")
the output is Hi Gene\n\n\n\n\nHope as well.
If I place the text in a cell by itself with proper newlines and have this REGEXREPLACE(A1, "(\n)\n*", "$1") it works.
Note I could not just do s/\n+/\n/ as it still does not interpret the newline notation as anything special. It would just output \n instead of a newline.
I believe that you don't need to double escape the newlines, e.g. just search for \n:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope", "\n+", "\n")
When you replace \\n you are searching for the literal text \n, rather than newline.

Regex to remove commas between quotes with comma right before end quote Notepad++

In Notepad++, I am using Regex to replace commas between quotes in CSV file.
Using similar example from here.This is what I am trying to read.
1070,17,2,GN3-670,"COLLAR B, M STAY,","2,606.45"
except in my text there is an extra comma right before the closing quotes.
The regex ("[^",]+),([^"]+") does not seem to pick up the last comma and result is
1070,17,2,GN3-670,"COLLAR B M STAY,","2606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY","2606.45"
Is there a simple Regex or will I have to use csv reader C#?
Edit: Some of the Regex is giving false matches so I would like to add another scenario. If I have
1070,17,2,GN3-670,"COLLAR B, M STAY,",55, FREE,"2,606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY",55, FREE,"2606.45"
I think this is what you're looking for:
,(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)
This matches any comma that's followed by an odd number of quotes. It consumes only the comma, so you replace it with nothing.
The thing about your original solution is that it would only match one comma per quoted field. It never even tried to match the second comma in "COLLAR B, M STAY,", so its position didn't really matter. This solution removes any number of commas, regardless of their position within the field.
UPDATE: This regex assumes you're processing one line at a time. If you're using it on a whole document containing many lines, the regex is probably timing out. You can work around that by excluding line terminators (carriage returns and linefeeds), like this:
,(?=[^"\r\n]*"(?:[^"\r\n]*"[^"\r\n]*")*[^"\r\n]*$)
Note that the CSV spec (such as it is) says you can have line terminators in quoted fields, so this regex is technically incorrect. If you do need to support multiline fields, you might as well switch to the CSV library. Regexes are not quite capable of handling CSV fully, but in most cases they're good enough.
You can use the following to match:
((["])(?:(?=(\\?))\3.)*?),\2
And replace with the following:
\1"
See DEMO
This should work
Find What ("[^"]*),"
Replace With \1"

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)