Regex for string with optional quotes - regex

I'm trying to break apart a string I'm getting from a telnet service which puts in quotes either end of a filename has white space in it, and doesn't include the quotes if there are no white space present. All the other fields are delimited by spaces so no real issue there.
I'm trying (maybe too ambitiously!) to get the whole lot out in Regex groups. Not that it has much bearing on it, but I'm using Perl.
An example of a quoted string is:
"RAW Superleague backchat 0907 1531" movie/DV/DV100 63173952000 576000 15:21:35:24 16:34:43:01
and an unquoted string might be:
F0736584_02 movie/DV/DV100 9172224000 576000 16:04:19:00 16:14:55:24
I'm using the regex:
/^"?(.*)"$?\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)/
which returns the string with quotes very nicely in groups, but doesn't return the second without quotes. I thought that the optional flag would handle this, but it seems not. Any help appreciated.

Because the second line doesn't start with a whitespace. Try this:
/^"?(.*)"$?\s?(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)/
^----------- new

Related

How to include '#' pound sign or hashtag to be read as part of a field?

I have my yml file looking like this:
fields:
MC Number: \s+\d+
BOL \#: \s+[0-9A-Z]
But the whole line is not colored correctly as shown in the picture, meaning that it's still being read from MC Number.
I tried adding quotes on the hashtag ("#") but it still wouldn't work. Obviously, leaving it without the backwards slash, everything starting at the hashtag would be commented out on that line.
Marked in red is the problem field, it looks green through the whole line:
Strictly speaking this probably is a bug in the syntax highlighter, but one you might work around simply.
I'm assuming you put the backslash in front of the hash to prevent it from being treated as a comment delimiter. The problem is that backslash escapes only works for characters within quotes.
I'd suggest you put quotes around the "BOL #". Inside quotes backslash escaping the # is not needed.
Putting it all together, line with the error should probably read:
"BOL #": \s+[0-9A-Z]
.

Regex Find Spaces between single qoutes and replace with underscore

I have a database table that I have exported. I need to replace the image file name with a space and would like to use notepad++ and regex to do so. I have:
'data/green tea powder.jpg'
'data/prod_img/lumina herbal shampoo.JPG'
'data/ALL GREEN HERBS.jpeg'
'data/prod_img/PSORIASIS KIT (640x530) (2).jpg'
and need to make them look like this:
'data/green_tea_powder.jpg'
'data/prod_img/lumina_herbal_shampoo.JPG'
'data/ALL_GREEN_HERBS.jpeg'
'data/prod_img/PSORIASIS_KIT_(640x530)_(2).jpg'
I just want to change the spaces between the quotes (I don't want to change the capitalization). To be more specific I would like to replace any and all spaces between 'data/ and ' because there are other spaces between quotes in the DB, for example:
'data/ REPLACE ANY SPACE HERE '
I found this:
\s(?!(?:[^']*'[^']*')*[^']*$)
but there are other places where there are spaces between quotes so I'd like to search for data/ in the beging and not just a single quote but I can't figure out how. I tried \s(?!(?:[^'data\/]*'[^']*')*[^']*$) but it didn't work and I am not familiar enough with regex to make it do so.
An example of a full line from the database is:
(712, 'GRTE-P', '', 'data/green tea powder.jpg', '2014-03-12 22:52:03'),
I don't want to replace the spaces in the time and data stamp at the end of the line, just the image file names.
Thanks in advance for your help!
You have to use a \G based pattern to ensure that matches are contiguous.
search: (?:\G(?!^)|'data/)[^' ]*\K[ ]replace: _
The first match uses the second branch of the alternation, then the next matches are contiguous and use the first branch.

Regex to remove commas between quotes with comma right before end quote Notepad++

In Notepad++, I am using Regex to replace commas between quotes in CSV file.
Using similar example from here.This is what I am trying to read.
1070,17,2,GN3-670,"COLLAR B, M STAY,","2,606.45"
except in my text there is an extra comma right before the closing quotes.
The regex ("[^",]+),([^"]+") does not seem to pick up the last comma and result is
1070,17,2,GN3-670,"COLLAR B M STAY,","2606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY","2606.45"
Is there a simple Regex or will I have to use csv reader C#?
Edit: Some of the Regex is giving false matches so I would like to add another scenario. If I have
1070,17,2,GN3-670,"COLLAR B, M STAY,",55, FREE,"2,606.45"
I would like
1070,17,2,GN3-670,"COLLAR B M STAY",55, FREE,"2606.45"
I think this is what you're looking for:
,(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)
This matches any comma that's followed by an odd number of quotes. It consumes only the comma, so you replace it with nothing.
The thing about your original solution is that it would only match one comma per quoted field. It never even tried to match the second comma in "COLLAR B, M STAY,", so its position didn't really matter. This solution removes any number of commas, regardless of their position within the field.
UPDATE: This regex assumes you're processing one line at a time. If you're using it on a whole document containing many lines, the regex is probably timing out. You can work around that by excluding line terminators (carriage returns and linefeeds), like this:
,(?=[^"\r\n]*"(?:[^"\r\n]*"[^"\r\n]*")*[^"\r\n]*$)
Note that the CSV spec (such as it is) says you can have line terminators in quoted fields, so this regex is technically incorrect. If you do need to support multiline fields, you might as well switch to the CSV library. Regexes are not quite capable of handling CSV fully, but in most cases they're good enough.
You can use the following to match:
((["])(?:(?=(\\?))\3.)*?),\2
And replace with the following:
\1"
See DEMO
This should work
Find What ("[^"]*),"
Replace With \1"

REGEXREPLACE in Google Spreadsheet

I am trying to use REGEX in Google Sheets to clean up form data arriving as comma delimited data with arbitrary leading commas and single spaces.
sample data from form:
,,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
I want to use
REGEXREPLACE(text, regular_expression, replacement)
to remove multiple commas and single spaces that may occur between commas, replacing with a single comma so the line reads
Refrigerator,Slide,Dual Slide,Microwave Oven, . . . etc
The match string (^,+|(,+ ,)|,+) works properly in the Rubular.com simulator, but when used in the Google Spreadsheet as in example with raw data above pasted in at cell M12 as source text:
REGEXREPLACE("M12","(^,+|(,+ ,)|,+)",",")
it fails by not removing one of the leading commas.
,Refrigerator,,,,, ,,Slide,,Dual Slide,,Microwave Oven,,Indoor Shower,Built in Stereo,Day/Night Switch,,BluRay/DVD
The Googlesheet REGEX help points to https://github.com/google/re2/blob/master/doc/syntax.txt which seems to describe the operations the same as the simulator.
From what you're describing, Google is working as expected and the other site linked isn't. Your regex is matching ^,+, amongst other things, (ie one or more commas at the start), and replacing them with a single comma. If the input string has commas at the start, I would expect the output to have one too.
You could build on what you've done with another regular expression replace, and strip any leading commas:
REGEXREPLACE(REGEXREPLACE(M12,"((,+ ,)|,+)",","), "^,+", "")
This uses your original one, minus the leading commas part, to do the original replace, then wraps it in a second call looking for just leading commas, and replacing those with nothing.
Having said that, your original regex is also not quite working as expected either and isn't stripping all the commas and spaces down to a single comma in all circumstances. Instead, you can use this one:
REGEXREPLACE(REGEXREPLACE(M12,"( ?(, *)+)",","), "^,+", "")
This looks for an optional space, followed by one or more commas, each with zero or more commas after them, replacing the whole lot with a single comma, then keeping the new "remove all commas at the start" replace also.
One more good way to do this:
=TEXTJOIN(", ",1,SPLIT(A1,", "))

Change `"` quotation marks to latex style

I'm editing a book in LaTeX and its quotation marks syntax is different from the simple " characters. So I want to convert "quoted text here" to ``quoted text here''.
I have 50 text files with lots of quotations inside. I tried to write a regular expression to substitute the first " with `` and the second " with '', but I failed. I searched on internet and asked some friends, but I had no success at all. The closest thing I got to replace the first quotation mark is
s/"[a-z]/``/g
but this is clearly wrong, since
"quoted text here"
will become
``uoted text here"
How can I solve my problem?
I'm a little confused by your approach. Shouldn't it be the other way round with s/``/"[a-z]/g? But then, I think it'll be better with:
s/``(.*?)''/"\1"/g
(.*?) captures what's between `` and ''.
\1 contains this capture.
If it's the opposite that you're looking for (i.e. I wrongly interpreted your question), then I would suggest this:
s/"(.*?)"/``\1''/g
Which works on the same principles as the previous regex.
Use the following to tackle multiple quotations, replacing all " in one step.
echo '"Quote" she said, "again."' | sed "s/\"\([^\"]*\)\"/\`\`\1''/g"
The [^\"]* avoids the need for ungreedy matching, which does not seem possible in sed.
If you are using the TeXmaker software, you could use a regular expression with the Replace command (CTRL+R), and put the following into the Find field:
"([^}]*)"
and into the Replace field:
``$1''
And then just press the Replace All button. But after that, you still have to check that everything is fine, and maybe you need to do some corrections. This has worked pretty well for me.
Try grouping the word:
sed 's/"\([a-z]\)/``\1/'
On my PC:
abhishekm71#PC:~$ echo \"hello\" | sed 's/"\([a-z]\)/``\1/'
``hello"
It depends a little on your input file (are quotes always paired, or can there be ommissions?). I suggest the following robust approach:
sed 's/"\([0-9a-zA-Z]\)/``\1/g'
sed "s/\([0-9a-zA-Z]\)\"/\1\'\'/g"
Assumption: An opening quotation mark is always immediately followed by a letter or digit, a closing quotation mark is preceeded by one. Quotations can span over several words an even several input lines (some of the other solutions don't work when this happens).
Note that I also replace the closing quotation mark: Depending on the fonts you use the double quotation mark can be typeset as neutral straight quotation mark.
You are looking for something contained in straight quotation marks not containing a quotation mark, so the best regex is "([^"]*?)". Replace it with ``\1''. In Perl this can be simplified to s/"([^"]*?)"/``\1''/g. I would be very careful with this approach, it only works if all opening quotation marks have matching closing ones, for example in "one" two "three" four. But it will fail in "one" t"wo "three" four producing ``one'' t``wo ''three".