Remove <space><comma> but not <space><space><comma> - regex

I have a CSV file that has un-encapsulated text strings, some that contain commas. This of course throws off the CSV parser.
My CSV has the following patterns:
A column with no value will contain 2 spaces
A column with a value will look like <comma><value><comma>, with no spaces between the value and the commas.
All of the errant commas that I need to remove (that are contained in text strings) are either preceded by or followed by a single space. Example:
<somevalue,Check this out, I think you'll like it.,<somevalue>
I need to regex to replace that <space><comma> with just a <space> or a <hyphen>. But I can't just search on comma> because that will catch all of the valid instances.

You can use the following to match:
(?<![ ])[ ],
And replace with '' (empty string)

Another option would be to match on non-space followed by space and comma and replace with the non-space:
... -replace '(^|[^ ]) ,', '$1'

Related

\1 not defined in the RE

In my script, I'm in passing a markdown file and using sed, I'm trying to find lines that do not have one or more # and are not empty lines and then surround those lines with <p></p> tags
My reasoning:
^[^#]+ At beginning of line, find lines that do not begin with 1 or more #
.\+ Then find lines that contain one or more character (aka not empty lines)
Then replace the matched line with <p>\1</p>, where \1 represents the matched line.
However, I'm getting "\1 not defined in the RE". Is my reasoning above correct and how do I fix this error?
BODY=$(sed -E 's/^[^#]+.\+/<p>\1</p>/g' "$1")
Backslash followed by a number is replaced with the match for the Nth capture group in the regexp, but your regexp has no capture groups.
If you want to replace the entire match, use &:
BODY=$(sed -E 's%^[^#].*%<p>&</p>%' "$1")
You don't need to use .+ to find non-empty lines -- the fact that it has a character at the beginning that doesn't match # means it's not empty. And you don't need + after [^#] -- all you care is that the first character isn't #. You also don't need the g modifier when the regexp matches the entire line -- that's only needed to replace multiple matches per line.
And since your replacement string contains /, you need to either escape it or change the delimiter to some other character.

regex to remove all text in last column of pipe-delimited ragged flat file

I have a ragged .pip, pipe-delimited, quote-qualified flat file with 3 columns. The end-of-record delimiter is carriage-return line-feed ({CR}{LF}). An example file is:
x|stuff|zz {CR}{LF}
ab|"some|thing"|"els|e" {CR}{LF}
"wh|at"|text|b {CR}{LF}
I need to remove the text in the last (3rd) column, including its column delimiter. So, I want the above example file to appear as:
x|stuff {CR}{LF}
ab|"some|thing" {CR}{LF}
"wh|at"|text {CR}{LF}
I want to use a regex find-replace in Notepad++. What should my regex (find) be? I know there is a similar post for this (Regular expression to remove the last column from a pipe delimited file), but it doesn't seem to work for my situation.
Your search pattern can be constructed by a literal pipe (must be escaped), followed by zero or more non-pipe chars (greedy) and anchored at end of line. But I see that some of the fields may contain quoted values with pipes. So you would need to handle those in a separate match. Try this:
\|("[^"]*"|[^|]*)$
I just tested this pattern on your example data set and confirmed it works. Do you have any quoted values that have quote characters that need to be escaped? If so, how are they escaped? With a leading quote? With a backslash? Perhaps it might be better to use a CSV parser instead of a regex if you do have any quoted data in the last column with literal quotes inside.

remove all commas between quotes with a vim regex

I've got a CSV file with lines like:
57,13,"Bob, Bill and Susan",Student,Club,Funded,64,3200^M
I need them to look like
57,13,Bob-Bill-and-Susan,Student,Club,Funded,64,3200
I'm using vim regexes. I've broken it down into 4 steps:
Remove ^M and insert newlines:
:%s:<ctrl-V><ctrl-M>:\r:g`
Replace all with -:
:%s: :\-:g
Remove commas between quotes: Need help here.
Remove quotes:
:%s:\"\([^"]*\)\":\1:g
How do I remove commas between quotes, without removing all commas in the file?
Something like this?
:%s:\("\w\+\),\(\w\+"\):\1 \2:g
My preferred solution to this problem (removing commas inside quoted regions) is to use replacements with an expression instead of trying to get this done in one regex.
To do this you need to prepend you replacement with \= to get the replacement treated as a vim expression. From here you can extract just the parts between quotes and then manipulate the the matched part separately. This requires having two short regexes instead of one complicated one.
:%s/".\{-}"/\=substitute(submatch(0), ',', '' , 'g')/g
So ".\{-}" matches anything in quotes (non greedy) and substitute(submatch(0), ',', '' , 'g') takes what was matched and removes all of the commas and its return value is used as the actual replacement.
The relevant help page is :help sub-replace-special.
As for the other parts of your question. Step 1 is essentially trying to remove all carriage returns since the file format is actually the dos file format. You can remove them with the dos2unix program.
In Step 2 escaping the - in the replacement is unnecessary. So the command is just
:%s/ /-/g
In Step 4, you have an overly complicated regex if all you want to do is remove quotes. Since all you need to do is match quotes and remove them
:%s/"//g
:%s:\("\w*\)\(,\)\(.*"\):\1\3:g
example: "this is , an, example"
\("\w*\) match start of " every letter following qoutes group \1 for back reference
\(,\) capture comma group \2 for back reference
(.*"\) match every other character upto the second qoute ->group 3 for backreference
:\1\3: only include groups without comma, discard group 2 from returned string which is \2
:%s:\("\w*\)\(,\)\(.*"\):\1\3:g removes commas

Regex for this dashed pattern

Would anyone have a suggestion for a regex that manipulates line that ends in:
,04-721-0G-00033-AU
and transform that string into:
,04,721,0G,00033,AU
(replaces all dashes after last comma in a string into commas)
Keep in mind that there could be preceding parts of the string that have dashes and commas, so what I know for sure is that the part of the line I want manipulated is a string that starts with a last comma in the line, ends with EOL and has this structure of ,XX-XXX-XX-XXXXX-XX
Any suggestions?
Thanks.
Match: ,(?=[^,]*$)(\w{2})-(\w{3})-(\w{2})-(\w{5})-(\w{2})$
Replace by: ,$1,$2,$3,$4,$5
How it works:
,(?=[^,]*$) selects the last , of the line (literally: the , that is only followed by anything but an other , until the end of the line).
after that, we try to match your XX-XXX-XX-XXXXX-XX with
(\w{2})-(\w{3})-(\w{2})-(\w{5})-(\w{2})
make sure that the end of the line has been reached by matching $
Then you just rewrite:
the ,
each XX group separated by a -.
Would this pattern (test replace) do what you like?
-(?=[^,]{1,15}$)
Replace with ,
Checks at hyphen, if there are 1-15 charcters left to end that are no commas using a look ahead, if so replaces with comma.
As no language is specified, for a multiline replace, you might want to add the m-modifier for multiline, for JS additional the g-modifier for global (test with modifiers).

Notepad++ matchin end of line in regexp

I want to transform this
a
b
b
into this
a
b
b
number of empty lines is variable and can be pretty huge. Empty lines contains spaces. I want to use a regexp like \r\n( *\r\n)+, but notepad++ seems not to like those special characters in regexp, tryed also \\r\\n( *\\r\\n)+
Please note that empty lines may contain spaces, so the correct regexp would be something like \\r\\n( *\\r\\n)+
You can do 'replace all' multiple times on
\r\n\r\n -> \r\n
That's with 'Extended' option selected, not 'Regular expression'.
If the empty line contains spaces, then first replace all lines with only spaces with nothing using regex: ^\s+$ -> ''. Then to the extended replacement above.
Alternatively:
You can also replace all \r\n with some sequence of characters that doesn't exists in the document, e.g. ### then use the following regex replacement : '###(\s*###)+' -> '###' and finally replace back the sequence ('###') with \r\n.