Regex find comma not inside quotes - regex

I'm checking line by line in C#
Example data:
bob jones,123,55.6,,,"Hello , World",,0
jim neighbor,432,66.5,,,Andy "Blank,,1
john smith,555,77.4,,,Some value,,2
Regex to pick commas outside of quotes doesn't resolve second line, it's the closest.

Try the following regex:
(?!\B"[^"]*),(?![^"]*"\B)
Here is a demonstration:
regex101 demo
It does not match the second line because the " you inserted does not have a closing quotation mark.
It will not match values like so: ,r"a string",10 because the letter on the edge of the " will create a word boundary, rather than a non-word boundary.
Alternative version
(".*?,.*?"|.*?(?:,|$))
This will match the content and the commas and is compatible with values that are full of punctuation marks
regex101 demo

The below regex is for parsing each fields in a line, not an entire line
Apply the methodical and desperate regex technique: Divide and conquer
Case: field does not contain a quote
abc,
abc(end of line)
[^,"]*(,|$)
Case: field contains exactly two quotes
abc"abc,"abc,
abc"abc,"abc(end of line)
[^,"]*"[^"]*"[^,"]*(,|$)
Case: field contains exactly one quote
abc"abc(end of line)
abc"abc, (and that there's no quote before the end of this line)
[^,"]*"[^,"]$
[^,"]*"[^"],(?!.*")
Now that we have all the cases, we then '|' everything together and enjoy the resultant monstrosity.

The best answer written by Vasili Syrakis does not work with negative numbers inside quotation marks such as:
bob jones,123,"-55.6",,,"Hello , World",,0
jim neighbor,432,66.5
Following regex works for this purpose:
,(?!(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$))
But I was not successful with this part of input:
,Andy "Blank,

try this pattern ".*?"(*SKIP)(*FAIL)|, Demo

import re
print re.sub(',(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)',"",string)

Related

Regex to capture a single new line instance, but not 2

I have a text file where lines are trimmed by newline characters /n and paragraphs by double newlines /n/n
I want to strip out those single newlines and replace with simple spaces. But I do not want the double newlines affected.
I thought something like one of these would work:
(?!\n\n)\n
\n{1}
\n{1,1}
But no luck. Everything I try inevitably ends up affecting those double new lines too. How can I write a regex that effectively "ignores" the /n/n but captures the /n
You can search using this regex:
(.)\n(?!\n)
And replace it with:
"\1 "
RegEx Demo
RegEx Breakup:
.\n: Match any character followed by a line break
(?!\n): Negative lookahead to assert that we don't have a line break at next position. We match one character before matching \n to make sure we don't match an empty line. Also note that this character is being captured in capture group #1. This will match all single line breaks but will skip double line breaks.
\1 : is replacement to append a space after first capture group
Python Code:
import re
repl = re.sub('(.)\n(?!\n)', r'\1 ', input)
print (repl)
Javscript Code:
repl = input.replace(/(.)\n(?!\n)/g, '$1 ')
console.log (repl)
You'll need a negative lookahead and a negative lookbehind. /(?<!\n)\n(?!\n)/g would probably work off the top of my head.
That said, you should be aware of kind of spotty browser support for lookbehinds. It's gotten better since I last checked, but Safari and IE don't support it at all.
I thought of a simple way to do this.(may not be the right way from a regex point of view) but its a workaround.
import re
sample = """This is a sentence in para1.
this is also a sentence in para1
The begining of paragraph2 and sentence1
this is a second line in paragraph2.
"""
print(sample)
sample = re.sub(r'\n\n\n',"NPtag",sample)
sample = re.sub(r'\n\n'," ",sample)
sample = re.sub(r"NPtag",'\n\n\n',sample)
print("OUTPUT*****\n")
print(sample)
the workaround is to replace the multi-line(3 in this case to demonstrate the space clearly) breaker with a NewParagraphtag(NPtag) and then substitute the single newline(2 in the above case, to demonstrate the sapce clearly in notebook env) with space and resubstitute the NPtag with multiline break. You can see the output here as:
Hope this helps. Eager to see other regex answers too! Happy coding

Regex: Exact match string ending with specific character

I'm using Java. So I have a comma separated list of strings in this form:
aa,aab,aac
aab,aa,aac
aab,aac,aa
I want to use regex to remove aa and the trailing ',' if it is not the last string in the list. I need to end up with the following result in all 3 cases:
aab,aac
Currently I am using the following pattern:
"aa[,]?"
However it is returning:
b,c
If lookarounds are available, you can write:
,aa(?![^,])|(?<![^,])aa,
with an empty string as replacement.
demo
Otherwise, with a POSIX ERE syntax you can do it with a capture:
^(aa(,|$))+|(,aa)+(,|$)
with the 4th group as replacement (so $4 or \4)
demo
Without knowing your flavor, I propose this solution for the case that it does know the \b.
I use perl as demo environment and do a replace with "_" for demonstration.
perl -pe "s/\baa,|,aa\b/_/"
\b is the "word border" anchor. I.e. any start or end of something looking like a word. It allows to handle line end, line start, blank, comma.
Using it, two alternatives suffice to cover all the cases in your sample input.
Output (with interleaved input, with both, line ending in newline and line ending in blank):
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
If the \b is unknown in your regex engine, then please state which one you are using, i.e. which tool (e.g. perl, awk, notepad++, sed, ...). Also in that case it might be necessary to do replacing instead of deleting, i.e. to fine tune a "," or "" as replacement. For supporting that, please show the context of your regex, i.e. the replacing mechanism you are using. If you are deleting, then please switch to replacing beforehand.
(I picked up an input from comment by gisek, that the cpaturing groups are not needed. I usually use () generously, including in other syntaxes. In my opinion not having to think or look up evaluation orders is a benefit in total time and risks taken. But after testing, I use this terser/eleganter way.)
If your regex engine supports positive lookaheads and positive lookbehinds, this should work:
,aa(?=,)|(?<=,)aa,|(,|^)aa(,|$)
You could probably use the following and replace it by nothing :
(aa,|,aa$)
Either aa, when it's in the begin or the middle of a string
,aa$ when it's at the end of the string
Demo
As you want to delete aa followed by a coma or the end of the line, this should do the trick: ,aa(?=,|$)|^aa,
see online demo

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, á la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

Regex validation of filename failing

I'm trying to validate a filename having letters "CAT" or "DOG" followed by 8 numerics, and ending in ".TXT".
Examples:
CAT20000101.TXT
DOG20031212.TXT
This would NOT match:
ATA12330000.TXT
CAT200T0101.TXT
DOG20031212.TX1
Here's the regex I am trying to make work:
(([A-Z]{3})([0-9]{8})([\.TXT]))\w+
Why is the last section (.TXT) failing against non-matching file extensions?
See example: http://regexr.com/3a7fo
Inside character class there is no regex grouping hence [\.TXT] is not right.
You can use this regex:
^[A-Z]{3}[0-9]{8}\.TXT$
For only matching CAT and DOG use:
^(CAT|DOG)[0-9]{8}\.TXT$
lose the unnecessary parentheses
[A-Z]{3}[0-9]{8}[\.TXT]\w+
lose the unnecessary/pattern-breaking character class [] around \.TXT
[A-Z]{3}[0-9]{8}\.TXT\w+
lose the \w+ at the end
[A-Z]{3}[0-9]{8}\.TXT
change [A-Z]{3} to (?:CAT|DOG).
(?:CAT|DOG)[0-9]{8}\.TXT
voilà.
It's failing because \.TXT is in square brackets, which matches only one of those four characters. Just use (\.TXT).
remove square brackets around [.TXT] to .TXT
Your example modified http://regexr.com/3a7fu

Regex to match everything before two forward-slashes (//) not contained in quotes

I've been grappling with some negative lookahead and lookbehind patterns to no avail. I need a regex that will match everything in a string before two forward slashes, unless said characters are in quotes.
For example, in the string "//this is a string//" //A comment about a string about strings,
the substring "//this is a string//" ought to be matched and the rest ignored.
As you can see, the point is to exclude any single-line comments (C++/Java style).
Thanks in advanced.
Here you go:
^([^/"]|".+?"|/[^/"])*
How about
\/\/[^\"']*$
It will match // if it is not followed by either a " or a '. It's not exactly what you requested, but closely meets your requirements. It will only choke on comments that contain " or ', like
// I like "bread".
Maybe better than no solution.
A python/regex based comment remover I wrote a while back, if it's helpful:
def remcomment(line):
for match in re.finditer('"[^"]*"|(//)', line):
if match.group(1):
return line[:match.start()].rstrip()
return line