Regex to match invalid CSV line with unescaped quotes - regex

Let's say I have a file of strings like
11,"abc","def"
12,"ab "c"","def" // invalid
13,"ab,"c"","def" // invalid
14,""a" b,c","def" // invalid
15,""a", "b"c","def" // invalid
As you can see some of the double quotes are unescaped. I'd like to filter out invalid strings before I try to parse them.
I'm thinking to do something like \,\".+\"\, to find a token and then to check that it doesn't contain "," inside. But I can't figure out how to make it work.
I've searched in SO but haven't found an answer which works for me.
Thank you.

If String always start and end with ", you can try with this Java regex:
(?<=,\s{0,99}"|(?!\A)\G)[^"]+|(?<=(?!\A)\G|")(")(?!\s*[,\n]|$)
DEMO
the group 1 capture invalid quotes, you can get the indices with matcher.start(1) and matcher.end(1). \s{0,99} will work only in Java.

Related

Regex to match forward slash surrounded by double quotes

I have a serialised string that comes from Spring hosted end-points. On the frontend which is javascript based, I wanted to prettify the serialised string that comes from API to a string that is parsable through JSON.parse();
Please let me know the regex to match and replace the required fields as below.
sample string: \"address\":\"<VALUE>"\"}, I want to replace all the instances of "\" which comes at the end of VALUE with \"
Tried doing this: str.replaceAll('\"/\\\"', '/\\\"') but no luck.
Here is the code, we have to escape characters to put the wanted values into the variable:
testString='\\\"address\\\":\\\"<VALUE>"\\\"},';
alert(testString);
alert(testString.replace(/\"\\\"/,'\\\"'));
The first alert gives us the originale testString:
\"address\":\"<VALUE>"\"},
and the second the modified testString
\"address\":\"<VALUE>\"},
Tested with https://www.webtoolkitonline.com/javascript-tester.html

Get an exact regex match of an email value from a list of email addresses

I have a text field which stores a list of email addresses e.g: x#demo.com; a.x#demo.com. I have another text field which stores the exact value matched from the list of emails i.e. if /x#demo.com/i is in x#demo.com;a.x#demo.com then it should return x#demo.com.
The issue I am having is that if I have /a.x#demo.com/i, I will get x#demo.com instead of a.x#demo.com
I know of the regex expression /^x#demo.com$/i, but this means I can only have one email in my list of email addresses which won't help.
I have tried a couple of other regex expressions with no luck.
Any ideas on how I can achieve this?
You can use this slightly changed regex:
/(^|;)x#demo.com($|;)/i
It will match from either beginning of string or start after a semi colon and end either at end of string or at a semi colon.
Edit:
Small change, this uses look behind and look forward, then you will only get the match, you want:
(?<=^|;)x#demo.com(?=$|;)
Edit2:
To allow Spaces around the semi colon and at start and end, use this (#-quoted):
#"(?<=^\s*|;\s*)x#demo.com(?=\s*$|\s*;)"
or use double escaping:
"(?<=^\\s*|;\\s*)x#demo.com(?=\\s*$|\\s*;)"

Talend tExtractRegexFields match after comma syntax

I'm trying to extract everything in a string after the first comma, using the tExtractRegexFields component.
I'm splitting strings in an address field (Address_1) to a second address field (Address_2).
On regexr.com, the following syntax works perfectly: ,[\s\S]*$
In order to comply with Talend's escape sequences, I changed that syntax to
,[\\s\\S]*$. That solved the error, but the code doesn't appear to match on anything, since nothing is split from Address_1 to Address_2.
What's wrong? Does this syntax not work in Talend? Are there alternate Regex solutions?
To slipt string using tExtractRegexFields use grouping regex so each group will be delivered to a column, i used this regex and it works fine "^(.*)[,]([^,]*)$", this is the job: (my input string: "123 North Drive,PO Box 1,Miami, FL 55555-5555" )

Setting regular expression to validate URL format in Adobe CQ5

I want to validate a URL inside a textfield using Adobe CQ5, so I set up the properties regex and regexText as usual, but for some reason is not working:
<facebook
jcr:primaryType="cq:Widget"
emptyText="http://www.facebook.com/account-name"
fieldDescription="Set the Facebook URL"
fieldLabel="Facebook"
name="./facebookUrl"
regex="/^(http://www.|https://www.|http://|https://)[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?$/"
regexText="Invalid URL format"
xtype="textfield"/>
So when I type inside the component I can see an error message at the console:
Uncaught TypeError: this.regex.test is not a function
To be more accurate the error comes from this line:
if (this.regex && !this.regex.test(value)) {
I tried several regular expressions and none of them worked. I guess the problem is the regular expression itself, because in the other hand I have this other regex to evaluate email address, and it works perfectly fine:
/^[A-za-z0-9]+[\\._]*[A-za-z0-9]*#[A-za-z.-]+[\\.]+[A-Za-z]{2,4}$/
Any suggestions? Thanks in advance.
The syntax of your regex seems to treat the forward slashes (/) as special characters. Since you want to parse a URL containing slashes, my guess is you should escape them twice like this: '\\/' instead of '/'. The result would be:
/^(http:\\/\\/www.|https:\\/\\/www.|http:\\/\\/|https:\\/\\/)[a-z0-9]+([-.]{1}[a-z0-9]+)‌​*.[a-z]{2,5}(:[0-9]{1,5})?(\\/.*)?$/
You need to escape them twice because the string to be compiled as a regex must contain '\/' to escape the slashes, but to introduce a backslash in a string you have to escape the backslash itself too.

Using regex multiple capture groups to split up a string

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/