Adjust existing regex to ignore semicolon inside quotes - regex

I am using a regex to read csv files and split its columns. The input of files changes frequently, and is unpredictable how the content will come (not the format). I already use the following regex to read the csv file and split the columns:
;(?=(?:[^\"]*\"*[^\"]*\")*[^\"]*$)
It was working until I faced a input like these:
'02'.'018'.'7975';PRODUCT 1;UN;02
'02'.'018'.'7976';PRODUCT 2;UN;02
'02'.'018'.'7977';PRODUCT 3;UN;02
'02'.'018'.'7978';"PRODUCT 4 ; ADDITIONAL INFO";UN;02 // Problem
'02'.'018'.'7979';"PRODUCT 5 ; ADDITIONAL INFO";UN;02 // Problem
I would like to understand how I can adjust my regex and adapt it to ignore semicolon inside quotes.
I am using Java with the method split from String class.

Bear in mind that you should probably use a parser for this, but if you must use regex, here's one that should work:
;(?=[^"]*(?:(?:"[^"]*){2})*$)
Explanation
; matches the semicolon.
(?=...) is a positive lookahead. It checks that the pattern contained in it will match, without actually matching it.
[^"]*(?:(?:"[^"]*){2})*$ ensures that there are an even number of quotes in the rest of the string.

Related

Regex: Trying to extract all values (separated by new lines) within an XML tag

I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.
If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.
You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1
finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,
After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1

Regex: Deal with CSV containing nested JSON objects (comma hell!)

I have a CSV file which contains arbitrary JSON objects. Here's a simplified version of the file:
v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}},lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,{"scheduleIds":[]},lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"},lastcolumn
The commas (if any) inside the JSON wreak havok with CSV parsing.
I'm looking for a way to either...
...capture and replace all the commas outside the JSON objects with pipes (|) so I can simply key on those:
v1|2020-06-09T22:44:46.377Z|cb6deb64-d6a0-4151-ba9b-bfa54ae75180|{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff"**,**"permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}|lastcolumn
v2|2020-06-09T22:44:47.377Z|50769c0d-0a05-4028-9f0b-40ab570af31a|{"scheduleIds":[]}|lastcolumn
v3|2020-06-09T22:44:48.377Z|12345678-0a05-4028-9f0b-40ab570af31a|{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}|lastcolumn
...or wrap each JSON object with single quotes:
v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,'{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}',lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,'{"scheduleIds":[]}',lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,'{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}',lastcolumn
Alas, my regex kung-fu is too weak to create something flexible enough based on the arbitrary nature of the JSON objects that may show up.
The closest I've gotten is:
(?!\B{[^}]*),(?![^{]*}\B)
Which still captures commas (the comma directly before "permissionsToParty", below) in an object like this:
{"payload":{"assetId":"710728f9-7c13-4bcb-8b5d-ef347afe0b58","permissionsToParty":[{"partyType":0,"partyId":"32435a92-c7b3-4fc0-b722-2e88e9e839e5","permissions":["CREATE","DOWNLOAD","VIEW"]}]}}
Can anyone simplify what I've done thus far and help me with an expression that ignores ALL commas within the outermost {} symbols of the JSON?
Your regex is quite close to your expectations. In order to get rid of the expected comma you may try the below regex:
(?!\B{[^}]*|"),(?![^{]*}\B|")
^^ ^^
Changed this part
You can find the demo of the above regex in here.
But you can use this second regex below to put the json string to quotes. This is more efficient as compared to the above regex and I'll recommend that you use this.
(\{.*\})
Explanation of the above regex:
() - Represents capturing group.
\{ - Matches the curly brace { literally.
.* - Matches everything inside of the curly braces greedily in order to capture all the nested braces.
'$1' - You can replace the given test string with 1st captured group $1 and put quotes outside of it.
You can find the demo of the above regex in here.
Pictorial Representation of the first regex:
Remember that you can do alot with regex but sometimes you need create your own code for it. You can't do everything with it.
How to do this depends on what you know about the csv sent. It looks like there isn't any values within double quotes if you do not count the json part?
Some regex engines has recurson.
If that works finding json parts with this expression \{((?>[^{}]+|(?R))*)\}
Description how it works Recursion explained here.
Here is a guide how csv can be parsed if it has double quoted parts.
Expression: (?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
Guide to parse cvs
If you know that cvs do not contain any double quoted values, then it may be doable if you convert it in a two step process.

Use REGEX to find line breaks within a wrapped content

The direct question: How can I use REGEX lookarounds to find instances of \r\n that occur between a set of characters (stand in open and closing tags), "[ and ]" with arbitrary characters and line breaks inside as well?
The situation:
I have a large database exported to tab or comma delineated text files that I'm trying to import into excel. The problem is that some of the cells come from text areas that contain line breaks, and are qualified by double quotes. Importing into excel these line breaks are treated as new rows. I cannot adjust how the file is exported. I data needs to be preserved, but the exact format doesn't, so I was planning on using some placeholder for the returns or ~
Here's a generic illustration of the format of my data:
column1rowA column2rowA column3rowA column4rowA
column1rowB column2rowB "column3rowB
3Bcont
3Bcont
3Bcont
" column4rowB
column1rowC column2rowC column4rowC
column1rowD column2rowD "column3rowD
3Dcont" column4rowD
My thought has been to try to select and replace line breaks within the quotes using REGEX search and replace in Notepad++. To try and make is simpler I have tried adding a character to the double quotes to help indicate whether it is an opening or closing quote:
"[column3rowB
3Bcont
3Bcont
3Bcont
]"
I am new to REGEX. The progress I've made (which isn't much) is:
(?<="[) missing some sort of wildcard \r\n(?=.*]")
Every iteration I've tried has also included every line break between the first "[ and last ]"
I would also appreciate any other approaches that solve the underlying problem
If you can use some tool other than Notepad++, you can use this regex (see my working example on regex101):
(?!\n(([^"]*"){2})*[^"]*$)\n
It uses a negative lookahead to find line breaks only when not followed by an even number of quotes. You could replace them with <br>, spaces, or whatever is appropriate.
Breakdown:
(?! ... ) This is the negative lookahead, necessary because it's zero-width. Anything matched by it will still be available to match again.
(([^"]*"){2})* This is the other key piece. It ensures even-numbered pairs of non-quote characters followed by a quote.
[^"]*$ This is ensuring that there are no more quotes from there until the end of the string.
Caveat:
I couldn't get it to work in Notepad++ because it always recognizes $ as the end of a line, not the end of the entire string.
Great answer from Brian. I added an option that would only consider real linebreaks (i.e. \n\r), which worked for my CSV file:
(?!\n|\r(([^"]*"){2})*[^"]*$)\n|\r

Regular expression to delete row from csv

I have a line from CSV
first decimal;;;first text;;second text with newlines, special symbols, including semicolons;second decimal, always present;first dot separated float, may not present;second dot separated float, may not present;third text that present only if present previous float
I need to delete second text (with new lines and special symbols).
As for now I have expression like:
(?<=;;)(.*?)(?=;\d+)
First part of it does not work, and I don't know how to make it select text preceded by only two semicolons (for now it selects text preceded by two or more semicolons and first decimal preceded by semicolons + newline if I turn on dotall). Besides, I do not know how to include newline symbol here (.*?).
If you have a CSV file that contains semicolons and newlines as part of quoted fields, then regex is not the right tool for this. Imagine what would happen if you had a field like "This is one field;;don't split this;42"...
If you're sure that you'll never have two semicolons before or within a quoted field, then you may give regex a try. But a dedicated CSV parser would definitely be a safer bet.
That said, let's see why your regex fails:
Imagine the line 1;;;2;3. Your regex will match ;2 because it fulfills all the requirements - there are two semicolons before it, and a semicolon plus digit after it. It's also the shortest possible match at this position in the string.
What can you do? You could use another lookbehind assertion to make sure that it's not possible to match three semicolons before the current position:
(?<=;;)(?<!;;;)(.*?)(?=;\d+)
Give it a try - but look into CSV libraries too, because they will solve your problem better.

Regex Partial String CSV Matching

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?
My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.
s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.
Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.
If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.
Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.
This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g
Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.