Regex: Deal with CSV containing nested JSON objects (comma hell!) - regex

I have a CSV file which contains arbitrary JSON objects. Here's a simplified version of the file:
v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}},lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,{"scheduleIds":[]},lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"},lastcolumn
The commas (if any) inside the JSON wreak havok with CSV parsing.
I'm looking for a way to either...
...capture and replace all the commas outside the JSON objects with pipes (|) so I can simply key on those:
v1|2020-06-09T22:44:46.377Z|cb6deb64-d6a0-4151-ba9b-bfa54ae75180|{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff"**,**"permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}|lastcolumn
v2|2020-06-09T22:44:47.377Z|50769c0d-0a05-4028-9f0b-40ab570af31a|{"scheduleIds":[]}|lastcolumn
v3|2020-06-09T22:44:48.377Z|12345678-0a05-4028-9f0b-40ab570af31a|{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}|lastcolumn
...or wrap each JSON object with single quotes:
v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,'{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}',lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,'{"scheduleIds":[]}',lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,'{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}',lastcolumn
Alas, my regex kung-fu is too weak to create something flexible enough based on the arbitrary nature of the JSON objects that may show up.
The closest I've gotten is:
(?!\B{[^}]*),(?![^{]*}\B)
Which still captures commas (the comma directly before "permissionsToParty", below) in an object like this:
{"payload":{"assetId":"710728f9-7c13-4bcb-8b5d-ef347afe0b58","permissionsToParty":[{"partyType":0,"partyId":"32435a92-c7b3-4fc0-b722-2e88e9e839e5","permissions":["CREATE","DOWNLOAD","VIEW"]}]}}
Can anyone simplify what I've done thus far and help me with an expression that ignores ALL commas within the outermost {} symbols of the JSON?

Your regex is quite close to your expectations. In order to get rid of the expected comma you may try the below regex:
(?!\B{[^}]*|"),(?![^{]*}\B|")
^^ ^^
Changed this part
You can find the demo of the above regex in here.
But you can use this second regex below to put the json string to quotes. This is more efficient as compared to the above regex and I'll recommend that you use this.
(\{.*\})
Explanation of the above regex:
() - Represents capturing group.
\{ - Matches the curly brace { literally.
.* - Matches everything inside of the curly braces greedily in order to capture all the nested braces.
'$1' - You can replace the given test string with 1st captured group $1 and put quotes outside of it.
You can find the demo of the above regex in here.
Pictorial Representation of the first regex:

Remember that you can do alot with regex but sometimes you need create your own code for it. You can't do everything with it.
How to do this depends on what you know about the csv sent. It looks like there isn't any values within double quotes if you do not count the json part?
Some regex engines has recurson.
If that works finding json parts with this expression \{((?>[^{}]+|(?R))*)\}
Description how it works Recursion explained here.
Here is a guide how csv can be parsed if it has double quoted parts.
Expression: (?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
Guide to parse cvs
If you know that cvs do not contain any double quoted values, then it may be doable if you convert it in a two step process.

Related

Regex: Trying to extract all values (separated by new lines) within an XML tag

I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.
If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.
You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1
finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,
After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1

Regex for fixing YAML strings

I am trying to create a bunch of YAML files, mostly composed of strings of text. Now when using apostrophes in words, they must be escaped by typing a double apostrophe, because I’m using apostrophes to wrap the strings.
I want to create a regex that will check for apostrophes in the text that aren’t double. What I have is this:
^([^'\n]*?)'(([^'\n]*?)'(?!')([^'\n]+?))*?'$\n
https://regex101.com/r/v4nUTn/3
My issue is that as soon as my string has a double apostrophe, but also has an apostrophe which isn’t a double apostrophe, it doesn’t match because my negative lookahead doesn’t match as soon as it sees the double apostrophe. (for example the string t''e'st won’t match even though it is missing a double apostrophe after the e)
How can I make it so that my negative lookahead will not fail as soon as it sees one double apostrophe?
This regex should work:
\w'\w
Test here.
My guess is that maybe an expression similar to
('[^'\r\n]*'|[^\r\n\w']+)|([\w']*)
would be an option to look into.
If the second capturing group returns true, then the string is undesired.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
One suggestion would be to do this in two steps.
For example, if every 'candidate' value looks like this: - 'something here' (where you want to test the apostrophes in the something here content of the string, then first isolate out that content via:
/^\s*- '(.+)'$/im
And then make sure all apostrophe's appear as you want them to appear within match group 1 of the result.
Then, replace the original match with your 'sanitised' match.
Doing this means you don't have to be concerned with the bounding apostrophes causing complications to the check for apostrophes in the value.
Note: there may well be a perfect one-step regex to do this, but understanding that you can break tasks into several steps is useful if you spend a lot of time with regular expressions, and can help you sidestep 'perfect regex paralysis'.
If you want your string to match if there is at least one 'single quote' between your singlequote strings, then you should allow consumption of either a string which doesn't have any singlequote in it or consume if it contains two singlequotes and then you should modify your regex a bit to consume two singlequotes and add |'' in your regex, which will now consume either non-singlequote text or a portion which has at least two singlequotes.
Try this updated regex demo and see if this works like you wanted?
https://regex101.com/r/v4nUTn/4

Adjust existing regex to ignore semicolon inside quotes

I am using a regex to read csv files and split its columns. The input of files changes frequently, and is unpredictable how the content will come (not the format). I already use the following regex to read the csv file and split the columns:
;(?=(?:[^\"]*\"*[^\"]*\")*[^\"]*$)
It was working until I faced a input like these:
'02'.'018'.'7975';PRODUCT 1;UN;02
'02'.'018'.'7976';PRODUCT 2;UN;02
'02'.'018'.'7977';PRODUCT 3;UN;02
'02'.'018'.'7978';"PRODUCT 4 ; ADDITIONAL INFO";UN;02 // Problem
'02'.'018'.'7979';"PRODUCT 5 ; ADDITIONAL INFO";UN;02 // Problem
I would like to understand how I can adjust my regex and adapt it to ignore semicolon inside quotes.
I am using Java with the method split from String class.
Bear in mind that you should probably use a parser for this, but if you must use regex, here's one that should work:
;(?=[^"]*(?:(?:"[^"]*){2})*$)
Explanation
; matches the semicolon.
(?=...) is a positive lookahead. It checks that the pattern contained in it will match, without actually matching it.
[^"]*(?:(?:"[^"]*){2})*$ ensures that there are an even number of quotes in the rest of the string.

Regex for value.contains() in Google Refine

I have a column of strings, and I want to use a regex to find commas or pipes in every cell, and then make an action. I tried this, but it doesn't work (no syntax error, just doesn't match neither commas nor pipes).
if(value.contains(/(,|\|)/), ...
The funny thing is that the same regex works with the same data in SublimeText. (Yes, I can work it there and then reimport, but I would like to understand what's the difference or what is my mistake).
I'm using Google Refine 2.5.
Since value.match should return captured texts, you need to define a regex with a capture group and check if the result is not null.
Also, pay attention to the regex itself: the string should be matched in its entirety:
Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups.
So, add .* before and after the pattern you are looking inside a larger string:
if(value.match(/.*([,|]).*/) != null)
You can use a combination of if and isNonBlank like:
if(isNonBlank(value.match(/your regex/), ...

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.