Matching key/value pairs with comments - regex

For a JavaScript application, I'm trying to come up with a regex that will match key/value pairs in a string. It's working pretty well, but there is one last thing that I need to implement and I'm not sure how.
The syntax is very similar to what you'll find in a .env file. So key/value pairs look like KEY=value.
A few rules that I have already implemented:
The key
alphanumeric string.
can't be empty and can't be a number.
may contain an underscore
The value
can be string
may be surrounded by single or double quotes, or none at all.
Now I'm trying to add comments with # in there. It works, except when # is between the quotes. Any idea how to fix that? Thanks!
Here is my code sample:
// This is my regex
const regex = /^\s*(?![0-9_]*\s*=\s*([\W\w\s.]*)\s*$)[A-Z0-9_]+\s*=\s*(.*)?\s*(?<!#.*)/gi;
// Outputs [ "KEY=value " ] --> OK
const str = `KEY=value # Comment`;
console.log(str.match(regex));
// Outputs [ "KEY2=val" ] --> OK
const str2 = `KEY2=val#ue # Comment`;
console.log(str2.match(regex));
// Outputs [ "key3='value3' " ] --> OK
const str3 = `key3='value3' # Comment`;
console.log(str3.match(regex));
// Outputs [ "key_4='val" ] --> NOT OK
// Expecting [ "key_4='val#ue4' " ]
const str4 = `key_4='val#ue4' # Comment`;
console.log(str4.match(regex));
EDIT:
Here is another sample for testing:
# The following are matching
ONE = This is ONE
TWO=This is TWO
THREE="This is 'THREE'"
FOUR = "This is \"FOUR\""
fi_ve = 'This is \'FIVE\''
six='This is "SIX"'
NUMBER7="This is SEVEN" # Comment for SEVEN
number8="This is EIGHT"#Comment for EIGHT
NINE="This is #9"
TEN=This is #10
ELEVEN=
TWELVE=10
THIRTEEN=TRUE
FOURTEEN="true"
FIFTEEN=false
SIXTEEN='FALSE'
# The following are not matching(incl. empty line)
17="Is not valid because the key is a number"
="Is also not valid because the key is missing"

You may use
([A-Za-z_]\w*)[ \t]*=[ \t]*('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*"|[^\r\n#]*)
See the regex demo
([A-Za-z_]\w*) - Group 1:
[ \t]*=[ \t]* - a = enclosed with 0 or more spaces or tabs
('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*"|[^\r\n#]*) - Group 2:
'[^'\\]*(?:\\.[^'\\]*)*'| - a '...' like substring that may contain any string escape sequence, or
"[^"\\]*(?:\\.[^"\\]*)*"| - a "..." like substring that may contain any string escape sequence, or
[^\r\n#]* - 0 or more chars other than #, CR and LF

Related

How to match all characters between [ ] and except ", "

Hello I try to extract each group of data ( each data is separated by , from a string like that
MyString=[XXXXXX:XX XX XX XX, XXXXX:332.83, XXXXX:XXX-XX-XX XX:XX:XX, XXXX:0.0, XXXX:2, XXXX:0, XXXX:-256, counter_tipeee:5, XXXX:136935, XXXX:0, XXXX:XX XXX XXX, XXXX:0.5, XXXXX:true, XXXX:0.509375, XXX:0.0, XXXX:[2022-06-14 06:45:00], 2022-09-17 XXXXX:1]
With this regex, I can match all characters except ,
([^,]*)
https://regex101.com/r/lCN2YK/1
But I search to mismatch ,
The problem is if I remove space with \s, it removes space from certain data of my string. I search to extract all data that is not precisely coma+space ,
Another problem with my regex, it does not exclude the first [ and the last ] from my string. I can't exclude all [ ] because certain data have [ ]
I found this regex to exclude the first and last character ^.(.*).$ but don't know how to combine my two regex
https://regex101.com/r/CAsKHE/1
The output that I expect is
List<String> My_goal= [
XXXXXX:XX XX XX XX
XXXXX:332.83
XXXXX:XXX-XX-XX XX:XX:XX
XXXX:0.0, XXXX:2
....
2022-09-17,XXXXX:1
]
Try this:
(?<=(?<!: *)\[).*?(?=,)|(?<=, *(?=[^ \r\n]))(?:.*?(?=,)|[^,\r\n\[\]]+?(?=\])|[^,\r\n]+\](?= *\]))
See regex demo.

Regex to match certain characters anywhere between two characters

I want to detect (and return) any punctuation within brackets. The line of text I'm looking at will have multiple sets of brackets (which I can assume are properly formatted). So given something like this:
[abc.] [!bc]. [.e.g] [hi]
I'd want to detect all those cases and return something like [[.], [!], [..]].
I tried to do /{.*?([.,!?]+).*?}/g but then it returns true for [hello], [hi] which I don't want to match!
I'm using JS!
You can match substrings between square brackets and then remove all chars that are not punctuation:
const text = '[abc.] [!bc]. [.e.g]';
const matches = text.match(/\[([^\][]*)]/g).map(x => `[${x.replace(/[^.,?!]/g, '')}]`)
console.log(matches);
If you need to make your regex fully Unicode aware you can leverage ECMAScript 2018+ compliant solution like
const text = '[abc.] [!bc、]. [.e.g]';
const matches = text.match(/\[([^\][]*)]/g).map(x => `[${x.replace(/[^\p{P}\p{S}]/gu, '')}]`)
console.log(matches);
So,
\[([^\][]*)] matches a string between [ and ] with no other [ and ] inside
.replace(/[^.,?!]/g, '') removes all chars other than ., ,, ? and !
.replace(/[^\p{P}\p{S}]/gu, '') removes all chars other than Unicode punctuation proper and symbols.

Extract table key-values from LUA code

I have multiple strings from LUA code, each one with a LUA table item, something like:
atable['akeyofthetable'] = { 'name' = 'a name', 'thevalue' = 34, 'anotherkey' = 'something' }
The string might be spanned in multiple lines, meaning it might be:
atable['akeyofthetable'] = { 'name' = 'a name',
'thevalue' = 34,
"anotherkey" = 'something' }
How to get some (ex: only name and anotherkey in the above example) of the keys with their values as "re.match" objects in python3 from that string? Because this is taken from code, the existence of keys is not guarantied, the "quoting" of keys and values (double or single quotes) may vary, even from key to key, and there may be empty values ('name' = '') or non quoted strings as values ('thevalue' = anonquotedstringasvalue). Even the order of the keys is not guarantied. Split using commas (,) is not working because some string values have commas (ex: 'anotherkey' = 'my beloved, strange, value' or even 'anotherkey' = "my beloved, 'strange' = 34, value"). Also keys may or may not be quoted (depends, if names are in ASCII probably will not be quoted).
Is it possible to do this using one regex or I must do multiple searches for every key needed?
Code
If there is a possibility of escaped quotes \' or \" within the string, you can substitute the respective capture groups for '((?:[^'\\]|\\.)*)' as seen here.
See regex in use here
['\"](?:name|anotherkey)['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")
Usage
See code in use here
import re
keys = [
"name",
"anotherkey"
]
r = r"['\"](" + "|".join([re.escape(key) for key in keys]) + r")['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")"
s = "atable['akeyofthetable'] = { 'name' = 'a name',\n\t 'thevalue' = 34, \n\t \"anotherkey\" = 'something' }"
print(re.findall(r, s))
Explanation
The second point below is replaced by a join of the keys array.
['\"] Match any character in the set '"
(name|anotherkey) Capture the key into capture group 1
['\"] Match any character in the set '"
\s* Match any number of whitespace characters
= Match this literally
\s* Match any number of whitespace characters
(?:'([^']*)'|\"([^\"]*)\") Match either of the following
'([^']*)' Match ', followed by any character except ' any number of times, followed by '
\"([^\"]*)\" Match ", followed by any character except " any number of times, followed by "

Why is this regexp returning an empty item at the beginning of the array?

I've got a javascript string I'm trying to split, but I'm getting an empty element at the beginning of the returned array, and I can't figure out why.
var split_in_el = in_el.split(/(#|\.|\[)/);
where split_in_el is either first,#last, or [color:red].
the returned arrays I'm getting (in Node.js, but shouldn't matter) are
.first //split_in_el
[ '', '.', 'first' ] //returned
#last //split_in_el
[ '', '#', 'last' ] //returned
[color:blue] //split_in_el
[ '', '[', 'color:blue]' ] //returned
Here's a jsfiddle showing the issue.
This is how split() works in general, let's say we are splitting the following on the dot .:
Hello.World
^
----- -----
Then the returned array would be: ["Hello", "World"].
Now what if the previous line were like this:
.World
^
-- -----
Then we get an array like this: ["", "World"], the split() method returns everything before the dot . and everything after the ., nothing exists before the dot . here so it returns empty string "".
In a larger example:
.Hello.World.From.
It would return: ["", "Hello", "World", "From", ""].
Now the confusing part in your situation shouldn't be how you are getting the empty string, but rather how are you getting the character which you are splitting about in the resulting array.
For example, there is a dot . when you split around the dot . in .first, and there is a pound sign # when you split around the pound sign # in #last, etc ..
This can become obvious when you look at the documentation of split() method:
If separator is a regular expression that contains capturing parentheses, then each time separator is matched the results (including any undefined results) of the capturing parentheses are spliced into the output array.
The separator in your case is a regular expression /(#|\.|\[)/ that matches (or splits around) either a dot ., pound sign # or a colon : inside a capturing group, so they are added to the resulting array.
/(#|\.|\[)/
^ ^
---------
These parentheses are used to create the capturing group
You can solve this by converting the capturing group into a non-capturing one like this:
/(?:#|\.|\[)/
^^
Notice the syntax
Finally I want to add one thing: in situations like .first and #last you probably don't want to use split(), but rather RegExp.exec() or String.match() to look for a particular match using a given pattern.
For example if you want to retrieve the word after a . character like .first then you can do:
var matches = ".first".match(/\.\w+/);

Regular expression to find unescaped double quotes in CSV file

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?
Not a match:
"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"
Match:
"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
Try this:
(?m)""(?![ \t]*(,|$))
Explanation:
(?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
"" // match two successive double quotes
(?! // start negative look ahead
[ \t]* // zero or more spaces or tabs
( // open group 1
, // match a comma
| // OR
$ // the end of the line or string
) // close group 1
) // stop negative look ahead
So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".
(i) besides being the normal start-of-the-string and end-of-the-string meta characters.
Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.
My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.
NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.
Here is my code:
orgTexts = [
'"asdf","asdf"',
'"", "asdf"',
'"asdf", ""',
'"adsf", "", "asdf"',
'"asdf""asdf", "asdf"',
'"asdf", """asdf"""',
'"asdf", """"'
]
orgTexts.each{|orgText|
# Preprocessing - Eliminate spaces before and after comma
# Here is needed if you may have spaces before and after a valid comma
orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
# Detect valid character (non-quote and valid quote)
resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
# resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
# [^\"] ===> A non qoute
# | ===> or
# ^\" ===> beginning quot
# | ===> or
# \"$ ===> endding quot
# | ===> or
# (?<=,)\" ===> quot just after comma
# \"(?=,) ===> quot just before comma
# (?<=\\\\)\" ===> escaped quot
# This part is to show the invalid non-escaped quots
print orgText
print resText.gsub(Regexp.new('"'), '^')
# This part is to determine if there is non-escaped quotes
# Here is the actual matching, use this one if you don't want to know which quote is un-escaped
isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
# Basicall, it match it from start to end (^...$) there is only a valid character
print orgText + ": " + isMatch
print
print ""
print ""
}
When executed the code prints:
"asdf","asdf"
-------------
"asdf","asdf": false
"","asdf"
---------
"","asdf": false
"asdf",""
---------
"asdf","": false
"adsf","","asdf"
----------------
"adsf","","asdf": false
"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true
"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true
"asdf",""""
--------^^-
"asdf","""": true
I hope I give you some idea here that you can use with other engine and language.
".*"(\n|(".*",)*)
should work, I guess...
For single-line matches:
^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
or for multi-line:
(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.
Try this regular expression:
"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
That will match any quoted string with at least one pair of unescaped double quotes.