Extract table key-values from LUA code - regex

I have multiple strings from LUA code, each one with a LUA table item, something like:
atable['akeyofthetable'] = { 'name' = 'a name', 'thevalue' = 34, 'anotherkey' = 'something' }
The string might be spanned in multiple lines, meaning it might be:
atable['akeyofthetable'] = { 'name' = 'a name',
'thevalue' = 34,
"anotherkey" = 'something' }
How to get some (ex: only name and anotherkey in the above example) of the keys with their values as "re.match" objects in python3 from that string? Because this is taken from code, the existence of keys is not guarantied, the "quoting" of keys and values (double or single quotes) may vary, even from key to key, and there may be empty values ('name' = '') or non quoted strings as values ('thevalue' = anonquotedstringasvalue). Even the order of the keys is not guarantied. Split using commas (,) is not working because some string values have commas (ex: 'anotherkey' = 'my beloved, strange, value' or even 'anotherkey' = "my beloved, 'strange' = 34, value"). Also keys may or may not be quoted (depends, if names are in ASCII probably will not be quoted).
Is it possible to do this using one regex or I must do multiple searches for every key needed?

Code
If there is a possibility of escaped quotes \' or \" within the string, you can substitute the respective capture groups for '((?:[^'\\]|\\.)*)' as seen here.
See regex in use here
['\"](?:name|anotherkey)['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")
Usage
See code in use here
import re
keys = [
"name",
"anotherkey"
]
r = r"['\"](" + "|".join([re.escape(key) for key in keys]) + r")['\"]\s*=\s*(?:'([^']*)'|\"([^\"]*)\")"
s = "atable['akeyofthetable'] = { 'name' = 'a name',\n\t 'thevalue' = 34, \n\t \"anotherkey\" = 'something' }"
print(re.findall(r, s))
Explanation
The second point below is replaced by a join of the keys array.
['\"] Match any character in the set '"
(name|anotherkey) Capture the key into capture group 1
['\"] Match any character in the set '"
\s* Match any number of whitespace characters
= Match this literally
\s* Match any number of whitespace characters
(?:'([^']*)'|\"([^\"]*)\") Match either of the following
'([^']*)' Match ', followed by any character except ' any number of times, followed by '
\"([^\"]*)\" Match ", followed by any character except " any number of times, followed by "

Related

Matching key/value pairs with comments

For a JavaScript application, I'm trying to come up with a regex that will match key/value pairs in a string. It's working pretty well, but there is one last thing that I need to implement and I'm not sure how.
The syntax is very similar to what you'll find in a .env file. So key/value pairs look like KEY=value.
A few rules that I have already implemented:
The key
alphanumeric string.
can't be empty and can't be a number.
may contain an underscore
The value
can be string
may be surrounded by single or double quotes, or none at all.
Now I'm trying to add comments with # in there. It works, except when # is between the quotes. Any idea how to fix that? Thanks!
Here is my code sample:
// This is my regex
const regex = /^\s*(?![0-9_]*\s*=\s*([\W\w\s.]*)\s*$)[A-Z0-9_]+\s*=\s*(.*)?\s*(?<!#.*)/gi;
// Outputs [ "KEY=value " ] --> OK
const str = `KEY=value # Comment`;
console.log(str.match(regex));
// Outputs [ "KEY2=val" ] --> OK
const str2 = `KEY2=val#ue # Comment`;
console.log(str2.match(regex));
// Outputs [ "key3='value3' " ] --> OK
const str3 = `key3='value3' # Comment`;
console.log(str3.match(regex));
// Outputs [ "key_4='val" ] --> NOT OK
// Expecting [ "key_4='val#ue4' " ]
const str4 = `key_4='val#ue4' # Comment`;
console.log(str4.match(regex));
EDIT:
Here is another sample for testing:
# The following are matching
ONE = This is ONE
TWO=This is TWO
THREE="This is 'THREE'"
FOUR = "This is \"FOUR\""
fi_ve = 'This is \'FIVE\''
six='This is "SIX"'
NUMBER7="This is SEVEN" # Comment for SEVEN
number8="This is EIGHT"#Comment for EIGHT
NINE="This is #9"
TEN=This is #10
ELEVEN=
TWELVE=10
THIRTEEN=TRUE
FOURTEEN="true"
FIFTEEN=false
SIXTEEN='FALSE'
# The following are not matching(incl. empty line)
17="Is not valid because the key is a number"
="Is also not valid because the key is missing"
You may use
([A-Za-z_]\w*)[ \t]*=[ \t]*('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*"|[^\r\n#]*)
See the regex demo
([A-Za-z_]\w*) - Group 1:
[ \t]*=[ \t]* - a = enclosed with 0 or more spaces or tabs
('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*"|[^\r\n#]*) - Group 2:
'[^'\\]*(?:\\.[^'\\]*)*'| - a '...' like substring that may contain any string escape sequence, or
"[^"\\]*(?:\\.[^"\\]*)*"| - a "..." like substring that may contain any string escape sequence, or
[^\r\n#]* - 0 or more chars other than #, CR and LF

Replacing unknown number of named groups

I am working on such a pattern:
<type>"<prefix>"<format>"<suffix>";<neg_type>"<prefix>"<format>"<suffix>"
So i wrote 2 examples here, with or without prefix:
n"prefix"#,##0"suffix";-"prefix"#,##0"suffix"
n#,##0"suffix";-#,##0"suffix"
Indeed i wrote the folowing regex to capture my group:
raw = r"(?P<type>^.)(?:\"(?P<prefix>[^\"]*)\"){0,1}(?P<format>[^\"]*)(?:\"(?P<suffix>[^\"]*)\"){0,1};(?P<negformat>.)(?:\"(?P=prefix)\"){0,1}(?P=format)(?:\"(?P=suffix)\"){0,1}"
Now i am parsing a big text which contain such structure and i would like to replace the prefix or suffix (only if they exist!). Due to the unknown number (potentially null) of captured group i do not know how to easily can make my replacements (with re.sub).
Additionnaly, due to some implementation constraint i treat sequentially prefix and suffix (so i do not get the suffix to replace at the same time than the prefix to replace even if they belong to the same sentence.
First, we can simplify your regex by using single quotes for the string. That removes the necessity of escaping the " character. Second, {0,1} can be replaced by ?:
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?:"(?P<prefix2>(?P=prefix))")?(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
Notice that I have added (?P<prefix2>) and (?P<suffix2) named groups above for the second occurrences of the prefix and suffix.
I am working on the assumption that your pattern may be repeated within the text (if the pattern only appears once, this code will still work). In that case, the character substitutions must be made from the last to first occurrence so that the start and last character offset information returned by the regex engine remains correct even after character substitutions are made. Similarly, when we find an occurrence of the pattern, we must first replace in the order suffix2, prefix2, suffix and prefix.
We use re.finditer to iterate through the text to return match objects and form these into a list, which we reverse so that we can process the last matches first:
import re
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?:"(?P<prefix2>(?P=prefix))")?(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
s = """a"prefix"format"suffix";b"prefix"format"suffix"
x"prefix_2"format_2"suffix_2";y"prefix_2"format_2"suffix_2"
"""
new_string = s
matches = list(re.finditer(raw, s, flags=re.MULTILINE))
matches.reverse()
if matches:
for match in matches:
if match.group('suffix2'):
new_string = new_string[0:match.start('suffix2')] + 'new_suffix' + new_string[match.end('suffix2'):]
if match.group('prefix2'):
new_string = new_string[0:match.start('prefix2')] + 'new_prefix' + new_string[match.end('prefix2'):]
if match.group('suffix'):
new_string = new_string[0:match.start('suffix')] + 'new_suffix' + new_string[match.end('suffix'):]
if match.group('prefix'):
new_string = new_string[0:match.start('prefix')] + 'new_prefix' + new_string[match.end('prefix'):]
print(new_string)
Prints:
a"new_prefix"format"new_suffix";b"new_prefix"format"new_suffix"
x"new_prefix"format_2"new_suffix";y"new_prefix"format_2"new_suffix"
The above code, for demo purposes, makes the same substitutions for each occurrence of the pattern.
As far as your second concern:
There is nothing preventing you from making two passes against the text, once to replace the prefixes and once to replace suffixes as these become know. Obviously, you would only be checking certain groups for each pass, but you could still be using the same regex. And, of course, for each occurrence of the pattern you can have unique substitutions. The above code shows how to find and make the substitutions.
To allow 0 to 9 instances or the prefix
import re
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?P<prefix2>(?:"(?P=prefix)"){0,9})(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
s = """a"prefix"format"suffix";b"prefix""prefix""prefix"format"suffix"
x"prefix_2"format_2"suffix_2";y"prefix_2"format_2"suffix_2"
"""
new_string = s
matches = list(re.finditer(raw, s, flags=re.MULTILINE))
matches.reverse()
if matches:
for match in matches:
if match.group('suffix2'):
new_string = new_string[0:match.start('suffix2')] + 'new_suffix' + new_string[match.end('suffix2'):]
if match.group('prefix2'):
start = match.start('prefix2')
end = match.end('prefix2')
repl = s[start:end]
n = repl.count('"') // 2
new_string = new_string[0:start] + (n * '"new_prefix"') + new_string[end:]
if match.group('suffix'):
new_string = new_string[0:match.start('suffix')] + 'new_suffix' + new_string[match.end('suffix'):]
if match.group('prefix'):
new_string = new_string[0:match.start('prefix')] + 'new_prefix' + new_string[match.end('prefix'):]
print(new_string)
Prints:
a"new_prefix"format"new_suffix";b"new_prefix""new_prefix""new_prefix"format"new_suffix"
x"new_prefix"format_2"new_suffix";y"new_prefix"format_2"new_suffix"

Regular Expression: is it possible to get numbers in optional parts by one regex

I have one string, it will be like: 1A2B3C or 2B3C or 1A2B or 1A3C.
The string is comprised by serval optional parts of number + [A|B|C].
It is possible to get the numbers before every character with one regex?
For example:
1A2B3C => (1, 2, 3)
1A3C => (1, 0, 3) There is no 'B', so gives 0 instead.
=> Or just (1, 3) but should show that the 3 is in front of 'C'.
Assuming Python because of your tuple notation, and because that's what I feel like using.
If the only allowed letters are A, B and C, you can do it with an extra processing step:
pattern = re.compile(r'(?:(\d+)A)(?:(\d+)B)?(?:(\d+)C)?')
match = pattern.fullmatch(some_string)
if match:
result = tuple(int(g) for g in match.groups('0'))
else:
raise ValueError('Bad input string')
Each option is surrounded by a non-capturing group (?:...) so the whole thing gets treated as a unit. Inside the unit, there is a capturing group (\d+) to capture the number, and an uncaptured character.
The method Matcher.groups returns a tuple of all the groups in the regex, with unmatched ones set to '0'. The generator then converts to an int for you. You could use tuple(map(int, match.groups('0'))).
You can also use a dictionary to hold the numbers, keyed by character:
pattern = re.compile(r'(?:(?P<A>\d+)A)(?:(?P<B>\d+)B)?(?:(?P<C>\d+)C)?')
match = pattern.fullmatch(some_string)
if match:
result = {k: int(v) for k, v in match.groupdict('0').items()}
else:
raise ValueError('Bad input string')
Matcher.groupdict is just like groups except that it returns a dictionary of the named groups: capture groups marked (?P<NAME>...).
Finally, if you don't mind having the dictionary, you can adapt this approach to parse any number of groups with arbitrary characters:
pattern = re.compile(r'(\d+)([A-Z])')
result = {}
while some_string:
match = pattern.match(some_string)
if not match:
raise ValueError('Bad input string')
result[match.group(2)] = int(match.group(1))
some_string = some_string[match.end():]

Matching multiple quoted strings in a single line with regex

I want to match quoted strings of the form 'a string' within a line. My issue comes with the fact that I may have multiple strings like this in a single line. Something like
result = functionCall('Hello', 5, 'World')
I can search for phrases bounded by strings with ['].*['], and that picks up quoted strings just fine if there is a single one in a line. But with the above example it would find 'Hello', ', 5, ' and 'World', when I only actually want 'Hello' and 'World'. Obviously I need some way of knowing how many ' precede the currently found ' and not try to match when there is an odd amount.
Just to note, in my case strings are only defined using ', never ".
you should use [^']+ between quotes:
var myString = "result = functionCall('Hello', 5, 'World')";
var parts = myString.match(/'[^']+'/g);

String separation in required format, Pythonic way? (with or w/o Regex)

I have a string in the format:
t='#abc #def Hello this part is text'
I want to get this:
l=["abc", "def"]
s='Hello this part is text'
I did this:
a=t[t.find(' ',t.rfind('#')):].strip()
s=t[:t.find(' ',t.rfind('#'))].strip()
b=a.split('#')
l=[i.strip() for i in b][1:]
It works for the most part, but it fails when the text part has the '#'.
Eg, when:
t='#abc #def My email is red#hjk.com'
it fails. The #names are there in the beginning and there can be text after #names, which may possibly contain #.
Clearly I can append initally with a space and find out first word without '#'. But that doesn't seem an elegant solution.
What is a pythonic way of solving this?
Building unashamedly on MrTopf's effort:
import re
rx = re.compile("((?:#\w+ +)+)(.*)")
t='#abc #def #xyz Hello this part is text and my email is foo#ba.r'
a,s = rx.match(t).groups()
l = re.split('[# ]+',a)[1:-1]
print l
print s
prints:
['abc', 'def', 'xyz']
Hello this part is text and my email is foo#ba.r
Justly called to account by hasen j, let me clarify how this works:
/#\w+ +/
matches a single tag - # followed by at least one alphanumeric or _ followed by at least one space character. + is greedy, so if there is more than one space, it will grab them all.
To match any number of these tags, we need to add a plus (one or more things) to the pattern for tag; so we need to group it with parentheses:
/(#\w+ +)+/
which matches one-or-more tags, and, being greedy, matches all of them. However, those parentheses now fiddle around with our capture groups, so we undo that by making them into an anonymous group:
/(?:#\w+ +)+/
Finally, we make that into a capture group and add another to sweep up the rest:
/((?:#\w+ +)+)(.*)/
A last breakdown to sum up:
((?:#\w+ +)+)(.*)
(?:#\w+ +)+
( #\w+ +)
#\w+ +
Note that in reviewing this, I've improved it - \w didn't need to be in a set, and it now allows for multiple spaces between tags. Thanks, hasen-j!
t='#abc #def Hello this part is text'
words = t.split(' ')
names = []
while words:
w = words.pop(0)
if w.startswith('#'):
names.append(w[1:])
else:
break
text = ' '.join(words)
print names
print text
How about this:
Splitting by space.
foreach word, check
2.1. if word starts with # then Push to first list
2.2. otherwise just join the remaining words by spaces.
You might also use regular expressions:
import re
rx = re.compile("#([\w]+) #([\w]+) (.*)")
t='#abc #def Hello this part is text and my email is foo#ba.r'
a,b,s = rx.match(t).groups()
But this all depends on how your data can look like. So you might need to adjust it. What it does is basically creating group via () and checking for what's allowed in them.
[i.strip('#') for i in t.split(' ', 2)[:2]] # for a fixed number of #def
a = [i.strip('#') for i in t.split(' ') if i.startswith('#')]
s = ' '.join(i for i in t.split(' ') if not i.startwith('#'))
[edit: this is implementing what was suggested by Osama above]
This will create L based on the # variables from the beginning of the string, and then once a non # var is found, just grab the rest of the string.
t = '#one #two #three some text afterward with # symbols# meow#meow'
words = t.split(' ') # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)): # go through each word
word = words[i]
if word[0] == '#': # grab #'s from beginning of string
L.append(word[1:])
continue
s = ' '.join(words[i:]) # put spaces back in
break # you can ignore the rest of the words
You can refactor this to be less code, but I'm trying to make what is going on obvious.
Here's just another variation that uses split() and no regexpes:
t='#abc #def My email is red#hjk.com'
tags = []
words = iter(t.split())
# iterate over words until first non-tag word
for w in words:
if not w.startswith("#"):
# join this word and all the following
s = w + " " + (" ".join(words))
break
tags.append(w[1:])
else:
s = "" # handle string with only tags
print tags, s
Here's a shorter but perhaps a bit cryptic version that uses a regexp to find the first space followed by a non-# character:
import re
t = '#abc #def My email is red#hjk.com #extra bye'
m = re.search(r"\s([^#].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red#hjk.com #extra bye
This doesn't work properly if there are no tags or no text. The format is underspecified. You'll need to provide more test cases to validate.