Parsing a game's console commands? - regex

I need to be able to handle data that can look like:
set setting1 "bind button_x +actionslot1;bind button_y \" bind button_x +stance \" "
bind button_a jump
set setting2 1 1 0 1
toggle setting_3 " \"value 1\" \"value 2\" \"value 3\" "
These are what some of the commands for the console of a game look like, and I'm trying to write an emulator of sorts that will interpret the code the same way the game will.
The first thing that comes to mind is regex, but I'm not sure it's the best option. For example, when matching for the value of a setting, I might trying something like /set [\w_]+ "?(.+)"?/, but the wildcard matches the ending quote because it's not lazy, but if I make it lazy, it matches the quote inside the value. If I make it greedy and stop it from matching the quotes, it won't match the escaped quotes in the values.
Even if there are possible regex solutions, they seem like the wrong option. I had asked before about how programs like Visual Studio and Notepad++ know which parentheses and curly braces matched, and I was told there was something similar to regex in some ways but much more powerful.
The only other thing I can think of is to go through the lines of code character by character and use booleans to determine that state of the current character.
What are my options here? What do game developers use to handle console commands?
edit: Here's another possible command which strongly deters me from using regex:
set setting4 "bind button_a \" bind button_b "\" set setting1 0 \" " \" "
The commands include not just escaped quotes, but quotes of the manner "\" inside escaped quotes.

I would suggest you read about Lexical Analysis
, this is the process of tokenizing your text using a grammar.
I think it will help you with what you are trying to do.

I don't want to keep you on the path of regex -- you are correct that there are non-regex solutions that may be more appropriate (I just don't know what they are). However, here is one possible regex that should fix your quotes issue:
/set [\w_]+ "?((\\"|[^"])+)"?/
I changed .+ to (\\"|[^"])+. Basically it's matching occurrences of \" OR of anything that isn't a quote. In other words, it will will match anything except quotes that aren't escaped.
Again, if someone can suggest a more sophisticated non-regex solution, you should strongly consider it.
Edit: The updated example you've provided breaks this solution, and I think it would break any regex solution.
Edit 2: Here is a C# string version of your regex. It uses # to tell the compiler to treat the string as a verbatim literal, which means it ignores \ as an escape character. The only caveat is that in order to represent " in a verbatim literal you have to type it as "", but it's still better than having slashes everywhere. Given the prevalence of escape sequences in regexes, I recommend using verbatim literals anywhere that you have to type a regex in a string.
string pattern = #"set [\w_]+ ""?((\\""|[^""])+)""?"

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Basic Vim - Search and Replace text bounded by specific characters

Say I wanted to replace :
"Christoph Waltz" = "That's a Bingo";
"Gary Coleman" = "What are you talking about, dear Willis?";
to just have :
"Christoph Waltz"
"Gary Coleman"
i.e. I want to remove all the characters including and after the = and the ;
I thought the regex for finding the pattern would be \=.*?\;. In vim, I tried :
:%s/\=.*?\;$//g
but it gave me an Invalid Command error and Nothing after \=. How do I remove the above text? Apologies, but I'm new to this.
Vim's regular expression dialect is different; its escaping is optimized for text searches. See :help perl-patterns for a comparison with Perl regular expressions. As #EvergreenTree has noted, you can influence the escaping behavior with special atoms; cp. :help /\v.
For your example, the non-greedy match is .\{-}, not .*?, and, as mentioned, you mustn't escape several literal characters:
:%s/ =.\{-};$//
(The /g flag is superfluous, too; there can be only one match anchored to the end via $.)
This is because of vim's weird handling of regexes by default. Instead of \= interpreting as a literal =, it interprets it as a special regex character. To make vim's regex system work more normally, you can prefix it with \v, which is "very magic" mode. This should do the trick:
%s/\v\=.*\;$//g
I won't explain how vim enterprets every single character in very magic mode here, but you can read about it in this help topic:
:help /magic

How many backslashes are required to escape regexps in emacs' "Customize" mode?

I'm trying to use emacs' customize-group packages to tweak some parts of my setup, and I'm stymied. I see things like this in my .emacs file after I make changes with customize:
'(tramp-backup-directory-alist (quote (("\\\\`.*\\\\'" . "~/.emacs.d/autobackups"))))
This was the result of putting the following into the customize text field:
Regexp matching filename: \\`.*\\'
This is a representative sample: I'm actually trying to change several things that want a regexp, and they all show this same problem. How many layers of quoting are there, really? I can't seem to find the magic number of backslashes to get the gosh-dang thing to do what I'm asking it to, even for the simplest regular expressions like .*. Right now, the given customization produces - nothing. It makes no change from emacs' default behavior.
Better yet, where on earth is this documented? It's a little difficult to Google for, but I've been trying quite a few things there as well as in the official documentation and the Emacs wiki. Where is an authoritative source for how many dang backslashes one needs to make a regular expression in customize-mode actually work - or at the very least, to fail with some kind of warning instead of failing silently?
EDIT: As so often happens with questions asked in anger, I was asking the wrong question. Fortunately the answers below, led me to the answer to the question that I needed, which was about quoting rules. I'm going to try to write down what I learned here, because I find the documentation and Googleable resources to be maddeningly obscure about this. So here are the quoting rules I found by trial and error, and I hope that they help someone else, inspire correction, or both.
When an emacs customize-mode buffer asks you for a "Regexp matching filename", it is being, as emacs often is, both terse and idiosyncratic (how often the creator's personality is imparted to the creation!). It means, for one thing, a regexp that will be compared to the whole path of the file in search of a match, not just to the name of the file itself as you might assume from the term "filename". This is the same sense of "filename" used in emacs' buffer-file-name function, for example.
Further, although if you put foo in the field, you'll see "foo" (with double-quotes) written to the actual file, that's not enough quoting and not the right quoting. You need to quote your regexp with the quoting style that, as far as I can tell, only emacs uses: the ``backtick-foo-single-quote'`scheme. And then you need to escape that, making it \`backslash-backtick-foo-backslash-single-quote\' (and if you think that's a headache to input in Markdown, it's more so in emacs).
On top of this, emacs appears to have a rule that the . regexp special character does not match a / at the beginning of filenames, so, as was happening to me above, the classic .* pattern will appear to match nothing: to match "all files", you actually need the regexp /.*, which then you stuff into the quote format of customize-mode to produce \`/.*\', after which customize paints another layer of escaping onto it and writes it to the customization file.
The final result for one of my efforts - a setting such that #autosave# files don't gunk up the directory you're working in, but instead all live in one place:
(custom-set variables
'(auto-save-file-name-transforms (quote (
("\\`/[^/]*:\\([^/]*/\\)*\\([^/]*\\)\\'" "~/.emacs.d/autobackups/\\2" t)
("\\`/.*/\\(.*?\\)\\'" "~/.emacs.d/autobackups/\\1" t)
))))
Backslashes in elisp are a far greater threat to your sanity than parentheses.
EDIT 2: Time for me to be wrong again. I finally found the relevant documentation (through reading another Stack Overflow question, of course!): Regexp Backslash Constructs. The crucial point of confusion for me: the backtick and single quote are not quoting in this context: they're the equivalent of perl's ^ and $ special characters. The backslash-backtick construct matches an empty string anchored at the beginning of the string being checked for a match, and the backslash-single-quote construct matches the empty string at the end of the string-under-consideration. And by "string under consideration," I mean "buffer, which just happens to contain only a file path in this case, but you need to match the whole dang thing if you want a match at all, since this is elisp's global regexp behavior."
Swear to god, it's like dealing with an alien civilization.
EDIT 3: In order to avoid confusing future readers -
\` is the emacs regex for "the beginning of the buffer." (cf Perl's \A)
\' is the emacs regex for "the end of the buffer." (cf Perl's \Z)
^ is the common-idiom regex for "the beginning of the line." It can be used in emacs.
$ is the common-idiom regex for "the end of the line." It can be used in emacs.
Because regex searches across multi-line bodies of text are more common in emacs than elsewhere (e.g. M-x occur), the backtick and single-quote special characters are used in emacs, and as best as I can tell, they're used in the context of customize-mode because if you are considering generic unknown input to a customize-mode field, it could contain newlines, and therefore you want to use the beginning-of-buffer and end-of-buffer special characters because the beginning and end of the input are not guaranteed to be the beginning and end of a line.
I am not sure whether to regret hijacking my own Stack Overflow question and essentially turning it into a blog post.
In the customize field, you'd enter the regexp according to the syntax described here. When customize writes the regexp into a string, any backslashes or double-quote chars in the regexp will be escaped, as per regular string escaping conventions.
So in short, just enter single backslashes in the regexp field, and they'll get correctly doubled up in the resulting custom-set-variables clause written to your .emacs.
Also: since your regexp is for matching filenames, you might try opening up a directory containing files you'd like to match, and then run M-x re-builder RET. You can then enter the regexp in string-escaped format to confirm that it matches those files. By typing % m in a dired buffer, you can enter a regexp in unescaped format (ie. just like in the customize field), and dired will mark matching filenames.

How to search (using regex) for a regex literal in text?

I just stumbled on a case where I had to remove quotes surrounding a specific regex pattern in a file, and the immediate conclusion I came to was to use vim's search and replace util and just escape each special character in the original and replacement patterns.
This worked (after a little tinkering), but it left me wondering if there is a better way to do these sorts of things.
The original regex (quoted): '/^\//' to be replaced with /^\//
And the search/replace pattern I used:
s/'\/\^\\\/\/'/\/\^\\\/\//g
Thanks!
You can use almost any character as the regex delimiter. This will save you from having to escape forward slashes. You can also use groups to extract the regex and avoid re-typing it. For example, try this:
:s#'\(\\^\\//\)'#\1#
I do not know if this will work for your case, because the example you listed and the regex you gave do not match up. (The regex you listed will match '/^\//', not '\^\//'. Mine will match the latter. Adjust as necessary.)
Could you avoid using regex entirely by using a nice simple string search and replace?
Please check whether this works for you - define the line number before this substitute-expression or place the cursor onto it:
:s:'\(.*\)':\1:
I used vim 7.1 for this. Of course, you can visually mark an area before (onto which this expression shall be executed (use "v" or "V" and move the cursor accordingly)).

Can some please explain this elisp regexp

Can some please explain the following regexp, which I found in ediff-trees.el as a specification for which files/directories to exclude from its comparison process.
"\\`\\(\\.?#.*\\|.*,v\\|.*~\\|\\.svn\\|CVS\\|_darcs\\)\\'"
Although I am somewhat familiar with regular expressions encountering this elisp string-based variant has thrown me off.
First thing, remember that elisp's regexes have to be string-escaped, which created a lot of extra backslashes. Removing them, we get
\`\(\.?#.*\|.*,v\|.*~\|\.svn\|CVS\|_darcs\)\'
Then, \( and \) mean grouping, "foo\|bar" means "either foo, or bar".
So, piece by piece, this regexp matches: either an emacs temporary file (something starting with #, possibly preceded by a period: .?#.), or an RCS file (ending in ,v: .,v), or an emacs backup file (ending in ~: .*~), or an svn directory (.svn), or a cvs directory (CVS), or a darcs directory (_darcs).
Edit to correct: as andre-r correctly points out, the backtick \` and single quote \' basically mean "beginning and end of the string" (respectively). So this means that the regexp finds strings which match exactly one of the choices I've outlined above (i.e., the string starts, then comes one of those choices, then the string ends). I had previously said they meant quoting, I don't know what I was thinking :). Thanks andre-r!
Sorry, this isn't really an answer; it's merely a comment to rbp's answer. But I can't figure out how to get the code sample to render nicely inside a comment, whereas it looks fine here in this answer.
Anyway:
I dunno about you, but I find
(rx bos (group (or (and (zero-or-one ".") "#" (zero-or-more nonl))
(and (zero-or-more nonl) ",v" )
(and (zero-or-more nonl) "~" )
".svn"
"CVS"
"_darcs"
))
eos)
a lot easier to read -- and it's exactly equivalent.
Parentheses in elisp regexes need to be escaped. Backslashes in strings need to be escaped, so you end up with \\( and \\) when any sensible regex parser would just use ( and ). Don't get me wrong, I love Emacs, but having to escape parentheses in a regex was a really bad idea. The pipes and periods and backticks are also being escaped - that's why you've got this hell of double backslashes. Strip out those and you get (in regex literal form):
`(.?#.*|.*,v|.*~|\.svn|CVS|_darcs)'
See this question for more discussion on the subject of escaped parens in elisp.