How can I disambiguate strings and regexps in Elisp custom declarations? - regex

Consider the following Emacs Lisp code, which defines a customizable variable that can either be a literal string or a regular expression:
(defcustom myvar "" "String or regexp"
:type '(choice (string :tag "String")
(regexp :tag "Regexp")))
This works just fine in the Custom interface (customize-variable 'myvar), but it then becomes impossible to tell whether the variable was set to a string or a regular expression. Even the Custom interface thinks it's a string no matter what. If you set the variable to a regexp using Custom and then close and reopen the Custom buffer for the variable, it will once again say it's a string.
So, is there any way to disambiguate this, to ensure that when the variable is set to a regexp through Custom, my code can determine that it's meant to be a regexp and not a simple string?
Ideally, I would like some sort of mechanism to have the string be stored internally as (cons 'string VALUE), where VALUE is the string that the user types in, and similarly have the regexp stored internally as (cons 'regexp VALUE).
Edit
From my searching, I've found the :value-to-internal and :value-to-external properties that you can supply to define-widget, but I can't figure out how to use them in a way that doesn't cause an error when I try to customize the resulting variable.

Rephrasing my answer from help-gnu-emacs --
Do what you suggested: use a cons instead of a string.
You need some way to programmatically distinguish arbitrary text from text used as a regexp. You know that intention at customize time: the user chooses one or the other.
You need to make sure that the user choice results in different (distinguishable) values. A cons recording (a) the text and (b) the type/choice/use is a good way to do that.

Related

TCL: check if variable is list

set var1 A
set var2 {A}
Is it possible to check if variable is list in TCL? For var1 and var2 llength gives 1. I am thinking that these 2 variables are considered same. They are both lists with 1 element. Am I right?
Those two things are considered to be entirely identical, and will produce identical bytecode (except for any byte offsets used for indicating where the content of constants are location, which is not information normally exposed to scripts at all so you can ignore it, plus the obvious differences due to variable names). Semantically, braces are a quoting mechanism and not an indicator of a list (or a script, or …)
You need to write your code to not assume that it can look things up by inspecting the type of a value. The type of 123 could be many different things, such as an integer, a list (of length 1), a unicode string or a command name. Tcl's semantics are based on you not asking what the type of a value is, but rather just using commands and having them coerce the values to the right type as required. Tcl's different to many other languages in this regard.
Because of this different approach, it's not easy to answer questions about this in general: the answers get too long with all the different possible cases to be considered in general yet most of it will be irrelevant to what you're really seeking to do. Ask about something specific though, and we'll be able to tell you much more easily.
You can try string is list $var1 but that will accept both of these forms - it will only return false on something that can't syntactically be interpreted as a list, eg. because there is an unmatched bracket like "aa { bb".

Splitting a mixed string/number argument list of a Lua function call in C++/Qt

I want to parse the argument list of a Lua function call in C++ using Qt (4.8) in order to avoid a dependency to the Lua interpreter. The comma-separated argument list can be assumed to consist only of string literals and numbers. Eventually the result should be available as a QStringList. The tricky part there is to cope with commas that are part of string arguments as well with the fact that string arguments may use single or double quotes. Until I get to a solution (using regular expressions) myself, somebody might already have dealt with that or a similar problem.
Example:
The argument list string
"Foo", "not 'bar'", 'a, b ,c', 42, 1e-8
should be transformed to a string list containing the items
Foo, not 'bar', a, b, c, 42 and 1e-8
(omitting the quotes per item to avoid confusion)
Not familiar with all the possibilities of your arguments, but the examples you mentioned get correctly matched with this: (?<=")[\w',-]*?(?=")|(?<=^'|\s').*(?='(?:,|$))|[\w-]+, as seen here: https://regex101.com/r/rX7fX7/3
The idea is that you write the "difficult" situations in alternations, preferably to the left, while the less difficult solutions to the right. This way, the engine will first check if a problem situation is present before trying to match whole words.
The current regex doesn't work correctly if quotes/doublequotes appear in middle of the arguments, but your examples didn't have such situations.

How to store regex patterns, as regex objects or strings?

How to store regex patterns, as regex objects or strings?
I have a class X, and I need to store a pattern that will later
be used for matching regular expressions. At this point I simply
have a member called 'patternRegex' as std::string. Would it not
be better if I store an object of type regex? Then the naming
would be just 'pattern' because from the type it will be clear
that it is regex. Are there any tradeoffs I should watch out for?
"Compilation" from string to a regular expression finite state machine is time costly. If you plan to use the regular expressions frequently, eg. in loops, your code will be faster if you keep the regex objects instead of their string representations.
Regular expression strings get compiled before use. If you intend to use one regular expression more than once you may like to compile it first by instantiating a regex object.
It's better to store them as objects, because constructing a regex from a string invokes parsing the string and building (implementation-defined) parsing structures. So, better create a member field of type std::regex
The other answer already mentioned that you should store a std::regex because it is faster when used multiple times. I think it's worth to point that there is another advantage which holds even if used only once: It catches errors early.
In my code the string often comes from some configuration file and I'd like to know as soon as possible if it is a valid regular expression or not. When you store just the string, it'll only fail when first used which might be much harder to test.

String matching in emacs lisp matching arbitary string

In emacs lisp I only know the functions string-match[-p], but I know no method for matching a literal string to a string.
E.g. assume that I have a string generated by some function and want to know if another string contains it. In many cases string-match-p will work fine, but when the generated string contains regexp syntax, it will result in unexpected behaviour, maybe even crash if the regular expression syntax contained is invalid (e.g. unbalanced quoted parentheses \(, \)).
Is the some function in emacs lisp, that is similiar to string-match-p but doesn't interpret regular expression syntax?
As regexp-matching is implemented in C I assume that matching the correct regexp is faster than some substring/string= loop; Is there some method to escape an arbitrary string into a regular expression that matches that string and only that string?
Are you looking for regexp-quote?
The docs say:
(regexp-quote STRING)
Return a regexp string which matches exactly STRING and nothing else.
And I don't know that your assumption in #2 is correct, string= should be faster...
Either use regexp-quote as recommended by #trey-jackson, or don't use strings at all.
Emacs is not optimized for string handling; it is optimized for buffers. So, if you manipulate text, you might find it faster to create a temporary buffer, insert your text there, and then use search-forward to find your fixed string (non-regexp) in that buffer.
Perhaps cl-mismatch, an analogue to Common Lisp mismatch function? Example usage below:
(mismatch "abcd" "abcde")
;; 4
(mismatch "abcd" "aabcd" :from-end t)
;; -1
(mismatch "abcd" "aabcd" :start2 1)
;; nil
Ah, sorry, I didn't understand the question the first time. If you want to know whether the string is a substring of another string (may start at any index in the searched string), then you could use cl-search, again, an analogue of Common Lisp search function.
(search "foo\\(bar" "---foo\\(bar")
;; 3

How to create regular expression to get all functions from code

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!
Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.
Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps
Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).
It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.
(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)