String matching in emacs lisp matching arbitary string - regex

In emacs lisp I only know the functions string-match[-p], but I know no method for matching a literal string to a string.
E.g. assume that I have a string generated by some function and want to know if another string contains it. In many cases string-match-p will work fine, but when the generated string contains regexp syntax, it will result in unexpected behaviour, maybe even crash if the regular expression syntax contained is invalid (e.g. unbalanced quoted parentheses \(, \)).
Is the some function in emacs lisp, that is similiar to string-match-p but doesn't interpret regular expression syntax?
As regexp-matching is implemented in C I assume that matching the correct regexp is faster than some substring/string= loop; Is there some method to escape an arbitrary string into a regular expression that matches that string and only that string?

Are you looking for regexp-quote?
The docs say:
(regexp-quote STRING)
Return a regexp string which matches exactly STRING and nothing else.
And I don't know that your assumption in #2 is correct, string= should be faster...

Either use regexp-quote as recommended by #trey-jackson, or don't use strings at all.
Emacs is not optimized for string handling; it is optimized for buffers. So, if you manipulate text, you might find it faster to create a temporary buffer, insert your text there, and then use search-forward to find your fixed string (non-regexp) in that buffer.

Perhaps cl-mismatch, an analogue to Common Lisp mismatch function? Example usage below:
(mismatch "abcd" "abcde")
;; 4
(mismatch "abcd" "aabcd" :from-end t)
;; -1
(mismatch "abcd" "aabcd" :start2 1)
;; nil
Ah, sorry, I didn't understand the question the first time. If you want to know whether the string is a substring of another string (may start at any index in the searched string), then you could use cl-search, again, an analogue of Common Lisp search function.
(search "foo\\(bar" "---foo\\(bar")
;; 3

Related

Intellij: Regular Expression failed to match - produced stack overflow when matching content of the file [duplicate]

This is my Regex
((?:(?:'[^']*')|[^;])*)[;]
It tokenizes a string on semicolons. For example,
Hello world; I am having a problem; using regex;
Result is three strings
Hello world
I am having a problem
using regex
But when I use a large input string I get this error
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
How is this caused and how can I solve it?
Unfortunately, Java's builtin regex support has problems with regexes containing repetitive alternative paths (that is, (A|B)*). This is compiled into a recursive call, which results in a StackOverflow error when used on a very large string.
A possible solution is to rewrite your regex to not use a repititive alternative, but if your goal is to tokenize a string on semicolons, you don't need a complex regex at all really, just use String.split() with a simple ";" as the argument.
If you really need to use a regex that overflows your stack, you can increase the size of your stack by passing something like -Xss40m to the JVM.
It might help to add a + after the [^;], so that you have fewer repetitions.
Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”? Maybe that comes in handy, too. (Update: it is called possessive quantifiers).
A completely different alternative is to write a utility method called splitQuoted(char quote, char separator, CharSequence s) that explicitly iterates over the string and remembers whether it has seen an odd number of quotes. In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.
'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.

Racket raw strings

In Racket you have to escape backslashes in strings, therefore Windows paths and regexes become verbose.
For example, the regular expression (.*)\1 can be represented with the string "(.*)\\1" or the regexp constant #rx"(.*)\\1"; the \ in the regular expression must be escaped to include it in a string or regexp constant. [Source: Regexp Syntax]
In many languages like Perl and Ruby regexes are supported syntactically /\([a-z]+\)/, in others there are optional raw strings, like in Python r"\([a-z]+\)". It seems that Racket doesn't support raw strings, where you don't need to escape backslashes, natively. Is there any method to implement them, a third-party library, a proposal, whatever?
See also:
Regular Expressions # The Racket Guide
Regular Expressions # The Racket Reference
As Chris mentioned, a custom reader can do this.
An example of a reader that Racket already supplies, that you could use, is at-exp:
#lang at-exp racket
#~a{C:\Windows\win.ini}
;; "C:\\Windows\\win.ini"
#~a{This is a string
with newlines.}
;; "This is a\nstring with newlines."
I like to use ~a with this because it converts anything to a string, and it's only two characters to type.
However for your regexp example, you can't use ~a or #rx. Instead you should use regexp:
#regexp{(.*)\1}
;; #rx"(.*)\\1"
In all of these examples, #function{string} is read as (function "string") -- basically. There are some nuances you can read about in the documentation for at-exp and Scribble.

How can I disambiguate strings and regexps in Elisp custom declarations?

Consider the following Emacs Lisp code, which defines a customizable variable that can either be a literal string or a regular expression:
(defcustom myvar "" "String or regexp"
:type '(choice (string :tag "String")
(regexp :tag "Regexp")))
This works just fine in the Custom interface (customize-variable 'myvar), but it then becomes impossible to tell whether the variable was set to a string or a regular expression. Even the Custom interface thinks it's a string no matter what. If you set the variable to a regexp using Custom and then close and reopen the Custom buffer for the variable, it will once again say it's a string.
So, is there any way to disambiguate this, to ensure that when the variable is set to a regexp through Custom, my code can determine that it's meant to be a regexp and not a simple string?
Ideally, I would like some sort of mechanism to have the string be stored internally as (cons 'string VALUE), where VALUE is the string that the user types in, and similarly have the regexp stored internally as (cons 'regexp VALUE).
Edit
From my searching, I've found the :value-to-internal and :value-to-external properties that you can supply to define-widget, but I can't figure out how to use them in a way that doesn't cause an error when I try to customize the resulting variable.
Rephrasing my answer from help-gnu-emacs --
Do what you suggested: use a cons instead of a string.
You need some way to programmatically distinguish arbitrary text from text used as a regexp. You know that intention at customize time: the user chooses one or the other.
You need to make sure that the user choice results in different (distinguishable) values. A cons recording (a) the text and (b) the type/choice/use is a good way to do that.

Regular Expression extract first three characters from a string

Using a regular expression how can I extract the first 3 characters from a string (Regardless of characters)? Also using a separate expression I want to extract the last 3 characters from a string, how would I do this? I can't find any examples on the web that work so thanks if you do know.
Thanks
Steven
Any programming language should have a better solution than using a regular expression (namely some kind of substring function or a slice function for strings). However, this can of course be done with regular expressions (in case you want to use it with a tool like a text editor). You can use anchors to indicate the beginning or end of the string.
^.{0,3}
.{0,3}$
This matches up to 3 characters of a string (as many as possible). I added the "0 to 3" semantics instead of "exactly 3", so that this would work on shorter strings, too.
Note that . generally matches any character except linebreaks. There is usually an s or singleline option that changes this behavior, but an alternative without option-setting is this, (which really matches any 3 characters):
^[\s\S]{0,3}
[\s\S]{0,3}$
But as I said, I strongly recommend against this approach if you want to use this in some code that provides other string manipulation functions. Plus, you should really dig into a tutorial.

Using regexp to evaluate search query

Is it possible to convert a properly formed (in terms of brackets) expression such as
((a and b) or c) and d
into a Regex expression and use Java or another language's built-in engine with an input term such as ABCDE (case-insensitive...)?
So far I've tried something along the lines of (b)(^.?)(a|e)* for the search b and (a or e) but it isn't really working out. I'm looking for it to match the characters 'b' and any of 'a' or 'e' that appear in the input string.
About the process - I'm thinking of splitting the input string into an array (based on this Regex) and receiving as output the characters that match (or none if the AND/OR conditions are not met). I'm relatively new to Regex and haven't spent a lot of time on it, so I'm sorry if what I'm asking about is not possible or the answer is really obvious.
Thanks for any replies.
The language of strings with balanced parentheses is not a regular language, which means no (pure) regular expression will match it.
That is because some kind of memory construct, usually a stack, is needed to maintain open parentheses.
That said, many languages offer recursive evaluation in regexes, notably Perl. I don't know the fine details, but I'm not going to bother with them because you can probably write your own parser.
Just iterate over every character in the string and keep track of a counter of open parentheses and a stack of strings. When you get to an open parentheses, push the stack in and put characters that aren't parentheses into string of the stack. When you get to a closed parentheses, evaluate the expression that you had built up and store the result onto the back of the string that's on the top of the stack.
Then again, I'm not fully sure I understand what you're doing. I apologize, then, if this is no help.
I'm not entirely certain I understand what you're trying to do, but here's something that might help. Start with something like
((a and b) or c) and d
And pass it through these substitution statements:
s/or/|/g
s/and| //g
s/([^()|])/(?=.*$1)/g
That will give you
(((?=.*a)(?=.*b))|(?=.*c))(?=.*d)
which is a regex that will match what you want.
No. A regex isn't computationally powerful enough to make sure that the opening and closing parentheses match. You need something that can describe it using a formal grammar.