Using regexp to evaluate search query - regex

Is it possible to convert a properly formed (in terms of brackets) expression such as
((a and b) or c) and d
into a Regex expression and use Java or another language's built-in engine with an input term such as ABCDE (case-insensitive...)?
So far I've tried something along the lines of (b)(^.?)(a|e)* for the search b and (a or e) but it isn't really working out. I'm looking for it to match the characters 'b' and any of 'a' or 'e' that appear in the input string.
About the process - I'm thinking of splitting the input string into an array (based on this Regex) and receiving as output the characters that match (or none if the AND/OR conditions are not met). I'm relatively new to Regex and haven't spent a lot of time on it, so I'm sorry if what I'm asking about is not possible or the answer is really obvious.
Thanks for any replies.

The language of strings with balanced parentheses is not a regular language, which means no (pure) regular expression will match it.
That is because some kind of memory construct, usually a stack, is needed to maintain open parentheses.
That said, many languages offer recursive evaluation in regexes, notably Perl. I don't know the fine details, but I'm not going to bother with them because you can probably write your own parser.
Just iterate over every character in the string and keep track of a counter of open parentheses and a stack of strings. When you get to an open parentheses, push the stack in and put characters that aren't parentheses into string of the stack. When you get to a closed parentheses, evaluate the expression that you had built up and store the result onto the back of the string that's on the top of the stack.
Then again, I'm not fully sure I understand what you're doing. I apologize, then, if this is no help.

I'm not entirely certain I understand what you're trying to do, but here's something that might help. Start with something like
((a and b) or c) and d
And pass it through these substitution statements:
s/or/|/g
s/and| //g
s/([^()|])/(?=.*$1)/g
That will give you
(((?=.*a)(?=.*b))|(?=.*c))(?=.*d)
which is a regex that will match what you want.

No. A regex isn't computationally powerful enough to make sure that the opening and closing parentheses match. You need something that can describe it using a formal grammar.

Related

Automatically find short regexp to match a set of words?

I am not looking for a specific regular expression, but for a software that find them.
Let us say I have a file A and a file B: how to find a regexp that matches all words of A, but does not match any of the words in A?
If A contains "truit fruit" and B contains "ridiculous", then the software could return something like ".ru." but '.r.' only would be invalid.
It is the "practical" aspect of another question [1], though what interests me is to find an actual software that solves it in practice.
Thanks for your help,
Nathann
[1] https://cstheory.stackexchange.com/questions/1854/is-finding-the-minimum-regular-expression-an-np-complete-problem
There is no algorithm to somehow "cleverly derive" a regular expression from examples. You can only implement a brute force attempt of an iteration through all permutations of common substrings of the words in A and tests B against it until you find a solution. You are not guaranteed to find a solution, though.
For the case that there are no common substrings of all words in A you could then extend that approach to introduce the "or" operator in regular expressions. But that get's really ugly and slow.
If that does not lead to a solution, then you'd have to go on extending your attempts such that also exclusion rules are added to the expression by iterating through all words in B and creating anti patterns from it. Horrible attempt.
And as said: you are never guaranteed to find a solution.
There is one thing though:
If you are not interested in how the final regular expression looks like you can do this: create a regex simply combining all words in a "whitespace padded version of A" with an "or" operation (so \struit\s|\sfruit\s in your example). Obviously that attempt creates huge expressions. You then would have to take care to exclude exact substrings that might occur in B again. Which may lead to much longer expressions still.
Bottom line: there is no really elegant solution for this. Simply because the question does not allow for that. Question is: why does it have to be a regular expression? Why can't you simply do string comparisions? That would probably not be more expensive anyway in such an vaguely defined scenario...

Emacs - Subword Regular Expressions Clarification

I'm trying to change the locations at which the subword-mode commands (subword-forward,subword-backward, etc.) stop.
I noticed that subword.el provides regular expressions for forward and backward matching, and I've been messing with them in trying to make some headway on adding more subword delimiters.
What I would really like is some clarification on how exactly the subword regular expressions work, as far as what exactly is being matched, so that I might be able to change it to include characters I want to stop on. I have a basic understanding of regular expressions and have used them before, but never any as large as those in subword.el.
I don't necessarily need help for both regular expressions as well. Any guidance on adding additional delimiters to one of the existing regular expressions would be equally appreciated, since that is my goal in changing them, but I would really like to know a bit about how the regular expressions are set up.
Lastly, in searching for a solution, I found this related StackOverflow question. I read it over, but subword.el doesn't contain the regular expressions itself as it looks to appear in the quoted section of the related question, and I don't understand what is meant by the last parenthetical statement in that quoted section.
Edit:
To try to put what I am looking to do in a clearer context, I just want the Ctrl+Left/Right in Emacs (subword-forward/backward) to act as closely to Eclipse as possible, in that I would like to have the cursor move similarly, stopping at the end and beginnings of lines with Ctrl+Left/Right once reached.
Here is another related StackOverflow question. The "viper" commands are much closer to what I am looking for, but slightly off, because I want the point to stop at the end of the line before continuing to the next.
The answer to the question in your last paragraph is contained in the other answer on that same linked page: (modify-syntax-entry ?\\ "w"). That makes backslash be a word-constituent character, so word functions treat it as part of a word.
Please specify the behavior you are trying to implement, in particular, what you mean by "adding more subword delimiters."
The regexps in subword.el are fairly straightforward. You say you do not need help understanding those regexps. But then what do you mean by asking "how exactly the subword regular expressions are constructed"? They were likely constructed by hand (based on what you already understand their various parts to be for).
A guess, since your description is unclear to me so far, is that all you are looking for is to specify some additional chars as having non-word syntax. If that is what you mean by "adding more subword delimiters" then just do that. If, for example, you want the char a to be a non-word character, then do something like this:
(modify-syntax-entry ?a ".") ; Or another nonword-constituent syntax class (this uses punctuation)
That makes a be a punctuation character instead of a word-constituent character. If you want some other syntax class than punctuation, then choose it similarly.
Update after comments
E.g., If you want any punctuation syntax to act the same as an uppercase letter, this will do it:
(defvar subword-forward-regexp
"\\W*\\(\\(\\([[:upper:]]\\|\\s.\\)*\\(\\W\\)?\\)[[:lower:][:digit:]]*\\)"
"Regexp used by `subword-forward-internal'.")
(defvar subword-backward-regexp
"\\(\\(\\W\\|[[:lower:][:digit:]]\\)\\(\\([[:upper:]]\\|\\s.\\)+\\W*\\)\\|\\W\\w+\\)"
"Regexp used by `subword-backward-internal'.")
Or if you want, say, just , to act the same as an uppercase letter, this will do that:
(defvar subword-forward-regexp
"\\W*\\(\\([,[:upper:]]*\\(\\W\\)?\\)[[:lower:][:digit:]]*\\)"
"Regexp used by `subword-forward-internal'.")
(defvar subword-backward-regexp
"\\(\\(\\W\\|[[:lower:][:digit:]]\\)\\([,[:upper:]]+\\W*\\)\\|\\W\\w+\\)"
"Regexp used by `subword-backward-internal'.")
If this is still not what you want, then try explaining what you want a bit better. E.g., you have not given a single example -- neither positive (should stop here) nor negative (should not stop here). You make those who try to help you guess more than they should have to, which is not efficient.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Lua string.match uses irregular regular expressions?

I'm curious why this doesn't work, and need to know why/how to work around it; I'm trying to detect whether some input is a question, I'm pretty sure string.match is what I need, but:
print(string.match("how much wood?", "(how|who|what|where|why|when).*\\?"))
returns nil. I'm pretty sure Lua's string.match uses regular expressions to find matches in a string, as I've used wildcards (.) before with success, but maybe I don't understand all the mechanics? Does Lua require special delimiters in its string functions? I've tested my regular expression here, so if Lua used regular regular expressions, it seems like the above code would return "how much wood?".
Can any of you tell me what I'm doing wrong, what I mean to do, or point me to a good reference where I can get comprehensive information about how Lua's string manipulation functions utilize regular expressions?
Lua doesn't use regex. Lua uses Patterns, which look similar but match different input.
.* will also consume the last ? of the input, so it fails on \\?. The question mark should be excluded. Special characters are escaped with %.
"how[^?]*%?"
As Omri Barel said, there's no alternation operator. You probably need to use multiple patterns, one for each alternative word at the beginning of the sentence. Or you could use a library that supports regex like expressions.
According to the manual, patterns don't support alternation.
So while "how.*" works, "(how|what).*" doesnt.
And kapep is right about the question mark being swallowed by the .*.
There's a related question: Lua pattern matching vs. regular expressions.
As they have already answered before, it is because the patterns in lua are different from the Regex in other languages, but if you have not yet managed to get a good pattern that does all the work, you can try this simple function:
local function capture_answer(text)
local text = text:lower()
local pattern = '([how]?[who]?[what]?[where]?[why]?[when]?[would]?.+%?)'
for capture in string.gmatch(text, pattern) do
return capture
end
end
print(capture_answer("how much wood?"))
Output: how much wood?
That function will also help you if you want to find a question in a larger text string
Ex.
print(capture_answer("Who is the best football player in the world?\nWho are your best friends?\nWho is that strange guy over there?\nWhy do we need a nanny?\nWhy are they always late?\nWhy does he complain all the time?\nHow do you cook lasagna?\nHow does he know the answer?\nHow can I learn English quickly?"))
Output:
who is the best football player in the world?
who are your best friends?
who is that strange guy over there?
why do we need a nanny?
why are they always late?
why does he complain all the time?
how do you cook lasagna?
how does he know the answer?
how can i learn english quickly?

How to write the regex for this expression

I want to match strings like this: !! so I suppose the input have the right elements but whether the they are evaluable, that is left for the evaluator!
1+(2-3)*(4/5)
what is the regex for matching this, something like this: ([0-9\+-\*/\(\)]+)? but this seems not working.
If you are only asking for a character validation, you can use
^[0-9+*/()-]*$
You don't need to escape characters in a character class (inside square brackets). And if you must include an hyphen, you HAVE to put it at the end, otherwise it would be considered as the character range operator.
That said, keep in mind this will only guarantee you that you have no other characters. It will NOT validate the structure (regexes are not the right tool for that). However, since you stated an evaluator will then process the input, that might be right for you.
You can't, this is not a regular language. Though some regexp implementations may provide additional features to match balanced parenthesis.
Regular expressions can not match arbitrary arithmetic formulas. Regexps only describe regular languages, while arithmetic formulas use a recursive grammar. See http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
A regex may be possible if you limit nesting depth, but if you want it all the way, with matching bracket detection, it will probably be very, very complicated.
If you want to match "1+(2-3)*(4/5)", then you can use this regular expression.
/1+\(2-3\)\*\(4\/5)/
What's that? That doesn't tell you what you want to know? Well, then what do you want to know? What information are you trying to extract from the string?
You can't just say "strings like this". Your question is not nearly enough clear.
If your question is to evaluate if a equation is valid then you will need a parser to Tokenize the expression than a grammar to evaluate if the expression is right.
You cant check if the equation as balanced parenthesis using regex. This is because a regular expression is equivalent to a Deterministic Finite Automata. Since the automata is finite, you will never have a automata big enough to check parenthesis.