Pattern matching language knowledge, pattern matching approach - regex

I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!

I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.

Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]

Related

What's the regular expression for an alphabet without the first occurrence of a letter?

I am trying to use FLEX to recognize some regular expressions that I need.
What I am looking for is given a set of characters, say [A-Z], I want a regular expression that can match the first letter no matter what it is, followed by a second letter that can be anything in [A-Z] besides the first letter.
For example, if I give you AB, you match it but if I give you AA you don't. So I am kind of looking for a regex that's something like
[A-Z][A-Z^Besides what was picked in the first set].
How could this be implemented for more occurrences of letters? Say if I want to match 3 letters without each new letter being anything from the previous ones. For instance ABC but not AAB.
Thank you!
(Mathematical) regular expressions have no context. In (f)lex -- where regular expressions are actually regular, unlike most regex libraries -- there is no such thing as a back-reference, positive or negative.
So the only way to accomplish your goal with flex patterns is to enumerate the possibilities, which is tedious for two letters and impractical for more. The two letter case would be something like (abbreviated);
A[B-Z]|B[AC-Z]|C[ABD-Z]|D[A-CE-Z]|…|Z[A-Y]
The inverse expression also has 26 cases but is easier to type (and read). You could use (f)lex's first-longest-match rule to make use of it:
AA|BB|CC|DD|…|ZZ { /* Two identical letters */ }
[[:upper:]]{2} { /* This is the match */ }
Probably, neither of those is the best solution. However, I don't think I can give better advice without knowing more specifics. The key is knowing what action you want to take if the letters do match, which you don't specify. And what the other patterns are. (Recall that a lexical scanner is intended to divide the input into tokens, although you are free to ignore a token once it is identified.)
Flex does come with a number of useful features which can be used for more flexible token handling, including yyless (to rescan part or all of the token), yymore (to combine the match with the next token), and unput (to insert a character into the input stream). There is also REJECT, but you should try other solutions first. See the flex manual chapter on actions for more details.
So the simplest solution might be to just match any two capital letters, and then in the action check whether or not they are the same.

A Regex to ignore a set of words

Is there a way to set regex to ignore a set of words separated by space?
I have different products names like:
"Matrix 10X, 10 ml + DISPENSER"
"Matrix 10X,10ml + DISPENSER" where the quantity varies
What I'm trying to do is to replace using regex all words except for:
"10 ml" | "10 ML" | "10ml" ---> these are to be ignored
I have found a code to replace all characters except words separated by space (like "10 ml")
https://regex101.com/r/bG8vB4/5
and to replace them when they are together (like "10ml")
https://regex101.com/r/bG8vB4/4
but can find a way to mix them together to keep just "10 ml" OR "10 ML" OR "10ml" and remove other characters up to the end of the string
Regexps are a mathematical model to do efficient computer recognition of strings. As easy as getting a regular expression to match a string if it has any of some words, math demonstrates that the regexp to get a matcher of strings that just matches a string if it has none of those words is possible. The way to get such a regexp, although is far more complex.
On regular expressions theory, a regular language is one that allows you to set a finite automaton from a regular expression, and the automaton that recognizes a string if the original doesn't is feasible by just switching all accept states into non-accepting states. Once done this, the hardest part is to build a regular expression that matches that automaton (that is possible, but the final regular expression is far more complex, in general than the original) This can be solved with an example (a simple one) and you'll see that that is a complex thing (of course, some regexp libraries allow you to use an operand for this, but you don't specify if the one you are using does) One such sample is when you have to recognize a simple C language comment. A comment is a string delimited by the sequences /* and */ but in the inner part, you cannot have the sequence */.
The first approach could be to use the following regexp:
\/\*.*\*\/
but that fails, as the inner regexp includes the recognition of */ as part of it, so /* bla bla bla */ bla bla bla */ will be recognized as a comment in whole (it should end at the first */) so wee need a regexp that recognizes anything but not something that includes */
Such subexpression is:
([^*]|\*[^/])*
which means and undefinite concatenation of characters different that *, or sequences that, including the first character as * are not followed by /. If you follow that concatenation, you'll see that it's impossible to form a sequence */ leading to our final regexp:
\/\*([^*]|\*[^/])*\*\/
(now you see how the things complicate)
To extend this to a single word (as word, more than two letters) you have to consider that you can allow:
([^w]|w[^o]|wo[^r]|wor[^d])*
in the set, and if you have two words (like foo and bar) you have to write:
([^f]|f[^o]|fo[^o]|[^b]|b[^a]|ba[^r])*
meaning that for each word you have such regexps, making the final regexp a bit complicated. Also, there can be interactions between words if some can be the prefix to another or some have the same prefix chars. This also can have the problem that the compilation of regexps into finite automata has produced many libraries that consider the | operator non conmutative and resolve them in a non conmutative way, leading to erroneous results.
You have not explained also what you mean with ignoring. If you mean matching them and pass around, is different to mean to ignore the whole line they could appear on. The regexps then (an the definition of the problem you need to solve is quite different ---my explanation was in the sense of rejecting a full sentence if it has any of the words on it, which probably is not what you mean) So please, explain (in your question) what do you mean with:
accepting you have matched a sentence containing a word.
rejecting such a sentence.
what are you rejecting (or ignoring) at all.
Rejecting just a word, is simply selecting a sencence that contains that word, and mark the word to be able to pass over it. But that's a different problem, and it requires to select sentences that do have the word.

Pattern matching for strings independent from symbols

I have need for an algorithm which can find pre-defined patterns in data (which is present in the form of strings) independent from the actual symbols/characters of the data and the pattern. I only care about the relations between the symbols, not the symbols themselves. It is also legal to have different pattern symbols for the same symbol in the data. The only thing the pattern matching algorithm has to enforce is that multiple occurences of the same symbol in the pattern are preserved. To give you an example:
The pattern is abca, so the first and the last letter are the same. For my application, an equivalent way to write this would be 1 2 3 1, where the digits are just variables. The data I have is thistextisatest. The resulting algorithm should give me two correct matches here, text and test. Because only in these two cases, the first and the fourth letter are the same, as in the pattern.
As a second example, the pattern abcd should return 12 matches (one for each position in thistextisat). Since no variable in the pattern is repeated, it is trivially matched everywhere. Even in the case of text and test, because it is legal that the variables a and d of the pattern map to the same symbol.
The goal of this algorithm should be to detect similarities in written language. Imagine having a dictionary of the English language and parsing it with the pattern unseen or equivalently 1 2 3 4 4 2. You would then see that, for example, the word belittle contains the same pattern of letters.
So, now that I hopefully made clear what I need, I have some questions:
What is this algorithm called? Is it a well-known problem that has been solved?
Are there publications on the matter? It is really hard to find anything useful when you don't know the correct search terms to separate this problem from regular pattern matching.
Is there a ready implementation of this?
I have not used Regex for anything too complicated, so I don't know if anything like this would even be possible in Regex, when you basically do not care about the symbols as such, but only consider the pattern of their occurences.
I'd really appreciate your help!
I don't think you need regular expressions here. Your search term:
unseen
123442
This has six characters, so index each word of your text into 6-mers
belittle
12,12,12,12,11,12,12 2-mers
123,123,123,122,112,123 3-mers
1234,1234,1233,1223,1123 4-mers
12345,12344,12334,12234 5-mers
123455,123442,123321 6-mers
So just looking at the 6-mers, you've got a match. Any 6 digit number less than your search term would also be a match, to allow for the abcd (1234) case matching an abca (1231) word.
So given a search term of n characters, just split each word into its constituent n-mers and check for numeric equal or less than.

Negating Regular Expression for Price

I have a regular expression for matching price where decimals are optional like so,
/[0-9]+(\.[0-9]{1,2})?/
Now what I would like to do is get the inverse of the expression, but having trouble doing so. I came up with something simple like,
/[^0-9.]/g
But this allows for multiple '.' characters and more than 2 numbers after the decimal. I am using jQuery replace function on blur to correct an input price field. So if a user types in something like,
"S$4sd3.24151 . x45 blah blah text blah" or "!#%!$43.24.234asdf blah blah text blah"
it will return
43.24
Can anyone offer any suggestions for doing this?
I would do it in two steps. First delete any non-digit and non-dot-character with nothing.
/[^0-9.]//g
This will yield 43.24151.45 and 43.24.234 for the first and second example respectively.
Then you can use your first regex to match the first occurence of a valid price.
/\d(\.\d{1,2})?/
Doing this will give you 43.24 for both examples.
I suppose in programming, it is not always clear what "inverse" means.
To suggest a solution exclusively based on the example that you presented, I will present one that is very similar to what Vince presented. I am having difficulty composing a Regular Expression that both matches the pattern that you need and captures a potentially arbitrary number of digits, through repeating capture groups. And I am not sure whether this would be doable in some reasonable way (perhaps someone else does). But a two step approach should be straightforward.
To note, I suspect that you are referring to JavaScript's replace function, which is a member of the String Object, and not jQuery replaceWith and replaceAll functions, in referring to 'jQuery replace function.' The latter are 'Dom manipulation' functions. But, correct me if I misunderstood.
As an example, based on some hypothetical input, you could use
<b>var numeric_raw = jQuery('input.textbox').attr ('value').replace (/[^0-9.]/g, "")</b>
to remove all characters from a value entered in a text field that are not digits or periods;
then you could use
<b>var numeric_str = numeric_raw.replace (/^[0]*(\d+\.\d{1,2}).*$/, "$1")</b>
The difference between the classes specified here and in Vince's answer are in that I am including filtering for leading 0s.
To note, in Vince's first reg ex, there might be an extra '/' -- but perhaps it has a purpose that I didn't catch.
With respect to "inverse," one way to understand your initial inquiry is that you are looking for an expression that does the opposite of the one that you provided.
To note, while the expression that you provided (/[0-9]+(.[0-9]{1,2})?/) does match both whole numbers and decimal numbers with up to two fractional digits, it also matches any single digit -- so, it may identify a match where one might not be envisioned, for a given input string. The expression does not have anchors ('^', '$'), and so might allow multiple possible matches. For example, in the String "1.111", both "1.11" and "1" match the pattern that you provided.
It appears to me that one pattern that matches any string that does not match your pattern is the following, or at least does this for most cases can be this:
/^(?:(?!.*[0-9]+(\.[0-9]{1,2})?).*)*$/
-- if someone could identify a precisely 'inverse' pattern, please feel free -- I am having some trouble understanding how lookaheads are interpreted at least for some nuances.
This relies on "negative lookahead" functionality, which JavaScript these days supports. You could refer to several stackoverflow postings for more information (eg. Regular Expressions and negating a whole character group), and there are multiple resources that could be found on the Internet that discuss "lookahead" and "lookbehind."
I suppose this answer carries some redundancy with respect to the one already given -- I might have commented on the Original Poster's post or on Vince's answer (instead of writing at least parts of my answer), but I am not yet able to make comments!

Extracting static strings from a regular expression

I'm trying to efficiently extract static strings (strings that MUST be matched for a given regular expression to match). I've been able to do it in the simplest cases but I'm trying to discover a more robust solution.
Given a regex such as the one below
"fox jump(ed|ing|s)"
would give us
"fox,jumped,jumping,jumps"
Another example is
"fox jump(ed|ing|s)?"
which would give us
"fox,jump"
because of the optional operator
The algorithm I have is overly simple for now. It will start from the end of the regex and removes groups or a single character followed by these operators "* ?" as well as "explode" grouped OR operators "(|)". This has worked quite well but doesn't take into consideration the full syntax of a regex. You can think of it as kind of a minimal set generating process for a regex (the minimal set of strings that the regex can "generate/must match").
WHY?
I'm trying to match a bunch of text against a large set of regexes. If I can get a list of "keywords" for these regexes that is "required" I can do a quick text search for that keyword to filter the regexes I care about (ignore the ones I am guaranteed to not match or even skip that text entirely effectively not running any regexes on the text because we are guarenteed to not have a match within our set of regexes). I can organize this set of keywords in an efficient data structure (Binary Search/Trie/Aho-Corasick) to filter the set of regexes before I even try to run the text through the Finite Automata. There are extremely fast string matching algorithms that I can run as a filtering stage before I attempt to run a regular expression. I've been able to increase throughput many folds doing this simple process.
See the library Xeger which given a regular expression will give you all the possible strings that match.
You seem to only want to keep the common prefix of these strings (the part where you said to ignore optional operators) but if you do that you might capture stings that have that common prefix yet do not have the ending you want (such as "jumpy" in your example). If this is not a problem then just find the shortest string given by Xeger, assuming that optional operators occur only at the end of the regex.
If I understand your problem correctly, you are looking for a set of words such that all these words are (disjoint) substrings of any word accepted by the (given) regular expression.
My guess is that such a set will very often be empty, but nevertheless it can be found.
To find such a set, I propose the following algorithm:
Find the FA corresponding to your input regex.
Identify bridges ( https://en.wikipedia.org/wiki/Bridge_(graph_theory) ) between the starting state S and the accepting state F. This can for example be done by removing an edge E and asking whether a path exists from S to E in the FA with E removed - repeat this for all edges.
All edges that are bridges must be hit during an accepting run, and each edge corresponds to a letter of input.
You may now generate the required words by connecting subsequent bridge edges end-to-end.
I think it follows from the algorithm construction that an FA (and not a DFA) suffices for this to work. Again, a proof would be nice but I think it should work:)