Determining whether a string is a math formula? - regex

So I'm processing math formulas from strings using the shunting-yard algorithm. I pass every string through a function which processes specific strings into values, but occasionally I check against strings that are just strings that should just get passed by - read skip the shunting-yard pass. Should I just use a regular expression to test for all of the symbols and numbers? Or is there a simpler way I might test for this? I guess the inverse would be checking if there are still any letters left in the string?

While I'm sure there is a better answer, I've decided to use the following until something better comes up, I'm working in AS3:
if ( !String( value ).match( /[a-zA-Z]/ ) && !String( value ).match( /^(-?\d+)$/ ) && String( value ).match( /[()\\*+-]/ ) )
value = MathParser.parse( value );
The first expression verifies that there are no characters - my shunting yard implementation doesn't process special characters or operations, it just does basic math, so if there are any characters remaining, something is wrong.
The second verifies that the string contains mathmatical symbols. There's no point in attempting the operation if there's no math to process. I assume this is unnecessary if I wanted to handle errors inside the shunting-yard processes.
And finally I verify that the value isn't just a standalone number of positive or negative value.
Comments for improvement appreciated.


C++: finding a specific digit in a number

Someone please guide me how to check whether a specific digit exists in an integer or not. for the sake of code optimization I am trying to avoid the use of strings or any kind of loops to iterate over all the digits of the integer.
If I need to find out whether 4 exists the in an integer or not, input and the output samples are given below:
Sample Input:
Sample Output:
Desired Code:
bool ifExists(int digit, int number)
return true;
return false;
Possible Logic:
I believe there must be a mathematical approach which will do the job in the if condition, however I am unable to find such method in cmath library.
Convert integer to string, do a string search for digit.
The "mathematical" method would have to do the same, compute the digit sequence by division/remainder by 10 and compare with the given digit.
A mathematical approach would be possible i think. By using log(x) in some crazy way.
But ones for surey you should use string or a loop by dividing by 10. log(x) requires far more resources than the way you want to use.
Also string or dividing by 10 is far more easy to read than a mathimatical solution.
For the mathimatical solution.
I suggest you try to transform the decimal number in a binary representation of it. Then it could be possible to create a filter for the digit you look for. By joining the filter and the transformed value through a bitwise & you could get what you seek. Im not sure if that realy is possible but that would be my first idea for an approach.
But as i said before. This method will be very expensive to the cpu and will be hard to read.

How do I implement a lexer given that I have already implemented a basic regular expression matcher?

I'm trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I'm clueless about how to proceed.
My lexer would be taking a list of tokens and their corresponding regexs. What is the general algorithm used to create a lexer out of this?
I thought about (OR)ing all the regex, but then I can't identify which specific token was matched. Even if I extend my regex module to return the pattern matched when a match is successful, how do I implement lookahead in the matcher?
Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.
One algorithm could be (this is not necessarily the only one):
Write a procedure next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
Write a procedure lex which recursively calls next_token.
Something like this (written in Python):
tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence
def next_token( remaining_string ):
for t_name, t_regex in tokens_regex: # check over in order of precedence
for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
if regex_match( remaining_string[:i], t_regex ):
return t_name, remaining_string[:i], remaining_string[i:]
return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character
def lex( string ):
tokens_so_far = []
remaining_string = string
while len(remaining_string) > 0:
t_name, t_value, string_remaining = next_token(remaining_string)
if t_name is not None:
tokens_so_far.append(t_name, t_value)
#elif not regex_match(t_value,ignore_regex):
#check against ignore regex, if not in it add to an error list/illegal characters
return tokens_so_far
Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).
Have fun! And good luck making a parser :).
I've done pretty much the same thing. The way I did it was to combine all the expressions in one pretty big NFA and converted that same thing into one DFA. When doing that keep track of the states that previously were accepting states in each corresponding original DFA and their precedence.
The generated DFA will have many states that are accepting states. You run this DFA until it recieves a character that it has no corresponding transitions for. If the DFA is then in an accepting state you will then look at which of your original NFAs that had that accepting state in them. The one that has the highest precedence is the token you're going to return.
This does not handle regular expression lookaheads. These are typically not really needed for lexer work anyway. That would be the job of the parser.
Such a lexer runs in much the same speed as a single regular expression since there is basically only one DFA for it to run. You can omit converting the NFA altogether for a faster-to-construct algorithm but slower to run. The algorithm is basically the same.
The source code for the lexer I wrote is freely available on github, should you want to see how I did it.

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Random string that matches a regexp [duplicate]

How would you go about creating a random alpha-numeric string that matches a certain regular expression?
This is specifically for creating initial passwords that fulfill regular password requirements.
Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.
Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.
One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.
There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.
Maybe that's a starting point for thinking about the problem, though!
Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.
To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.
String::Random in Perl will generate a random string from a subset of regular expressions:
use strict;
use warnings;
use String::Random qw/random_regex/;
print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!##$%^&*]'), "\n";
If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.
I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.
(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)
I have written Parsley, which consist of a Lexer and a Generator.
Lexer is for converting a regular expression-like string into a sequence of tokens.
Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();
* Generate a set of random codes based on Parsley pattern.
* Codes are guaranteed to be unique within the set.
* #param string $pattern Parsley pattern.
* #param int $amount Number of codes to generate.
* #param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
* #return array
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);
The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.
This PHP library looks promising: ReverseRegex
Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:
([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)
B6 6SB
P9 7EX
N9 2DH
GQ28 4UL
You'd need to write a string generator that can parse regular expressions and generate random members of character ranges for random lengths, etc.
Much easier would be to write a random password generator with certain rules (starts with a lower case letter, has at least one punctuation, capital letter and number, at least 6 characters, etc) and then write your regex so that any passwords created with said rules are valid.
Presuming you have both a minimum length and 3-of-4* (or similar) requirement, I'd just be inclined to use a decent password generator.
I've built a couple in the past (both web-based and command-line), and have never had to skip more than one generated string to pass the 3-of-4 rule.
3-of-4: must have at least three of the following characteristics: lowercase, uppercase, number, symbol
It is possible (for example, Haskell regexp module has a test suite which automatically generates strings that ought to match certain regexes).
However, for a simple task at hand you might be better off taking a simple password generator and filtering its output with your regexp.