Parsing Natural Language - c++

What's the most efficient way to parse natural language?
Let "strings" be a map<string, void (*func)(int,char**)> containing strings such as:
Set the alarm for *.
Call *.
Get me an * at * for *.
and their corresponding functions. Now suppose "input" is a string containing a sentence like:
Call David.
How to implement a function such as parse which would take the "input" and match it to one of the strings in the map. Then call its corresponding function, passing it argc and argv containing all the wild card entires (* in strings). What's the most efficient way to implement such a function?

Not sure why this question got a downvote. It's well-posed an non-trivial.
There are plenty of academic approaches to parsing, which are mostly needed for degenerate grammars. "natural language" is perhaps not a well-defined term, and natural languages do have some ambiguity, but such constrained subsets are not problematic.
In this specific example, we see that the different production rules (map entries) are not mutually ambiguous. In fact, the first token is sufficient for disambiguation. And since a std::map is sorted, we can do an efficient O(log N) search for that token.
Hence, we only need to derive the substitutions. Again, we'll ignore the degenerate cases. Nobody is going to bother with "Get me an at at at for at."`, even though it parses unambiguously.
Instead, for substitutions you simply collect tokens until you get the expected next token. Get me an * at * for *. means that the first * gets all tokens up to at, the second * collects tokens up to for, and the final * gets all remaining tokens.
You see that no backtracking is needed. If parsing fails, there simply is no match.

Related

Why would I ever want to use Maybe instead of a List?

Seeing as the Maybe type is isomorphic to the set of null and singleton lists, why would anyone ever want to use the Maybe type when I could just use lists to accomodate absence?
Because if you match a list against the patterns [] and [x] that's not an exhaustive match and you'll get a warning about that, forcing you to either add another case that'll never get called or to ignore the warning.
Matching a Maybe against Nothing and Just x however is exhaustive. So you'll only get a warning if you fail to match one of those cases.
If you choose your types such that they can only represent values that you may actually produce, you can rely on non-exhaustiveness warnings to tell you about bugs in your code where you forget to check for a given a case. If you choose more "permissive" types, you'll always have to think about whether a warning represents an actual bug or just an impossible case.
You should strive to have accurate types. Maybe expresses that there is exactly one value or that there is none. Many imperative languages represent the "none" case by the value null.
If you chose a list instead of Maybe, all your functions would be faced with the possibility that they get a list with more than one member. Probably many of them would only be defined for one value, and would have to fail on a pattern match. By using Maybe, you avoid a class of runtime errors entirely.
Building on existing (and correct) answers, I'll mention a typeclass based answer.
Different types convey different intentions - returning a Maybe a represents a computation with the possiblity of failing while [a] could represent non-determinism (or, in simpler terms, multiple possible return values).
This plays into the fact that different types have different instances for typeclasses - and these instances cater to the underlying essence the type conveys. Take Alternative and its operator (<|>) which represents what it means to combine (or choose) between arguments given.
Maybe a Combining computations that can fail just means taking the first that is not Nothing
[a] Combining two computations that each had multiple return values just means concatenating together all possible values.
Then, depending on which types your functions use, (<|>) would behave differently. Of course, you could argue that you don't need (<|>) or anything like that, but then you are missing out on one of Haskell's main strengths: it's many high-level combinator libraries.
As a general rule, we like our types to be as snug fitting and intuitive as possible. That way, we are not fighting the standard libraries and our code is more readable.
Lisp, Scheme, Python, Ruby, JavaScript, etc., manage to get along with just one type each, which you could represent in Haskell with a big sum type. Every function handling a JavaScript (or whatever) value must be prepared to receive a number, a string, a function, a piece of the document object model, etc., and throw an exception if it gets something unexpected. People who program in typed languages like Haskell prefer to limit the number of unexpected things that can occur. They also like to express ideas using types, making types useful (and machine-checked) documentation. The closer the types come to representing the intended meaning, the more useful they are.
Because there are an infinite number of possible lists, and a finite number of possible values for the Maybe type. It perfectly represents one thing or the absence of something without any other possibility.
Several answers have mentioned exhaustiveness as a factor here. I think it is a factor, but not the biggest one, because there is a way to consistently treat lists as if they were Maybes, which the listToMaybe function illustrates:
listToMaybe :: [a] -> Maybe a
listToMaybe [] = Nothing
listToMaybe (a:_) = Just a
That's an exhaustive pattern match, which rules out any straightforward errors.
The factor I'd highlight as bigger is that by using the type that more precisely models the behavior of your code, you eliminate potential behaviors that would be possible if you used a more general alternative. Say for example you have some context in your code where you uses a type of the form a -> [b], though the only correct alternatives (given your program's specification) are empty or singleton lists. Try as hard as you may to enforce the convention that this context should obey that rule, it's still possible that you'll mess up and:
Somehow a function used in that context will produce a list of two or more items;
And somehow a function that uses the results produced in that context will observe whether the lists have two or more items, and behave incorrectly in that case.
Example: some code that expects there to be no more than one value will blindly print the contents of the list and thus print multiple items when only one was supposed to be.
But if you use Maybe, then there really must be either one value or none, and the compiler enforces this.
Even though isomorphic, e.g. QuickCheck will run slower because of the increase in search space.

Return Set for a Command Parser

I need to write a parser to parse commands. 5 such commands are:
"a=10"
"b=foo"
"c=10,10"
"clear d"
"c push_back 2"
In the case of the first example, set is the command, a is the object and 10 is the value.
What do you think the parser should return for each line above?
Here is my idea:
"a=10" -> SET (COMMAND_ENUM), INT (VALUE_TYPE), "a", ("10")
"b=foo" -> SET (COMMAND_ENUM), STRING (VALUE_TYPE), "b", ("foo")
Is this a good approach? What is the standard approach for this problem? Should I dispatch instead?
I have a function which checks the type associated with an object. For example, a above is of type INT and must be assigned an INT value, otherwise the parser should return or throw an error of some sort. I also have a convert function for converting values from strings to the desired type. These throw if the conversion is not possible. If the parser tries to convert the values from strings to the required type, then it is probably a good idea to return them via a boost::variant.
You need to come up with at least a semi-formal grammar for the command language you want to recognize, since you've left a whole lot of things really vaguely specified (e.g. in b=foo you want b to be a variable name but foo to be a string literal. How do you distinguish them?. Does a sequence of characters represent an identifier if it's on the right side of an assignment, but a literal if it's on the left side? Or does a single character represent an identifier, but multiple characters represent a literal?) In c=10,10 does 10,10 represent a list or a vector? Writing a grammar will at least force you to think about such things, and it will also serve at least as a guide to how to write your parser (at most it will be something that can be automatically translated into your parser).
You're on the right track by thinking of how statements should be represented as Abstract Syntax Trees (ASTs), but you need to take a step backwards and look at what you want in terms of concrete syntax.

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
List(
Constant("abc"),
ZeroOrOne(Constant("d")),
Constant("efg"),
OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
Constant("qrs"),
AnyChar()
)
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Random string that matches a regexp [duplicate]

This question already has answers here:
Using Regex to generate Strings rather than match them
(12 answers)
Closed 1 year ago.
How would you go about creating a random alpha-numeric string that matches a certain regular expression?
This is specifically for creating initial passwords that fulfill regular password requirements.
Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.
Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.
One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.
There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.
Maybe that's a starting point for thinking about the problem, though!
Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.
To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.
String::Random in Perl will generate a random string from a subset of regular expressions:
#!/usr/bin/perl
use strict;
use warnings;
use String::Random qw/random_regex/;
print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!##$%^&*]'), "\n";
If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.
I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.
(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)
I have written Parsley, which consist of a Lexer and a Generator.
Lexer is for converting a regular expression-like string into a sequence of tokens.
Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();
/**
* Generate a set of random codes based on Parsley pattern.
* Codes are guaranteed to be unique within the set.
*
* #param string $pattern Parsley pattern.
* #param int $amount Number of codes to generate.
* #param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
* #return array
*/
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);
The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.
This PHP library looks promising: ReverseRegex
Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:
([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)
Outputs
D43WF
B6 6SB
MP445FR
P9 7EX
N9 2DH
GQ28 4UL
NH1 2SL
KY2 9LS
TE4Y 0AP
You'd need to write a string generator that can parse regular expressions and generate random members of character ranges for random lengths, etc.
Much easier would be to write a random password generator with certain rules (starts with a lower case letter, has at least one punctuation, capital letter and number, at least 6 characters, etc) and then write your regex so that any passwords created with said rules are valid.
Presuming you have both a minimum length and 3-of-4* (or similar) requirement, I'd just be inclined to use a decent password generator.
I've built a couple in the past (both web-based and command-line), and have never had to skip more than one generated string to pass the 3-of-4 rule.
3-of-4: must have at least three of the following characteristics: lowercase, uppercase, number, symbol
It is possible (for example, Haskell regexp module has a test suite which automatically generates strings that ought to match certain regexes).
However, for a simple task at hand you might be better off taking a simple password generator and filtering its output with your regexp.