Most Efficient way to 'look up' Keywords - c++

Alright so I am writing a function as part of a lexical analyzer that 'looks up' or searches for a match with a keyword. My lexer catches all the obvious tokens such as single and multi character operators (+ - * / > < = == etc) (also comments and whitespace are already taken out) so I call a function after I've collected a stream of only alphanumeric characters (including underscores) into a string, this string then needs to be matched as either a known keyword or an identifier.
So I was wondering how I might go about identifying it? I know I basically need to compare it to some list or array or something of all the built in keywords, and if it matches one return that match to it's corresponding enum value; otherwise, if there is no match, then it must be a function or variable identifier. So how should I look for matches? I read somewhere that something called a Binary Search Tree is an efficient way to do it or by using Hash Tables, problem is I've never used either so I am not sure if it's the right way. Could I possibly use a MySQL database?

If your set of keywords is fixed, a perfect hash can be built for O(1) lookup. Check out gperf or cmph.

A "trie" will surely be the most efficient way.

Whatever implementation of std::map you have will probably be sufficient.

This is for a language, with a specific set of keywords that never change, and there aren't very many of them?
If so, it probably doesn't matter what you use. You will have bigger fish to fry.
However, since the list doesn't change, it would be hard to beat a hard coded search like this:
// search on first letter
switch(s[0]){
case 'a':
// search on 2nd letter, etc.
break;
case 'b':
// search on 2nd letter, etc.
break;
........
case '_':
// search on 2nd letter, etc.
break;
}

For singe character keywords a lookup table would be perfect. For multicharacter (especially if the lengths differs): a hash table. If you need performance, you could even use source code generation to create the hash tables (using a simple hash function that is able or not to ignore case, depending on your syntax).
So I'd implement it with a LUT and a hash table: first you check the first character with the LUT (if it's a simple operator, it would start with a non-alpha-numeric value), and, if not found, check the hash table.

Related

ICU: simple case mapping for whole strings

I would like to find a substring in UTF-8 string, case-insensitive. from what I read, the usual way it's done is case folding the strings in order to bring them to canonical form.
However, since case folding can change the length of the strings (and I don't want to change the length of the strings because I need to know what is the exact offset of the substring match in the original string), it seems I should be using Simple case mapping. although the case-insensitive comparison won't be accurate, it will be best effort.
However, I cannot find in ICU API functions that operate on strings with Simple case mapping. I can find Simple case mapping only for single char functions (u_foldCase() in uchar.h). Is there an option to use Simple case folding for whole strings?

How to create a program that identifies longest substring palindrome regardless of letter casing

I've attempted at googling algorithms for a program that outputs the result indicated in the question. Mostly, all what I've found was algorithms that satisfied the first constraint, but did not take into account the second part (ignoring the casing of letters). Conventional functions, such as strcmpi (I'm using c++) requires constant characters which make it impossible to incorporate within the algorithms alluded to above. In essence, I just need an idea on how I can go about creating such a program.
First create a program that identifies longest substring palindrome using your own compare function. And in that compare function if two characters are same then return true else if the difference between ASCII values of two characters is 32 then also return true. And rest as it is.

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Deduplicating an array of keywords (but not based on EXACT match)

I have a list of a few thousand terms. There is significant overlap in those terms, but in different forms. For example (ruby, a_ruby), (triathlon, triathlete, triathletes), (nonprofit, non_profit, non_profits).
Most of these have significant number of character overlap, but not exactly in the same form. For example, (nonprofit and non_profit)
What regex sequence will be the best for this? I know that i can use stemming as well, but wondering how i can combine that with the regex.
For a single list of a few thousand items, I'd consider an alternate approach.
Sort the list alphabetically then manually remove the duplicates. Whatever regex and subsequent processing you end up with will probably take as much time if not more than going through the list manually.
Of course, I'm assuming this is a one-time proposition. I defer to regex experts for a programmatic solution.
I agree with Bob Kaufman that you should do a first pass to eliminate actual duplicates. After that, you have a problem that regex cannot solve for you; you will need to look into measurements of edit distance to get anywhere with it.
My usual strategy in this situation, which is not perfectly reliable, is as follows:
1) Remove all nonalphanumeric characters.
2) Make all strings lowercase.
3) Put all of the strings in a HashSet (this will remove duplicates).
4) Check for any cases where word and word+"s" are both in the set, and remove the plural one.
5) Output the strings in alphabetical order, and do a quick manual search for duplicates. If any are found, define new rules accordingly.
Other rules you may need:
Replace & with and.
Remove all instances of "inc"
Replace all instances of television with TV.

Random string that matches a regexp [duplicate]

This question already has answers here:
Using Regex to generate Strings rather than match them
(12 answers)
Closed 1 year ago.
How would you go about creating a random alpha-numeric string that matches a certain regular expression?
This is specifically for creating initial passwords that fulfill regular password requirements.
Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.
Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.
One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.
There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.
Maybe that's a starting point for thinking about the problem, though!
Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.
To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.
String::Random in Perl will generate a random string from a subset of regular expressions:
#!/usr/bin/perl
use strict;
use warnings;
use String::Random qw/random_regex/;
print random_regex('[A-Za-z]{3}[0-9][A-Z]{2}[!##$%^&*]'), "\n";
If you have a specific problem, you probably have a specific regular expression in mind. I would take that regular expression, work out what it means in simple human terms, and work from there.
I suspect it's possible to create a general regex random match generator, but it's likely to be much more work than just handling a specific case - even if that case changes a few times a year.
(Actually, it may not be possible to generate random matches in the most general sense - I have a vague memory that the problem of "does any string match this regex" is the halting problem in disguise. With a very cut-down regex language you may have more luck though.)
I have written Parsley, which consist of a Lexer and a Generator.
Lexer is for converting a regular expression-like string into a sequence of tokens.
Generator is using these tokens to produce a defined number of codes.
$generator = new \Gajus\Parsley\Generator();
/**
* Generate a set of random codes based on Parsley pattern.
* Codes are guaranteed to be unique within the set.
*
* #param string $pattern Parsley pattern.
* #param int $amount Number of codes to generate.
* #param int $safeguard Number of additional codes generated in case there are duplicates that need to be replaced.
* #return array
*/
$codes = $generator->generateFromPattern('FOO[A-Z]{10}[0-9]{2}', 100);
The above example will generate an array containing 100 codes, each prefixed with "FOO", followed by 10 characters from "ABCDEFGHKMNOPRSTUVWXYZ23456789" haystack and 2 numbers from "0123456789" haystack.
This PHP library looks promising: ReverseRegex
Like all of these, it only handles a subset of regular expressions but it can do fairly complex stuff like UK Postcodes:
([A-PR-UWYZ]([0-9]([0-9]|[A-HJKSTUW])?|[A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?) ?[0-9][ABD-HJLNP-UW-Z]{2}|GIR0AA)
Outputs
D43WF
B6 6SB
MP445FR
P9 7EX
N9 2DH
GQ28 4UL
NH1 2SL
KY2 9LS
TE4Y 0AP
You'd need to write a string generator that can parse regular expressions and generate random members of character ranges for random lengths, etc.
Much easier would be to write a random password generator with certain rules (starts with a lower case letter, has at least one punctuation, capital letter and number, at least 6 characters, etc) and then write your regex so that any passwords created with said rules are valid.
Presuming you have both a minimum length and 3-of-4* (or similar) requirement, I'd just be inclined to use a decent password generator.
I've built a couple in the past (both web-based and command-line), and have never had to skip more than one generated string to pass the 3-of-4 rule.
3-of-4: must have at least three of the following characteristics: lowercase, uppercase, number, symbol
It is possible (for example, Haskell regexp module has a test suite which automatically generates strings that ought to match certain regexes).
However, for a simple task at hand you might be better off taking a simple password generator and filtering its output with your regexp.