Formulation of language and regular expressions - regex

I can't figure out what is the formal language and regular expression
of this automaton :
DFA automaton
I know that the instance of 'b' or 'a' have to be even.
At first I thought the language was:
L = {(a^i)(b^j) | i(mod2) = j(mod2) = 0, i,j>=0}
But the automaton can start from 'b', so the language is incorrect.
also, the regular expression i found, isn't match either ((aa)* + (bb)) -
can't get abab for example.

The regex I got by progressively ripping out nodes (order: 3,1,2,0) is:
(aa|bb|(ab|ba)(bb|aa)*(ab|ba))*
As far as I can tell, that's the simplest it goes. (I'd love to know if anyone has a simpler reduction—I'm actually taking a test on this stuff this week!)
Step-by-step process
We start off by adding a new start and accept state. Every old accept state (in this case, there's only one) gets linked to the new accept state with an ε transition:
Next, we rip out state 3. We need to preserve all paths that run through state 3. In this case we've added a path from state 0 back to itself, paths from state 0 to state 2, and state 2 back to itself:
We do the same with state 1:
We can simplify this a bit: we'll concatenate the looping-back transitions with commas. (At the end, this will turn into the union operator (| or ⋃ etc. depending on your notation.)
We'll remove state 2 next, and get everything smooshed onto one big loop:
Loops become stars; we remove the last state so we just have a transition from the start state to the end state connected with one big regular expression:
And that's our regular expression!
Language definition
You're pretty close with the language definition. If you can allow something a little looser, it would be this:
L = { w | w contains an even number of 'a's and 'b's }
The problem with your definition is that you start the string w off with a every time, whereas the only restriction is on the parity of the number of a's and b's.

Related

Steps to draw a DFA (or NFA) from a simple statement?

I am given a simple statement: Construct a DFA over alphabet {0, 1} that accepts all the strings that end in 101?
My question is that what will be the steps to design it? Or design an NFA, because then I know the clear steps yo convert an NFA to a DFA, so I will then convert the NFA to the DFA.
Note:- It is just a minor course for me, so I have never studied anything like regular expressions, or any algorithms probably used to construct DFA's.
If you want more of an explanation on how I derived this, I'd be happy to explain, but for now I just drew the DFA and explained each state.
Sorry about the screenshot...I didn't know how to convert it straight to an image.
On input 0 at state 0, it loops back to itself. On 1, it prepares
itself to end because it could possibly be '101'.
q1 loops to itself on input 1 because it's still preparing to end on
'101'. Input '0' on q1 means it is preparing for input '10', so it goes to q2.
Input '0' on q2 breaks the whole cycle and goes back to q0. Input '1'
results in moving to q3, the accepting state.
Any input on q3 results in going back to whatever point in the cycle
the input corresponds with.
That is, on '1' it goes back to q1, or the state where the first '1'
was encountered in '101', preparing to end.
On '0', it goes to q2 because in order to get to q3, there must have
been an input of '1' from q2, so no matter what, the last two input
symbols are '10' now.
TikZ DFA examples.
Here,the string should end with 101.So we need to draw nfa for it and later convert it into DFA
Here the total states are A,B,C,D.
I will upload an image here. In that I have drawn NFA and then I have drawn transition table for it.
And then I have drawn transition table for conversion of NFA to DFA.
I also drawn DFA for your sake.
In NFA, when a specific input is given to the current state, the machine goes to multiple states. It can have zero, one or more than one move on a given input symbol. On the other hand, in DFA, when a specific input is given to the current state, the machine goes to only one state. DFA has only one move on a given input State.
THE STEPS FOR CONVERTING NFA TO DFA:
Step 1: Initially Q' = ϕ
Step 2: Add q0 of NFA to Q'. Then find the transitions from this start state.
Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in Q', then add it to Q'.
Step 4: In DFA, the final state will be all the states which contain F(final states of NFA)
View the image here
Click here

Avoiding Comments w/ C++ getline()

I'm using getline() to open a .cpp file.
getline(theFile, fileData);
I'm wondering if there is any way to have getline() avoid grabbing c++ comments (/*, */ and //)?
So far, trying something like this doesn't quite work.
if (fileData[i] == '/*')
I think it's unavoidable for you to read the comments, but you can dispose of them by reading through the file one character at a time.
To do this, you can load the file into a string and build a state machine with the following states:
This is actual code
The previous character was /
The previous character was *
I am a single-line comment
I am a multi-line comment
The state machine starts in State 1
If the machine is in State 1 and hits a / character, transition to State 2.
If the machine is in State 2 and hits a / character, transition to State 4. Otherwise, transition to State 1.
If the machine is in State 2 and hits a * character, transition to State 5. Otherwise, transition to State 1.
If the machine is in State 4 and hits a newline character, transition to State 1.
If the machine is in State 5 and hits a * character, transition to State 3.
If the machine is in State 3 and hits a / character, transition to State 1 (the multi-line comment ends). Otherwise, transition to State 5.
If you mark the positions of the characters where the machine enters and exits the comment states, you can then strip these characters from the string.
Alternatively, you could explore regular expressions, which provide ways of describing this kind of state machine very succinctly.
So, one problem is that if(fileData[i] == '/*') is testing if the char fileData[i] is equal to '/*' which is... Not a char.
To find if a line contains a comment, you will probably want to look into one of the following:
<regex> in C++11 (Boost has a regular expression library as well, if that's more your thing.)
strstr in vanilla C/C++.
For multi-line comments, you'll probably want to store something like store a flag indicating whether the state of the previous line was "in comment" or not, and then search for /* or */ according to that flag, updating it as you go.
Single quotation marks designate a char, and the char data type represent a SINGLE char.'/*' doesn't make sense, because it's two char while fileData[i] refers to a single char.
Your if statement needs to be far more robust.

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Regex, writing a toy compiler, parsing, comment remover

I'm currently working my way through this book:
http://www1.idc.ac.il/tecs/
I'm currently on a section where the excersize is to create a compiler for a very simple java like language.
The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.
I'm on chaper 10 which and starting to write the tokenizer.
1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?
2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?
Thanks in advance!
The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
->(Start)----->(I1)------->((Identifier))
| | ^
| +-+
| A-Z0-9
|
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
+-+
0-9
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.
1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.
2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.
In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.
This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".

Tokenize the text depending on some specific rules. Algorithm in C++

I am writing a program which will tokenize the input text depending upon some specific rules. I am using C++ for this.
Rules
Letter 'a' should be converted to token 'V-A'
Letter 'p' should be converted to token 'C-PA'
Letter 'pp' should be converted to token 'C-PPA'
Letter 'u' should be converted to token 'V-U'
This is just a sample and in real time I have around 500+ rules like this. If I am providing input as 'appu', it should tokenize like 'V-A + C-PPA + V-U'. I have implemented an algorithm for doing this and wanted to make sure that I am doing the right thing.
Algorithm
All rules will be kept in a XML file with the corresponding mapping to the token. Something like
<rules>
<rule pattern="a" token="V-A" />
<rule pattern="p" token="C-PA" />
<rule pattern="pp" token="C-PPA" />
<rule pattern="u" token="V-U" />
</rules>
1 - When the application starts, read this xml file and keep the values in a 'std::map'. This will be available until the end of the application(singleton pattern implementation).
2 - Iterate the input text characters. For each character, look for a match. If found, become more greedy and look for more matches by taking the next characters from the input text. Do this until we are getting a no match. So for the input text 'appu', first look for a match for 'a'. If found, try to get more match by taking the next character from the input text. So it will try to match 'ap' and found no matches. So it just returns.
3 - Replace the letter 'a' from input text as we got a token for it.
4 - Repeat step 2 and 3 with the remaining characters in the input text.
Here is a more simple explanation of the steps
input-text = 'appu'
tokens-generated=''
// First iteration
character-to-match = 'a'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ap'
pattern-found = false
tokens-generated = 'V-A'
// since no match found for 'ap', taking the first success and replacing it from input text
input-text = 'ppu'
// second iteration
character-to-match = 'p'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'pp'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ppu'
pattern-found = false
tokens-generated = 'V-A + C-PPA'
// since no match found for 'ppu', taking the first success and replacing it from input text
input-text = 'u'
// third iteration
character-to-match = 'u'
pattern-found = true
tokens-generated = 'V-A + C-PPA + V-U' // we'r done!
Questions
1 - Is this algorithm looks fine for this problem or is there a better way to address this problem?
2 - If this is the right method, std::map is a good choice here? Or do I need to create my own key/value container?
3 - Is there a library available which can tokenize string like the above?
Any help would be appreciated
:)
So you're going through all of the tokens in your map looking for matches? You might as well use a list or array, there; it's going to be an inefficient search regardless.
A much more efficient way of finding just the tokens suitable for starting or continuing a match would be to store them as a trie. A lookup of a letter there would give you a sub-trie which contains only the tokens which have that letter as the first letter, and then you just continue searching downward as far as you can go.
Edit: let me explain this a little further.
First, I should explain that I'm not familiar with these the C++ std::map, beyond the name, which makes this a perfect example of why one learns the theory of this stuff as well as than details of particular libraries in particular programming languages: unless that library is badly misusing the name "map" (which is rather unlikely), the name itself tells me a lot about the characteristics of the data structure. I know, for example, that there's going to be a function that, given a single key and the map, will very efficiently search for and return the value associated with that key, and that there's also likely a function that will give you a list/array/whatever of all of the keys, which you could search yourself using your own code.
My interpretation of your data structure is that you have a map where the keys are what you call a pattern, those being a list (or array, or something of that nature) of characters, and the values are tokens. Thus, you can, given a full pattern, quickly find the token associated with it.
Unfortunately, while such a map is a good match to converting your XML input format to a internal data structure, it's not a good match to the searches you need to do. Note that you're not looking up entire patterns, but the first character of a pattern, producing a set of possible tokens, followed by a lookup of the second character of a pattern from within the set of patterns produced by that first lookup, and so on.
So what you really need is not a single map, but maps of maps of maps, each keyed by a single character. A lookup of "p" on the top level should give you a new map, with two keys: p, producing the C-PPA token, and "anything else", producing the C-PA token. This is effectively a trie data structure.
Does this make sense?
It may help if you start out by writing the parsing code first, in this manner: imagine someone else will write the functions to do the lookups you need, and he's a really good programmer and can do pretty much any magic that you want. Writing the parsing code, concentrate on making that as simple and clean as possible, creating whatever interface using these arbitrary functions you need (while not getting trivial and replacing the whole thing with one function!). Now you can look at the lookup functions you ended up with, and that tells you how you need to access your data structure, which will lead you to the type of data structure you need. Once you've figured that out, you can then work out how to load it up.
This method will work - I'm not sure that it is efficient, but it should work.
I would use the standard std::map rather than your own system.
There are tools like lex (or flex) that can be used for this. The issue would be whether you can regenerate the lexical analyzer that it would construct when the XML specification changes. If the XML specification does not change often, you may be able to use tools such as lex to do the scanning and mapping more easily. If the XML specification can change at the whim of those using the program, then lex is probably less appropriate.
There are some caveats - notably that both lex and flex generate C code, rather than C++.
I would also consider looking at pattern matching technology - the sort of stuff that egrep in particular uses. This has the merit of being something that can be handled at runtime (because egrep does it all the time). Or you could go for a scripting language - Perl, Python, ... Or you could consider something like PCRE (Perl Compatible Regular Expressions) library.
Better yet, if you're going to use the boost library, there's always the Boost tokenizer library -> http://www.boost.org/doc/libs/1_39_0/libs/tokenizer/index.html
You could use a regex (perhaps the boost::regex library). If all of the patterns are just strings of letters, a regex like "(a|p|pp|u)" would find a greedy match. So:
Run a regex_search using the above pattern to locate the next match
Plug the match-text into your std::map to get the replace-text.
Print the non-matched consumed input and replace-text to your output, then repeat 1 on the remaining input.
And done.
It may seem a bit complicated, but the most efficient way to do that is to use a graph to represent a state-chart. At first, i thought boost.statechart would help, but i figured it wasn't really appropriate. This method can be more efficient that using a simple std::map IF there are many rules, the number of possible characters is limited and the length of the text to read is quite high.
So anyway, using a simple graph :
0) create graph with "start" vertex
1) read xml configuration file and create vertices when needed (transition from one "set of characters" (eg "pp") to an additional one (eg "ppa")). Inside each vertex, store a transition table to the next vertices. If "key text" is complete, mark vertex as final and store the resulting text
2) now read text and interpret it using the graph. Start at the "start" vertex. ( * ) Use table to interpret one character and to jump to new vertex. If no new vertex has been selected, an error can be issued. Otherwise, if new vertex is final, print the resulting text and jump back to start vertex. Go back to (*) until there is no more text to interpret.
You could use boost.graph to represent the graph, but i think it is overly complex for what you need. Make your own custom representation.