Avoiding Comments w/ C++ getline() - c++

I'm using getline() to open a .cpp file.
getline(theFile, fileData);
I'm wondering if there is any way to have getline() avoid grabbing c++ comments (/*, */ and //)?
So far, trying something like this doesn't quite work.
if (fileData[i] == '/*')

I think it's unavoidable for you to read the comments, but you can dispose of them by reading through the file one character at a time.
To do this, you can load the file into a string and build a state machine with the following states:
This is actual code
The previous character was /
The previous character was *
I am a single-line comment
I am a multi-line comment
The state machine starts in State 1
If the machine is in State 1 and hits a / character, transition to State 2.
If the machine is in State 2 and hits a / character, transition to State 4. Otherwise, transition to State 1.
If the machine is in State 2 and hits a * character, transition to State 5. Otherwise, transition to State 1.
If the machine is in State 4 and hits a newline character, transition to State 1.
If the machine is in State 5 and hits a * character, transition to State 3.
If the machine is in State 3 and hits a / character, transition to State 1 (the multi-line comment ends). Otherwise, transition to State 5.
If you mark the positions of the characters where the machine enters and exits the comment states, you can then strip these characters from the string.
Alternatively, you could explore regular expressions, which provide ways of describing this kind of state machine very succinctly.

So, one problem is that if(fileData[i] == '/*') is testing if the char fileData[i] is equal to '/*' which is... Not a char.
To find if a line contains a comment, you will probably want to look into one of the following:
<regex> in C++11 (Boost has a regular expression library as well, if that's more your thing.)
strstr in vanilla C/C++.
For multi-line comments, you'll probably want to store something like store a flag indicating whether the state of the previous line was "in comment" or not, and then search for /* or */ according to that flag, updating it as you go.

Single quotation marks designate a char, and the char data type represent a SINGLE char.'/*' doesn't make sense, because it's two char while fileData[i] refers to a single char.
Your if statement needs to be far more robust.

Related

Is there any drawback to using '\n' at the start instead of end?

Mostly I see people using \n at the end of string but putting \n at the beginning makes more sense to me since now I don't have to keep track of what will be printed next.
For example-
std::cout<<"Some string\n"; //syntax 1
Suppose after this the control goes to some other function where I don't need a new line but using this syntax is enforcing that newline to be inserted unless I can think ahead and keep track of whether next line needs to be printed in newline or not.
std::cout<<"\nSome string"; //syntax 2
But by using the second syntax I can avoid such things and I only have to worry about the current statement.
Question- Is it only a personal preference of using either of the 2 syntaxes or is there any drawback to second one over the first one?
It is not at all "personal preference" - the two solutions are semantically different. You would use one over the other when the requirements of your application demand it.
One critical point though is on many platforms \n causes any buffered text to be flushed and the text to be output. If you delay the \n you may not see the output immediately until the next \n which may not be deterministic or timely.
You can use either of these syntaxes according to your needs. A choice will matter in case of recurring statements such as loops and recursive functions where a string needs to be printed.
The first syntax std::cout<<"\n String"; will however always begin printing after it has inserted a new line, which might not suit your purpose if it's the first line that's being printed.
But I believe it's just a matter of personal preference.
In unix systems, when the content of standard output is intended to be the input of another program, printing a \n last generally means being able to interact with a larger set of command line utilities.
In the end it depends on what you want to do with standard output. That's why it's possible to write both ways.
It depends on the context you are writing your output in.
Syntax 1
If you want to make sure that your output begins with a new line and you can't tell if the cursor is at the beginning of a line you would favor syntax 1.
Example without '\n':
Output you don't control
----------------------------
Value 1 | Value 2 | Value 3Some string of your function.
Example with '\n':
Output you don't control
----------------------------
Value 1 | Value 2 | Value 3
Some string of your function.
Syntax 2
On the other hand, if you are done with your output it is good practice to finish with a new line so the upcoming output doesn't have to be concerned in what state you left the cursor in.
Example without '\n':
Some string of your function.Output you don't control
----------------------------
Value 1 | Value 2 | Value 3
Example with '\n':
Some string of your function.
Output you don't control
----------------------------
Value 1 | Value 2 | Value 3

Formulation of language and regular expressions

I can't figure out what is the formal language and regular expression
of this automaton :
DFA automaton
I know that the instance of 'b' or 'a' have to be even.
At first I thought the language was:
L = {(a^i)(b^j) | i(mod2) = j(mod2) = 0, i,j>=0}
But the automaton can start from 'b', so the language is incorrect.
also, the regular expression i found, isn't match either ((aa)* + (bb)) -
can't get abab for example.
The regex I got by progressively ripping out nodes (order: 3,1,2,0) is:
(aa|bb|(ab|ba)(bb|aa)*(ab|ba))*
As far as I can tell, that's the simplest it goes. (I'd love to know if anyone has a simpler reduction—I'm actually taking a test on this stuff this week!)
Step-by-step process
We start off by adding a new start and accept state. Every old accept state (in this case, there's only one) gets linked to the new accept state with an ε transition:
Next, we rip out state 3. We need to preserve all paths that run through state 3. In this case we've added a path from state 0 back to itself, paths from state 0 to state 2, and state 2 back to itself:
We do the same with state 1:
We can simplify this a bit: we'll concatenate the looping-back transitions with commas. (At the end, this will turn into the union operator (| or ⋃ etc. depending on your notation.)
We'll remove state 2 next, and get everything smooshed onto one big loop:
Loops become stars; we remove the last state so we just have a transition from the start state to the end state connected with one big regular expression:
And that's our regular expression!
Language definition
You're pretty close with the language definition. If you can allow something a little looser, it would be this:
L = { w | w contains an even number of 'a's and 'b's }
The problem with your definition is that you start the string w off with a every time, whereas the only restriction is on the parity of the number of a's and b's.

stop short of multiple strings and characters using '^'

I'm doing a regex operation that to stop short of either character sets { or \t\t{.
the first is ok, but the second cannot be achieved using the ^ symbol the way I have been.
My current regex is [\t+]?{\d+}[^\{]*
As you can see, I've used ^ effectively with a single character, but I cannot apply it to a string of characters like \t\t\{
How can the current regex be applied to consider both of these possibilities?
Example text:
{1} The words of the blessing of Enoch, wherewith he blessed the elect and righteous, who will be living in the day of tribulation, when all the wicked and godless are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, which the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, [even] on Mount Sinai,
And appear from His camp
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
When I do this as a multi-line extract, the indendantation does not maintain for the first line of each block. Ideally the extract should stop short of the \t\t{ allowing it to be picked up properly in the next extract, creating perfectly indented blocks. The reason for this is when they are taken from the database, the \t\t should be detected at the first line to allow dynamic formatting.
[\t+]?{\d+}[\s\S]*?(?=\s*{|$)
You can use this.See demo.
https://regex101.com/r/nNUHJ8/1

Convert and validate string

I need to take time as user input in the form HH:MM and then validate it.
It needs to be a proper time in that certain format. Any good Ideas on how to do that?
I'm trying to make a function that will iterate through the string, validating each character, then convert them into numbers (or some kind of time stamp) so I can compare several strings to eachother.
I'm only using the std namespace.
Use boost::regex to match string and its parts (HH) and (MM) and use scanf to get hours and minutes.
It sounds more like an algorithm problem, I would:
1, check the length of the string if it's 5.
2, check if ':' is in the middle.
3, check HH is in the range.
4, check MM is in the range.
5, Convert it to the format which will bring convenience to you.
It may be overkill for this particular problem, but this kind of task is a great fit for a state machine. Basically, you'll want to read the input one character at a time, and each character can change the machine's state until you end up in a success or error state. For example:
First character
If not a number, change to error state
Otherwise store value and change to state 2
Second character
If not a number, change to error state
Otherwise multiply stored value by 10 and add second character. If the result is out of range, change to error state. Otherwise, change to state 3
Third character
If :, change to state 4, otherwise change to error state
Fourth character
Similar to First character, changing to state 5 upon success.
Fifth character
Similar to Second character, changing to state 6 upon success.
Success state
A winner is yuo!
Error state
Handle the error, duh.

Regex, writing a toy compiler, parsing, comment remover

I'm currently working my way through this book:
http://www1.idc.ac.il/tecs/
I'm currently on a section where the excersize is to create a compiler for a very simple java like language.
The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.
I'm on chaper 10 which and starting to write the tokenizer.
1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?
2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?
Thanks in advance!
The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
->(Start)----->(I1)------->((Identifier))
| | ^
| +-+
| A-Z0-9
|
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
+-+
0-9
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.
1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.
2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.
In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.
This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".