Difficulty parsing comments that come after ; in a line - c++

I'm parsing a file that has definitions of functions. Since functions may be written in multiple lines, I'm parsing until encountering a ;:
#include <iostream>
#include <string>
void removeLineBreaks(std::string &str)
{
auto pos = str.find('\n');
while (pos != std::string::npos)
{
str.replace(pos, 1, "");
pos = str.find('\n', pos);
}
}
int main()
{
std::ifstream ifStream("a.pr");
std::string sLine;
const char sDelim(';');
while (std::getline(ifStream, sLine, sDelim))
{
sLine += sDelim;
removeLineBreaks(sLine);
// process further
}
}
The text can be something like this:
a=f(b,c); // comment
d=f(e,f);
Since I'm reading until ;, here I get two pieces:
a=f(b,c); and
// comment
\n d=f(e,f);.
If I call removeLineBreaks on the second piece, it'll become // comment d=f(e,f); so it'd be treated a comment by my parser.
What options do I have to make this work correctly? I could think of this - before calling removeLineBreaks on the line, get the string until \n, and if it starts with //, cut that part from the line, and only then call removeLineBreaks.
Any other ideas?

You first need to remove the //comments from your input and only then you can split at the semicolons.
Consider the following input:
a=f(b,c); // Functions comments are not functions; a=F(b,c);
If you first split on semicolons and then remove the comments, then you would end up with two functions:
a=f(b,c);
a=F(b,c);
But you only want to have the first one.
The solution is to:
Read the file line by line (lines delimited by LF).
While doing that remove the line based // comments and also all line breaks.
Combine all input back into a single string.
Split the string on semicolons to extract all functions which are outside comments.
The steps do not have to be done sequentially. You can do all these steps simultaneously on the input stream of characters, end emit a stream of functions. In fact, this is what real parsers would do.
You are essentially writing a simple parser. As your language gets more and more complicated you will find it more and more difficult to parse a file in such a way. For example with the approach above it will be difficult to emit error messages with line number information.
If you want to write a proper parser I would recommend a recursive descent parser together with a PEG (parser expression grammar). This approach is easy to learn, has less pitfalls than other approaches and is yet very powerful for computer languages. See here: https://en.wikipedia.org/wiki/Parsing_expression_grammar
Warning: If you hear people suggesting flex and bison (or lex and yacc), I strongly recommend against using them. They are complicated to use and are very limited in what they can parse and how it needs to be specified. I would rather suggest to use a light-weight and modern parsing framework like PEGTL: https://github.com/taocpp/PEGTL.

Related

D: split string by comma, but not quoted string

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Differentiating between delimiter and newline in getline

ifstream file;
file.open("file.csv");
string str;
while(file.good())
{
getline(file,str,',')
if (___) // string was split from delimiter
{
[do this]
}
else // string was split from eol
{
[do that]
}
}
file.close();
I'd like to read from a csv file, and differentiate between what happens when a string is split off due to a new line and what happens when it is split off due to the desired delimiter -- i.e. filling in the ___ in the sample code above.
The approaches I can think of are:
(1) manually adding a character to the end of each line in the original file,
(2) automatically adding a character to the end of each line by writing to another file,
(3) using getline without the delimiter and then making a function to split the resulting string by ','.
But is there a simpler or direct solution?
(I see that similar questions have been asked before, but I didn't see any solutions.)
My preference for clarity of the code would be to use your option 3) - use getline() with the standard '\n' delimiter to read the file into a buffer line by line and then use a tokenizer like strtok() (if you want to work on the C level) or boost::tokenizer to parse the string you read from the file.
You're really dealing with two distinct steps here, first read the line into the buffer, then take the buffer apart to extract the components you're after. Your code should reflect that and by doing so, you're also avoiding having to deal with odd states like the ones you describe where you end up having to do additional parsing anyway.
There is no easy way to determine "which delimiter terminated the string", and it gets "consumed" by getline, so it's lost to you.
Read the line, and parse split on commas yourself. You can use std::string::find() to find commas - however, if your file contains strings that in themselves contain commas, you will have to parse the string character by character, since you need to distinguish between commas in quoted text and commas in unquoted text.
Your big problem is your code does not do what you think it does.
getline with a delimiter treats \n as just another character from my reading of the docs. It does not split on both the delimiter and newline.
The efficient way to do this is to write your oen custom splitting getline: cppreference has a pretty clear description of what getline does, mimicing it should be easy (and safer than shooting from the hip, files are tricky).
Then return both the string, and information about why you finished your parse in a second channel.
Now, using getline naively then splitting is also viable, and will be much faster to write, snd probably less error prone to boot.

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

How to read a message from a file, modifying only words?

Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)

Search HTML lines and remove lines that don't start with </form></td><td><a

I have an HTML file with very bad formatted code that I get from a website, I want to extract some very small pieces of information.
I am only interested in lines that start like this:
</form></td><td> <b>user897</b></td></tr><tr><td>HouseA</td><td>2</td><td class="entriesTableRow-gamename">HouseA Type12 <span class="entriesTableRow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>
and I want to extract 3 fields:
A:HouseA
B:HouseA Type12
C:user123
D:10
I know I've seen people recommend HTML Agility Pack and lib2xml but I really don't think I need all that. My app is in C/C++.
I am already using getline to start reading lines, I am just not sure what's the best way to proceed. Thanks!
std::ifstream data("Home.html");
std::string line;
while(std::getline(data,line))
{
linenum++;
std::stringstream lineStream(line);
std::string user;
if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0)
{
printf("found a wanted line in line:%d\n", linenum);
}
}
In the general case, an XML/HTML parser is likely the best way here, as it will be robust against differing input. (Whatever you do, don't use regexps!)
Update
However, if you're targetting specific input, as it seems that you're doing, you can use sscanf (as you suggest) or cin.read() or regexp to scan manually.
Just beware that this code can break at any moment that the HTML changes (even just with whitespace).
Therefore, my/our recommendation is to use a proper tool for the job. XML/HTML is not raw text, and should not be treated as such.
How about writing a python script instead? :)