Reading text file by scanning for keywords

Reading text file by scanning for keywords - c++

As part of a bigger application I am working on a class for reading input from a text file for use in the initialization of the program. Now I am myself fairly new to programming, and I only started to learn C++ in December, so I would be very grateful for some hints and ideas on how to get started! I apologise in advance for a rather long wall of text.
The text file format is "keyword-driven" in the following way:
There are a rather small number of main/section keywords (currently 8) that need to be written in a given order. Some of them are optional, but if they are included they should adhere to the given ordering.
Example:
Suppose there are 3 potential keywords ordered like as follows:
"KEY1" (required)
"KEY2" (optional)
"KEY3" (required)
If the input file only includes the required ones, the ordering should be:
"KEY1"
"KEY3"
Otherwise it should be:
"KEY1"
"KEY2"
"KEY3"
If all the required keywords are present, and the total ordering is ok, the program should proceed by reading each section in the sequence given by the ordering.
Each section will include a (possibly large) amount of subkeywords, some of which are optional and some of which are not, but here the order does NOT matter.
Lines starting with characters '*' or '--' signify commented lines, and they should be ignored (as well as empty lines).
A line containing a keyword should (preferably) include nothing else than the keyword. At the very least, the keyword must be the first word appearing there.
I have already implemented parts of the framework, but I feel my approach so far has been rather ad-hoc. Currently I have manually created one method per section/main keyword , and the first task of the program is to scan the file for to locate these keywords and pass the necessary information on to the methods.
I first scan through the file using an std::ifstream object, removing empty and/or commented lines and storing the remaining lines in an object of type std::vector<std::string>.
Do you think this is an ok approach?
Moreover, I store the indices where each of the keywords start and stop (in two integer arrays) in this vector. This is the input to the above-mentioned methods, and it would look something like this:
bool readMAINKEY(int start, int stop);
Now I have already done this, and even though I do not find it very elegant, I guess I can keep it for the time being.
However, I feel that I need a better approach for handling the reading inside of each section, and my main issue is how should I store the keywords here? Should they be stored as arrays within a local namespace in the input class or maybe as static variables in the class? Or should they be defined locally inside relevant functions? Should I use enums? The questions are many!
Now I've started by defining the sub-keywords locally inside each readMAINKEY() method, but I found this to be less than optimal. Ideally I want to reuse as much code as possible inside each of these methods, calling upon a common readSECTION() method, and my current approach seems to lead to much code duplication and potential for error in programming. I guess the smartest thing to do would simply be to remove all the (currently 8) different readMAINKEY() methods, and use the same function for handling all kinds of keywords. There is also the possibility for having sub-sub-keywords etc. as well (i.e. a more general nested approach), so I think maybe this is the way to go, but I am unsure on how it would be best to implement it?
Once I've processed a keyword at the "bottom level", the program will expect a particular format of the following lines depending on the actual keyword. In principle each keyword will be handled differently, but here there is also potential for some code reuse by defining different "types" of keywords depending on what the program expects to do after triggering the reading of it. Common task include e.g. parsing an integer or a double array, but in principle it could be anything!
If a keyword for some reason cannot be correctly processed, the program should attempt as far as possible to use default values instead of terminating the program (if reasonable), but an error message should be written to a logfile. For optional keywords, default values will of course also be used.
In order to summarise, therefore, my main questions are the following:
1. Do you think think my approach of storing the relevant lines in a std::vector<std::string> to be reasonable?
This will of course require me to do a lot of "indexing work" to keep track of where in the vector the different keywords are located. Or should I work more "directly" with the original std::ifstream object? Or something else?
2. Given such a vector storing the lines of the text file, how I can I best go about detecting the keywords and start reading the information following them?
Here I will need to take account of possible ordering and whether a keyword is required or not. Also, I need to check if the lines following each "bottom level" keyword is in the format expected in each case.
One idea I've had is to store the keywords in different containers depending on whether they are optional or not (or maybe use object(s) of type std::map<std::string,bool>), and then remove them from the container(s) if correctly processed, but I am not sure exactly how I should go about it..
I guess there is really a thousand different ways one could answer these questions, but I would be grateful if someone more experienced could share some ideas on how to proceed. Is there e.g. a "standard" way of doing such things? Of course, a lot of details will also depend on the concrete application, but I think the general format indicated here can be used in a lot of different applications without a lot of tinkering if programmed in a good way!
UPDATE
Ok, so let my try to be more concrete. My current application is supposed to be a reservoir simulator, so as part of the input I need information about the grid/mesh, about rock and fluid properties, about wells/boundary conditions throughout the simulation and so on. At the moment I've been thinking about using (almost) the same set-up as the commercial Eclipse simulator when it comes to input, for details see
http://petrofaq.org/wiki/Eclipse_Input_Data.
However, I will probably change things a bit, so nothing is set in stone. Also, I am interested in making a more general "KeywordReader" class that with slight modifications can be adapted for use in other applications as well, at least it can be done in a reasonable amount of time.
As an example, I can post the current code that does the initial scan of the text file and locates the positions of the main keywords. As I said, I don't really like my solution very much, but it seems to work for what it needs to do.
At the top of the .cpp file I have the following namespace:
//Keywords used for reading input:
namespace KEYWORDS{
/*
* Main keywords and corresponding boolean values to signify whether or not they are required as input.
*/
enum MKEY{RUNSPEC = 0, GRID = 1, EDIT = 2, PROPS = 3, REGIONS = 4, SOLUTION = 5, SUMMARY =6, SCHEDULE = 7};
std::string mainKeywords[] = {std::string("RUNSPEC"), std::string("GRID"), std::string("EDIT"), std::string("PROPS"),
std::string("REGIONS"), std::string("SOLUTION"), std::string("SUMMARY"), std::string("SCHEDULE")};
bool required[] = {true,true,false,true,false,true,false,true};
const int n_key = 8;
}//end KEYWORDS namespace
Then further down I have the following function. I am not sure how understandable it is though..
bool InputReader::scanForMainKeywords(){
logfile << "Opening file.." << std::endl;
std::ifstream infile(filename);
//Test if file was opened. If not, write error message:
if(!infile.is_open()){
logfile << "ERROR: Could not open file! Unable to proceed!" << std::endl;
std::cout << "ERROR: Could not open file! Unable to proceed!" << std::endl;
return false;
}
else{
logfile << "Scanning for main keywords..." << std::endl;
int nkey = KEYWORDS::n_key;
//Initially no keywords have been found:
startIndex = std::vector<int>(nkey, -1);
stopIndex = std::vector<int>(nkey, -1);
//Variable used to control that the keywords are written in the correct order:
int foundIndex = -1;
//STATISTICS:
int lineCount = 0;//number of non-comment lines in text file
int commentCount = 0;//number of commented lines in text file
int emptyCount = 0;//number of empty lines in text file
//Create lines vector:
lines = std::vector<std::string>();
//Remove comments and empty lines from text file and store the result in the variable file_lines:
std::string str;
while(std::getline(infile,str)){
if(str.size()>=1 && str.at(0)=='*'){
commentCount++;
}
else if(str.size()>=2 && str.at(0)=='-' && str.at(1)=='-'){
commentCount++;
}
else if(str.size()==0){
emptyCount++;
}
else{
//Found a non-empty, non-comment line.
lines.push_back(str);//store in std::vector
//Start by checking if the first word of the line is one of the main keywords. If so, store the location of the keyword:
std::string fw = IO::getFirstWord(str);
for(int i=0;i<nkey;i++){
if(fw.compare(KEYWORDS::mainKeywords[i])==0){
if(i > foundIndex){
//Found a valid keyword!
foundIndex = i;
startIndex[i] = lineCount;//store where the keyword was found!
//logfile << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
//std::cout << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
break;//fw cannot equal several different keywords at the same time!
}
else{
//we have found a keyword, but in the wrong order... Terminate program:
std::cout << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
logfile << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
return false;
}
}
}//end for loop
lineCount++;
}//end else (found non-comment, non-empty line)
}//end while (reading ifstream)
logfile << "\n";
logfile << "FILE STATISTICS:" << std::endl;
logfile << "Number of commented lines: " << commentCount << std::endl;
logfile << "Number of non-commented lines: " << lineCount << std::endl;
logfile << "Number of empty lines: " << emptyCount << std::endl;
logfile << "\n";
/*
Print lines vector to screen:
for(int i=0;i<lines.size();i++){
std:: cout << "Line nr. " << i << " : " << lines[i] << std::endl;
}*/
/*
* So far, no keywords have been entered in the wrong order, but have all the necessary ones been found?
* Otherwise return false.
*/
for(int i=0;i<nkey;i++){
if(KEYWORDS::required[i] && startIndex[i] == -1){
logfile << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;;
logfile << "Cannot proceed with initialisation!" << std::endl;
std::cout << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;
std::cout << "Cannot proceed with initialisation!" << std::endl;
return false;
}
}
//If everything is in order, we also initialise the stopIndex array correctly:
int counter = 0;
//Find first existing keyword:
while(counter < nkey && startIndex[counter] == -1){
//Keyword doesn't exist. Leave stopindex at -1!
counter++;
}
//Store stop index of each keyword:
while(counter<nkey){
int offset = 1;
//Find next existing keyword:
while(counter+offset < nkey && startIndex[counter+offset] == -1){
offset++;
}
if(counter+offset < nkey){
stopIndex[counter] = startIndex[counter+offset]-1;
}
else{
//reached the end of array!
stopIndex[counter] = lines.size()-1;
}
counter += offset;
}//end while
/*
//Print out start/stop-index arrays to screen:
for(int i=0;i<nkey;i++){
std::cout << "Start index of " << KEYWORDS::mainKeywords[i] << " is : " << startIndex[i] << std::endl;
std::cout << "Stop index of " << KEYWORDS::mainKeywords[i] << " is : " << stopIndex[i] << std::endl;
}
*/
return true;
}//end else (file opened properly)
}//end scanForMainKeywords()

You say your purpose is to read initialization data from a text file.
Seems you need to parse (syntax analyze) this file and store the data under the right keys.
If the syntax is fixed and each construction starts with a keyword, you could write a recursive descent (LL1) parser creating a tree (each node is a stl vector of sub-branches) to store your data.
If the syntax is free, you might pick JSON or XML and use an existing parsing library.

Related

Is it possible to print this shape without printing spaces in C++ as described in the book "Think Like Programmer" by Anton Spraul?

In a book named "Think Like a Programmer" by Anton Spraul, on chapter 2, exercise 1, page 53, this is what is written:
Using the same rule as the shapes programs from earlier in the chapter
(only two output statements — one that outputs the hash mark and one
that outputs an end-of-line), write a program that produces the
following shape:
########
######
####
##
So we can only use cout << "\n"; and cout << "#"; but not cout << " ";
Solving this problem by printing spaces is easy (see code below). But is it possible to print such shape without printing spaces in C++?
#include <iostream>
using std::cin;
using std::cout;
int main()
{
int shapeWidth = 8;
for (int row = 0; row < 4; row++) {
for(int spaceNum = 0; spaceNum < row; spaceNum++) {
cout << " ";
}
for(int hashNum = 0; hashNum < shapeWidth-2*row; hashNum++) {
cout << "#";
}
cout << "\n";
}
}
Solving this problem by printing spaces is easy (see code above). But is it possible to print such shape without printing spaces in C++?
In one of the answers I read one can remove the for (int spaceNum. . . loop and rather just put cout << setw(row+1); to achieve that.
Clarification: The author never used a shape as example where one had to print spaces or indentations like the above. Interpreting the exercise above literally, he expects us to write that shape by printing "#" and "\n" only. Printing spaces or indentations by only printing "#" and "\n" seems not possible to me, so I thought maybe he was not careful when he wrote exercises. Or there's a way to achieve that but it's just me who doesn't know. This is why I asked this.

Yes, this is possible.
Set the stream width to the length of the row.
Set the stream formatting to right-justified.
Look up stream manipulators in your book or another C++ references.

Possibly! But that is a question about the system you’re running on, not about C++. As far as the language is concerned, it’s just outputting characters. The fact that outputting a “space” character moves the cursor to the right on the screen — or even the existence of the screen — is a matter of how the operating system and the software running on it interpret the characters that have been output.
Most notably, you can use ANSI escape sequences to move the cursor around on most text consoles. There’s no particular reason to do this unless you’re writing a full-screen text mode UI.

Parsing Data of data from a file

i have this project due however i am unsure of how to parse the data by the word, part of speech and its definition... I know that i should make use of the tab spacing to read it but i have no idea how to implement it. here is an example of the file
Recollection n. The power of recalling ideas to the mind, or the period within which things can be recollected; remembrance; memory; as, an event within my recollection.
Nip n. A pinch with the nails or teeth.
Wodegeld n. A geld, or payment, for wood.
Xiphoid a. Of or pertaining to the xiphoid process; xiphoidian.
NB: Each word and part of speech and definition is one line in a text file.

If you can be sure that the definition will always follow the first period on a line, you could use an implementation like this. But it will break if there are ever more than 2 periods on a single line.
string str = "";
vector<pair<string,string>> v; // <word,definition>
while(getline(fileStream, str, '.')) { // grab line, deliminated '.'
str[str.length() - 1] = ""; // get rid of n, v, etc. from word
v.push_back(make_pair<string,string>(str,"")); // push the word
getline(fileStream, str, '.'); // grab the next part of the line
v.back()->second = str; // push definition into last added element
}
for(auto x : v) { // check your results
cout << "word -> " << x->first << endl;
cout << "definition -> " << x->second << endl << endl;
}
The better solution would be to learn Regular Expressions. It's a complicated topic but absolutely necessary if you want to learn how to parse text efficiently and properly:
http://www.cplusplus.com/reference/regex/

Formatting text in C++

So I'm trying to make a little text based game for my first real project in C++. When the game calls for a big block of text, several paragraphs worth, I'm having trouble getting it to look nice. I want to have uniform line lengths, and to do that I'm just having to manually enter line breaks in the appropraite places. Is there a command to do this for me? I have seen the setw() command, but that doesn't make the text wrap if it goes past the width. Any advice? Here is what I'm doing right now if that helps.
cout << " This is essentially what I do with large blocks of" << '\n';
cout << " descriptive text. It lets me see how long each of " << '\n';
cout << " the lines will be, but it's really a hassle, see? " << '\n';

Getting or writing a library function to automatically insert leading whitespace and newlines suitable for a particular width of output would be a good idea. This website considers library recommendations off topic, but I've included some code below - not particularly efficient but easy to understand. The basic logic should be to jump forwards in the string to the maximum width, then move backwards until you find whitespace (or perhaps a hyphen) at which you're prepared to break the line... then print the leading whitespace and the remaining part of the line. Continue until done.
#include <iostream>
std::string fmt(size_t margin, size_t width, std::string text)
{
std::string result;
while (!text.empty()) // while more text to move into result
{
result += std::string(margin, ' '); // add margin for this line
if (width >= text.size()) // rest of text can fit... nice and easy
return (result += text) += '\n';
size_t n = width - 1; // start by assuming we can fit n characters
while (n > width / 2 && isalnum(text[n]) && isalnum(text[n - 1]))
--n; // between characters; reduce n until word breaks or 1/2 width left
// move n characters from text to result...
(result += text.substr(0, n)) += '\n';
text.erase(0, n);
}
return result;
}
int main()
{
std::cout << fmt(5, 70,
"This is essentially what I do with large blocks of "
"descriptive text. It lets me see how long each of "
"the lines will be, but it's really a hassle, see?");
}
That hasn't been tested very thoroughly, but seems to work. See it run here. For some alternatives, see this SO question.
BTW, your original code can be simplified to...
cout << " This is essentially what I do with large blocks of\n"
" descriptive text. It lets me see how long each of\n"
" the lines will be, but it's really a hassle, see?\n";
...as C++ considers the statement unfinished until it hits the semicolon, and double-quoted string literals that appear next to each other in the code are concatenated automatically, as if the inner double quotes and whitespace were removed.

There are pretty library to handle text formatting, named cppformat/cppformat on github.
By using this, you can manipulate text formatting easily.

You can make a tierce simple function
void print(const std::string& brute, int sizeline)
{
for(int i = 0; i < brute.length(); i += sizeline)
std::cout << brute.substr(i, sizeline) << std::endl;
}
int main()
{
std::string mytext = "This is essentially what I do with large blocks of"
"descriptive text. It lets me see how long each of "
"1the lines will be, but it's really a hassle, see?";
print(mytext, 20);
return 0;
}
but words will be cut
output:
This is essentially
what I do with large
blocks ofdescriptiv
e text. It lets me s
ee how long each of
1the lines will be,
but it's really a ha
ssle, see?

sqlite table code manager?

Fairly new to sqlite (and sql). I have several tables I need to generate with several column names, which can change as I code (in c++). How do I manage them? Am I doing it right? There must be utility codes out there that's much better.
Edit: Specifically, I'd like to avoid run-time errors by having an abstraction of the table and field names at compile time (e.g. using #defines, but maybe something else is better).
E.g. I'm currently thinking about creating a class TableHandler that will:
sqlite *db;
....
TableHandler tb("TableName");
tb.addField("FirstName", "TEXT");
tb.addField("Id", "INTEGER");
tb.createTable(db); //calls sqlite3_exec("create table TableName(FirstName TEXT, Id INTEGER)");
tb.setEntry("FirstName", "bob");
tb.addEntry(); //calls sqlite3_exec("insert ...");
tb.createCode(stdout);
//this will generate
/*
#define kTableName "TableName"
#define kFirstName "FirstName"
#define kId "Id"
...anything else useful?
*/

I asked a similar question and it was down voted so i deleted it. I wrote some code to do the insert if you are interested. But i agreed with the negative comments that static SQL statements are less error prone. Not to mention less cpu intensive.
For insert i took a std::set of std::pair of std::string. The first string being the column name and the second string its value. And the Query returned a similar structure. I played with std::map and std::vector and std::unordered_set all of them would have different benefits here.
What would be great if you get around to it is a small utility program that could read a definition of a class and write all the SQL for you. I started this and gave up because of parsing the C++ header file got way to complicated for the time I would save on future projects.
Added
std::string Insert(std::string table, std::vector< std::pair<std::string,std::string> > row)
{
if (row.size()==0 || table.size()==0)
return "";
std::stringstream name,value;
auto it = row.begin();
name << "INSERT INTO " << table.c_str()<<"('" << (*it).first << "'";
value << "VALUES('" <<(*it).second << "'";
for ( ; it < row.end(); it++)
{
name << ", '" << (*it).first << "'";
value << ", '" <<(*it).second << "'";
}
name << ") ";
value << ");";
name << value.str();
return name.str();
}

Read file and extract certain part only

ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
What I am trying to do, is to extrat the "href" value. But I cant get it works.
EDIT:
Thanks to Tony hint, I use this:
if(line.find("href=") != std::string::npos ){
// Process
}
it works!!

I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.
Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.
You might want to look at the answers to this previous question for suggestions about an HTML parser to use.

As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:
#include <fstream>
#include <iostream>
#include <string>
int main ( int, char ** )
{
std::ifstream file("sample.html");
if ( !file.is_open() ) {
std::cerr << "Failed to open file." << std::endl;
return (EXIT_FAILURE);
}
for ( std::string line; (std::getline(file,line)); )
{
// process line.
}
}
As for the inner part the processes the line, there are several problems.
It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.
Here is a simple test content for sample.html.
<html>
<a href="foo.pdf"/>
</html>
Sticking the following inside the loop:
if ((line.find("href=") != std::string::npos) &&
(line.find(".pdf" ) != std::string::npos))
{
const std::size_t start_pos = line.find("href");
std::string temp = line.substr(start_pos+6);
const std::size_t stop_pos = temp.find("\"");
std::string result = temp.substr(0, stop_pos);
std::cout << "'" << result << "'" << std::endl;
}
I actually get the output
'foo.pdf'
However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading text file by scanning for keywords - c++

Related

Is it possible to print this shape without printing spaces in C++ as described in the book "Think Like Programmer" by Anton Spraul?

Parsing Data of data from a file

Formatting text in C++

sqlite table code manager?

Read file and extract certain part only

Categories

Resources