C++ Parsing a line out of a large file

C++ Parsing a line out of a large file - c++

I have read an entire file into a string from a memory mapped file Win API
CreateFile( "WarandPeace.txt", GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0 )
etc...
Each line is terminated with a CRLF. I need to find something on a line like "Spam" in the line "I love Spam and Eggs" (and return the entire line (without the CRLF) in a string (or a pointer to the location in the string) The original string cannot be altered.
EDITED:
Something like this:
string ParseStr( string sIn, string sDelim, int nField )
{
int match, LenStr, LenDelim, ePos, sPos(0), count(0);
string sRet;
LenDelim = sDelim.length();
LenStr = sIn.length();
if( LenStr < 1 || LenDelim < 1 ) return ""; // Empty String
if( nField < 1 ) return "";
//=========== cout << "LenDelim=" << LenDelim << ", sIn.length=" << sIn.length() << endl;
for( ePos=0; ePos < LenStr; ePos++ ) // iterate through the string
{ // cout << "sPos=" << sPos << ", LenStr=" << LenStr << ", ePos=" << ePos << ", sIn[ePos]=" << sIn[ePos] << endl;
match = 1; // default = match found
for( int k=0; k < LenDelim; k++ ) // Byte value
{
if( ePos+k > LenStr ) // end of the string
break;
else if( sIn[ePos+k] != sDelim[k] ){ // match failed
match = 0; break; }
}
//===========
if( match || (ePos == LenStr-1) ) // process line
{
if( !match ) ePos = LenStr + LenDelim; // (ePos == LenStr-1)
count++; // cout << "sPos=" << sPos << ", ePos=" << ePos << " >" << sIn.substr(sPos, ePos-sPos) << endl;
if( count == nField ){ sRet = sIn.substr(sPos, ePos-sPos); break; }
ePos = ePos+LenDelim-1; // jump over Delim
sPos = ePos+1; // Begin after Delim
} // cout << "Final ePos=" << ePos << ", count=" << count << ", LenStr=" << LenStr << endl;
}// next
return sRet;
}
If you like it, vote it up. If not, let's see what you got.

If you are trying to match a more complex pattern then you can always fall back to boost's regex lib.
See: http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/html/index.html
#include <iostream>
#include <string>
#include <boost/regex.hpp>
using namespace std;
int main( )
{
std::string s;
std::string sre("Spam");
boost::regex re;
ifstream in("main.cpp");
if (!in.is_open()) return 1;
string line;
while (getline(in,line))
{
try
{
// Set up the regular expression for case-insensitivity
re.assign(sre, boost::regex_constants::icase);
}
catch (boost::regex_error& e)
{
cout << sre << " is not a valid regular expression: \""
<< e.what() << "\"" << endl;
continue;
}
if (boost::regex_match(line, re))
{
cout << re << " matches " << line << endl;
}
}
}

Do you really have to do it in C++? Perhaps you could use a language which is more appropriate for text processing, like Perl, and apply a regular expression.
Anyway, if doing it in C++, a loop over Prev_delim_position = sIn.find(sDelim, Prev_delim_position) looks like a fine way to do it.

system("grep ....");

Related

Infinity Loop in Lexical Analyzer in C++

int main(int argc, char *argv[]) {
ifstream inFile;
int numOfLines = 0, numOfTokens = 0, numOfStrings = 0, maxStringLength = 0, l = 0, fileCount=0, mostCommonCount=0;
string inputFile, mostCommonList="", word;
for(int i = 1; i < argc; i++){
if(strpbrk(argv[i] , "-")){
if(flags.find(string(argv[i]))!=flags.end()) flags[string(argv[i])] = true;
else{
cerr << "INVALID FLAG " << argv[i] << endl;
exit(1);
}
}
else{
inFile.open(argv[i]);
fileCount++;
if(!inFile && fileCount==1){
cerr << "UNABLE TO OPEN " << argv[i] << endl;
exit(1);
}
else{
string line;
while(getline(inFile, line)) inputFile+=line+='\n';
if(fileCount>1){
cerr << "TOO MANY FILE NAMES" << endl;
exit(1);
}
}
}
}
int linenum = 0;
TType tt;
Token tok;
while((tok = getNextToken(&inFile, &linenum))!=DONE && tok != ERR){
tt = tok.GetTokenType();
word = tok.GetLexeme();
if(flags["-v"]==true){
(tt == ICONST||tt==SCONST||tt==IDENT) ? cout<<enumTypes[tok.GetTokenType()]<<"("<< tok.GetLexeme()<<")"<<endl : cout<< enumTypes[tok.GetTokenType()]<<endl;
}
if(flags["-mci"]==true){
if(tt==IDENT){
(identMap.find(word)!=identMap.end()) ? identMap[word]++ : identMap[word]=1;
if(identMap[word]>mostCommonCount) mostCommonCount = identMap[word];
}
}
if(flags["-sum"]==true){
numOfTokens++;
if(tt==SCONST){
numOfStrings++;
l = word.length();
if(l > maxStringLength) maxStringLength = l;
}
}
}
if(tok==ERR){
cout << "Error on line" << tok.GetLinenum()<<"("<<tok.GetLexeme()<<")"<<endl;
return 0;
}
if(flags["-mci"]==true){
cout << "Most Common Identifier: ";
if(!identMap.empty()){
word ="";
for(auto const& it : identMap){
if(it.second==mostCommonCount) word += it.first + ",";
}
word.pop_back();
cout << word << endl;
}
}
if(flags["-sum"]){
numOfLines = tok.GetLinenum();
numOfLines = tok.GetLinenum();
cout << "Total lines: " << numOfLines << endl;
cout << "Total tokens: " << numOfTokens << endl;
cout << "Total strings: " << numOfStrings << endl;
cout << "Length of longest string: " << maxStringLength << endl;
}
inFile.close();
return 0;
}
For some reason this code is running infinitely. I cannot figure out the source of error. I also do not know whether this file or the other linked file is causing this error so I posted the main program code. I think is one of the switch statements that causing this error but I am not sure. FYI: I am supposed to make a lexical analyzer so I had three files one lexigh.h (contains all the data types and all the functions), getToken.cpp(file that defines the functions from lexigh.h) and the main program which calls the methods and tests it.

Where can I use OpenMP in my C++ code

I am writing a C++ code to calculate the code coverage and I want to used the OpenMP to help enhance my code by minimizing the overall run time by making the functions work in parallel so I can get less run time.
Can someone please tell me how and where to use the OpenMP?
int _tmain(int argc, _TCHAR* argv[])
{
std::clock_t start;
start = std::clock();
char inputFilename[] = "Test-Case-3.cs"; // Test Case File
char outputFilename[] = "Result.txt"; // Result File
int totalNumberOfLines = 0;
int numberOfBranches = 0;
int statementsCovered = 0;
float statementCoveragePercentage = 0;
double overallRuntime = 0;
ifstream inFile; // object for reading from a file
ofstream outFile; // object for writing to a file
inFile.open(inputFilename, ios::in);
if (!inFile) {
cerr << "Can't open input file " << inputFilename << endl;
exit(1);
}
totalNumberOfLines = NoOfLines(inFile);
inFile.clear(); // reset
inFile.seekg(0, ios::beg);
numberOfBranches = NoOfBranches(inFile);
inFile.close();
statementsCovered = totalNumberOfLines - numberOfBranches;
statementCoveragePercentage = (float)statementsCovered * 100/ totalNumberOfLines;
outFile.open(outputFilename, ios::out);
if (!outFile) {
cerr << "Can't open output file " << outputFilename << endl;
exit(1);
}
outFile << "Total Number of Lines" << " : " << totalNumberOfLines << endl;
outFile << "Number of Branches" << " : " << numberOfBranches << endl;
outFile << "Statements Covered" << " : " << statementsCovered << endl;
outFile << "Statement Coverage Percentage" << " : " << statementCoveragePercentage <<"%"<< endl;
overallRuntime = (std::clock() - start) / (double)CLOCKS_PER_SEC;
outFile << "Overall Runtime" << " : " << overallRuntime << " Seconds"<< endl;
outFile.close();
}
i want to minimize the time taken to count the number of branches by allowing multiple threads to work in parallel to calculate the number faster? how can i edit the code so that i can use the open mp and here you can find my functions:bool is_only_ascii_whitespace(const std::string& str)
{
auto it = str.begin();
do {
if (it == str.end()) return true;
} while (*it >= 0 && *it <= 0x7f && std::isspace(*(it++)));
// one of these conditions will be optimized away by the compiler,
// which one depends on whether char is signed or not
return false;
}
// Function 1
int NoOfLines(ifstream& inFile)
{
//char line[1000];
string line;
int lines = 0;
while (!inFile.eof()) {
getline(inFile, line);
//cout << line << endl;
if ((line.find("//") == std::string::npos)) // Remove Comments
{
if (!is_only_ascii_whitespace(line)) // Remove Blank
{
lines++;
}
}
//cout << line << "~" <<endl;
}
return lines;
}
// Function 2
int NoOfBranches(ifstream& inFile)
{
//char line[1000];
string line;
int branches = 0;
while (!inFile.eof()) {
getline(inFile, line);
if ((line.find("if") != std::string::npos) || (line.find("else") != std::string::npos))
{
branches++;
}
}
return branches;
}

FASTA reader written in C++?

Let me start off by stating that I am a beginner in C++. Anyways, the FASTA format goes as follows:
Any line starting with a '>' indicates the name/id of the gene sequence right below it. There is a gene sequence right below the id. This gene sequence can be 1 or multiple lines.
So... what I want to do is print: id << " : " << gene_sequence << endl;
Here is my code:
#include <iostream>
#include <fstream>
int main(int argc, char **argv) {
if (argc < 2) {
std::cerr << " Wrong format: " << argv[0] << " [infile] " << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if (!input.good()) {
std::cerr << "Error opening: " << argv[1] << " . You have failed." << std::endl;
return -1;
}
std::string line, id, DNA_sequence;
while (std::getline(input, line).good()) {
if (line[0] == '>') {
id = line.substr(1);
std::cout << id << " : " << DNA_sequence << std::endl;
DNA_sequence.clear();
}
else if (line[0] != '>'){
DNA_sequence += line;
}
}
}
For the second argument inputted into the command line, here is the content of my file:
>DNA_1
GATTACA
>DNA_2
TAGACCA
TAGACCA
>DNA_3
ATAC
>DNA_4
AT
Please copy and paste into text file.
After this has been done, and the code has been executed, I want to point out the problem. The code skips inputting the sequence of DNA_1 into its correct respective place, and instead placing DNA_1 's sequence into DNA_2. The results get pushed forward 1 as a result. Any assistance or tips would be greatly appreciated?
As I've said before, I am new to C++. And the semantics are quite hard to learn compared to Python.

I see a few problems with your code.
First you loop on std::ifstream::good() which doesn't work because it won't allow for End Of File (which happens even after a good read).
Then you access line[0] without checking if the line is empty which could cause a seg-fault.
Next you output the "previous line" before you have even collected it.
Finally you don't output the final line because the loop terminates when it doesn't find another >.
I added comments to my corrections to your code:
#include <iostream>
#include <fstream>
int main(int argc, char **argv) {
if (argc < 2) {
std::cerr << " Wrong format: " << argv[0] << " [infile] " << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if (!input.good()) {
std::cerr << "Error opening: " << argv[1] << " . You have failed." << std::endl;
return -1;
}
std::string line, id, DNA_sequence;
// Don't loop on good(), it doesn't allow for EOF!!
// while (std::getline(input, line).good()) {
while (std::getline(input, line)) {
// line may be empty so you *must* ignore blank lines
// or you have a crash waiting to happen with line[0]
if(line.empty())
continue;
if (line[0] == '>') {
// output previous line before overwriting id
// but ONLY if id actually contains something
if(!id.empty())
std::cout << id << " : " << DNA_sequence << std::endl;
id = line.substr(1);
DNA_sequence.clear();
}
else {// if (line[0] != '>'){ // not needed because implicit
DNA_sequence += line;
}
}
// output final entry
// but ONLY if id actually contains something
if(!id.empty())
std::cout << id << " : " << DNA_sequence << std::endl;
}
Output:
DNA_1 : GATTACA
DNA_2 : TAGACCATAGACCA
DNA_3 : ATAC
DNA_4 : AT

You're storing the new id before printing the old one:
id = line.substr(1);
std::cout << id << " : " << DNA_sequence << std::endl;
Swap the lines around for proper order. You probably also want to check if you have any id already present to skip the first entry.

working implementation is here
https://rosettacode.org/wiki/FASTA_format#C.2B.2B
only corrected
while( std::getline( input, line ).good() ){
to
while( std::getline( input, line ) ){
Code
#include <iostream>
#include <fstream>
int main( int argc, char **argv ){
if( argc <= 1 ){
std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if(!input.good()){
std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
return -1;
}
std::string line, name, content;
while( std::getline( input, line ) ){
if( line.empty() || line[0] == '>' ){ // Identifier marker
if( !name.empty() ){ // Print out what we read from the last entry
std::cout << name << " : " << content << std::endl;
name.clear();
}
if( !line.empty() ){
name = line.substr(1);
}
content.clear();
} else if( !name.empty() ){
if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
name.clear();
content.clear();
} else {
content += line;
}
}
}
if( !name.empty() ){ // Print out what we read from the last entry
std::cout << name << " : " << content << std::endl;
}
return 0;
}
input:
>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED
output:
Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

c++ Script that finds questions in string

I am new at coding , been coding for about a week and i am trying to do a script that finds the "?" and the "." in the script , then outputs their position in the script and i use those value to print the question to a text file.
Except it does not really work.
If you put the value in like this, it works.
myfile << test.substr( 18, 20 )
But like this it does not work it just print the whole script from the value of dot[0] until the end of the script.
myfile << test.substr( dot[0], interrogation[0] )
The way that i use to find the "?" position in the string is also not very accurate.
Where there is the .
if(x > 0){
I had a while loop but i replaced it for debugging reasons .
This is the whole code.
If you can help me i appreciate it.
Thanks.
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main (){
std::vector< int > interrogation ;
std::vector< int > dot;
string look = "?";
string look_again = ".";
string test = "ver. o que e isto? nao sei. ola? adeus. fghfghfhfghf";
string::size_type pos = test.find(look);
string::size_type sop = test.find(look_again);
string::size_type exc = test.find(look_again_again);
while (pos != std::string::npos)
{
int a = pos ;
int b = sop;
cout << " . found at : " << sop << std::endl;
cout << " ? found at : " << pos << std::endl;
interrogation.push_back(a);
dot.push_back(b);
string fragment = test.substr (0 , pos ); // works
//cout << fragment << endl ;
string fragment2 = test.substr (0 , sop ); // works
//cout << fragment2 << endl ;
pos = test.find(look, pos + 1);
sop = test.find(look_again, sop + 1);
}
int x = 1;
if(x > 0){
int a = 1;
int q = dot[a];
int w = interrogation[a];
// to save text
// to save text
string save = "saved_question.txt" ;
ofstream myfile;
myfile.open (save.c_str(), ios::app);
myfile << test.substr( 18, 20 ) + "\n" ;
myfile.close();
cout << "Question saved in text file" << endl;
}
}

The code is not finished yet but i got it working with help.
Thanks
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main (){
std::vector< int > interrogation ;
std::vector< int > dot;
//std::vector< int > exclamation;
string look = "?";
string look_again = ".";
string look_again_again = "!";
string test = " ver.o que e isto? nao sei. ola? adeus.";
string::size_type pos = test.find(look);
string::size_type sop = test.find(look_again);
while (pos != std::string::npos)
{
int a = pos ;
int b = sop;
cout << " . found at : " << sop << std::endl;
cout << " ? found at : " << pos << std::endl;
// cout << " ! found at : " << exc << std::endl;
interrogation.push_back(a);
dot.push_back(b);
//exclamation.push_back(c);
string fragment = test.substr (0 , pos ); // works
//cout << fragment << endl ;
string fragment2 = test.substr (0 , sop) ; // works
//cout << fragment2 << endl ;
string fragment3 = test.substr (dot.back() + 1, interrogation.back() - dot.back()); // works
cout << fragment3 << endl ;
pos = test.find(look, pos + 1);
sop = test.find(look_again, sop + 1);
}
}

You can do something like this:
void function(string str) {
string sentence = "";
for(int i=0; i < str.length(); i++) {
if(str[i] == '.' || str[i] == '!')
sentence = ""; // the sentence is not a question so clear the sentence
else if(str[i] == '?') {
sentence.push_back(str[i]);
cout << sentence << endl; // output the question - just replace cout with your ofstream
}
else
sentence.push_back(str[i]);
}
}
I think I've seen this question before though..

_stricmp returns wrong value while string DOES match. What to do?

I have this bit of code:
second = strtok (NULL,"\n");
logprintf(second);
if(_stricmp(second,"WINDOWS") == 0)
The logprintf prints data into a logfile, and it prints "WINDOWS" (without quotes).
but _stricmp returns 13 somehow.. So the if check never gets passed. I've tried doing sscanf/sprintf/other string ways but none work. I'm out of ideas.
The full code:
#ifdef WIN32
char buf[65535];
bool found = false;
bool install = false;
bool installing = false;
unsigned int installed = 0;
WIN32_FIND_DATA fd;
HANDLE h = FindFirstFile(L"./*.AIFPAK", &fd);
char * NAME = NULL;
char * ID = NULL;
char * AUTHOR = NULL;
int VERSION;
HZIP hz = OpenZip(fd.cFileName,0);
ZIPENTRY ze;
GetZipItem(hz,-1,&ze);
int numitems=ze.index;
for (int i=0; i<numitems; ++i)
{
GetZipItem(hz,i,&ze);
if(found == false)
{
if(_wcsicmp(ze.name,L"INSTALL.AIFLIST") == 0)
{
found = true;
UnzipItem(hz,i,buf,65535);
i = 0;
stringstream strx;
strx << buf;
string line;
while (getline(strx, line)) {
char * first = NULL;
char * second = NULL;
char * third = NULL;
first = strtok ((char *)line.c_str()," ");
if(install == false)
{
if(_stricmp(first,"NAME") == 0)
{
NAME = strtok (NULL,"\n");
}
if(_stricmp(first,"ID") == 0)
{
ID = strtok (NULL,"\n");
}
if(_stricmp(first,"AUTHOR") == 0)
{
AUTHOR = strtok (NULL,"\n");
}
if(_stricmp(first,"VERSION") == 0)
{
VERSION = atoi(strtok (NULL,"\n"));
}
if(_stricmp(first,"START_INSTALL") == 0)
{
second = strtok (NULL,"\n");
logprintf(second);
if(_stricmp(second,"WINDOWS") == 0)
{
cout << "INSTALLAH\n";
install = true;
cout << NAME << "|" << ID << "|" << AUTHOR << "|" << VERSION << "|\n";
}
}
}
else
{
cout << "ELSE FIRST: " << first << "\n";
if(_stricmp(first,"UNPACK") == 0)
{
second = strtok (NULL,">");
third = strtok (NULL,"\n");
cout << first << "|" << second << "|" << third << "|\n";
ToDoVec.push_back(ToDoInfo(0,second,third));
}
if(_stricmp(first,"PRINT") == 0)
{
second = strtok (NULL,"\n");
cout << first << "|" << second << "|" << third << "|\n";
ToDoVec.push_back(ToDoInfo(3,second,""));
}
if(_stricmp(first,"ADD_PLUGIN") == 0)
{
second = strtok (NULL,"\n");
cout << first << "|" << second << "|" << third << "|\n";
ToDoVec.push_back(ToDoInfo(1,second,""));
}
if(_stricmp(first,"ADD_FILTERSCRIPT") == 0)
{
second = strtok (NULL,"\n");
cout << first << "|" << second << "|" << third << "|\n";
ToDoVec.push_back(ToDoInfo(2,second,""));
}
if(_stricmp(first,"END_INSTALL") == 0)
{
break;
}
}
}
}
}
else
{
if(installing == false)
{
i = 0;
installing = true;
for(unsigned int ix = 0; ix < ToDoVec.size(); ++ix)
{
cout << "|" << ToDoVec.at(ix).action << "|" << ToDoVec.at(ix).string1 << "|" << ToDoVec.at(ix).string2 << "|\n";
}
}
else
{
}
}
}
found = false;
install = false;
installing = false
CloseZip(hz);
while (FindNextFile(h, &fd))
{
//fnVec.push_back(fd.cFileName);
}
#else//assuming linux
#endif
The "INSTALL.AIFLIST" file looks like this:
NAME Route Connector Plugin
ID GAMER_GPS
VERSION 1733
AUTHOR Gamer_Z
START_INSTALL WINDOWS
UNPACK RouteConnector/plugins/RouteConnectorPlugin.dll>./plugins/RouteConnectorPlugin.dll
UNPACK RouteConnector/examples/other/filterscripts/Node_GPS.amx>./filterscripts/Node_GPS.amx
UNPACK RouteConnector/examples/other/filterscripts/Node_GPS.pwn>./filterscripts/Node_GPS.pwn
UNPACK RouteConnector/scriptfiles/GPS.dat>./scriptfiles/GPS.dat
UNPACK RouteConnector/sampGDK/EXTRACTED/libsampgdk-2.2.1-win32/bin/sampgdk2.dll>./sampgdk2.dll
ADD_PLUGIN RouteConnectorPlugin
ADD_FILTERSCRIPT Node_GPS
END_INSTALL
START_INSTALL LINUX
UNPACK RouteConnector/plugins/RouteConnectorPlugin.so>./plugins/RouteConnectorPlugin.so
UNPACK RouteConnector/examples/other/filterscripts/Node_GPS.amx>./filterscripts/Node_GPS.amx
UNPACK RouteConnector/examples/other/filterscripts/Node_GPS.pwn>./filterscripts/Node_GPS.pwn
UNPACK RouteConnector/scriptfiles/GPS.dat>./scriptfiles/GPS.dat
PRINT To make this plugin work you must install the sampgdk library from www.github.com/Zeex/ (if you don't have it installed)
ADD_PLUGIN RouteConnectorPlugin.so
ADD_FILTERSCRIPT Node_GPS
END_INSTALL
All the data is read correctly into 'buf'.
Can anybody suggest how to fix that problem?

It looks as if you're not accounting for CRLF line endings. Opening a file in text mode will translate "\r\n" to "\n", but there's nothing in your code that does so. If "WINDOWS" is followed by "\r\n", you're treating that as "WINDOWS\r" followed by a "\n", because "\n" is all you're passing to strtok. There are several possible solutions, but one is passing "\r\n" to strtok instead.

13 is the carriage return character '\r', so your input has a trailing '\r' which doesn't show when printing. To remove it, pass "\r\n" as separators to strtok.

The 13 indicates that your second string contains "WINDOWS\r". The \r is a carriage-return character - on Windows, lines in text files are terminated with a \r\n sequence, so it seems that your getline() function is terminating at a \n and returning the \r as part of the string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Parsing a line out of a large file - c++

Do you really have to do it in C++? Perhaps you could use a language which is more appropriate for text processing, like Perl, and apply a regular expression. Anyway, if doing it in C++, a loop over Prev_delim_position = sIn.find(sDelim, Prev_delim_position) looks like a fine way to do it.

system("grep ....");

Related

Infinity Loop in Lexical Analyzer in C++

Where can I use OpenMP in my C++ code

FASTA reader written in C++?

c++ Script that finds questions in string

_stricmp returns wrong value while string DOES match. What to do?

Categories

Resources