Reading in fasta file C++ - c++

I'm trying to read in a fasta file. I want to remove/ignore the header/info lines that begin with ">" and store the following sequences into sperate strings. Below is the code I have to do that (partially reworked from https://rosettacode.org/wiki/FASTA_format#C++, as what I had originally worked even less). They have a good example of what I want to do.
My problem is given this fasta file:
">sequence_1
MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHE
GAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFI
STSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIG
CKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCH
AACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKIN
EAFDLLRSGKSIRTVLTF
">sequence_2
MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEA
AGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGT
STFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCK
AAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSV
IVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
DLLHSGKSIRTVLTF
My output:
Sequence 1: MSTAGKVIKCKAAVLWELHKPFTIEDIEVAPPKAHEVRIKMVATGVCRSDDHVVSGTLVTPLPAVLGHEGAGIVEGVTCVKPGDKVIPLFSPQCGECRICKHPESNFCSRSDLLMPRGTLREGTSRFSCKGKQIHNFISTSTFSQYTVVDDIAVAKIDGASPLDKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSVIIGCKAAGAARIIAVDINKDKFAKAKELGATECIYSKPIQEVLQEMTDGGVDFSFEVIGRLDTMTSALLSCHAACGVSVVVGVPPNAQNLSMNPMLLLLGRTWKGAIFGGFKSKDSVPKLVAKKFPLDPLITHVLPFEKINEAFDLLRSGKSIRTVLTF
Sequence 2: MNQGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICHTDDHVVSGNLVTPLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTCRVCKNPESNYCLKNDLGNPRGTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGYGSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATECINPQDYKLPIQEVLKEMTDGSTVIGRLDTMMASLLCCGTSVIVEDTPASQNLSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGF
The last line or so of Sequence 2 is cut off..... Any help/solutions?
void read_in_Protein(string Protein_filename)
{ // read in the sequences
fstream myfile;
myfile.open(Protein_filename, ios::in);
if (!myfile.is_open()) {
cerr << "Error can not open file" << endl;
exit(1);
}
string Protein_Sequences{};
string Protein_Seq_names{};
// string temp{};
string Prot_Seq1{};
string Prot_Seq2{};
string line{};
while (getline(myfile, line).good()) {
//std::cout << "Line input received (" << line.length() << "): " << line << std::endl;
if (line.empty() || line[0] == '>') { // Identifier marker
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
//std::cout << "\tReseting to new sequence" << std::endl;
// cout << Protein_Sequences << endl;
Protein_Seq_names.clear();
Prot_Seq1 = Protein_Sequences;
}
if (!line.empty()) {
//std::cout << "\tSetting sequence start" << std::endl;
Protein_Seq_names = line.substr(1);
}
// std::cout << "\tClearing sequences..." << std::endl;
Protein_Sequences.clear();
}
else if (!Protein_Seq_names.empty()) {
line = line.substr(0, line.length() - 1);
if (line.find(' ') != string::npos) { // Invalid sequence--no spaces allowed
//std::cout << "\tSpace found, clearing buffers..." << std::endl;
Protein_Seq_names.clear();
Protein_Sequences.clear();
}
else {
//std::cout << "\tAppending line to protein sequence..." << std::endl;
Protein_Sequences += line;
}
}
//std::cout << "Protein_Sequences: " << Protein_Sequences << std::endl;
}
if (!Protein_Seq_names.empty()) { // Print out what we read from the last entry
// cout << Protein_Sequences << endl;
Prot_Seq2 = Protein_Sequences;
}
cout << "\nSequence 1: " << Prot_Seq1 << endl;
cout << Prot_Seq1.length();
cout << "\nSequence 2: " << Prot_Seq2 << endl;
cout << Prot_Seq2.length();
}

Assuming your file doesn't end with a new line then the last call to std::getline will set the eof bit to indicate that it reached the end of the file before finding the line ending. As you are checking .good() in your while loop the last line will be discarded. You should instead check !fail() (or just the boolean value of the stream itself which is equivalent to !fail()):
while (getline(myfile, line))
After reading the final line the next iteration of the loop will try to read whilst the stream is in the eof state and immediately fail and break out of the loop.

Related

Debug Error Thrown When Checking !str.find()

I'm learning C++. I'm developing a simple "library management" application that allows users to create an account, check out books, etc. Each book is managed using a unique text file. The text file contains three lines as follows, however the third line is the only important thing here, as it contains the owner of the book.
The following code prints the contents of an additional text file that contains a list of all the books, but that shouldn't be relevant to the error. It converts the contents of the text file to a string, and then checks to see if "NA" is present. If "NA" is present, it is replaced with the current username. The file is then reopened using ios::trunc to wipe the file, and the new string is passed into the file. This works fine.
The issue is that when running the application, if a username is already there instead of "NA", I get a Debug Error that only reads abort() has been called. I've tried debugging, but I can't get any more information.
This is the error and the code:
void bookCheckout()
{
system("CLS");
string line;
string bookChoice;
ifstream checkBookList;
ofstream checkOutBook;
checkBookList.open("books/booklist.txt");
string sTotal;
string s;
cout << "<---Avaliable Books--->" << endl;
while (getline(checkBookList, line)) {
cout << line << endl;
}
checkBookList.close();
cout << "\nWhat Book Would You Like?:";
cin >> bookChoice;
checkBookList.open("books/" + bookChoice + ".txt");
while (!checkBookList.eof()) {
getline(checkBookList, s);
sTotal += s + "\n";
}
checkBookList.close();
if (sTotal.find("NA")) {
sTotal.replace(sTotal.find("NA"), 2, globalUsername);
checkOutBook.open("books/" + bookChoice + ".txt", ios::trunc);
checkOutBook << sTotal;
checkOutBook.close();
}
else if (!sTotal.find("NA")) {
cout << "Book already checked out!" << endl;
}
checkOutBook.close();
system("PAUSE");
}
There are a few issues with your code:
lack of error handling.
while (!checkBookList.eof()) - see Why is iostream::eof inside a loop condition (i.e. while (!stream.eof())) considered wrong?
if (sTotal.find("NA")) - string::find() returns an INDEX, not a BOOLEAN. If 0 is returned, that will be evaluated as false. All other values will be evaluated as true, including string::npos (-1), which is what string::find() returns if a match is not found.
Also, your goal is to check if the 3RD LINE SPECIFICALLY is "NA", so using string::find() is not the best choice for that purpose. Think of what would happen if the 1st or 2nd line happened to contain the letters NA. Your code logic would not behave properly.
else if (!sTotal.find("NA")) - no need to call find() in the else at all. Just use else by itself.
With that said, try something more like this:
void bookCheckout()
{
system("CLS");
ifstream checkBookList;
checkBookList.open("books/booklist.txt");
if (!checkBookList.is_open()) {
cerr << "Unable to open booklist.txt" << endl;
return;
}
string line;
cout << "<---Available Books--->" << endl;
while (getline(checkBookList, line)) {
cout << line << endl;
}
checkBookList.close();
cout << "\nWhat Book Would You Like?:";
string bookChoice;
getline(cin, bookChoice);
checkBookList.open("books/" + bookChoice + ".txt");
if (!checkBookList.is_open()) {
cerr << "Unable to open " + bookChoice + ".txt for reading" << endl;
return;
}
string sTotal, sOwner;
int lineNum = 0;
string::size_type ownerIndex = string::npos;
while (getline(checkBookList, line)) {
++lineNum;
if (lineNum == 3) {
sOwner = line;
ownerIndex = sTotal.size();
}
sTotal += line + "\n";
}
checkBookList.close();
if (sOwner == "NA") {
sTotal.replace(ownerIndex, 2, globalUsername);
ofstream checkOutBook("books/" + bookChoice + ".txt", ios::trunc);
if (!checkOutBook.is_open()) {
cerr << "Unable to open " + bookChoice + ".txt for writing" << endl;
return;
}
checkOutBook << sTotal;
checkOutBook.close();
cout << "Book checked out!" << endl;
}
else {
cout << "Book already checked out by " << sOwner << "!" << endl;
}
system("PAUSE");
}
Alternatively, use a std::vector to gather the book contents, that will give you indexed access to each line:
#include <vector>
void bookCheckout()
{
system("CLS");
ifstream checkBookList;
checkBookList.open("books/booklist.txt");
if (!checkBookList.is_open()) {
cerr << "Unable to open booklist.txt" << endl;
return;
}
string line;
cout << "<---Available Books--->" << endl;
while (getline(checkBookList, line)) {
cout << line << endl;
}
checkBookList.close();
cout << "\nWhat Book Would You Like?:";
string bookChoice;
getline(cin, bookChoice);
checkBookList.open("books/" + bookChoice + ".txt");
if (!checkBookList.is_open()) {
cerr << "Unable to open " + bookChoice + ".txt for reading" << endl;
return;
}
vector<string> sTotal;
sTotal.reserve(3);
while (getline(checkBookList, line)) {
sTotal.push_back(line);
}
while (sTotal.size() < 3) {
sTotal.push_back("");
}
checkBookList.close();
if (sTotal[2] == "" || sTotal[2] == "NA") {
sTotal[2] = globalUsername;
ofstream checkOutBook("books/" + bookChoice + ".txt", ios::trunc);
if (!checkOutBook.is_open()) {
cerr << "Unable to open " + bookChoice + ".txt for writing" << endl;
return;
}
for(size_t i = 0; i < sTotal.size(); ++i) {
checkOutBook << sTotal[i] << '\n';
}
checkOutBook.close();
cout << "Book checked out!" << endl;
}
else {
cout << "Book already checked out by " << sTotal[2] << "!" << endl;
}
system("PAUSE");
}

Segmentation fault on getline while parsing a file

I'm making a very simple file parser, in CSV style. The compilation runs smoothly, and when I run it, I'm having a segfault (core dumped). The only printed line is the one telling "Done" to say that the file succesfully opened. So my guess is that the Segfault happened during while(getline(myfile, line)).
Here's my code (parser.cpp):
#include "parser.h"
vector<string> str_explode(string const & s, char delim)
{
vector<string> result;
istringstream iss(s);
for (string token; getline(iss, token, delim); )
{
result.push_back(move(token));
}
return result;
}
vector<vector<string>> getTokensFromFile(string fileName)
{
bool verbose = true;
if(verbose)
cout << "Entering getTokensFromFile(" << fileName << ")" << endl ;
/* declaring what we'll need :
* string line -> the line beeing parsed
* ifstream myfile -> the file that name has been given as parameter
* vector <vector <string> > tokens -> the return value
*
* Putting all line into tokens
*/
string line;
ifstream myfile(fileName);
vector< vector<string> > tokens;
if(verbose)
cout << "Opening file " << fileName << " ... ";
if (myfile.is_open())
{
if(verbose)
cout << "Done !" << endl;
while (getline (myfile,line))
{
if(verbose)
cout << "Parsing line '" << line << "'. ";
// If line is blank or start with # (comment)
// then we don't parse it
if((line.length() == 0) || (line.at(0) == '#'))
{
if(verbose)
cout << "Empty or comment, passing.";
continue;
}
else
{
vector <string> tmptokens;
if(verbose)
cout << "Adding token " << tmptokens[0] << " and its values.";
tokens.push_back(tmptokens);
}
cout << endl;
}
}
else
{
cout << "Unable to open file " << fileName << endl;
throw exception();
}
if(verbose)
cout << "Exiting getTokensFromFile(" << fileName << ")" << endl;
return tokens;
}
main.cpp
#include "parser.h"
int main()
{
getTokensFromFile("testfile.csv");
return 0;
}
And my testfile.csv
version;1.3
###### SPECIE ######
SpecieID;Value1
VariantID;Value2
####################
##### IDENTITY #####
Name;Value3
DOName;Value4
####################
All files are in the same folder.
Do you have any clue why I'm having this segfault?
Thanks
Here is one obvious error, where you are accessing a vector's element out-of-bounds. Accessing an out-of-bounds element is undefined behavior.
else
{
vector <string> tmptokens;
if(verbose)
cout << "Adding token " << tmptokens[0] << " and its values.";
tokens.push_back(tmptokens);
}
Since tmptokens is empty, there is no tmptokens[0].
If the vector is empty, you could have done this:
else
{
if(verbose)
cout << "Adding new token and its values.";
tokens.push_back({});
}
There is no need to manually create an empty vector starting with C++11.

getline and testing EOF at once

I would like to ask about my problem I tried to read Getline and EOF Question but did not help.
Problem is I have no idea where could be mistake here:
Is there some problem with used function ( getline or checking EOF ) ?
If there is no text in text.txt file it says there something was found. But I have no idea why or where I made a mistake ...
What I want is: Search for string and if there is no text in txt file I want it to says EOF or something. It still says - even if file is empty - string I was looking for was found in line one , position one - for example
I am puting there code:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int openFile(void);
int closeFile(void);
int getTime(void);
int findTime();
int findDate();
int stringFind(string);
bool getOneLine(void);
string what;
bool ifound = false;
string foundstring;
string filename ;
fstream inputfile;
string sentence ;
size_t found ;
string foundTime ;
string foundDate ;
bool timeIsHere = false;
bool dateIsHere = false;
int iterTime = 0;
int iterDate = 0;
int line = 0;
int main (void){
sentence.clear();
cout << " Enter the file name:" << endl;
openFile();
while (getOneLine() != false) {
stringFind("Time");
}
cout << "END OF PROGRAM" << endl;
system("PAUSE");
///getTime();
closeFile();
system("PAUSE");
}
int closeFile(void) {
inputfile.close();
cout << " File: " << filename << " - was closed...";
return 0;
}
int openFile(void) {
cout << " Insert file name in program directory or full path to desired file you want to edit:"<<endl;
cout << " Do not use path with a space in directory address or filename ! " << endl;
cout<<" ";
getline(cin, filename);
inputfile.open(filename, ios::in);
cout <<" file_state: " << inputfile.fail();
if (inputfile.fail() == 1) {
cout << " - Cannot open your file" << endl;
}
else cout << " - File was openned sucesfully"<< endl;
return 0;
}
int stringFind(string what) {
cout << " I am looking for:" << what << endl;
found = what.find(sentence);
if (found == string::npos) {
cout << " I could not find this string " << endl;
}
else if(found != string::npos){
cout << " substring was found in line: " << line + 1 << " position: " << found + 1 << endl << endl;
ifound = true;
foundstring = sentence;
}
return 0;
}
bool getOneLine(void) {
if (inputfile.eof()) {
cout << "END OF FILE" << endl << endl;
return false;
}
else{
getline(inputfile, sentence);
cout << "next sentence is: "<< sentence << endl;
return true;
}
}
I am newbie and I have no one to ask - personally . I tried to edit While cycle and IF's to make sure that I did not make a serious mistake but I have no idea.
I tried it with for example sample.txt and this file was empty.
Always test whether input succeeded after the read attempt! The stream cannot know what you are attempting to do. It can only report whether the attempts were successful so far. So, you'd do something like
if (std::getline(stream, line)) {
// deal with the successful case
}
else {
// deal with the failure case
}
In the failure case you might want to use use eof() to determine whether the failure was due reaching the end of the stream: Having reached the end of file and, thus, std::ios_base:eofbit being set is often not an error but simply the indication that you are done. It may still be an error, e.g., when it is known how many lines are to be read but fewer lines are obtained.
Correct way to use getline() and EOF checking would be like this:
bool getOneLine(void) {
if (getline(inputfile, sentence)) {
cout << "next sentence is: "<< sentence << endl;
return true;
}
if (inputfile.eof())
cout << "EOF reached" << endl;
else
cout << "Some IO error" << endl;
return false;
}
You have one mistake here:
found = what.find(sentence);
You are seeking inside of what for the sentence. If sentence is empty, it will be found.
Change it to
found = sentence.find(what);
You should definitivly learn how to use a debugger. That way you would find such issues pretty fast!

fstream << operate not writing entire output

Everthing goes well until the f << "string" << temp_int << endl; statement
get different results with different openmodes, either doesn't write at all or writes the first two chars of "NumberSaves"
unsigned int temp_int = 0;
fstream f("resources/saveData/Player/savelog.txt");
if (!f)
{
cout << "error accessing savelist" << endl;
}
else
{
string skip;
std::stringstream iss;
string line;
readVarFromFile(f, iss, skip, line, { &temp_int }); //check how many saves currently
temp_int += 1; //increment number of saves by 1
f.seekp(ios_base::beg);
cout << "Write position: " << f.tellp() << endl; //check stream is at beginning
f << "<NumberSaves>" << temp_int << endl; //truncate <NumberSaves> 'x' with <NumberSaves> 'x + 1'
cout << "Write position: " << f.tellp() << endl; //position suggests the entire string has been written, only two characters have been
if (!f)
{
cout << "ERROR";
}
f.seekp(ios_base::end);
f << currentPlayer->getName(); //append players name to end of file
}
desired output is as follows
NumberSaves 2
player
anotherplayer
current output
Nu
player
Use seekp properly like this:
os.seekp(0, std::ios_base::end); // means bring me to 0 from the end of file.
look at the example code in
http://en.cppreference.com/w/cpp/io/basic_ostream/seekp
std::ios_base::end is a direction not an absolute position. It is just an enum value. The value is probably 2 and that is why it brings you to position 2 inside the file.

FASTA reader written in C++?

Let me start off by stating that I am a beginner in C++. Anyways, the FASTA format goes as follows:
Any line starting with a '>' indicates the name/id of the gene sequence right below it. There is a gene sequence right below the id. This gene sequence can be 1 or multiple lines.
So... what I want to do is print: id << " : " << gene_sequence << endl;
Here is my code:
#include <iostream>
#include <fstream>
int main(int argc, char **argv) {
if (argc < 2) {
std::cerr << " Wrong format: " << argv[0] << " [infile] " << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if (!input.good()) {
std::cerr << "Error opening: " << argv[1] << " . You have failed." << std::endl;
return -1;
}
std::string line, id, DNA_sequence;
while (std::getline(input, line).good()) {
if (line[0] == '>') {
id = line.substr(1);
std::cout << id << " : " << DNA_sequence << std::endl;
DNA_sequence.clear();
}
else if (line[0] != '>'){
DNA_sequence += line;
}
}
}
For the second argument inputted into the command line, here is the content of my file:
>DNA_1
GATTACA
>DNA_2
TAGACCA
TAGACCA
>DNA_3
ATAC
>DNA_4
AT
Please copy and paste into text file.
After this has been done, and the code has been executed, I want to point out the problem. The code skips inputting the sequence of DNA_1 into its correct respective place, and instead placing DNA_1 's sequence into DNA_2. The results get pushed forward 1 as a result. Any assistance or tips would be greatly appreciated?
As I've said before, I am new to C++. And the semantics are quite hard to learn compared to Python.
I see a few problems with your code.
First you loop on std::ifstream::good() which doesn't work because it won't allow for End Of File (which happens even after a good read).
Then you access line[0] without checking if the line is empty which could cause a seg-fault.
Next you output the "previous line" before you have even collected it.
Finally you don't output the final line because the loop terminates when it doesn't find another >.
I added comments to my corrections to your code:
#include <iostream>
#include <fstream>
int main(int argc, char **argv) {
if (argc < 2) {
std::cerr << " Wrong format: " << argv[0] << " [infile] " << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if (!input.good()) {
std::cerr << "Error opening: " << argv[1] << " . You have failed." << std::endl;
return -1;
}
std::string line, id, DNA_sequence;
// Don't loop on good(), it doesn't allow for EOF!!
// while (std::getline(input, line).good()) {
while (std::getline(input, line)) {
// line may be empty so you *must* ignore blank lines
// or you have a crash waiting to happen with line[0]
if(line.empty())
continue;
if (line[0] == '>') {
// output previous line before overwriting id
// but ONLY if id actually contains something
if(!id.empty())
std::cout << id << " : " << DNA_sequence << std::endl;
id = line.substr(1);
DNA_sequence.clear();
}
else {// if (line[0] != '>'){ // not needed because implicit
DNA_sequence += line;
}
}
// output final entry
// but ONLY if id actually contains something
if(!id.empty())
std::cout << id << " : " << DNA_sequence << std::endl;
}
Output:
DNA_1 : GATTACA
DNA_2 : TAGACCATAGACCA
DNA_3 : ATAC
DNA_4 : AT
You're storing the new id before printing the old one:
id = line.substr(1);
std::cout << id << " : " << DNA_sequence << std::endl;
Swap the lines around for proper order. You probably also want to check if you have any id already present to skip the first entry.
working implementation is here
https://rosettacode.org/wiki/FASTA_format#C.2B.2B
only corrected
while( std::getline( input, line ).good() ){
to
while( std::getline( input, line ) ){
Code
#include <iostream>
#include <fstream>
int main( int argc, char **argv ){
if( argc <= 1 ){
std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
return -1;
}
std::ifstream input(argv[1]);
if(!input.good()){
std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
return -1;
}
std::string line, name, content;
while( std::getline( input, line ) ){
if( line.empty() || line[0] == '>' ){ // Identifier marker
if( !name.empty() ){ // Print out what we read from the last entry
std::cout << name << " : " << content << std::endl;
name.clear();
}
if( !line.empty() ){
name = line.substr(1);
}
content.clear();
} else if( !name.empty() ){
if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
name.clear();
content.clear();
} else {
content += line;
}
}
}
if( !name.empty() ){ // Print out what we read from the last entry
std::cout << name << " : " << content << std::endl;
}
return 0;
}
input:
>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED
output:
Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED