Traversing a Fatsa file in C/C++ - c++

I'm looking to write a program in C/C++ to traverse a Fasta file formatted like:
>ID and header information
SEQUENCE1
>ID and header information
SEQUENCE2
and so on
in order to find all unique sequences (check if subset of any other sequence)
and write unique sequences (and all headers) to an output file.
My approach was:
Copy all sequences to an array/list at the beginning (more efficient way to do this?)
Grab header, append it to output file, compare sequence for that header to everything in the list/array. If unique, write it under the header, if not delete it.
However, I'm a little unsure as to how to approach reading the lines in properly. I need to read the top line for the header, and then "return?" to the next line to read the sequence. Sometimes the sequence spans more then two lines, so would I use > (from the example above) as a delimiter? If I use C++, I imagine I'd use iostreams to accomplish this?
If anybody could give me a nudge in the right direction as to how I would want to read the information I need to manipulate/how to carry out the comparison, it'd be greatly appreciated.

First, rather than write your own FASTA reading routine you probably want to use something that alrady exists, for example, see: http://lh3lh3.users.sourceforge.net/parsefastq.shtml
Internally you'll have the sequences without newlines and that is probably helpful. I think the simplest approach from a high level is
loop over fasta and write out sequences to a file
sort that file
with the sorted file it becomes easier to pick out subsequences so write a program to find the "unique ids"
Using the unique id's go back to the original fasta and get whatever additional information you need.

Your approach is usable. Below is an implementation of it.
However, I'm a little unsure as to how to approach reading the lines
in properly. ... Sometimes the sequence spans more then two lines, so would I use > (from the example above) as a delimiter?
That's right; in addition, there's just the EOF which has to be checked.
I wrote the function getd() for that, which reads a single-line description or concatenated lines of sequence data and returns a pointer to the string it allocated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *getd()
{
char *s = NULL, *eol;
size_t i = 0, size = 1; // 1 for null character
int c;
#define MAXLINE 80 // recommended max. line length; longer lines are okay
do // read single-line description or concatenated lines of sequence data
{
do // read a line (until '\n')
{
s = realloc(s, size += MAXLINE+1); // +1 for newline character
if (!s) puts("out of memory"), exit(1);
if (!fgets(s+i, size, stdin)) break;
eol = strchr(s+i, '\n');
i += MAXLINE+1;
} while (!eol);
if (!i) { free(s); return NULL; } // nothing read
if (*s == '>') return s; // single-line description
i = eol-s;
ungetc(c = getchar(), stdin); // peek at next character
} while (c != '>' && c != EOF);
return s;
}
int main()
{
char *s;
struct seq { char *head, *data; } *seq = NULL;
int n = 0, i, j;
while (s = getd())
if (*s == '>')
{ // new sequence: make room, store header
seq = realloc(seq, ++n * sizeof *seq);
if (!seq) puts("out of memory"), exit(1);
seq[n-1] = (struct seq){ s, "" };
}
else
if (n) // store sequence data if at least one header present
seq[n-1].data = s;
for (i = 0; i < n; ++i)
{
const int max = 70; // reformat output data to that line length max.
printf("%s", seq[i].head);
for (s = seq[i].data, j = 0; j < n; ++j)
if (j != i) // compare sequence to the others, delete if not unique
if (strstr(seq[j].data, s)) { s = seq[i].data = ""; break; }
for (; strlen(s) > max && s[max] != '\n'; s += max)
printf("%.*s\n", max, s);
printf("%s", s);
}
}

Related

My compressed file have larger file size than the original file

I was able to write a code for huffman coding only using queue library. But as I save my file for compression it gives a larger byte size than the original file.
ex.
filesize.txt has 17 bytes it contain a string "Stressed-desserts" while
compressedfile.bin has 44 bytes which contains the huffman codes of the original file "01111011000011110001001100100011110010010111".
This is my whole code
#include <iostream>
#include <queue>
#include <fstream>
using namespace std;
struct HuffNode{
int my_Frequency;
char my_Char;
string my_Code;
HuffNode* my_Left;
HuffNode* my_Right;
};
//global variables
int freq[256] = {0};
string encoded = "";
string filename;
//Comparing the frequency in the priority queue
struct compare_freq {
bool operator()(HuffNode* l, HuffNode* r) {
return l->my_Frequency > r->my_Frequency;
}
};
priority_queue <HuffNode*, vector<HuffNode*>, compare_freq> freq_queue;
//get the file from user
string get_file_name()
{
cout << "Input file name to compress: ";
cin >> filename;
return filename;
}
//Scan the file to be compressed and tally all the occurence of all characters.
void file_getter()
{
fstream fp;
char c;
fp.open(get_file_name(), ios::in);
if(!fp)
{
cout << "Error: Couldn't open file " << endl;
system("pause");
}
else
{
while(!fp.eof())
{
c = fp.get();
freq[c]++;
}
}
fp.close();
}
//HuffNode to create a newNode for queue containing the letter and the frequency
HuffNode* set_Node(char ch, int count)
{
HuffNode* newNode = new HuffNode;
newNode->my_Frequency = count;
newNode->my_Char = ch;
newNode->my_Code = "";
newNode->my_Right = nullptr;
newNode->my_Left = nullptr;
return newNode;
}
//Sort or Prioritize characters based on numbers of occurences in text.
void insert_Node(char ch, int count)
{
//pass the ch and count to the newNodes for queing
freq_queue.push(set_Node(ch, count));
}
void create_Huffman_Tree()
{
HuffNode* root;
file_getter();
//insert the characters in the their frequencies into the priority queue
for(int i = 0; i < 256; i++)
{
if(freq[i] > 0)
{
insert_Node(char(i), freq[i]);
}
}
//build the huffman tree
while(freq_queue.size() > 1)
{
//get the two highest priority nodes
HuffNode* for_Left = freq_queue.top();
freq_queue.pop();
HuffNode* for_Right = freq_queue.top();
freq_queue.pop();
//Create a new HuffNode with the combined frequency of the left and right children
int freq = for_Left->my_Frequency + for_Right->my_Frequency;
char ch = '$';
root = set_Node(ch, freq);
root->my_Left = for_Left;
root->my_Right = for_Right;
//Insert the new node into the priority_queue.
freq_queue.push(root);
}
// The remaining HuffmanNode in the queue is the root of the Huffman tree
root = freq_queue.top();
}
void preOrderTraverse(HuffNode* root, char c, string code)
{
if (root == nullptr) {
// If the tree is empty, return
return;
}
if (root->my_Char == c)
{
// If the current HuffmanNode is a leaf HuffmanNode, print the code for the character.
root->my_Code = code;
encoded += code;
return;
}
// Otherwise, recurse on the left and right children
preOrderTraverse(root->my_Left, c, code + "0");
preOrderTraverse(root->my_Right, c, code + "1");
}
void encode_File(string ccode)
{
HuffNode* root = freq_queue.top();
for(int i = 0; i < ccode.length(); i++)
{
char c = ccode[i];
string code = "";
preOrderTraverse(root, c, code);
}
}
void save_Huffman_Code()
{
fstream fp, fp2;
fp.open("Compressed_file.bin", ios::out);
fp2.open(filename, ios::in);
string ccode;
getline(fp2, ccode);
encode_File(ccode);
fp << encoded;
fp.close();
fp2.close();
}
int main()
{
create_Huffman_Tree();
HuffNode* root = freq_queue.top();
save_Huffman_Code();
}
I should get a compressed file that has a smaller byte size than the original. I am trying to write the code without using bit operations, unorderedmap or map. I only use priority_queue for the program.
You are writing eight bits per bit to your output, so it is eight times larger than it's supposed to be. You want to write one bit per bit. To write bits, you need to accumulate them, one by one, into a byte buffer until you have eight, then write that byte. At the end, write the remaining bits. Use the bit operators << and | to put the bits into the byte buffer. E.g. for each bit equal to 0 or 1:
unsigned buf = 0, n = 0;
...
buf |= bit << n;
if (++n == 8) {
fp.put(buf);
buf = n = 0;
}
...
if (n)
fp.put(buf);
There are many other things wrong with your code.
Because c is a signed byte type, freq[c]++; will fail for input that has bytes larger than 127, as c will be negative. You need int c; instead of char c.
Using while(!fp.eof()) will result in getting a -1 as your last character, which is an EOF indication, and again indexing your array with a negative number. Do while ((c = fp.get()) != -1).
You use a series of get()'s the first time you read the file, which is correct. However the second time you read the file, you use a single getline(). This only gets the first line, and it omits the new line character. Read the file the same way both times, with a series of get()'s.
You are only writing the codes. There is no description of the Huffman code preceding them, so there is no way for a decoder to make any sense of the bits you send. Once you fix it to send a bit per bit instead of a byte per bit, your output will be smaller than what the data can actually be compressed to. When you add the tree, the input and output will be about the same length.
You are traversing the entire tree every time you want to encode one character! You need to make a table of codes by traversing the tree once, and then use the table to encode.
There is no way to know how many characters have been encoded, which will result in an ambiguity for any extra bits in the last byte. You need to either send the number of characters ahead of the encoded characters, or include one more symbol when coding for an end-of-stream indicator.
what you have in encoded is string of 0s and 1s. Those itself are characters.
You may want to convert the string to binary and then store it?
If you use character(a byte) to store the 0s and 1s, it will take more space. Instead of using 1 bit to store digit, it uses 1 byte. So if you convert the data to bits it should take (44/8)+1

What is the reason behind the debugging getting stopped abruptly in the following code?

Here is the code to find the number of matches of a string, which is input from the user, can be found in the file temp.txt. If, for example, we want love to be counted, then matches like love, lovely, beloved should be considered. We also want to count the total number of words in temp.txt file.
I am doing a line by line reading here, not word by word.
Why does the debugging stop at totalwords += counting(line)?
/*this code is not working to count the words*/
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int totalwords{0};
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
wordcount++;
}
}
return wordcount;
}
int main() {
ifstream in_file;
in_file.open("temp.txt");
if(!in_file){
cerr<<"PROBLEM OPENING THE FILE"<<endl;
}
string line{};
int counter{0};
string word {};
cout<<"ENTER THE WORD YOU WANT TO COUNT IN THE FILE: ";
cin>>word;
int n {0};
n = ( word.length() - 1 );
while(getline(in_file>>ws,line)){
totalwords += counting(line);
while(line.find(word)!=string::npos){
counter++;
int index{0};
index = line.find(word);
line.erase(0,(index+n));
}
}
cout<<endl;
cout<<counter<<endl;
cout<<totalwords;
return 0;
}
line.erase(0, index); doesn't erase the space, you need
line.erase(0, index + 1);
Your code reveals a few problems...
At very first, counting a single word for an empty line doesn't appear correct to me. Second, erasing again and again from the string is pretty inefficient, with every such operation all of the subsequent characters are copied towards the front. If you indeed wanted to do so you might rather want to search from the end of the string, avoiding that. But you can actually do so without ever modifying the string if you use the second parameter of std::string::find (which defaults to 0, so has been transparent to you...):
int index = line.find(' ' /*, 0*); // first call; 0 is default, thus implicit
index = line.find(' ', index + 1); // subsequent call
Note that using the character overload is more efficient if you search for a single character anyway. However, this variant doesn't consider other whitespace like e. g. tabulators.
Additionally, the variant as posted in the question doesn't consider more than one subsequent whitespace! In your erasing variant – which erases one character too few, by the way – you would need to skip incrementing the word count if you find the space character at index 0.
However I'd go with a totally new approach, looking at each character separately; you need a stateful loop for in that case, though, i.e. you need to remember if you already are within a word or not. It might look e. g. like this:
size_t wordCount = 0; // note: prefer an unsigned type, negative values
// are meaningless anyway
// size_t is especially fine as it is guaranteed to be
// large enough to hold any count the string might ever
// contain characters
bool inWord = false;
for(char c : line)
{
if(isspace(static_cast<unsigned char>(c)))
// you can check for *any* white space that way...
// note the cast to unsigned, which is necessary as isspace accepts
// an int and a bare char *might* be signed, thus result in negative
// values
{
// no word any more...
inWord = false;
}
else if(inWord)
{
// well, nothing to do, we already discovered a word earlier!
//
// as we actually don't do anything here you might just skip
// this block and check for the opposite: if(!inWord)
}
else
{
// OK, this is the start of a word!
// so now we need to count a new one!
++wordCount;
inWord = true;
}
}
Now you might want to break words at punctuation characters as well, so you might actually want to check for:
if(isspace(static_cast<unsigned char>(c)) || ispunct(static_cast<unsigned char>(c))
A bit shorter is the following variant:
if(/* space or punctuation */)
{
inWord = false;
}
else
{
wordCount += inWord; // adds 0 or 1 depending on the value
inWord = false;
}
Finally: All code is written freely, thus unchecked – if you find a bug, please fix yourself...
debugging getting stopped abruptly
Does debugging indeed stop at the indicated line? I observed instead that the program hangs within the while loop in counting. You may make this visible by inserting an indicator output (marked by HERE in following code):
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
cout << '.'; // <--- HERE: indicator output
wordcount++;
}
}
return wordcount;
}
As Jarod42 pointed out, the erase call you are using misses the space itself. That's why you are finding spaces and “counting words” forever.
There is also an obvious misconception about words and separators of words visible in your code:
empty lines don't contain words
consecutive spaces don't indicate words
words may be separated by non-spaces (parentheses for example)
Finally, as already mentioned: if the problem is about counting total words, it's not necessary to discuss the other parts. And after the test (see HERE) above, it also appears to be independent on file input. So your code could be reduced to something like this:
#include <iostream>
#include <string>
int counting(std::string line) {
int wordcount = 0;
if (line.empty()) {
return 1;
}
if (line.find(" ") == std::string::npos) {
wordcount++;
} else {
while (line.find(" ") != std::string::npos) {
int index = 0;
index = line.find(" ");
line.erase(0, index);
wordcount++;
}
}
return wordcount;
}
int main() {
int totalwords = counting("bla bla");
std::cout << totalwords;
return 0;
}
And in this form, it's much easier to see if it works. We expect to see a 2 as output. To get there, it's possible to try correcting your erase call, but the result would then still be wrong (1) since you are actually counting spaces. So it's better to take the time and carefully read Aconcagua's insightful answer.

Split student list that has format like: 0001 William Bill Junior 8.5

I need to split a list of student like this into ID, Name and score. This is an exercise so no string is allowed, only char
0001 William Bob 8.5
0034 Howard Stark 9.5
0069 Natalia Long Young 8
Here's the code
int readFile(list& a) {
char str[MAX];
short i = 0;
ifstream fi(IN);
if (!fi)
return 0;
while (!fi.eof()) {
fi.getline(str, MAX - 1);
char* temp = strtok(str, " ");
if (temp == NULL)
continue;
strcpy(a.sv[i].id, temp);
temp = strtok(NULL, "0123456789");
strcpy(a.sv[i].name, temp);
temp = strtok(NULL, "\n");
a.sv[i].grade = atof(temp);
i++;
}
a.size = i;
fi.close();
return 1;
}
Using strtok() I have splitted ID and Name successfully, but the score are
0.5
0.5
0
I know the problem is because of temp = strtok(NULL, "0123456789"); but I don't know how to fix it, are there any delimiters beside "0123456789", or can I move the pointer back?
This is my attempt to fix the while(!file.eof()) and solve my problem.
Here's my heading and structs:
#include<iostream>
#include<fstream>
#include<string>
#define IN "D:\\Input.txt"
#define OUT "D:\\Output.txt"
#define MAX 40
using namespace std;
struct sv{
char id[MAX], name[MAX] , sex[MAX];
float grade;
};
struct dssv {
sv sv[MAX];
short size;
};
And here's my function:
int readFile(dssv& a) {
char str[MAX];
short i = 0;
ifstream fi(IN);
if (!fi)
return 0;
while (fi>>a.sv[i].id && fi.getline(str, MAX)) {
char* name = strchr(str, ' ');
int pos = strrchr(name, ' ') - name;
char* score = str + pos;
strcpy(name + pos, "\0"); \\null-terminate to remove the score.
strcpy(a.sv[i].name, name + 1);
a.sv[i].grade = atof(score + 1);
i++;
}
a.size = i;
fi.close();
return 1;
}
Still figuring out how to fix the eof() and why do I need two pointers char* name and char* score instead of one and reuse it.
You have started off on the wrong foot. See Why is while ( !feof (file) ) always wrong?. While there are a number of ways to separate the information into id, name, score, probably the most basic is to simply read an entire line of data into a temporary buffer (character array), and then to use sscanf to separate id, name & score.
The parsing with sscanf is not difficult, the only caveat being that your name can contain whitespace, so you cannot simply use "%s" as the format specifier to extract the name. This is mitigated by your score field always starting with a digit and digits do not occur in names (there are always exceptions to the rule -- and it can be handled with a simple parse with a pair of pointers, but for the basic example we will make this formatting assumption)
To make data handling simpler and be able to coordinate all the information for one student as a single object (allowing you to create an array of them to hold all student information) you can use a simple stuct. Declaring a few constants to set the sizes for everything avoids using Magic-Numbers throughout your code. (though for the sscanf field-width modifiers, actual numbers must be used as you cannot use constants or variables for the width modifier) For example, your struct could be:
#define MAXID 8 /* if you need a constant, #define one (or more) */
#define MAXNM 64
#define MAXSTD 128
#define MAXLN MAXSTD
typedef struct { /* simple struct to hold student data */
char id[MAXID];
char name[MAXNM];
double score;
} student_t;
(and POSIX reserves the "_t" suffix for extension of types, but there won't be a "student_t" type -- but in general be aware of the restriction though you will see the "_t" suffix frequently)
The basic approach is to read a line from your file into a buffer (with either fgets or POSIX getline) and then pass the line to sscanf. You condition your read loop on the successful read of each line so your read stops when EOF is reached. For separating the values with sscanf, it is convenient to use a temporary struct to hold the separated values. That way if the separation is successful, you simply add the temporary struct to your array. To read the students into an array of student_t you could do:
size_t readstudents (FILE *fp, student_t *s)
{
char buf[MAXLN]; /* temporary array (buffer) to hold line */
size_t n = 0; /* number of students read from file */
/* read each line in file until file read or array full */
while (n < MAXSTD && fgets (buf, MAXLN, fp)) {
student_t tmp = { .id = "" }; /* temporary stuct to fill */
/* extract id, name and score from line, validate */
if (sscanf (buf, "%7s %63[^0-9] %lf", tmp.id, tmp.name, &tmp.score) == 3) {
char *p = strrchr (tmp.name, 0); /* pointer to end of name */
/* backup overwriting trailing spaces with nul-terminating char */
while (p && --p >= tmp.name && *p == ' ')
*p = 0;
s[n++] = tmp; /* add temp struct to array, increment count */
}
}
return n; /* return number of students read from file */
}
Now let's take a minute and look at the sscanf format string used:
sscanf (buf, "%7s %63[^0-9] %lf", tmp.id, tmp.name, &tmp.score)
Above, with the line in buf, the format string used is "%7s %63[^0-9] %lf". Each character array type uses a field-width modifier to limit the number of characters stored in the associated array to one-less-than the number of characters available. This protects the array bounds and ensures that each string stored is nul-terminated. The "%7s" is self-explanatory - read at most 7-characters into what will be the id.
The next conversion specifier for the name is "%63[^0-9]" which is a bit more involved as it uses the "%[...] character class conversion specifier with the match inverted by use of '^' as the first character. The characters in the class being digits 0-9, the conversion specifier reads up to 63 character that do Not include digits. This will have the side-effect of including the spaces between name and score in name. Thankfully they are simple enough to remove by getting a pointer to the end of the string with strrchr (tmp.name, 0); and then backing up checking if the character is a ' ' (space) and overwriting it with a nul-terminating character (e.g. '\0' or numeric equivalent 0).
The last part of the sscanf conversion, "%lf" is simply the conversion specifier for the double value for score.
Note: most importantly, the conversion is validated by checking the return of the call to sscanf is 3 -- the number of conversions requested. If all conversions succeed into the temporary struct tmp, then tmp is simply added to your array of struct.
To call the function from main() and read the student information, you simply declare an array of student_t to hold the information, open and validate your data file is open for reading, and make a call to readstudents capturing the return to validate that student information was actually read from the file. Then you can make use of the data as you wish (it is simply output below):
int main (int argc, char **argv) {
student_t students[MAXSTD] = {{ .id = "" }}; /* array of students */
size_t nstudents = 0; /* count of students */
/* use filename provided as 1st argument (stdin by default) */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
perror ("file open failed");
return 1;
}
/* read students from file, validate return, if zero, handle error */
if ((nstudents = readstudents (fp, students)) == 0) {
fputs ("error: no students read from file.\n", stderr);
return 1;
}
if (fp != stdin) /* close file if not stdin */
fclose (fp);
for (size_t i = 0; i < nstudents; i++) /* output each student data */
printf ("%-8s %-24s %g\n",
students[i].id, students[i].name, students[i].score);
return 0;
}
All that remains is including the required headers, stdio.h and string.h and testing:
Example Use/Output
$ ./bin/read_stud_id_name_score dat/stud_id_name_no.txt
0001 William Bob 8.5
0034 Howard Stark 9.5
0069 Natalia Long Young 8
It works as needed.
Note, this is the most basic way of separating the values and only works based on the assumption that your score field starts with a digit.
You can eliminate that assumption by manually parsing the information you need by reading each line in the same manner, but instead of using sscanf, simply declare a pair of pointers to isolate id, name & score manually. The basic approach being to advance a pointer to the first whitespace and read id, skip the following whitespace and position the pointer at the beginning of name. Start from the end of the line with the other and backup to the first whitespace at the end and read score, then continue backing up positioning the pointer in the first space after name. Then just copy the characters between your start and end pointer to name and nul-terminate. It is more involved from a pointer-arithmetic standpoint, but just as simple. (that is left to you)
Look things over and let me know if you have further questions. Normally, you would dynamically declare your array of students and allocate/reallocate as needed to handle any number of students from the file. (or from an actual C++ standpoint use the vector and string types that the standard template library provides and let the containers handle the memory allocation for you) That too is just one additional layer that you can add to add flexibility to your code.
C++ Implementation
I apologize for glossing over a C++ solution, but given your use of C string functions in your posted code, I provided a C solution in return. A C++ solution making using the std::string and std::vector is not that much different other than from a storage standpoint. The parsing of the three values is slightly different, where the entire line is read into id and name and then the score is obtained from the portion of the line held in name and then those characters erased from name.
Changing the C FILE* to std::ifstream and the array of student_t to a std::vector<student_t>, your readstudents() function could be written as:
void readstudents (std::ifstream& fp, std::vector<student_t>& s)
{
std::string buf; /* temporary array (buffer) to hold line */
student_t tmp; /* temporary stuct to fill */
/* read each line in file until file read or array full */
while (fp >> tmp.id && getline(fp, tmp.name)) {
/* get offset to beginning digit within tmp.name */
size_t offset = tmp.name.find_first_of("0123456789"),
nchr; /* no. of chars converted with stod */
if (offset == std::string::npos) /* validate digit found */
continue;
/* convert to double, save in tmp.score */
tmp.score = std::stod(tmp.name.substr(offset), &nchr);
if (!nchr) /* validate digits converted */
continue;
/* backup using offset to erase spaces after name */
while (tmp.name.at(--offset) == ' ')
tmp.name.erase(offset);
s.push_back(tmp); /* add temporary struct to vector */
}
}
(note: the return type is changed to void as the .size() of the student vector can be validated on return).
The complete example would be:
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <vector>
struct student_t { /* simple struct to hold student data */
std::string id;
std::string name;
double score;
};
void readstudents (std::ifstream& fp, std::vector<student_t>& s)
{
std::string buf; /* temporary array (buffer) to hold line */
student_t tmp; /* temporary stuct to fill */
/* read each line in file until file read or array full */
while (fp >> tmp.id && getline(fp, tmp.name)) {
/* get offset to beginning digit within tmp.name */
size_t offset = tmp.name.find_first_of("0123456789"),
nchr; /* no. of chars converted with stod */
if (offset == std::string::npos) /* validate digit found */
continue;
/* convert to double, save in tmp.score */
tmp.score = std::stod(tmp.name.substr(offset), &nchr);
if (!nchr) /* validate digits converted */
continue;
/* backup using offset to erase spaces after name */
while (tmp.name.at(--offset) == ' ')
tmp.name.erase(offset);
s.push_back(tmp); /* add temporary struct to vector */
}
}
int main (int argc, char **argv) {
std::vector<student_t> students {}; /* array of students */
if (argc < 2) { /* validate one argument given for filename */
std::cerr << "error: filename required as 1st argument.\n";
return 1;
}
std::ifstream fp (argv[1]); /* use filename provided as 1st argument */
if (!fp.good()) { /* validate file open for reading */
std::cerr << "file open failed";
return 1;
}
/* read students from file, validate return, if zero, handle error */
readstudents (fp, students);
if (students.size() == 0) {
std::cerr << "error: no students read from file.\n";
return 1;
}
for (auto s : students) /* output each student data */
std::cout << std::left << std::setw(8) << s.id
<< std::left << std::setw(24) << s.name
<< s.score << '\n';
}
(the output is the same -- aside from 2-spaces omitted between the values)
Look things over and let me know if you have questions.

How to read only some previously know lines using ifstream (C++)

By preprocessing on the file i found some line for further processing, know i want to read that lines. is there any faster solution than reading lines one by one using ifstream::getline(...) ?
For example i know that i want only lines of product 4 (0-4-8-12-16-...) or special line numbers stored in a vector...
Now I'm doing this :
string line;
int counter = 0;
while( getline(ifstr,line) ){
if(counter%4 =0){
// some code working with line
}
}
but i want something like this (if faster)
while(getline(ifstr,line)){
// some code working with line
while(++counter%4 !=0){ // or checking on index vector
skipline(ifstr)
}
}
Let me mention again that i have some line index (sorted but not this regular) but i use this example of product4 for simplicity.
Edit: and i want to jump to line at the begining, for example i know that i need to read from line number 2000, how to skip 1999 lines quickly ?
Thanks all
Because #caps said this left him with the feeling there's nothing in the standard library to help with this kind of task, I felt compelled to demonstrate otherwise :)
Live On Coliru
template <typename It, typename Out, typename Filter = std::vector<int> >
Out retrieve_lines(It begin, It const end, Filter lines, Out out, char const* delim = "\\n") {
if (lines.empty())
return out;
// make sure input is orderly
assert(std::is_sorted(lines.begin(), lines.end()));
assert(lines.front() >= 0);
std::regex re(delim);
std::regex_token_iterator<It> line(begin, end, re, -1), eof;
// make lines into incremental offsets
std::adjacent_difference(lines.begin(), lines.end(), lines.begin());
// iterate advancing by each offset requested
auto advanced = [&line, eof](size_t n) { while (line!=eof && n--) ++line; return line; };
for (auto offset = lines.begin(); offset != lines.end() && advanced(*offset) != eof; ++offset) {
*out++ = *line;
}
return out;
}
This is noticably more generic. The trade off (for now) is that the tokenizing iterator requires a random access iterator. I find this a good trade-off because "random access" on files really asks for memory mapped files anyways
Live Demo 1: from string to vector<string>
Live On Coliru
int main() {
std::vector<std::string> output_lines;
std::string is(" a b c d e\nf g hijklmnop\nqrstuvw\nxyz");
retrieve_lines(is.begin(), is.end(), {0,3,999}, back_inserter(output_lines));
// for debug purposes
for (auto& line : output_lines)
std::cout << line << "\n";
}
Prints
a b c d e
xyz
Live Demo 2: From file to cout
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
int main() {
boost::iostreams::mapped_file_source is("/etc/dictionaries-common/words");
retrieve_lines(is.begin(), is.end(), {13,784, 9996}, std::ostream_iterator<std::string>(std::cout, "\n"));
}
Prints e.g.
ASL's
Apennines
Mercer's
The use of boost::iostreams::mapped_file_source can easily be replaced with straight up ::mmap but I found it uglier in the presentation sample.
Store std::fstream::streampos instances corresponding to line beginnings of your file into a std::vector and then you can access a specific line using the index of this vector. A possible implementation follows,
class file_reader {
public:
// load file streampos offsets during construction
explicit file_reader(const std::string& filename)
: fs(filename) { cache_file_streampos(); }
std::size_t lines() const noexcept { return line_streampos_vec.size(); }
// get a std::string representation of specific line in file
std::string read_line(std::size_t n) {
if (n >= line_streampos_vec.size() - 1)
throw std::out_of_range("out of bounds");
navigate_to_line(n);
std::string rtn_str;
std::getline(fs, rtn_str);
return rtn_str;
}
private:
std::fstream fs;
std::vector<std::fstream::streampos> line_streampos_vec;
const std::size_t max_line_length = // some sensible value
// store file streampos instances in vector
void cache_file_streampos() {
std::string s;
s.reserve(max_line_length);
while (std::getline(fs, s))
line_streampos_vec.push_back(fs.tellg());
}
// go to specific line in file stream
void navigate_to_line(std::size_t n) {
fs.clear();
fs.seekg(line_streampos_vec[n]);
}
};
Then you can read a specific line of your file via,
file_reader fr("filename.ext");
for (int i = 0; i < fr.lines(); ++i) {
if (!(i % 4))
std::string line_contents = fr.read_line(i); // then do something with the string
}
ArchbishopOfBanterbury's answer is nice, and I would agree with him that you will get cleaner code and better efficiency by just storing the character positions of the beginning of each line when you do your preprocessing.
But, supposing that is not possible (perhaps the preprocessing is handled by some other API, or is from user input), there is a solution that should do the minimal amount of work necessary to read in only the specified lines.
The fundamental problem is that, given a file with variable line lengths, you cannot know where each line begins and ends, since a line is defined as a sequence of characters that end in '\n'. So, you must parse every character to check and see if it is '\n' or not, and if so, advance your line counter and read in the line if the line counter matches one of your desired inputs.
auto retrieve_lines(std::ifstream& file_to_read, std::vector<int> line_numbers_to_read) -> std::vector<std::string>
{
auto begin = std::istreambuf_iterator<char>(file_to_read);
auto end = std::istreambuf_iterator<char>();
auto current_line = 0;
auto next_line_num = std::begin(line_numbers_to_read);
auto output_lines = std::vector<std::string>();
output_lines.reserve(line_numbers_to_read.size()); //this may be a silly "optimization," since all the strings are still separate unreserved buffers
//we can bail if we've reached the end of the lines we want to read, even if there are lines remaining in the stream
//we *must* bail if we've reached the end of the stream, even if there are supposedly lines left to read; that input must have been incorrect
while(begin != end && next_line_num != std::end(line_numbers_to_read))
{
if(current_line == *next_line_num)
{
auto matching_line = std::string();
if(*begin != '\n')
{
//potential optimization: reserve matching_line to something that you expect will fit most/all of your input lines
while(begin != end && *begin != '\n')
{
matching_line.push_back(*begin++);
}
}
output_lines.emplace_back(matching_line);
++next_line_num;
}
else
{
//skip this "line" by finding the next '\n'
while(begin != end && *begin != '\n')
{
++begin;
}
}
//either code path in the previous if/else leaves us staring at the '\n' at the end of a line,
//which is not the right state for the next iteration of the loop.
//So skip this '\n' to get to the beginning of the next line
if (begin != end && *begin == '\n')
{
++begin;
}
++current_line;
}
return output_lines;
}
Here it is live on Coliru, along with the input I tested it with. As you can see, it correctly handles empty lines as well as correctly handling being told to grab more lines than are in the file.

How to tokenize (words) classifying punctuation as space

Based on this question which was closed rather quickly:
Trying to create a program to read a users input then break the array into seperate words are my pointers all valid?
Rather than closing I think some extra work could have gone into helping the OP to clarify the question.
The Question:
I want to tokenize user input and store the tokens into an array of words.
I want to use punctuation (.,-) as delimiter and thus removed it from the token stream.
In C I would use strtok() to break an array into tokens and then manually build an array.
Like this:
The main Function:
char **findwords(char *str);
int main()
{
int test;
char words[100]; //an array of chars to hold the string given by the user
char **word; //pointer to a list of words
int index = 0; //index of the current word we are printing
char c;
cout << "die monster !";
//a loop to place the charecters that the user put in into the array
do
{
c = getchar();
words[index] = c;
}
while (words[index] != '\n');
word = findwords(words);
while (word[index] != 0) //loop through the list of words until the end of the list
{
printf("%s\n", word[index]); // while the words are going through the list print them out
index ++; //move on to the next word
}
//free it from the list since it was dynamically allocated
free(word);
cin >> test;
return 0;
}
The line tokenizer:
char **findwords(char *str)
{
int size = 20; //original size of the list
char *newword; //pointer to the new word from strok
int index = 0; //our current location in words
char **words = (char **)malloc(sizeof(char *) * (size +1)); //this is the actual list of words
/* Get the initial word, and pass in the original string we want strtok() *
* to work on. Here, we are seperating words based on spaces, commas, *
* periods, and dashes. IE, if they are found, a new word is created. */
newword = strtok(str, " ,.-");
while (newword != 0) //create a loop that goes through the string until it gets to the end
{
if (index == size)
{
//if the string is larger than the array increase the maximum size of the array
size += 10;
//resize the array
char **words = (char **)malloc(sizeof(char *) * (size +1));
}
//asign words to its proper value
words[index] = newword;
//get the next word in the string
newword = strtok(0, " ,.-");
//increment the index to get to the next word
++index;
}
words[index] = 0;
return words;
}
Any comments on the above code would be appreciated.
But, additionally, what is the best technique for achieving this goal in C++?
Have a look at boost tokenizer for something that's much better in a C++ context than strtok().
Already covered by a lot of questions is how to tokenize a stream in C++.
Example: How to read a file and get words in C++
But what is harder to find is how get the same functionality as strtok():
Basically strtok() allows you to split the string on a whole bunch of user defined characters, while the C++ stream only allows you to use white space as a separator. Fortunately the definition of white space is defined by the locale so we can modify the locale to treat other characters as space and this will then allow us to tokenize the stream in a more natural fashion.
#include <locale>
#include <string>
#include <sstream>
#include <iostream>
// This is my facet that will treat the ,.- as space characters and thus ignore them.
class WordSplitterFacet: public std::ctype<char>
{
public:
typedef std::ctype<char> base;
typedef base::char_type char_type;
WordSplitterFacet(std::locale const& l)
: base(table)
{
std::ctype<char> const& defaultCType = std::use_facet<std::ctype<char> >(l);
// Copy the default value from the provided locale
static char data[256];
for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
defaultCType.is(data, data+256, table);
// Modifications to default to include extra space types.
table[','] |= base::space;
table['.'] |= base::space;
table['-'] |= base::space;
}
private:
base::mask table[256];
};
We can then use this facet in a local like this:
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));
<stream>.imbue(std::locale(std::locale(), wordSplitter));
The next part of your question is how would I store these words in an array. Well, in C++ you would not. You would delegate this functionality to the std::vector/std::string. By reading your code you will see that your code is doing two major things in the same part of the code.
It is managing memory.
It is tokenizing the data.
There is basic principle Separation of Concerns where your code should only try and do one of two things. It should either do resource management (memory management in this case) or it should do business logic (tokenization of the data). By separating these into different parts of the code you make the code more generally easier to use and easier to write. Fortunately in this example all the resource management is already done by the std::vector/std::string thus allowing us to concentrate on the business logic.
As has been shown many times the easy way to tokenize a stream is using operator >> and a string. This will break the stream into words. You can then use iterators to automatically loop across the stream tokenizing the stream.
std::vector<std::string> data;
for(std::istream_iterator<std::string> loop(<stream>); loop != std::istream_iterator<std::string>(); ++loop)
{
// In here loop is an iterator that has tokenized the stream using the
// operator >> (which for std::string reads one space separated word.
data.push_back(*loop);
}
If we combine this with some standard algorithms to simplify the code.
std::copy(std::istream_iterator<std::string>(<stream>), std::istream_iterator<std::string>(), std::back_inserter(data));
Now combining all the above into a single application
int main()
{
// Create the facet.
std::ctype<char>* wordSplitter(new WordSplitterFacet(std::locale()));
// Here I am using a string stream.
// But any stream can be used. Note you must imbue a stream before it is used.
// Otherwise the imbue() will silently fail.
std::stringstream teststr;
teststr.imbue(std::locale(std::locale(), wordSplitter));
// Now that it is imbued we can use it.
// If this was a file stream then you could open it here.
teststr << "This, stri,plop";
cout << "die monster !";
std::vector<std::string> data;
std::copy(std::istream_iterator<std::string>(teststr), std::istream_iterator<std::string>(), std::back_inserter(data));
// Copy the array to cout one word per line
std::copy(data.begin(), data.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}