First of all I apologize if it gets too long, as the problem can be detailed quickly, however, I describe a good part of the environment to perhaps come and help.
I'm developing a tool in C++ that analyzes genomic data (FASTQ data specifically). In case you don't know, a FASTQ file contains raw reads of genetic sequence, where each set of 4 lines of this file represents a read of the genome.
Well, as the processing of a read does not depend on other reads (in my case), I intend to parallelize the work with threads to save processing time.
When searching, I found some libraries ready and I'm trying to adapt. Right now I'm trying to use Thread Pool (specifically this lib -> https://github.com/log4cplus/ThreadPool/) but I'm having some problems with the integration.
I will detail the general algorithm a little.
Each iteration in the FASTQ file takes a set of 4 lines from the file and stores it in this struct.
struct READ {
std::string name; // Small size
std::string comment; // Small size
std::string seq; // String ranging between 0 and 1000 characters
std::string which; // The same amount of 'seq' characters
};
That is immediately passed as a value to the foreign function to perform the (non-trivial) task managed by the Thread Pool.
ThreadPool pool (4); // Creating thread pool with 4 threads
while (exist_read) // as long as there are reads to be processed -> read set of 4 lines
{
Read read;
read = next_read (); // read will contain 4 lines of the file read in the iteration
pool.enqueue ([read] {
Search (read); // search function performs non-trivial search work
});
}
As can be seen, I passed the values in the lambda function and tested it with several threads, but I ended up having inferior performance when using single-thread.
So I was suggested to use a movable object. And with great help, I got this struct.
struct READ
{
READ () = default;
READ (READ &&) = default;
READ & operator = (READ &&) = default;
std::string name;
std::string comment;
std::string seq;
std::string which;
};
And the lambda function has changed to:
pool.enqueue ([r {std::move(read)}] {
Search (std::move(r));
});
In addition, did update the reading of the FASTQ file of:
[...]
Read read;
read.name = lines[0];
read.seq = lines[1];
read.comment = lines[2];
read.qual = lines[3];
[...]
To:
[...]
Read read;
read.name = std::move(lines [0]);
read.seq = std::move(lines [1]);
read.comment = std::move(lines [2]);
read.qual = std::move(lines [3]);
[...]
However, I still haven't been able to test the performance, because I'm having the following print_error
error: function "READ::READ (const READ &)" (declared implicitly)
cannot be referenced - it is a deleted function C / C ++ (1776)
I searched other forums but I couldn't find an answer that would suit my problem.
If anyone can help me with this problem I will be very grateful. I hope I have been clear.
In addition, I also accept suggestions for other alternatives.
Thanks.
Related
I have these large pcap files of market tick data. On average they are 20gb each. The files are divided into packets. Packets are divided into a header and messages. Messages are divided into a header and fields. Fields are divided into a field code and field value.
I am reading the file a character at a time. I have a file reader class that reads the characters and passes the characters by const ref to 4 call back functions, on_packet_delimiter, on_header_char, on_message_delimiter, on_message_char. The message object uses a similar function to construct its fields.
Up to here I've noticed little loss of efficiency as compared to just reading the chars and not doing anything with them.
The part of my code, where I'm processing the message header and extracting the instrument symbol of the message, slows down the process considerable.
void message::add_char(const char& c)
{
if (!message_header_complete) {
if (is_first_char) {
is_first_char = false;
if (is_lower_case(c)) {
first_prefix = c;
} else {
symbol_vector.push_back(c);
}
} else if (is_field_delimiter(c)) {
on_message_header_complete();
on_field_delimiter(c);
} else {
symbol_vector.push_back(c);
}
} else {
// header complete, collect field information
if (is_field_delimiter(c)) {
on_field_delimiter(c);
} else {
fp->add_char(c);
}
}
}
...
void message::on_message_header_complete()
{
message_header_complete = true;
symbol.assign(symbol_vector.begin(),symbol_vector.end());
}
...
In on_message_header_complete() I am feeding the chars to symbol_vector. Once header is complete I convert to string using vector iterator. Is this the most efficient way to do this?
In addition to The Quantum Physicist's answer: std::string should behave quite similar as vector does. Even the 'reserve' function is available in the string class, if you intend to use it for efficiency.
Adding the characters is just as easy as it can get:
std::string s;
char c = 's';
s += c;
You could add the characters directly to your member, and you are fine. But if you want to keep your member clean until the whole string is collected, you still should use a std::string object instead of the vector. You then add the characters to the temporary string and upon completion, you can swap the contents then. No copying, just pointer exchange (and some additional data such as capacity and size...).
How about:
std::string myStr(myVec.begin(), myVec.end());
Although this works, I don't understand why you need to use vectors in the first place. Just use std::string from the beginning, and use myStr.append() to add characters or strings.
Here's an example:
std::string myStr = "abcd";
myStr.append(1,'e');
myStr.append(std::string("fghi"));
//now myStr is "abcdefghi"
This is a follow up question from here: C++ - Developing own version of std::count_if?
I have the following function:
// vector for storing the file names that contains sound
std::vector<std::string> FilesContainingSound;
void ContainsSound(const std::unique_ptr<Signal>& s)
{
// Open the Wav file
Wav waveFile = Wav("Samples/" + s->filename_);
// Copy the signal that contains the sufficient energy
std::copy_if(waveFile.Signal().begin(), waveFile.Signal().end(),
FilesContainingSound.begin(), [] (const Signal& s) {
// If the energy bin > threshold then store the
// file name inside FilesContaining
}
}
But to me, I only need to capture the string "filename" inside of the lambda expression, because I'll only be working with this. I just need access to the waveFile.Signal() in order to do the analysis.
Anyone have any suggestions?
EDIT:
std::vector<std::string> FilesContainingSound;
std::copy_if(w.Signal().begin(), w.Signal().end(),
FilesContainingSound.begin(), [&] (const std::unique_ptr<Signal>& file) {
// If the energy bin > threshold then store the
// file name inside FilesContaining
});
You seem to be getting different levels of abstraction confused here. If you're going to work with file names, then you basically want something on this order:
std::vector<std::string> input_files;
std::vector<std::string> files_that_contain_sound;
bool file_contains_sound(std::string const &filename) {
Wav waveFile = Wav("Samples/" + filename);
return binned_energy_greater(waveFile, threshold);
}
std::copy_if(input_files.begin(), input_files.end(),
std::back_inserter(files_that_contain_sound),
file_contains_sound);
For the moment I've put the file_contains_sound in a separate function simply to make its type clear -- since you're dealing with file names, it must take a file name as a string, and return a bool indicating whether that file name is one of the group you want in your result set.
In reality, you almost never really want to implement that as an actual function though--you usually want it to be an object of some class that overloads operator() (and a lambda is an easy way to generate a class like that). The type involved must remain the same though: it still needs to take a file name (string) as a parameter, and return a bool to indicate whether that file name is one you want in your result set. Everything dealing with what's inside the file will happen inside of that function (or something it calls).
I am new to Threading Building Blocks (TBB); I would need to implement the following logic with TBB nodes:
A node of type N receives two inputs; for instance:
1. std::vector // data
2. bool // flag
These inputs come asynchronously.
If the input is of type 1, process the data owned by the node of type N to produce two outputs, for instance:
a. std::vector
b. int
If the input is of type 2, process the data owned by the node of type N to produce one output, say a std::vector.
I have been trying to formulate the input part using a tbb::flow::or_node, and the output part using tbb::flow::multifunction_node.
If there is only one input and multiple outputs, this logic can be written with tbb::flow::multifunction_node (I tested, it works). If there is one output, and multiple inputs, I found example of code illustrating solutions. However, it is not clear to me how the case of multiple asynchronous inputs and multiple outputs can be implemented with the TBB framework. Suggestions welcome.
You should be able to do what you want with the current implementation of or_node. (We are re-designing the output of the or_node to make it more friendly, but we need input from users like you on issues with the or_node Community Preview Feature.)
One thing to remember is to turn on the CPF when you are compiling code with the or_node. The switch is -DTBB_PREVIEW_GRAPH_NODES=1 .
# define TBB_PREVIEW_GRAPH_NODES 1 // necessary to turn on the or_node community Preview Feature.
#include "tbb/flow_graph.h"
#include <vector>
using namespace tbb::flow;
// The output format of the or_node is a struct that contains
// 1. the index of the input that the message appeared on, and
// 2. a tuple, the (i-1)th element of which is the message received
typedef or_node<tuple<std::vector<double>, bool> > my_or_node_type;
// it wasn't clear from the description if you wanted to differentiate between the vectors output with
// an input of type 1. or type 2. If you need to do that you can add an extra output port to the multifunction_node.
typedef multifunction_node<my_or_node_type::output_type, tuple<std::vector<double>, int> > my_mf_node_type;
struct mf_node_body {
void operator()(const my_or_node_type::output_type &in, my_mf_node_type::output_ports_type &op) {
switch(in.indx) {
case 0: {
// do the operation for the first input (the std::vector) The vector will be in
// get<0>(in.result). Remember you are copying vectors here, so if you have big
// vectors you will probably want to do some buffer management on your own and
// pass refs to the vector instead.
}
break;
case 1: {
// do the operation signaled by the second input (the bool.) The value of the
// input is in get<1>(in.result).
}
break;
}
}
};
main() {
graph g;
my_or_node_type multi_in(g);
my_mf_node_type multi_out(g, unlimited, mf_node_body());
// if the vector-producing node is called vpn, you attach it to the 0-th input of the or_node with
// make_edge(vpn, input_port<0>(multi_in));
//
// the bool-producing node bn can be attached similarly:
// make_edge(bn, input_port<1>(multi_in);
//
// attach the multi-in to the multi-out:
// make_edge(multi_in, multi_out);
//
// attach the vector consumer node vcn
// make_edge(output_port<0>(multi_out), vcn);
//
// attach the integer output to the int consuming node icn
// make_edge(output_port<1>(multi_out), icn);
//
// start up the graph and make sure to do a wait_for_all() at the end.
}
Remember that the multifunction_node body is invoked in parallel, so the work it does should not have race conditions (unless you want race conditions for some reason.) You can make the node body execute serially by constructing it with serial instead of unlimited. And the only way to ensure you can safely destroy the graph is to make sure no tasks are executing any of the nodes. The best way to do this is to do a g.wait_for_all().
Regards,
Chris
P.S. - one addendum. If the multifunction_node is defined serial, it will have an input buffer unless you explicitly exclude it. This may change the behavior of your graph if you are not expecting the buffer to be there.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Reading from text file until EOF repeats last line
I am writting data to a file using the following code
//temp is class object
fstream f;
f.open ("file", ios::in|ios::out|ios::binary);
for(i=0;i<number_of_employees ;++i)
{
temp.getdata();
f.write( (char*)&temp,sizeof(temp));
}
f.close();
temp is the object of following class
class employee
{
char eno[20];
char ename[20];
char desg[20];
int bpay;
int ded;
public:
void getdata();
void displaydata();
}
But when i write data using this code i find that the last object written to file gets written two times.
my function to read from file is
fstream f;
f.open ("file", ios::in|ios::out|ios::binary);
while(f)
{
f.read((char*)&temp, sizeof(temp));
temp.displaydata();
}
f.close();
following shows my file when it is read till eof
Number :1
Name :seb
Designation:ceo
Basic Pay :1000
Deductions :100
Number :2
Name :sanoj
Designation:cto
Basic Pay :2000
Deductions :400
Number :2
Name :sanoj
Designation:cto
Basic Pay :2000
Deductions :400
What is the cause of this and how can i solve it?
If the problem is repeated output, it's very likely caused by the way you are looping. Please post the exact loop code.
If the loop is based on the data you receive from getdata(), you'll need to look closely at exactly what you input as well. You might not be receiving what you expect.
Of course, without real code, these are almost just guesses.
The reason for your problem is simple: you're not checking whether the
read has succeeded before using the results. The last read encounters
end of file, fails without changing the values in your variables, and
then you display the old values. The correct way to do exactly what
you're trying to do would be:
while ( f.read( reinterpret_cast<char*>( &temp ), sizeof( temp ) ) ) {
temp.displaydata();
}
Exactly what you're trying to do, however, is very fragile, and could
easily break with the next release of the compiler. The fact that your
code needs a reinterpret_cast should be a red flag, indicating that
what you're doing is extremely unportable and implementation dependent.
What you need to do is first, define a binary format (or use one that's
already defined, like XDR), then format your data according to it into a
char buffer (I'd use std::vector<char> for this), and finally use
f.write on this buffer. On reading, it's the reverse: you read a
block of char into a buffer, and then extract the data from it.
std::ostream::write and std::istream::read are not for writing and
reading raw data (which makes no sense anyway); if they were, they'd
take void*. They're for writing and reading pre-formatted data.
Writing an object to a file with write((char*)object, sizeof(object)) is looking for trouble!
Rather write a dedicated write function for the class:
class employee {
...
void write(ostream &out) {
out.write(eno, sizeof(eno));
out.write(ename, sizeof(ename));
out.write(desg, sizeof(desg));
out.write((char*)&bpay, sizeof(bpay));
out.write((char*)&ded, sizeof(ded));
}
void read(istream &in) {
in.read(&eno, sizeof(eno));
in.read(&ename, sizeof(ename));
...
in.read((char*)&bpay, sizeof(bpay));
in.read((char*)&ded, sizeof(ded));
}
}
ostream &operator <<(ostream &out, employee &e) {
e.write(out);
return out;
}
istream &operator >>(istream &in, employee &e) {
e.read(in);
return in;
}
Once you've done that, you can use:
f << temp;
to write your employee record to the file.
But note that even this isn't great, because at least as far as the integers are concerned, we're becoming very platform dependent, ito the size of an int, and ito the endianness of the int.
I'm having trouble deserializing an object in C++ that I had serialized in C# and then sent over the network with ZMQ. I'm fairly certain the ZMQ part is working correctly because the C++ server application (Linux) successfully receives the serialized messages from C# (Windows) and sends them back to Windows where it can successfully deserialize the message, so I don't think I'm experiencing any sort of truncated or dropped packets in that regard.
However, when I receive the message on the Linux server, the C++ deserialize method does not correctly deserialize, it throws some a bunch of binary data into the 6th field (I can see this in MyObject.DebugString()), but no data in any other fields. The strange part here, however, is that a class I had with 5 fields works perfectly fine. C++ deserializes it correctly and all of the data is working properly. Below are a few tidbits of my code. Any help would be greatly appreciated.
C#:
MemoryStream stream = new MemoryStream();
ProtoBuf.Serializer.Serialize<TestType>(stream, (TestType)data);
_publisher.Send(stream.ToArray());
C++:
message_t data;
int64_t recv_more;
size_t recv_more_sz = sizeof(recv_more);
TestType t;
bool isProcessing = true;
while(isProcessing)
{
pSubscriber->recv(&data, 0);
t.ParseFromArray((void*)(data.data()),sizeof(t));
cout<<"Debug: "<<t.DebugString()<<endl;
pSubscriber->getsockopt(ZMQ_RCVMORE, &recv_more, &recv_more_sz);
isProcessing = recv_more;
}
The output looks like this:
Debug: f: "4\000\000\000\000\000\"
I'm having trouble copy and pasting, but the output continues like that for probably 3 or 4 lines worth of that.
This is my TestType class (proto file):
package Base_Types;
enum Enumr {
Dog = 0;
Cat = 1;
Fish = 2;
}
message TestType {
required double a = 1;
required Enumr b = 2;
required string c = 3;
required string d = 4;
required double e = 5;
required bytes f = 6;
required string g = 7;
required string h = 8;
required string i = 9;
required string j = 10;
}
Field "f" is listed as bytes because when it was a string before it was giving me a warning about UTF-8 encoding, however, when this class worked with only 5 fields (the enum was one of them), it did not give me that error. It's almost like instead of deserializing, it's throwing the binary for the entire class into field "f" (field 6).
Solution: There ended up being an issue where the memory wasn't being copied before it sent to a thread socket. When the publisher sent back out, it was packaging the data and changing what the router received. There needs to be a memcpy() on the C++ side in order to send out the data to be used internally. Thanks for all of the help.
I've parsed it through the reader in v2, and it seems to make perfect sense:
1=5
2=0
3=
4=yo
5=6
6=2 bytes, 68-69
7=how
8=are
9=you
10=sir
Note that I've done that purely from the hex data (not using the .proto), but it should be close to your original data. But most notably, it seems intact.
So: first thing to do; check that the binary you get at the C++ side is exactly the same as the binary you sent; this is doubly important if you are doing any translations along the way (binary => string, for example - which should be done via base-64).
second thing; if that doesn't work, it is possible that there is a problem in the C++ implementation. It seems unlikely since that is one of google's pets, but nothing is impossible. If the binary comes across intact, but it still behaves oddly, I can try speaking to the C++ folks, to see if one of us has gone cuckoo.