I have these large pcap files of market tick data. On average they are 20gb each. The files are divided into packets. Packets are divided into a header and messages. Messages are divided into a header and fields. Fields are divided into a field code and field value.
I am reading the file a character at a time. I have a file reader class that reads the characters and passes the characters by const ref to 4 call back functions, on_packet_delimiter, on_header_char, on_message_delimiter, on_message_char. The message object uses a similar function to construct its fields.
Up to here I've noticed little loss of efficiency as compared to just reading the chars and not doing anything with them.
The part of my code, where I'm processing the message header and extracting the instrument symbol of the message, slows down the process considerable.
void message::add_char(const char& c)
{
if (!message_header_complete) {
if (is_first_char) {
is_first_char = false;
if (is_lower_case(c)) {
first_prefix = c;
} else {
symbol_vector.push_back(c);
}
} else if (is_field_delimiter(c)) {
on_message_header_complete();
on_field_delimiter(c);
} else {
symbol_vector.push_back(c);
}
} else {
// header complete, collect field information
if (is_field_delimiter(c)) {
on_field_delimiter(c);
} else {
fp->add_char(c);
}
}
}
...
void message::on_message_header_complete()
{
message_header_complete = true;
symbol.assign(symbol_vector.begin(),symbol_vector.end());
}
...
In on_message_header_complete() I am feeding the chars to symbol_vector. Once header is complete I convert to string using vector iterator. Is this the most efficient way to do this?
In addition to The Quantum Physicist's answer: std::string should behave quite similar as vector does. Even the 'reserve' function is available in the string class, if you intend to use it for efficiency.
Adding the characters is just as easy as it can get:
std::string s;
char c = 's';
s += c;
You could add the characters directly to your member, and you are fine. But if you want to keep your member clean until the whole string is collected, you still should use a std::string object instead of the vector. You then add the characters to the temporary string and upon completion, you can swap the contents then. No copying, just pointer exchange (and some additional data such as capacity and size...).
How about:
std::string myStr(myVec.begin(), myVec.end());
Although this works, I don't understand why you need to use vectors in the first place. Just use std::string from the beginning, and use myStr.append() to add characters or strings.
Here's an example:
std::string myStr = "abcd";
myStr.append(1,'e');
myStr.append(std::string("fghi"));
//now myStr is "abcdefghi"
Related
I am processing really large text file in following way:
class Loader{
template<class READER>
bool loadFile(READER &reader){
/* for each line of the input file */ {
processLine_(line);
}
}
bool processLine_(std::string_view line){
std::vector<std::string> set; // <-- here
std::string buffer; // <-- here
// I can not do set.reserve(),
// because I have no idea how much items I will put.
// do something...
}
void printResult(){
// print aggregated result
}
}
The processing of 143,000,000 records take around 68 minutes.
So I decided to do some very tricky optimizations with several std::array buffers. Result was about 62 minutes.
However the code become very unreadable so I decided not to use them in production.
Then I decided to do partial optimization, e.g.
class Loader{
template<class READER>
bool loadFile(READER &reader);
std::vector<std::string> set; // <-- here
std::string buffer; // <-- here
bool processLine_(std::string_view line){
set.clear();
// do something...
}
void printResult();
}
I was hoping this will reduce malloc / free (new[] / delete[]) operation from buffer and from the set vector. I realize the strings inside the set vector still dynamically allocate memory.
However result went to 83 minutes.
Note I do not change anything except I move set and buffer on "class" level. I use them only inside processLine_ method.
Why is that?
Locality of reference?
Only explanation I think about is some strings to be small enough and to fit in SSO, but this sounds unlikely.
Using clang with -O3
I did profile and I found that most of the time is spent in a third party C library.
I supposed this library to be very fast, but this was not the case.
I am still puzzling with the slowdown, but even if I optimize it, it wont make such a big difference.
I have an error here with my code in strcpy.
I declared name , company , type , color and state as string
void SaveList(void)
{
ofstream pFile;
pFile.open("Car.dat",ios::binary);
if(pFile==NULL)
{
cout<<"Cannot Open File \n updated Data not saved into file\n\n";
exit(0);
}
struct Car_F NF; // structure variable to hold data for file
struct Car *CURR;
for (CURR = HEAD ; CURR != HEAD ; CURR = CURR->forw )
{ // copy record from linked list into file record structure
NF.ID=CURR->ID;
strcpy(NF.name,CURR->name);
strcpy(NF.Company_of_car,CURR ->Company_of_car);
strcpy(NF.type_of_car,CURR ->type_of_car);
strcpy(NF.color_of_car,CURR ->color_of_car);
NF.model_of_car = CURR->model_of_car;
NF.price_of_car = CURR->price_of_car;
strcpy(NF.state_of_car,CURR ->state_of_car);
pFile.write((char*) &NF,sizeof(NF)); // write record into file
}
pFile.close();
}
Can somebody help me out please. I will be really happy
The strcpy function is a C function that operates on char* character arrays (which is what C uses as strings). See: http://www.cplusplus.com/reference/cstring/strcpy/
C++ string objects are completely different, and remove the need for calling C string functions. See here for a list of C++ string member functions: http://www.cplusplus.com/reference/string/string/
The operator= string assignment is what you're after. For a first step in fixing this, I would replace the strcpy(NF.name,CURR->name); with NF.name = CURR->name; (and so on). C++ objects make it look a lot more simpler and natural, through the use of the operator function syntax.
Also, why are you copying the record completely before writing it out? Is that necessary? There are a number of other things that are not ideal with the code, such as casting the struct to a char* to write out. I would suggest studying some example code for iostream to see how that is used to serialise (write out and read in) objects.
I've created an fstream object to write info to files.
I write strings to the new file like
fStreamObject << "New message.\n";
because I want each << to print a string to the next line.
I want to be able to set a property and make a call like
fstreamObject << "New message.";
which will write the string to the next line.
Are there flags/settings for fstream objects that allows this to be done?
I've seen the different file modes (i.e. ofstream::in, ofstream::out, etc.), but I couldn't find one that auto writes to a new line. Also, I'm not looking to write my own solution. I want to be able to use a built in feature.
No, there are no readily configurable capabilities of that sort within the standard streams.
You may have to subclass the stream type and fiddle with operator<< to get this to work the way you want, or do it with a helper function of some description:
fstreamObject << nl("New message.");
(but that's hardly easier than just having the \n in there (for a string, anyway).
It depends on what you mean by "setting the stream". If we consider this to be fairly broad then the answer happens to be "yes"!
Here is how:
Create a stream buffer which inserts a newline every time it is flushed, i.e., when sync() is called. Otherwise it just forwards characters.
Change the file stream's stream buffer to use this stream buffer filtering to the file stream's stream buffer.
Set the flag std::ios_base::unitbuf which causes a flush after every [properly written] output operation.
Here are is the example code to do just that:
#include <iostream>
class newlinebuf
: public std::streambuf {
std::ostream* stream;
std::streambuf* sbuf;
int overflow(int c) { return this->sbuf->sputc(c); }
int sync() {
return (this->sbuf->sputc('\n') == std::char_traits::eof()
|| this->sbuf->pubsync() == -1)? -1: 0;
}
public:
newlinebuf(std::ostream& stream)
: stream(&stream)
, sbuf(stream.rdbuf(this)) {
stream << std::unitbuf;
}
~newlinebuf() { this->stream->rdbuf(this->sbuf); }
};
int main() {
newlinebuf sbuf(std::cout);
std::cout << "hello" << "world";
}
Although this approach work, I would recommend against using it! On problem is that all composite output operators, i.e., those using multiple output operators to do their work, will cause multiple newlines. I'm not aware of anything which can be done to prevent this behavior. There isn't anything in the standard library which enables just configuring the stream to do this: you'll need to insert the newline somehow.
No, the C++ streams do not allow that.
There is no way to decide where one insertion stops and the next starts.
For example for custom types, their stream-inserters are often implemented as calls to other stream-inserters and member-functions.
The only things you can do, is write your own class, which delegates to a stream of your choosing, and does that.
That's of strictly limited utiliy though.
struct alwaysenter {
std::ostream& o;
template<class X> alwaysenter& operator<<(X&& x) {
o<<std::forward<X>(x);
return *this;
}
};
This is a follow up question from here: C++ - Developing own version of std::count_if?
I have the following function:
// vector for storing the file names that contains sound
std::vector<std::string> FilesContainingSound;
void ContainsSound(const std::unique_ptr<Signal>& s)
{
// Open the Wav file
Wav waveFile = Wav("Samples/" + s->filename_);
// Copy the signal that contains the sufficient energy
std::copy_if(waveFile.Signal().begin(), waveFile.Signal().end(),
FilesContainingSound.begin(), [] (const Signal& s) {
// If the energy bin > threshold then store the
// file name inside FilesContaining
}
}
But to me, I only need to capture the string "filename" inside of the lambda expression, because I'll only be working with this. I just need access to the waveFile.Signal() in order to do the analysis.
Anyone have any suggestions?
EDIT:
std::vector<std::string> FilesContainingSound;
std::copy_if(w.Signal().begin(), w.Signal().end(),
FilesContainingSound.begin(), [&] (const std::unique_ptr<Signal>& file) {
// If the energy bin > threshold then store the
// file name inside FilesContaining
});
You seem to be getting different levels of abstraction confused here. If you're going to work with file names, then you basically want something on this order:
std::vector<std::string> input_files;
std::vector<std::string> files_that_contain_sound;
bool file_contains_sound(std::string const &filename) {
Wav waveFile = Wav("Samples/" + filename);
return binned_energy_greater(waveFile, threshold);
}
std::copy_if(input_files.begin(), input_files.end(),
std::back_inserter(files_that_contain_sound),
file_contains_sound);
For the moment I've put the file_contains_sound in a separate function simply to make its type clear -- since you're dealing with file names, it must take a file name as a string, and return a bool indicating whether that file name is one of the group you want in your result set.
In reality, you almost never really want to implement that as an actual function though--you usually want it to be an object of some class that overloads operator() (and a lambda is an easy way to generate a class like that). The type involved must remain the same though: it still needs to take a file name (string) as a parameter, and return a bool to indicate whether that file name is one you want in your result set. Everything dealing with what's inside the file will happen inside of that function (or something it calls).
I am trying to parse header packet of SIP protocol (Similar to HTTP) which is a text based protocol.
The fields in the header do not have an order.
For ex: if there are 3 fields, f1, f2, and f3 they can come in any order any number of times say f3, f2 , f1, f1.
This is increasing the complexity of my parser since I don't know which will come first.
What should I do to overcome this complexity?
Ultimately, you simply need to decouple your processing from the order of receipt. To do that, have a loop that repeats while fields are encountered, and inside the loop determine which field type it is, then dispatch to the processing for that field type. If you can process the fields immediately great, but if you need to save the potentially multiple values given for a field type you might - for example - put them into a vector or even a shared multimap keyed on the field name or id.
Pseudo-code:
Field x;
while (x = get_next_field(input))
{
switch (x.type())
{
case Type1: field1_values.push_back(x.value()); break;
case Type2: field2 = x.value(); break; // just keep the last value seen...
default: throw std::runtime_error("unsupported field type");
}
}
// use the field1_values / field2 etc. variables....
Tony already gave the main idea, I'll get more specific.
The basic idea in parsing is that it is generally separated into several phases. In your case you need to separate the lexing part (extracting the tokens) from the semantic part (acting on them).
You can proceed in different fashions, since I prefer a structured approach, let us suppose that we have a simple struct reprensenting the header:
struct SipHeader {
int field1;
std::string field2;
std::vector<int> field3;
};
Now, we create a function that take a field name, its value, and fill the corresponding field of the SipHeader structure appropriately.
void parseField(std::string const& name, std::string const& value, SipHeader& sh) {
if (name == "Field1") {
sh.field1 = std::stoi(value);
return;
}
if (name == "Field2") {
sh.field2 = value;
return;
}
if (name == "Field3") {
// ...
return;
}
throw std::runtime_error("Unknown field");
}
And then you iterate over the lines of the header and for each line separate the name and the value and call this functions.
There are obviously refinements:
instead of a if-chain you can use a map of functors or you can fully tokenize the source and store the fields in a std::map<std::string, std::string>
you can use a state-machine technic to immediately act on it without copying
but the essential advice is the same:
To manage complexity you need to separate the task in orthogonal subtasks.