How to optimize parse data flow algorithm?

How to optimize parse data flow algorithm? - c++

I need to implement some abstract protocol client-server conversation parsing library with C++. I don't have file containing the whole client-server conversation, but have to parse it on the fly. I have to implement following interface:
class parsing_class
{
public:
void on_data( const char* data, size_t len );
//other functions
private:
size_t pos_;// current position in the data flow
bool first_part_parsed_;
bool second_part_parsed_;
//... some more bool markers or something like vector< bool >
};
The data is passed to my class through on_data function. Data chunk length varies from one call to another. I know protocol's packet format and know how conversation should be organized, so I can judge by current pos_ whether i have enough data to parse Nth part.
Now the implementation is like following:
void parsing_class::on_data( const char* data, size_t len )
{
pos_ += len;
if( pos > FIRST_PART_SIZE and !first_part_parsed_ )
parse_first_part( data, len );
if( pos > SECOND_PART_SIZE and !second_part_parsed_ )
parse_second_part( data, len );
//and so on..
}
What I want is some tips how to optimize this algorithm. Maybe to avoid these numerous if ( on_data may be called very many times and each time it will have to go through all switches ).

You don't need all those bool and pos_, as they seem to only keep the state of what of the conversation has passed so that you can continue with the next part.
How about the following: write yourself a parse function for each of the parts of the conversation
bool parse_part_one(const char *data) {
... // parse the data
next_fun = parse_part_two;
return true;
}
bool parse_part_two(const char *data) {
... // parse the data
next_fun = parse_part_thee;
return true;
}
...
and in your class you add a pointer to the current parse function, starting at one. Now, in on_data all you do is to call the next parse function
bool success = next_fun(data);
Because each function sets the pointer to the next parse function, the next call of on_data will invoke the next parse function automatically. No tests required of where in the conversation you are.
If the value of len is critical (which I assume it would be) then pass that along as well and return false to indicate that the part could not be parsed (don't update next_fun in that case either).

Related

Processing ASCII commands in a more robust and type safe way

I have a module which receives ASCII commands and then reacts to them accordingly. I am wondering if it is possible, to have a more robust and typesafe way of calling handler functions.
In the past, I had code like the following, which is also very similar to this answer: Processing ASCII commands via RS232 in embedded c
struct Command commands[] = {
{"command1", command1Handler}
{"command2", command2Handler}
...
};
//gets called when a new string has been received
void parseCmd(const char *input) {
//find the fitting command entry and call function pointer
}
bool command1Handler(const char *input) { }
bool command2Handler(const char *input) { }
I don't like that all handler functions have to do their own parsing. This seems needlessly repetitive and error prone.
It would be cool, if instead we could have it the following way, where all parsing is done in the the parseCmd function:
struct Command commands[] = {
{"command1", command1HandlerSafe}
{"command2", command2HandlerSafe}
...
};
void parseCmd(const char *input) {
//1. find fitting command entry
//2. check that parameter number fits the expected number for the target function
//3. parse parameters and validate the types
//4. call function with parameters in their correct types
}
bool command1HandlerSafe(bool param1, const char *param2) { }
bool command2HandlerSafe(int param1) {}
I think with old C-style varargs it would be possible to do the parsing in a central function, but that would not bring type safety.
Edit:
Meanwhile I came up with the following solution, which I thought somewhat balances the hackiness and modularization:
class ParameterSet{
struct Param{
const char *paramString;
bool isInt();
int toInt();
float toFloat();
..
}
ParameterSet(const char *input);
Param at(size_t index);
size_t length();
char m_buffer[100];
Param m_params[10];
}
bool command1HandlerMoreSafe(const ParameterSet *paramSet);

Building an abstraction layer around this might make things more complex and thereby bug prone. I wouldn't do that unless the amount of commands you are supposed to handle is vast, needs to be maintained, and this is one of the main tasks of your application.
With the pre-requisites to keep type safe and keep parsing separate from algorithms, you could build something similar to the following C-like pseudo code:
typedef enum
{
INT,
STR
} type_t; // for marking which type that is supported by the command
typedef struct
{
type_t type;
const char* text; // what kind of text that is expected in case of strings
} arg_t;
typedef struct
{
const char* name; // the name of the command
arg_t* args; // list of allowed arguments
size_t args_n; // size of that list
void (*callback)(void); // placeholder for callback functions of different types
} command_t;
You can then make callback handler functions that aren't concerned about parsing, but only about their dedicated task:
void cmd_branch (const char* str);
void cmd_kill (int n);
The array of commands might look something like this:
const command_t commands[] =
{
{ // BRANCH [FAST][SLOW]
.name="BRANCH",
.args=(entry_t[]){ {STR,"FAST"}, {STR,"SLOW"} },
.args_n=2,
.callback= (void(*)(void)) cmd_branch,
},
{ // KILL [7][9]
.name="KILL",
.args=(entry_t[]){ {INT, NULL} },
.args_n=1,
.callback= (void(*)(void)) cmd_kill,
}
};
The parse function will then do:
Find which command that was received by searching the above list (bsearch if large list).
Check what type of arguments the received command supports
Parse arguments accordingly
Call the relevant function with arguments of the appropriate type
Since this example just used some dummy type function pointer (void(*)(void)), you'll have to cast to the correct type. Can be done by for example C11 _Generic:
call(commands[i], int_val);
which expands to:
#define call(command, arg) _Generic((arg), \
int: (void(*)(int)) command.callback, \
const char*: (void(*)(const char*)) command.callback )(arg)

One way to keep the command handling interfaces the same is to fall back on the venerable argv / argc interface that main() receives. Assuming the received commands have some notion of words (perhaps whitespace separated), it could go like this:
Receive the input string.
Parse the input into words where the first word is the name of the command and the remaining words are its arguments.
As the parsing proceeds, place a pointer to the string that contains each word in an array and keep count of the number of elements in the array.
Using the first word, look up a command function pointer. You can use something like bsearch() if the commands are all known at compile time. Perhaps a hash table might also be appropriate. However you implement the mapping, the result is a pointer to a function that takes an array of pointers to the arguments and a count of the number of elements in the pointer array.
Invoke the command function via its pointer and pass the array of parsed words and the count, just like main() is invoked by startup code.
From there, each command function can deal with what its arguments specifically mean, converting strings representations to internal forms as necessary.

Size of encoded avro message without encoding it

Is there a way to get the size of the encoded avro message without actually encoding it?
I'm using Avro 1.8.1 for C++.
I'm used to google protocol buffers where you can call ByteSize() on a protobuf to get the encoded size, so it's something similar i'm looking for.
Since the message in essence is a raw struct I get that the size cannot be retrieved from the message itself, but perhaps there is a helper method that i'm not aware of?

There is no way around it unfortunately...
Here is an example showing how the size can be calculated by encoding the object:
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream(1);
encoder->init(*out);
avro::encode(*encoder, obj);
out->flush();
uint32_t bufferSize = out->byteCount();

(Edit below shows a hacky way to shrink-to-fit an OutputStream after writing to it with a BinaryEncoder)
It's a shame that avro::encode() doesn't use backup on the OutputStream to free unused memory after encoding. Martin G's answer gives the best solution using only the tools avro provides, but it issues N memory allocations of 1 byte each if your serialized object is N bytes in size.
You could implement a custom avro::OutputStream that simply counts and discards all written bytes. This would get rid of the memory allocations. It's still not a great approach, as the actual encoder will have to "ask" for every single byte:
(Code untested, just for demonstration purposes)
#include <avro/Encoder.hh>
#include <cstdint>
class ByteCountOutputStream : public avro::OutputStream {
public:
size_t byteCount_ = 0;
uint8_t dummyWriteLocation_;
explicit ByteCountOutputStream() {};
bool next(uint8_t **data, size_t *len) final {
byteCount_ += 1;
*data = &dummyWriteLocation_;
*len = 1;
return true;
}
void backup(size_t len) final {
byteCount_ -= len;
}
uint64_t byteCount() const final {
return byteCount_;
}
void flush() final {}
};
this could then be used as:
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
ByteCountOutputStream out();
encoder->init(out);
avro::encode(*encoder, obj);
size_t bufferSize = out.byteCount();
Edit:
My initial question when stumbling upon this was: How can I tell how many bytes of the OutputStream are required (for storing / transmitting)? Or, equivalently, if OutputStream.byteCount() returns the count of bytes allocated by the encoder so far, how can I make the encoder "backup" / release the bytes it didn't use? Well, there is a hacky way:
The Encoder abstract class provides a init method. For the BinaryEncoder, this is currently implemented as:
void BinaryEncoder::init(OutputStream &os) {
out_.reset(os);
}
with out_ being the internal StreamWriter of the Encoder.
Now, the StreamWriter implements reset as:
void reset(OutputStream &os) {
if (out_ != nullptr && end_ != next_) {
out_->backup(end_ - next_);
}
out_ = &os;
next_ = end_;
}
which will return unused memory back to the "old" OutputStream before switching to the new one.
So, you can abuse the encoder's init method like this:
// setup as always
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
// actual serialization
encoder->init(*out);
avro::encode(*encoder, obj);
// re-init on the same OutputStream. Happens to shrink the stream to fit
encoder->init(*out);
size_t bufferSize = out->byteCount();
However, this behavior is not documented, so it might break in the future.

most efficient way to convert char vector to string

I have these large pcap files of market tick data. On average they are 20gb each. The files are divided into packets. Packets are divided into a header and messages. Messages are divided into a header and fields. Fields are divided into a field code and field value.
I am reading the file a character at a time. I have a file reader class that reads the characters and passes the characters by const ref to 4 call back functions, on_packet_delimiter, on_header_char, on_message_delimiter, on_message_char. The message object uses a similar function to construct its fields.
Up to here I've noticed little loss of efficiency as compared to just reading the chars and not doing anything with them.
The part of my code, where I'm processing the message header and extracting the instrument symbol of the message, slows down the process considerable.
void message::add_char(const char& c)
{
if (!message_header_complete) {
if (is_first_char) {
is_first_char = false;
if (is_lower_case(c)) {
first_prefix = c;
} else {
symbol_vector.push_back(c);
}
} else if (is_field_delimiter(c)) {
on_message_header_complete();
on_field_delimiter(c);
} else {
symbol_vector.push_back(c);
}
} else {
// header complete, collect field information
if (is_field_delimiter(c)) {
on_field_delimiter(c);
} else {
fp->add_char(c);
}
}
}
...
void message::on_message_header_complete()
{
message_header_complete = true;
symbol.assign(symbol_vector.begin(),symbol_vector.end());
}
...
In on_message_header_complete() I am feeding the chars to symbol_vector. Once header is complete I convert to string using vector iterator. Is this the most efficient way to do this?

In addition to The Quantum Physicist's answer: std::string should behave quite similar as vector does. Even the 'reserve' function is available in the string class, if you intend to use it for efficiency.
Adding the characters is just as easy as it can get:
std::string s;
char c = 's';
s += c;
You could add the characters directly to your member, and you are fine. But if you want to keep your member clean until the whole string is collected, you still should use a std::string object instead of the vector. You then add the characters to the temporary string and upon completion, you can swap the contents then. No copying, just pointer exchange (and some additional data such as capacity and size...).

How about:
std::string myStr(myVec.begin(), myVec.end());
Although this works, I don't understand why you need to use vectors in the first place. Just use std::string from the beginning, and use myStr.append() to add characters or strings.
Here's an example:
std::string myStr = "abcd";
myStr.append(1,'e');
myStr.append(std::string("fghi"));
//now myStr is "abcdefghi"

What's a better way to store information than by using static ints? C++

I'm keeping track of a player's "job" by setting his job to a number, and incrementing it by one if he changes job, and determining which job he currently is by whether the number is even or odd. (Only two jobs right now). However, I know there are better ways of doing this, and soon I'll need to implement for a third and fourth job, so I cannot keep using the even/odd check.
Here's my code for reference: (Please note that I only include relevant code)
GameModeState.cpp
// If changeJob's parameter number is 1, it increments the job. If number is 2, it only returns the current job
int GameModeState::changeJob(int number)
{
// Default job is even (landman)
static int job = 1;
if (number == 1)
{
job = (job+1);
return job;
}
else
{
return job;
}
}
int GameModeState::getJob()
{
int currentJob = (changeJob(2));
return currentJob;
}
// If the player opens the "stat sheet", it changes their job
void GameModeState::_statSheet(const String& message, const Awesomium::JSValue& input, Awesomium::JSValue& output)
{
changeJob(1);
}
GameModeState.h
class GameModeState : public GameState::State
{
public:
/// Changes the player's job if number is 1, or returns current job if number is 2
static int changeJob(int number);
/// Returns the current job number by calling changeJob appropriately
static int getJob();
private:
// Opening the player sheet will change the player's job
void _statSheet(const String& message, const Awesomium::JSValue& input, Awesomium::JSValue& output);
};
ZoneMovementState.cpp (This is where I check for current job)
#include "GameModeState.h"
#include <EnergyGraphics/ZoneParser.h>
void ZoneMovementState::_changeZone(const String& message, const Awesomium::JSValue& input, Awesomium::JSValue& output)
{
// If the number from getJob is even, the player is currently a geologist
if (GameModeState::getJob()%2 == 0)
{
ZoneParser::getSingleton().load("../media/zones/geology_zone.xml", false);
}
else //otherwise they are a landman
{
ZoneParser::getSingleton().load("../media/zones/landman_zone.xml", false);
}
transitionHandler->go();
}
I'm thinking either arrays or enums of the jobs will be the better way to deal with this, but I'm not sure how to implement this into my code. If you know a better way, please include examples or at least a point in the right direction. I will greatly appreciate it!

Don't use static variables to save anything like that inside a class. Use a member variable instead.
IMO the easiest way to do something like that and make it extensible is using a enum:
enum PlayerJob
JOB_NONE = 0,
JOB_GEOLOGIST,
JOB_LANDMAN,
...
NUM_JOBS // this element is optional but can be useful for range checking.
};
...
PlayerJob job = JOB_NONE;
...
switch(job)
{
case JOB_NONE:
break;
case JOB_GEOLOGIST:
...
break;
...
default:
error("Unhandled palyer job: %d", job);
break;
}
Also I'd think about somehow organizing such "job relevant" stuff into some kind of array or list or whatever to make it easier to call "job specific" things:
std::map<PlayerJob,std::string> jobzones;
jobzones.push_back(JOB_GEOLOGIST, "geozone.xml");
...
transitToZone(jobzones[job]);

Enums are nice, you may also think about using a std::stack or something similar for the GameState, so that you can push/pop etc.

You may want to look at the State pattern.

Streaming data for state machine playback

I have a state machine design that needs to support playback. We have states that perform actions and sometimes need to generate random numbers. In case the program shuts down while in the middle of the FSM's execution, the program needs to playback the whole FSM using the same random numbers as before.
For a basic example, let's say I had three states: A, B, and C. The FSM will call a state's Execute() function. At the end of the function, the state will post an event, and the FSM will determine which state to go to next. In state A, it will call rand(). If the number is even, it will post an event to go to state B, otherwise state C should be the next state.
void StateA::Execute(IEventQueue& rQueue)
{
int num = rand();
if( num % 2 == 0 )
{
rQueue.PostEvent("GoToStateB");
}
else
{
rQueue.PostEvent("GoToStateC");
}
}
If the random number is 69, then it should go to state C. While in state C, it's possible that the program might quit. When the program starts up again, it should playback the state machine. Obviously, for this to work correctly, it can't generate a completely new random number, it needs to use 69 again for accurate playback.
I have a file stream interface that I can use for saving data to a file, but the code is a little ugly:
void StateA::Execute(IEventQueue& rQueue, IFileStream& rStream)
{
int num = 0;
// fails if there's no more data to read
bool bSuccess = rStream.ReadInt(num);
if (!bSucess)
{
num = rand();
rStream.WriteInt(num);
}
// same code as before
}
My only problem with this solution is that I don't care for having to check the stream for data first and then conditionally write to the same stream.
I thought about hiding it like this:
void StateA::Execute(IEventQueue& rQueue, IStream& rStream)
{
int num = 0;
num = rand();
rStream & num;
// same code as before
}
Inside IStream, operator& (probably not the best use of overloading) would actually try to read an int from the stream. If that stream was empty, it would then write it instead. Like before, the behavior would be: read first until the end of stream, and then start appending.
So I guess my question is: is there a common idiom for this type of playback that I might be overlooking? Does anyone have alternate suggestions? I feel like I'm starting to over-complicate the design a bit.
Thanks!

Why have the states interact directly with the filestream? Single Responsibility says we should have a class who's job it is to provide the proper number based on some logic.
struct INumberSource {
virtual int GenNextNumber() = 0;
}
// My job is to provide numbers from an RNG
struct RNGNumberSource : public INumberSource {
virtual int GenNextNumber() {
return rand();
}
}
// My job is to write any numbers sourced through me to a file
// I delegate to another source to get an actual number
class FileStreamTrackingNumberSource : INumberSource {
public:
FileStreamTrackingNumberSource(INumberSource& source, IFileStream& stream)
: altSource(source), fileStream(stream) { }
virtual int GenNextNumber() {
int num = altSource.GenNextNumber();
fileStream.WriteInt(num);
return num;
}
private:
INumberSource altSource;
IFileStream& fileStream;
}
// My job is to source numbers from a file stream delegating to an
// alternate source when I run out
class FileStreamNumberSource : public INumberSource {
public:
FileStreamNumberSource(INumberSource& source, IFileStream& stream)
: altSource(source), fileStream(stream), failedRead(false) { }
virtual int GenNextNumber() {
int num = 0;
if(failedRead || !(failedRead = fileStream.ReadInt(num))) {
num = altSource.GenNextNumber();
}
return num;
}
private:
INumberSource& altSource;
IFileStream& fileStream;
bool failedRead;
}
So in your case you would provide an IFileStream and RNGNumberSource to a FileStreamTrackingNumberSource and provide that and the same IFileStream to a FileStreamNumberSource. That FileStreamNumberSource is what you would give to your State's INumberSource parameter.
Assuming you only needed the number to choose the next state then your state code could look like this:
void StateA::Execute(IEventQueue& rQueue, INumberSource& numberSource)
{
if( numberSource.GenNextNumber() % 2 == 0 )
{
rQueue.PostEvent("GoToStateB");
}
else
{
rQueue.PostEvent("GoToStateC");
}
}

I suspect you should have two files: one that records the events you are playing, and the other that you read "re-play" events from. If the re-play file is longer than the "recording" file, then that is the one you use for a re-play.
I also would not use operator overloading as you suggested. Perhaps just use a ternary operator.

I'm not sure I understand the rationale behind "playback", but can't you simply wrap the whole "random-number or read-from-file" logic behind a class or function?
UPDATE
On the subject of "playback" and your design in general, I'm not sure it's normal for a FSM to generate its own stimulus (i.e. the random numbers which in turn trigger state transitions). Normally, the stimulus is provided externally. If you re-factor with this in mind, then you no longer have this messy problem!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to optimize parse data flow algorithm? - c++

Related

Processing ASCII commands in a more robust and type safe way

Size of encoded avro message without encoding it

most efficient way to convert char vector to string

What's a better way to store information than by using static ints? C++

Streaming data for state machine playback

Categories

Resources