How to represent irregular, but ordered, data in objects?

How to represent irregular, but ordered, data in objects? - c++

I have a set of files with binary data. Each file is composed of blocks and each block has a header and then a set of events. Each event has a header and then a sequence of fields. My problem is with the sequence of fields.
These fields contain different lengths of ordered/structured data but the fields do not come in any particular order. For example, in one event I might have in one event 3 fields looking as follows:
Event Header (12 bytes, always, made of things like number of fields, size, etc)
Field Header (2 bytes, always, field type in the top 4 bits, size in the bottom 12)
Field Data (4299-4298(VDC) data, Signals from various wires in a vertical drift chamber)
Field Header ( '' )
Field Data (ADC-LAS data, Signals from various photo multiplier tubes)
Field Header ( '' )
Field Data (FERA data, Signals from a fast encoding readout adc system)
In another event I might have the same fields plus a few more, or a field removed and another added in, etc. It all depends on which pieces of the DAQ hardware had data to be recorded when the readout system triggered.
I have thought about a few possible solutions and honestly, none of them seem palatable to me.
Method 1:
Make an abstract base class Field and then for each field type (there are only 13) inherit from that.
Pros: Reading the data in from the file is easy, simply get the region id, allocate the appropriate type of field, read the data, and store a Field*. Also, this method appeals to my sense of a place for everything and everything in its place.
Cons: When I process the fields in an event to convert the data to the information that the analysis actually uses I am continuously needing to dynamic_cast<>() everything to the derived class. This is a bit tedious and ugly and I remember reading somewhere (a while ago) that if you are having to use dynamic_cast<>() then you are using polymorphism 'wrong'. Also this makes having object pools for the fields tricky as I would need a pool for every subclass of Field. Finally, if more field types are added later then in addition to modifying the processing code, additional subclasses of field need to be created.
Method 2:
Simply have a big array of bytes to store all the field headers and data. Then leave it up to the processing code to extract the structure as well as process the data.
Pros: This means that if data formats change in the future then the only changes that need to occur are in the event processing code. It's a pretty simple solution. It's more flexible.
Cons: Jobs in the processing/reading code are less compartmentalized. Feels less elegant.
I recognize that there is probably not a solution that is going to be 'elegant' in every way, and from the standpoint of KISS I am leaning towards method 2. Should I choose Method 1, Method 2, or is there some Method 3 that I have not thought of?

You are trying to choose between struct or tuple or MSRA safeprotocole handler
` // Example program
#include
#include
#include
#include
// start ABI Protocole signature
const int EVENT_HEADER_SZ = 12;
const int FIELD_HEADER_SZ = 2;
const int FIELD_DATA_SIZE = 2^12;
// end ABI Protocole
#ifdef WINDOWS
#define __NOVT __declspec(novtable
#else
#define __NOVT
#endif
struct Protocole_Header __NOVT {
union {
char pbody[EVENT_HEADER_SZ+1];
unsigned ptype : 32;
unsigned psize : 32;
unsigned pmisc : 32;
};
};
struct Field_Header __NOVT {
union {
char fbody[FIELD_HEADER_SZ+1];
unsigned ftype : 4; // type of data 0...15
unsigned fsize : 12; // size of field data to follow up 0..4096 max size
};
};
struct Field_Data {
std::string _content;
};
typedef std::tuple<uint_fast32_t, int_fast32_t,uint_fast32_t> Protocole_Header2;
enum PHR{
TPL_TYPE,
TPL_SIZE,
TPL_ETC
};
std::istream& operator >>(std::istream &is, std::tuple<uint_fast32_t, int_fast32_t,uint_fast32_t>& tpl)
{
is >> std::get<TPL_TYPE>(tpl) >> std::get<TPL_SIZE>(tpl) >> std::get<TPL_ETC>(tpl);
return is;
}
union Field_Header2 {
char fbody[FIELD_HEADER_SZ];
unsigned ftype : 4; // type of data 0...15
unsigned fsize : 12; // size of field data to follow up 0..4096 max size
};
int main()
{
Protocole_Header ph;
Field_Header fh;
Field_Data fd;
long i;
char protocole_buffer[FIELD_DATA_SIZE+1];
std::cin.get(ph.pbody,EVENT_HEADER_SZ);
std::cin.get(fh.fbody,FIELD_HEADER_SZ);
for(i=0;i<ph.psize;++i)
{
std::cin.get(protocole_buffer,fh.fsize);
fd._content = protocole_buffer; // push somewhere else
std::cin.get(fh.fbody,FIELD_HEADER_SZ);
}
// ...
// ...
Protocole_Header2 ph2;
Field_Header2 fh2;
std::cin >> ph2;
std::cin.get(fh2.fbody,FIELD_HEADER_SZ);
for(i=0;i<std::get<TPL_SIZE>(ph2);++i)
{
std::cin.get(protocole_buffer,fh.fsize);
fd._content = protocole_buffer; // push somewhere else
std::cin.get(fh2.fbody,FIELD_HEADER_SZ);
}
}`
Here , you have both of your answer ...
Note , using metastructure over structure is as much a burden than find back the code and recompile it in case of rupture of protocole.
Usually you do not define ABI for protocole structure, and that why C++ Spirit was made.
A parser must be used to handle protocole ( always, because protocole is a grammar on its own, define a EBNF and your code will run for decades without people to recompile it ... )
There is only exception for not using a parser , its when you need to pass MSRA or Heatlh Care or any Regulated sector. Rest of time, don't bind external data to ABI structure with C or C++ , it's a 100% cause of bugs .

Related

Generate a unique ID C++

I want to generate a unique identifier "ident" for my complex structure, how can i do that?
in my header, the complexe structure is:
struct Complexe {
float x;
float y;
static unsigned int ident;
};
void Init(Complexe&);
etc...
and in the cpp file, i need to attribute ident a unique int
Init(Complexe&z){
z.ident = 0;
z.y = 0;
z.x = 0;
};

May I recommend you std::hash?
std::size_t ident = std::hash<Complex>()(complexVar);
Writing it from memory but it should return you unique value (with very small chance of it being not) for each Complex type object.

Consider UUID, specifically uuid_generate on GNU/Linux, or (it seems) UuidCreate on Windows.

Generating a unique id is easy even if you want to write your own algorithm. Though your algorithm will need some thoughts if the environment you are working in is multi-threaded. In this case you will need to write a thread safe code. for example below code will generate a unique id and is also thread safe:
class Utility {
public :
static int getUniqueId();
};
int Utility::getUniqueId() {
static std::atomic<std::uint32_t> uid { 0 };
return ++uid;
}

One simple way is to just make a free-list to help you reuse IDs. So for instance you start at ID 0 and whenever you create a structure, you first check the free-list for any released IDs. It will be empty the first time through so you increment the ID counter to 1 and use that for your new structure. When you destroy a structure it's ID goes back on the free-list (which can be implemented as a stack), so that the next time you create a structure that ID will be reused. If the free-list runs out of IDs you just start incrementing the ID counter from where you left off....wash, rinse and repeat. The nice thing about this method is that you will never wrap integer range and accidentally use an already in use ID if your program ends up running a long time. The main down side is the extra storage for the stack, but the stack can always grow as you need it.

AST: get leaf value when leafs are of different types

I need to represent in a AST a structure like this:
struct {
int data;
double doubleDataArray[10];
struct {
int nestedData;
};
};
I'm creating an AST like this one:
I need to retrieve data from leaves. The problem that I have is that leaves contains heterogenous data. A leaf can represent an integer value, a double, a string and so on.
I can create classes like IntValue, DoubleValue that inherit from Value and store respective data, perform a dynamic_cast to convert Value to the type referred in its type attribute. Something like
switch (value->getType()) {
case Type::Int: {
auto iv = dynamic_cast<IntValue>(value);
int value = iv->getValue();
} break;
case Type::Double() {
auto dv = dynamic_cast<DoubleValue>(value);
double value = dv->getValue();
} break;
//…
}
but I'd like to know if there's a better way, because a switch like that one it's not easy maintainable and readable.
I've seen some example, like in boost::program_options, something like:
int value = value->getValue().as<int>();
It's a better way? How can I reproduce this?

You could do something like this using c++17
struct node {
//... other stuff
std::variant</*your types of nodes here*/> type;
}
then call this visitor on your nodes
std::visit([](auto&& node) {
if constexpr(std::is_same_v<std::decay_t<decltype(node)>, /* your type here */>) {
// ...
}
else if constexpr(/* ... */) {
// ...
}
}, node0.type);

Going on a tangent for a slightly different flavor of a solution, how about doing it the way capnproto does it? Capnproto's own schema compiler represents the AST in memory using the Capnproto wire encoding. The schema supports tagged unions. The lexer and parser for the schema are built using combinators (although I presume that you already have a good parser in place that produces the AST).
The structure could be expressed as follows using capnp schema:
# MyAst.capnp
struct Struct {
fields #0 :List(Field);
}
struct Field {
name #4 :Text;
union {
integer #0 :List(Int32);
fpoint #1 :List(Double);
text #2 :List(Text);
structure #3 :Struct;
}
}
The schema compiler would generate C++ code for this, with the following important classes Struct::Reader, Struct::Builder, Field::Reader and Field::Builder. Whatever makes the AST would use the Struct::Builder type to make a structure instance, with its data. Then, you'd traverse the structure as follows:
void processData(Struct::Reader reader) {
auto fields = reader.getFields();
for (auto &field : fields) {
if (field.hasInteger()) {
int32_t val = field.getInteger();
...
} else if (field.hasFpoint()) {
double val = field.getFpoint();
...
} else if (field.hasText()) {
kj::StringPtr val = field.getText();
...
} else if (field.hasStructure()) {
processData(field.getStructure());
}
}
}
The kj framework (included in capnproto) has quite a few compiler-building goodies, such as memory arenas. A Foo::Builder would then be obtained from an Orphan<Foo>, and the orphan is produced by an orphanage that carves out memory from an arena allocator. With your entire AST built in an arena with one or few large, contiguous segments, this would perform better than allocating all those types on general-purpose heap (assuming that your AST is not tiny). This representation also serializes directly to disk or network with no transcoding: you can do a binary dump of an orphanage's arena, then later load it directly and you get all your data back with zero effort and zero transcoding. The Foo::Reader and Foo::Builder types provide very fast accessors that don't do any data decoding nor translation - that's the advantage of the capnproto encoding. If you modify the data in the AST, the orphanage may grow, but it also provides a copy operation that copies only the referenced areas (a copying GC, if you will) - and that's blazing fast, too, since no transcoding is done. Chunks of verbatim binary data are copied with very little traversal overhead.

Ensure key/value of enumerable type never changes between software revs

We have a client/server application where older servers are supported by newer clients. Both support a shared set of icons.
We're representing the icons as an implicit enum that's included in the server and client builds:
enum icons_t { // rev 1.0
ICON_A, // 0
ICON_B, // 1
ICON_C // 2
};
Sometimes we retire icons (weren't being used, or used internally and weren't listed in our API), which led to the following code being committed:
enum icons_t { // rev 2.0
ICON_B, // 0
ICON_C // 1 (now if a rev 1.0 server uses ICON_B, it will get ICON_C instead)
};
I've changed our enum to the following to try and work around this:
// Big scary header about commenting out old icons
enum icons_t { // rev 2.1
// Removed: ICON_A = 0,
ICON_B = 1,
ICON_C = 2
};
Now my worry is a bad merge when multiple people add new icons:
// Big scary header about commenting out old icons
enum icons_t { // rev 30
// Removed: ICON_A = 0,
ICON_B = 1,
ICON_C = 2,
ICON_D = 3,
ICON_E = 3 // Bad merge leaves 2 icons with same value
};
Since it's an enum we don't really have a way to assert if the values aren't unique.
Is there a better data structure to manage this data, or a design change that wouldn't be open to mistakes like this? My thoughts have been going towards a tool to analyze pull requests and block merges if this issue is detected.

I have previously done tests that check out previous builds and scan header files for this type of version-breaking behaviour. You can use diff to generate a report of any changes, grep that for the common pattern, and identify the difference between deleting a fixed-index entry, changing the index of an entry, and deleting or inserting a floating index entry.
The one obvious way to avoid it is to NOT remove the dead indices, but rename them, i.e. ICON_A becomes ICON_A_RETIRED, and its slot is reserved for ever. It is inevitable that someone will change an index, though, so a good unit test would also help. Forcing a boilerplate style means the test is simpler than coping with the generic case.
Another trick might be to accept that the issue will occur, but if it is only a problem for customers, and at each software release/revision, update a base number for the range, release the software and update again, so the dev version is never compatible with the release, eg
#define ICON_RANGE 0x1000
#define ICON_REVISION_BASE ((RELEASENUM+ISDEVFLAG)*ICON_RANGE)
enum icon_t {
iconTMax = ICON_REVISION_BASE+ICON_RANGE,
iconTBase = ICON_REVISION_BASE,
icon_A,
icon_B,
Then, at run-time, any icons not in the current range are easily rejected, or you might provide a special look-up between versions, perhaps generated by trawling your version control revisions. Note that you can only provide backward compatibility this way, not forward compatibility. It would be up to newer code to preemptively back-translate their icon numbers to send to older modules, which may be more effort than it is worth.

This thought just crossed my mind: if we keep a literal at the end for the enum size, our unit tests can use that to assert if we haven't verified each enum literal:
enum icons_t {
ICON_A_DEPRECATED,
ICON_B,
ICON_C,
ICON_COUNT // ALWAYS KEEP THIS LAST
};
Then in testing:
unsigned int verifyCount = 0;
verify(0, ICON_A_DEPRECATED); // verifyCount++, assert 0 was not verified before
verify(1, ICON_B); // verifyCount++, assert 1 was never verified before
assert(ICON_COUNT == verifyCount, "Not all icons verified");
Then our only problem is ensuring tests pass before releasing, which we should be doing anyway.

Since the question has been tagged C++11, this could be better handled with Scoped enumerations.
Read about it here : http://en.cppreference.com/w/cpp/language/enum
Since the same enum file is included in both client and server, then removing any entry would lead to compilation failure in places a missing entry is being used.
All that need to be changed, is your icon_t.
Upgrade it from enum to enum class
enum class icon_t
{
ICON_A,
ICON_B,
};
Now you can't blatantly pass int instead of an icon_t. This reduces your probability to make mistakes drastically.
So the calling side
#include <iostream>
enum class icon_t
{
ICON_A,
ICON_B,
};
void test_icon(icon_t const & icon)
{
if (icon == icon_t::ICON_A)
std::cout << "icon_t::ICON_A";
if (icon == icon_t::ICON_B)
std::cout << "icon_t::ICON_B";
}
int main()
{
auto icon = icon_t::ICON_A;
test_icon(icon); // this is ok
test_icon(1); // Fails at compile time : no known conversion from 'int' to 'const icon_t' for 1st argument
return 0;
}
Moreover, extracting numerical values is allowed from Scoped Enumerators. static_cast to int is allowed. If required.
int n = static_cast<int>(icon); // Would return 0, the index of icon_t::ICON_A

C++ NTOH conversion with dispatcher - event queue

We are rewriting our legacy code in C to C++. At the core of our system, we have a TCP client, which is connected to master. Master will be streaming messages continuously. Each socket read will result in say an N number of message of the format - {type, size, data[0]}.
Now we don't copy these messages into individual buffers - but just pass the pointer the beginning of the message, the length and shared_ptr to the underlying buffer to a workers.
The legacy C version was single threaded and would do an inplace NTOH conversion like below:
struct Message {
uint32_t something1;
uint16_t something2;
};
process (char *message)
Message *m = (message);
m->something1 = htonl(m->something1);
m->something2 = htons(m->something2);
And then use the Message.
There are couple of issues with following the logging in new code.
Since we are dispatching the messages to different workers, each worker doing an ntoh conversion will cause cache miss issues as the messages are not cache aligned - i.e there is no padding b/w the messages.
Same message can be handled by different workers - this is the case where the message needs to processed locally and also relayed to another process. Here the relay worker needs the message in original network order and the local work needs to convert to host order. Obviously as the message is not duplicated both cannot be satisfied.
The solutions that comes to my mind are -
Duplicate the message and send one copy for all relay workers if any. Do the ntoh conversion of all messages belonging to same buffer in the dispatcher itself before dispatching - say by calling a handler->ntoh(message); so that the cache miss issue is solved.
Send each worker the original copy. Each worker will copy the message to local buffer and then do ntoh conversion and use it. Here each worker can use a thread-specific (thread_local) static buffer as a scratch pad to copy the message.
Now my question is
Is the option 1 way of doing ntoh conversion - C++sy? I mean the alignment requirement of the structure will be different from the char buffer. (we havent had any issue with this yet.). Using scheme 2 should be fine in this case as the scratch buffer can have alignment of max_align_t and hence should typecastable to any structure. But this incur copying the entire message - which can be quite big (say few K size)
Is there a better way to handle the situation?

Your primary issue seems to be how to handle messages that come in misaligned. That is, if each message structure doesn't have enough padding on the end of it so that the following message is properly aligned, you can trigger misaligned reads by reinterpreting a pointer to the beginning of a message as an object.
We can get around this a number of ways, perhaps the simplest would be to ntoh based on a single-byte pointer, which is effectively always aligned.
We can hide the nasty details behind wrapper classes, which will take a pointer to the start of a message and have accessors that will ntoh the appropriate field.
As indicated in the comments, it's a requirement that offsets be determined by a C++ struct, since that's how the message is initially created, and it may not be packed.
First, our ntoh implementation, templated so we can select one by type:
template <typename R>
struct ntoh_impl;
template <>
struct ntoh_impl<uint16_t>
{
static uint16_t ntoh(uint8_t const *d)
{
return (static_cast<uint16_t>(d[0]) << 8) |
d[1];
}
};
template <>
struct ntoh_impl<uint32_t>
{
static uint32_t ntoh(uint8_t const *d)
{
return (static_cast<uint32_t>(d[0]) << 24) |
(static_cast<uint32_t>(d[1]) << 16) |
(static_cast<uint32_t>(d[2]) << 8) |
d[3];
}
};
template<>
struct ntoh_impl<uint64_t>
{
static uint64_t ntoh(uint8_t const *d)
{
return (static_cast<uint64_t>(d[0]) << 56) |
(static_cast<uint64_t>(d[1]) << 48) |
(static_cast<uint64_t>(d[2]) << 40) |
(static_cast<uint64_t>(d[3]) << 32) |
(static_cast<uint64_t>(d[4]) << 24) |
(static_cast<uint64_t>(d[5]) << 16) |
(static_cast<uint64_t>(d[6]) << 8) |
d[7];
}
};
Now we'll define a set of nasty macros that will automatically implement accessors for a given name by looking up the member with the matching name in the struct proto (a private struct to each class):
#define MEMBER_TYPE(MEMBER) typename std::decay<decltype(std::declval<proto>().MEMBER)>::type
#define IMPL_GETTER(MEMBER) MEMBER_TYPE(MEMBER) MEMBER() const { return ntoh_impl<MEMBER_TYPE(MEMBER)>::ntoh(data + offsetof(proto, MEMBER)); }
Finally, we have an example implementation of the message structure you have given:
class Message
{
private:
struct proto
{
uint32_t something1;
uint16_t something2;
};
public:
explicit Message(uint8_t const *p) : data(p) {}
explicit Message(char const *p) : data(reinterpret_cast<uint8_t const *>(p)) {}
IMPL_GETTER(something1)
IMPL_GETTER(something2)
private:
uint8_t const *data;
};
Now Message::something1() and Message::something2() are implemented and will read from the data pointer at the same offsets they wind up being in Message::proto.
Providing the implementation in the header (effectively inline) has the potential to inline the entire ntoh sequence at the call site of each accessor!
This class does not own the data allocation it is constructed from. Presumably you could write a base class if there's ownership-maintaining details here.

Problems with parsing a text header packet in c++

I am trying to parse header packet of SIP protocol (Similar to HTTP) which is a text based protocol.
The fields in the header do not have an order.
For ex: if there are 3 fields, f1, f2, and f3 they can come in any order any number of times say f3, f2 , f1, f1.
This is increasing the complexity of my parser since I don't know which will come first.
What should I do to overcome this complexity?

Ultimately, you simply need to decouple your processing from the order of receipt. To do that, have a loop that repeats while fields are encountered, and inside the loop determine which field type it is, then dispatch to the processing for that field type. If you can process the fields immediately great, but if you need to save the potentially multiple values given for a field type you might - for example - put them into a vector or even a shared multimap keyed on the field name or id.
Pseudo-code:
Field x;
while (x = get_next_field(input))
{
switch (x.type())
{
case Type1: field1_values.push_back(x.value()); break;
case Type2: field2 = x.value(); break; // just keep the last value seen...
default: throw std::runtime_error("unsupported field type");
}
}
// use the field1_values / field2 etc. variables....

Tony already gave the main idea, I'll get more specific.
The basic idea in parsing is that it is generally separated into several phases. In your case you need to separate the lexing part (extracting the tokens) from the semantic part (acting on them).
You can proceed in different fashions, since I prefer a structured approach, let us suppose that we have a simple struct reprensenting the header:
struct SipHeader {
int field1;
std::string field2;
std::vector<int> field3;
};
Now, we create a function that take a field name, its value, and fill the corresponding field of the SipHeader structure appropriately.
void parseField(std::string const& name, std::string const& value, SipHeader& sh) {
if (name == "Field1") {
sh.field1 = std::stoi(value);
return;
}
if (name == "Field2") {
sh.field2 = value;
return;
}
if (name == "Field3") {
// ...
return;
}
throw std::runtime_error("Unknown field");
}
And then you iterate over the lines of the header and for each line separate the name and the value and call this functions.
There are obviously refinements:
instead of a if-chain you can use a map of functors or you can fully tokenize the source and store the fields in a std::map<std::string, std::string>
you can use a state-machine technic to immediately act on it without copying
but the essential advice is the same:
To manage complexity you need to separate the task in orthogonal subtasks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js