Size of encoded avro message without encoding it

Size of encoded avro message without encoding it - c++

Is there a way to get the size of the encoded avro message without actually encoding it?
I'm using Avro 1.8.1 for C++.
I'm used to google protocol buffers where you can call ByteSize() on a protobuf to get the encoded size, so it's something similar i'm looking for.
Since the message in essence is a raw struct I get that the size cannot be retrieved from the message itself, but perhaps there is a helper method that i'm not aware of?

There is no way around it unfortunately...
Here is an example showing how the size can be calculated by encoding the object:
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream(1);
encoder->init(*out);
avro::encode(*encoder, obj);
out->flush();
uint32_t bufferSize = out->byteCount();

(Edit below shows a hacky way to shrink-to-fit an OutputStream after writing to it with a BinaryEncoder)
It's a shame that avro::encode() doesn't use backup on the OutputStream to free unused memory after encoding. Martin G's answer gives the best solution using only the tools avro provides, but it issues N memory allocations of 1 byte each if your serialized object is N bytes in size.
You could implement a custom avro::OutputStream that simply counts and discards all written bytes. This would get rid of the memory allocations. It's still not a great approach, as the actual encoder will have to "ask" for every single byte:
(Code untested, just for demonstration purposes)
#include <avro/Encoder.hh>
#include <cstdint>
class ByteCountOutputStream : public avro::OutputStream {
public:
size_t byteCount_ = 0;
uint8_t dummyWriteLocation_;
explicit ByteCountOutputStream() {};
bool next(uint8_t **data, size_t *len) final {
byteCount_ += 1;
*data = &dummyWriteLocation_;
*len = 1;
return true;
}
void backup(size_t len) final {
byteCount_ -= len;
}
uint64_t byteCount() const final {
return byteCount_;
}
void flush() final {}
};
this could then be used as:
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
ByteCountOutputStream out();
encoder->init(out);
avro::encode(*encoder, obj);
size_t bufferSize = out.byteCount();
Edit:
My initial question when stumbling upon this was: How can I tell how many bytes of the OutputStream are required (for storing / transmitting)? Or, equivalently, if OutputStream.byteCount() returns the count of bytes allocated by the encoder so far, how can I make the encoder "backup" / release the bytes it didn't use? Well, there is a hacky way:
The Encoder abstract class provides a init method. For the BinaryEncoder, this is currently implemented as:
void BinaryEncoder::init(OutputStream &os) {
out_.reset(os);
}
with out_ being the internal StreamWriter of the Encoder.
Now, the StreamWriter implements reset as:
void reset(OutputStream &os) {
if (out_ != nullptr && end_ != next_) {
out_->backup(end_ - next_);
}
out_ = &os;
next_ = end_;
}
which will return unused memory back to the "old" OutputStream before switching to the new one.
So, you can abuse the encoder's init method like this:
// setup as always
MyAvroStruct obj;
avro::EncoderPtr encoder = avro::binaryEncoder();
std::auto_ptr<avro::OutputStream> out = avro::memoryOutputStream();
// actual serialization
encoder->init(*out);
avro::encode(*encoder, obj);
// re-init on the same OutputStream. Happens to shrink the stream to fit
encoder->init(*out);
size_t bufferSize = out->byteCount();
However, this behavior is not documented, so it might break in the future.

Related

LibTorch's save_to & load_from function - Model Archiving

I'm fairly new to LibTorch and it's Model Archiving system.
At the moment, I'm trying to save my model configuration from one module, and load it a different module, but I'm raising an error in LibTorch that I don't quite understand.
To do this, i've been reading the API documentation here: https://pytorch.org/cppdocs/api/classtorch_1_1serialize_1_1_output_archive.html#class-documentation
which doesn't seem to be all that helpful on the matter.
I've been trying utilise as much of LibTorch as possible here, but I'm suspecting a JSON or similar storage structure might infact be easier. I'm doing it this way (rather than using a .clone() or similar) as I intend to send the data at some point in the future.
I've simplified my code below;
torch::serialize::OutputArchive
NewModule::getArchive()
{
torch::serialize::OutputArchive archive;
auto params = named_parameters(true /*recurse*/);
auto buffers = named_buffers(true /*recurse*/);
for (const auto& val : params)
{
if (!is_empty(val.value()))
archive.write(val.key(), val.value());
// Same again with a write-buffers.
return archive;
}
}
This function aims to copy the contents to a torch::serialise::OutputArchive, which can then be saved to disk, passed into an ostream or "writer function". It's the last of these I'm struggling to get working successfully.
Torch specifies that the writer function must be of type std::function<size_t(const void*, size_t)> I'm assuming (as the docs don't specify!) that the const void* is actually an array of bytes which length is determined by the second parameter, size_t I am unsure why the return value is also a size_t here.
My next block of code reads this data blob and attempts to read it using torch::serialise::InputArchive. Calling load_from here produces the error: "PytorchStreamReader failed reading zip archive: failed finding central directory"
Can anyone help resolve why this is the case?
Code below:
void
NewModule::LoadFromData(const char* data, size_t data_size)
{
torch::serialize::InputArchive archive;
archive.load_from(data, data_size);
auto params = named_parameters(true);
auto buffers = named_buffers(true);
for (auto& val : params)
{
archive.read(val.key(), val.value());
}
// Same again with a copy buffers
}
torch::serialize::OutputArchive
NewModule::copyArchive()
{
NewModule other_module;
auto archive = getArchive();
std::function<size_t(const void*, size_t)> writer_lambda = [this, other_module](const void* data, size_t size) mutable -> size_t {
other_module.LoadFromData(reinterpret_cast<const char*>(data), size);
}
archive.save_to(writer_lambda);
}

How to access serialized data of Cap'n'Proto?

I'm working with Cap'n'Proto and my understanding is there is no need to do serialization as it's already being done. So my question is, how would I access the serialized data and get it's size so that I can pass it in as a byte array to another library.
// person.capnp
struct Person {
name #0 :Text;
age #1 :Int16;
}
// ...
::capnp::MallocMessageBuilder message;
Person::Builder person = message.initRoot<Person>();
person.setName("me");
person.setAge(20);
// at this point, how do I get some sort of handle to
// the serialized data of 'person' as well as it's size?
I've seen the writePackedMessageToFd(fd, message); call, but didn't quite understand what was being passed and couldn't find any API docs on it. I also wasn't trying to write to a file descriptor as I need the serialized data returned as const void*.
Looking in Capnproto's message.h file is this function which is in the base class for MallocMessageBuilder which says it gets the raw data making up the message.
kj::ArrayPtr<const kj::ArrayPtr<const word>> getSegmentsForOutput();
// Get the raw data that makes up the message.
But even then, Im' not sure how to get it as const void*.
Thoughts?

::capnp::MallocMessageBuilder message;
is your binary message, and its size is
message.sizeInWords()
(size in bytes divided by 8).

This appears to be whats needed.
// ...
::capnp::MallocMessageBuilder message;
Person::Builder person = message.initRoot<Person>();
person.setName("me");
person.setAge(20);
kj::Array<capnp::word> dataArr = capnp::messageToFlatArray(message);
kj::ArrayPtr<kj::byte> bytes = dataArr.asBytes();
std::string data(bytes.begin(), bytes.end());
const void* dataPtr = data.c_str();
At this point, I have a const void* dataPtr and size using data.size().

C++/WinRT runtime component to return large array

I am trying to write a windows runtime component using C++/WinRT that will be consumed from C# for unity programme. I am completely new C++, but having spent a long time, I have managed to make the code work by returning Windows.Graphics.Imaging.SoftwareBitmap object that came through the windows API. But returning the SoftwareBitmap is slow and I changed my code to return com_array instead.
Returning the com_array does speed up things significantly, but now I am running into Access violation reading location xxxxx error frequently. As someone with limited experience in C++, I am struggling to do this properly. Below is the simplified version of my code. I've removed majority of code for simplicity.
1.FrameData
#FrameData.idl
runtimeclass FrameData
{
RMFrameData();
BYTE[] imageArray;
}
#FrameData.h
struct FrameData : FrameData <RMFrameData>
{
RMFrameData() = default;
~RMFrameData();
com_array<uint8_t>& imageArray();
void imageArray(array_view<uint8_t const> value);
private:
com_array<uint8_t> m_imageArray{};
};
#FrameData.cpp
RMFrameData::~RMFrameData()
{
m_imageArray.clear();
}
com_array<uint8_t>& RMFrameData::imageArray()
{
return m_imageArray;
}
void RMFrameData::imageArray(array_view<uint8_t const> value)
{
m_imageArray = winrt::com_array<BYTE>(std::move_iterator(value.begin()), std::move_iterator(value.end()));
}
Method that uses FrameData to send it is below.
void PVProcessor::OnFrameArrived(const MediaFrameReader& sender, const MediaFrameArrivedEventArgs& args)
{
if (MediaFrameReference frame = sender.TryAcquireLatestFrame())
{
auto frame_data = winrt::make_self<winrt::HL2CV_WRT::implementation::RMFrameData>();
if (frame != nullptr && frame.VideoMediaFrame() != nullptr) {
uint32_t pixelBufferDataLength = 0;
uint8_t* pixelBufferData;
SoftwareBitmap bitmap = SoftwareBitmap::Convert(frame.VideoMediaFrame().SoftwareBitmap(), BitmapPixelFormat::Bgra8);
if (bitmap != nullptr) {
BitmapBuffer bitmapBuffer = bitmap.LockBuffer(BitmapBufferAccessMode::Read);
auto spMemoryBufferByteAccess{ bitmapBuffer.CreateReference().as<::Windows::Foundation::IMemoryBufferByteAccess>() };
winrt::check_hresult(spMemoryBufferByteAccess->GetBuffer(&pixelBufferData, &pixelBufferDataLength));
std::vector<BYTE> pixelVectorData{};
pixelVectorData.insert(pixelVectorData.end(), std::make_move_iterator(pixelBufferData), std::make_move_iterator(pixelBufferData + pixelBufferDataLength));
frame_data->imageArray(pixelVectorData); //set com_array here.
bitmap.Close();
}
}
frame.Close();
m_frame_data_available(*this, *frame_data);
}
}
Method using the FrameData from C# is below.
void OneFrameAvailable(object sender, RMFrameData e){
var array = e.imageArray;
/*
use array data here
*/
}
the code var array = e.imageArray; is throwing error Access violation reading location xxxx.
My questions are
how do I properly initialize m_imageArray in FrameData.h.
Is the destructor correctly cleaning up m_imageArray after using it in C#?
Do I need to explicitly call destructor from C# code after using frame data?
I wrote the C++ code thinking like java, so I am sure there are many many mistakes. Please let me know other things that need to be corrected.
The goal is to be able to access softwarebitmap's data from c# without needing to make copies along the way. How?
Thanks.

Using stream to treat received data

I am receiving messages from a socket.
The socket is packed within a header (that is basically the size of the message) and a footer that is a crc (a kind of code to check if the message is not corrupted)
So, the layout is something like :
size (2 bytes) | message (240 bytes) | crc (4 byte)
I wrote a operator>>
The operator>> is as following :
std::istream &operator>>(std::istream &stream, Message &msg) {
std::int16_t size;
stream >> size;
stream.read(reinterpret_cast<char*>(&msg), size);
// Not receive enough data
if (stream.rdbuf()->in_avail() < dataSize + 4) {
stream.setstate(std::ios_base::failbit);
return stream;
}
std::int16_t gotCrc;
stream >> gotCrc;
// Data not received correctly
if(gotCrc != computeCrc(msg)) {
stream.setstate(std::ios_base::failbit);
}
return stream;
}
The message can arrive byte by byte, or can arrive totally. We can even receive several messages in once.
Basically, what I did is something like this :
struct MessageReceiver {
std::string totalDataReceived;
void messageArrived(std::string data) {
// We add the data to totaldataReceived
totalDataReceived += data;
std::stringbuf buf(totalDataReceived);
std::istream stream(&buf);
std::vector<Message> messages(
std::istream_iterator<Message>(stream),
std::istream_iterator<Message>{});
std::for_each(begin(messages), end(messages), processMessage);
// +4 for crc and + 2 for the size to remove
auto sizeToRemove = [](auto init, auto message) {return init + message.size + 4 + 2;};
// remove the proceed messages
totalDataReceived.remove(accumulate(begin(messages), end(messages), 0, sizeToRemove);
}
};
So basically, we receive data, we insert it into a total array of data received. We stream it, and if we got at least one message, we remove it from the buffer totalDataReceived.
However, I am not sure it is the good way to go. Indeed, this code does not work when a compute a bad crc... (The message is not created, so we don't iterate over it). So each time, I am going to try to read the message with a bad crc...
How can I do this? I can not keep all the data in totalDataReceived because I can receive a lot of messages during the execution life time.
Should I implement my own streambuf?

I found what you want to create is a class which acts like a std::istream. Of course you can choose to create your own class, but I prefer to implement std::streambuf for some reasons.
First, people using your class are accustomed to using it since it acts the same as std::istream if you inherit and implement std::streambuf and std::istream.
Second, you don't need to create extra method or don't need to override operators. They're already ready in std::istream's class level.
What you have to do to implement std::streambuf is to inherit it, override underflow() and setting get pointers using setg().

How to optimize parse data flow algorithm?

I need to implement some abstract protocol client-server conversation parsing library with C++. I don't have file containing the whole client-server conversation, but have to parse it on the fly. I have to implement following interface:
class parsing_class
{
public:
void on_data( const char* data, size_t len );
//other functions
private:
size_t pos_;// current position in the data flow
bool first_part_parsed_;
bool second_part_parsed_;
//... some more bool markers or something like vector< bool >
};
The data is passed to my class through on_data function. Data chunk length varies from one call to another. I know protocol's packet format and know how conversation should be organized, so I can judge by current pos_ whether i have enough data to parse Nth part.
Now the implementation is like following:
void parsing_class::on_data( const char* data, size_t len )
{
pos_ += len;
if( pos > FIRST_PART_SIZE and !first_part_parsed_ )
parse_first_part( data, len );
if( pos > SECOND_PART_SIZE and !second_part_parsed_ )
parse_second_part( data, len );
//and so on..
}
What I want is some tips how to optimize this algorithm. Maybe to avoid these numerous if ( on_data may be called very many times and each time it will have to go through all switches ).

You don't need all those bool and pos_, as they seem to only keep the state of what of the conversation has passed so that you can continue with the next part.
How about the following: write yourself a parse function for each of the parts of the conversation
bool parse_part_one(const char *data) {
... // parse the data
next_fun = parse_part_two;
return true;
}
bool parse_part_two(const char *data) {
... // parse the data
next_fun = parse_part_thee;
return true;
}
...
and in your class you add a pointer to the current parse function, starting at one. Now, in on_data all you do is to call the next parse function
bool success = next_fun(data);
Because each function sets the pointer to the next parse function, the next call of on_data will invoke the next parse function automatically. No tests required of where in the conversation you are.
If the value of len is critical (which I assume it would be) then pass that along as well and return false to indicate that the part could not be parsed (don't update next_fun in that case either).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Size of encoded avro message without encoding it - c++

Related

LibTorch's save_to & load_from function - Model Archiving

How to access serialized data of Cap'n'Proto?

C++/WinRT runtime component to return large array

Using stream to treat received data

How to optimize parse data flow algorithm?

Categories

Resources