Is Arrow Streaming end-to-end copy-free - apache-arrow

I am confused about Arrow Streaming. Many sources describing Arrow just paraphrase the following
The Arrow memory format supports zero-copy reads
and say that Arrow is zero-copy tool.
However, as far as I understand these paragraphs:
The primitive unit of serialized data in the columnar format is the “record batch”. Semantically, a record batch is an ordered collection of arrays, known as its fields, each having the same length as one another but potentially different data types. A record batch’s field names and types collectively form the batch’s schema.
In this section we define a protocol for serializing record batches into a stream of binary payloads and reconstructing record batches from these payloads without need for memory copying.
the description of the IPC Streaming Format and, my limited understanding, the source, data is serialized, and only deserialization is zero-copy.
In other words - systems that use Arrow Streaming actually copy the data on the way.
Is that correct?

As you said, Arrow is always zero-copy on the receiver side.
systems that use Arrow Streaming actually copy the data on the way.
It depends what you mean by "copy". Is data duplicated in-memory in the same process? No. The bytes have to be transported somehow from one virtual address space to another, whether you believe that technically constitutes a "copy" or not depends on your application (and point of view, perhaps).
Here is the actual C++ code (as of this writing) where the data written by the sender into an OutputStream which is a proxy for the receiver
// Now write the buffers
for (size_t i = 0; i < payload.body_buffers.size(); ++i) {
const std::shared_ptr<Buffer>& buffer = payload.body_buffers[i];
int64_t size = 0;
int64_t padding = 0;
// The buffer might be null if we are handling zero row lengths.
if (buffer) {
size = buffer->size();
padding = BitUtil::RoundUpToMultipleOf8(size) - size;
}
if (size > 0) {
RETURN_NOT_OK(dst->Write(buffer));
}
if (padding > 0) {
RETURN_NOT_OK(dst->Write(kPaddingBytes, padding));
}
}
Nothing is being forcibly copied here. If dst is standing in front of a socket-like interface then the bytes go on the wire to the receiver immediately (or with buffering, or whatever the OutputStream is doing). If dst is a file handle then the bytes are written to the file, etc.

Related

How could I send message of many fields with c++ grpc stream?

Suppose the message that I need to send from server to client is like this:
message BatchReply {
bytes data = 1;
repeated int32 shape = 2;
string dtype = 3;
repeated int64 labels = 4;
}
Here shape/dtype are only small variables and can be represented with a few int space, while data/labels are large memory buffers that can takes as much as 1G memory.
I am trying to send this message with stream:
service ImageService {
rpc get_batch (BatchRequest) returns (stream BatchReply) {}
}
My question is that the examples I could find to send message through stream are all about messages with only one field in the message struct, such as:
service TransferFile {
rpc Upload(stream Chunk) returns (Reply) {}
}
message Chunk {
bytes buffer = 1; // here is only on field of buffer, what if there is a field of int val = 2; ?
}
What if there are two fields in the struct of Chunk. Do I need to call set_val() each time when I call set_buffer() during the same stream feeding process ?
You can just send a message with multiple fields over grpc. That is the advantage of using protobuf.
I do not know if the transmission layer you are using is able to deal with such a large message as you are specifying. You could test that. If this does not work you can use the example you give of TransferFile.
When looking at Chunk I get the impression that they are sending segments of the whole data. On the other side they then reconstruct the segments into the complete set. The type used for Chunk is bytes. Those are just raw bytes and can represent anything you would like.
To send your BatchReply in chunks you can use the following steps:
Set the data in a BatchReply object.
Serialize the object to a byte array.
Set a chunk of the byte array in a Chunk object. For example 100 bytes each time.
Send the Chunk object using the TransferFile interface.
Repeat the process from step 3. until you reach the end of the byte array.
On the reception side you concatenate the chunks into one array and deserialize the array back to a BatchReply object.

Monitor buffers in GNU Radio

I have a question regarding buffering in between blocks in GNU Radio. I know that each block in GNU (including custom blocks) have buffers to store items that are going to be sent or received items. In my project, there is a certain sequence I have to maintain to synchronize events between blocks. I am using GNU radio on the Xilinx ZC706 FPGA platform with the FMCOMMS5.
In the GNU radio companion I created a custom block that controls a GPIO Output port on the board. In addition, I have an independent source block that is feeding information into the FMCOMMS GNU block. The sequence I am trying to maintain is that, in GNU radio, I first send data to the FMCOMMS block, second I want to make sure that the data got consumed by the FMCOMMS block (essentially by checking buffer), then finally I want to control the GPIO output.
From my observations, the source block buffer doesn’t seem to send the items until it’s full. This will cause a major issue in my project because this means that the GPIO data will be sent before or in parallel with sending the items to the other GNU blocks. That’s because I’m setting the GPIO value through direct access to its address in the ‘work’ function of my custom block.
I tried to use pc_output_buffers_full() in the ‘work’ function of my custom source in order to monitor the buffer, but I’m always getting 0.00. I’m not sure if it’s supposed to be used in custom blocks or if the ‘buffer’ in this case is something different from where the output items are stored. Here's a small code snippet which shows the problem:
char level_count = 0, level_val = 1;
vector<float> buff (1, 0.0000);
for(int i=0; i< noutput_items; i++)
{
if(level_count < 20 && i< noutput_items)
{
out[i] = gr_complex((float)level_val,0);
level_count++;
}
else if(i<noutput_items)
{
level_count = 0;
level_val ^=1;
out[i] = gr_complex((float)level_val,0);
}
buff = pc_output_buffers_full();
for (int n = 0; n < buff.size(); n++)
cout << fixed << setw(5) << setprecision(2) << setfill('0') << buff[n] << " ";
cout << "\n";
}
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent? Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
GNU Radio Companion version: 3.7.8
OS: Linaro 14.04 image running on the FPGA
Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
Nope, that's not how GNU Radio works (at all!):
A while back I wrote an article that explains how GNU Radio deals with buffers, and what these actually are. While the in-memory architecture of GNU Radio buffers might be of lesser interest to you, let me quickly summarize the dynamics of it:
The buffers that (general_)work functions are called with behave for all that's practical like linearly addressable ring buffers. You get a random number of samples at once (restrictable to minimum numbers, multiples of numbers), and all that you not consume will be handed to you the next time work is called.
These buffers hence keep track of how much you've consumed, and thus, how much free space is in a buffer.
The input buffer a block sees is actually the output buffer of the "upstream" block in the flow graph.
GNU Radio's computation is backpressure-controlled: Any block's work method will immediately be called in an endless loop given that:
There's enough input for the block to do work,
There's enough output buffer space to write to.
Therefore, as soon as one block finishes its work call, the upstream block is informed that there's new free output space, thus typically leading to it running
That leads to high parallelity, since even adjacent blocks can run simultaneously without conflicting
This architecture favors large chunks of input items, especially for blocks that take a relative long time to computer: while the block is still working, its input buffer is already being filled with chunks of samples; when it's finished, chances are it's immediately called again with all the available input buffer being already filled with new samples.
This architecture is asynchronous: even if two blocks are "parallel" in your flow graph, there's no defined temporal relation between the numbers of items they produce.
I'm not even convinced switching GPIOs at times based on the speed computation in this completely non-deterministic timing data flow graph model is a good idea to start with. Maybe you'd rather want to calculate "timestamps" at which GPIOs should be switched, and send (timestamp, gpio state) command tuples to some entity in your FPGA that keeps absolute time? On the scale of radio propagation and high-rate signal processing, CPU timing is really inaccurate, and you should use the fact that you have an FPGA to actually implement deterministic timing, and use the software running on the CPU (i.e. GNU Radio) to determine when that should happen.
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent?
Other than that, a method to asynchronously tell another another block that, yes, you've processed N samples, would be either to have a single block that just observes the outputs of both blocks that you want to synchronize and consumes an identical number of samples from both inputs, or to implement something using message passing. Again, my suspicion is that this is not a solution to your actual problem.

How to mix audio input devices in Qt

I'm new to Qt's multimedia library and in my application I want to mix audio from multiple input devices (e.g. microphone), in order to stream it via TCP.
As far as I know I have to obtain the specific QAudioDeviceInfo for all needed devices first - together with an according QAudioFormat object - and use this with QAudioInput. Then I simply call start() for every created QAudioInput object and read out pending bytes with readLine().
But how can I mix audio data of multiple devices to one buffer?
I am not sure if there is any Qt specific method / class to do this. However it's pretty simple to do it yourself.
The most basic way (assuming you are using PCM), you can simply add the two streams/buffers together word by word (if I recall they are 16-bit PCM words).
So if you have two input buffers:
int16 buff1[10];
int16 buff2[10];
int16 mixBuff[10];
// Fill them...
//... code goes here to read from the buffers ....
// Add them (effectively mix them)
for (int i = 0; i < 10; i++)
{
mixBuff[i] = buff1[i] + buff2[i];
}
Now, this is very crude and does not take any scaling into consideration. So imagine buff1 and buff2 both use 80% of the dynamic range (call this full volume, beyond which you get distortion), then when you add them together you will get number overrun (i.e. 16-bit max is 65535 so 50000 + 50000 will be a over run).
Each time you mix you effectively need half the two inputs (so 65535 / 2 + 65535 / 2 = 65535... i.e. when you add them up you can't overrun). So your mix code is like this:
for (int i = 0; i < 10; i++)
{
mixBuff[i] = (buff1[i] >> 1) + (buff2[i] >> 1);
}
There is much more you can do (noise removal etc...) but then the maths start getting a bit hairy. This is very simple. You can use the shift afterwards to increase / decrease volume as simple volume control if you want.
EDIT
One thing to note... you are using readline() (which the docs say reads the data out as ASCII). I always use read() which it does not state the "format" it is read out, but I am assuming binary. So this code may not work if you use readline() but I have never tried it. It works well for read(), you don't really want to be working in ASCII if you want to manipulate the data.

avoiding collisions when collapsing infinity lock-free buffer to circular-buffer

I'm solving two feeds arbitrate problem of FAST protocol.
Please don't worry if you not familar with it, my question is pretty general actually. But i'm adding problem description for those who interested (you can skip it).
Data in all UDP Feeds are disseminated in two identical feeds (A and B) on two different multicast IPs. It is strongly recommended that client receive and process both feeds because of possible UDP packet loss. Processing two identical feeds allows one to statistically decrease the probability of packet loss.
It is not specified in what particular feed (A or B) the message appears for the first time. To arbitrate these feeds one should use the message sequence number found in Preamble or in tag 34-MsgSeqNum. Utilization of the Preamble allows one to determine message sequence number without decoding of FAST message.
Processing messages from feeds A and B should be performed using the following algorithm:
Listen feeds A and B
Process messages according to their sequence numbers.
Ignore a message if one with the same sequence number was already processed before.
If the gap in sequence number appears, this indicates packet loss in both feeds (A and B). Client should initiate one of the Recovery process. But first of all client should wait a reasonable time, perhaps the lost packet will come a bit later due to packet reordering. UDP protocol can’t guarantee the delivery of packets in a sequence.
// tcp recover algorithm further
I wrote such very simple class. It preallocates all required classes and then first thread that receive particular seqNum can process it. Another thread will drop it later:
class MsgQueue
{
public:
MsgQueue();
~MsgQueue(void);
bool Lock(uint32_t msgSeqNum);
Msg& Get(uint32_t msgSeqNum);
void Commit(uint32_t msgSeqNum);
private:
void Process();
static const int QUEUE_LENGTH = 1000000;
// 0 - available for use; 1 - processing; 2 - ready
std::atomic<uint16_t> status[QUEUE_LENGTH];
Msg updates[QUEUE_LENGTH];
};
Implementation:
MsgQueue::MsgQueue()
{
memset(status, 0, sizeof(status));
}
MsgQueue::~MsgQueue(void)
{
}
// For the same msgSeqNum should return true to only one thread
bool MsgQueue::Lock(uint32_t msgSeqNum)
{
uint16_t expected = 0;
return status[msgSeqNum].compare_exchange_strong(expected, 1);
}
void MsgQueue::Commit(uint32_t msgSeqNum)
{
status[msgSeqNum] = 2;
Process();
}
// this method probably should be combined with "Lock" but please ignore! :)
Msg& MsgQueue::Get(uint32_t msgSeqNum)
{
return updates[msgSeqNum];
}
void MsgQueue::Process()
{
// ready packets must be processed,
}
Usage:
if (!msgQueue.Lock(seq)) {
return;
}
Msg msg = msgQueue.Get(seq);
msg.Ticker = "HP"
msg.Bid = 100;
msg.Offer = 101;
msgQueue.Commit(seq);
This works fine if we assume that QUEUE_LENGTH is infinity. Because in this case one msgSeqNum = one updates array item.
But I have to make buffer circular because it is not possible to store entire history (many millions of packets) and there are no reason to do so. Actually I need to buffer enough packets to reconstruct the session, and once session is reconstructed i can drop them.
But having circular buffer significantly complicates algorithm. For example assume that we have circular buffer of length 1000. And at the same time we try to process seqNum = 10 000 and seqNum = 11 000 (this is VERY unlikely but still possible). Both these packets will map to the array updates at index 0 and so collision occur. In such case buffer should 'drop' old packets and process new packets.
It's trivial to implement what I want using locks but writing lock-free code on circular-buffer that used from different threads is really complicated. So I welcome any suggestions and advice how to do that. Thanks!
I don't believe you can use a ring buffer. A hashed index can be used in the status[] array. Ie, hash = seq % 1000. The issue is that the sequence number is dictated by the network and you have no control over it's ordering. You wish to lock based on this sequence number. Your array doesn't need to be infinite, just the range of the sequence number; but that is probably larger than practical.
I am not sure what is happening when the sequence number is locked. Does this mean another thread is processing it? If so, you must maintain a sub-list for hash collisions to resolve the particular sequence number.
You may also consider an array size as a power of 2. For example, 1024 will allow hash = seq & 1023; which should be quite efficient.

Sending binary data over socket in c++

In C++ I have data structure something like this:
struct Data
{
int N;
double R;
char Name[20];
};
This Data I have to send over from a client to server on a different system (I have to send an array of Data structs, but i could send it just one by one). I would like to send it over as binary data, so that I could extract the data on the other end put it inside the same struct type.
If both (client and sever) are compiled with the same compiler the sizeof(Data) and all bit paddings within structure would be the same. But as server is 64bit running Linux and client could be even 32bit windows, the ordering of data within Data could be different.
Am I Right? What would be the best way of dealing with this the problem?
Assuming that all clients and servers will always be built with the same compiler for same architecture and OS is almost always a bad idea. It is better to write code that explicitly packs and unpacks the member of the structure as a stream of bytes with a specified ordering, or converts the data to non-binary formats (e.g. JSON or XML) that can be parsed on the other side.
You could use Boost.Serialization to marshal/unmarshal (i.e. transform your data from your source computer format to one that is suitable for transmission, and from that format to the one used by your destination computer) your data and Boost.Asio to handle communication.
Proper protocol definition would help you to solve this issue. Have a fixed length header followed by variable length data.
Something like this
Header***
[DataSize] - 2 bytes
[MSG ID] - 1 byte
Packet**
[N] - 4 bytes
[R] - 8 bytes
[Len] - 2 bytes
[Name] - Len bytes
read the bytes and fill the structure