Unable to find the reason for "Broken Pipe" error while sending continuous data chunks through Beast websocket - c++

I am working on streaming audio recognition with IBM Watson speech to text web service API. I have created a web-socket with boost (beast 1.68.0) library in C++(std 11).
I have successfully connected to the IBM server, and want to send a 231,296 bytes of raw audio data to server in following manner.
{
"action": "start",
"content-type": "audio/l16;rate=44100"
}
websocket.binary(true);
<bytes of binary audio data 50,000 bytes>
<bytes of binary audio data 50,000 bytes>
<bytes of binary audio data 50,000 bytes>
<bytes of binary audio data 50,000 bytes>
<bytes of binary audio data 31,296 bytes>
websocket.binary(false);
{
"action": "stop"
}
Expected Result from IBMServer is :
{"results": [
{"alternatives": [
{ "confidence": xxxx,
"transcript": "call Rohan Chauhan "
}],"final": true
}], "result_index": 0
}
But I am not getting the desired result: rather the error says
"Broken pipe"
DataSize is: 50000 | mIsLast is : 0
DataSize is: 50000 | mIsLast is : 0
what : Broken pipe
DataSize is: 50000 | mIsLast is : 0
what : Operation canceled
DataSize is: 50000 | mIsLast is : 0
what : Operation canceled
DataSize is: 31296 | mIsLast is : 0
what : Operation canceled
Here is my code which is an adaptation of the sample example given in beast library.
Foo.hpp
class IbmWebsocketSession: public std::enable_shared_from_this<IbmWebsocketSession> {
protected:
char binarydata[50000];
std::string TextStart;
std::string TextStop;
public:
explicit IbmWebsocketSession(net::io_context& ioc, ssl::context& ctx, SttService* ibmWatsonobj) :
mResolver(ioc), mWebSocket(ioc, ctx) {
TextStart ="{\"action\":\"start\",\"content-type\": \"audio/l16;rate=44100\"}";
TextStop = "{\"action\":\"stop\"}";
/**********************************************************************
* Desc : Send start frame
**********************************************************************/
void send_start(beast::error_code ec);
/**********************************************************************
* Desc : Send Binary data
**********************************************************************/
void send_binary(beast::error_code ec);
/**********************************************************************
* Desc : Send Stop frame
**********************************************************************/
void send_stop(beast::error_code ec);
/**********************************************************************
* Desc : Read the file for binary data to be sent
**********************************************************************/
void readFile(char *bdata, unsigned int *Len, unsigned int *start_pos,bool *ReachedEOF);
}
Foo.cpp
void IbmWebsocketSession::on_ssl_handshake(beast::error_code ec) {
if(ec)
return fail(ec, "connect");
// Perform the websocket handshake
ws_.async_handshake_ex(host, "/speech-to-text/api/v1/recognize", [Token](request_type& reqHead) {reqHead.insert(http::field::authorization,Token);},bind(&IbmWebsocketSession::send_start, shared_from_this(),placeholders::_1));
}
void IbmWebsocketSession::send_start(beast::error_code ec){
if(ec)
return fail(ec, "ssl_handshake");
ws_.async_write(net::buffer(TextStart),
bind(&IbmWebsocketSession::send_binary, shared_from_this(),placeholders::_1));
}
void IbmWebsocketSession::send_binary(beast::error_code ec) {
if(ec)
return fail(ec, "send_start");
readFile(binarydata, &Datasize, &StartPos, &IsLast);
ws_.binary(true);
if (!IsLast) {
ws_.async_write(net::buffer(binarydata, Datasize),
bind(&IbmWebsocketSession::send_binary, shared_from_this(),
placeholders::_1));
} else {
IbmWebsocketSession::on_binarysent(ec);
}
}
void IbmWebsocketSession::on_binarysent(beast::error_code ec) {
if(ec)
return fail(ec, "send_binary");
ws_.binary(false);
ws_.async_write(net::buffer(TextStop),
bind(&IbmWebsocketSession::read_response, shared_from_this(), placeholders::_1));
}
void IbmWebsocketSession::readFile(char *bdata, unsigned int *Len, unsigned int *start_pos,bool *ReachedEOF) {
unsigned int end = 0;
unsigned int start = 0;
unsigned int length = 0;
// Creation of ifstream class object to read the file
ifstream infile(filepath, ifstream::binary);
if (infile) {
// Get the size of the file
infile.seekg(0, ios::end);
end = infile.tellg();
infile.seekg(*start_pos, ios::beg);
start = infile.tellg();
length = end - start;
}
if ((size_t) length < 150) {
*Len = (size_t) length;
*ReachedEOF = true;
// cout << "Reached end of File (last 150 bytes)" << endl;
} else if ((size_t) length <= 50000) { //Maximumbytes to send are 50000
*Len = (size_t) length;
*start_pos += (size_t) length;
*ReachedEOF = false;
infile.read(bdata, length);
} else {
*Len = 50000;
*start_pos += 50000;
*ReachedEOF = false;
infile.read(bdata, 50000);
}
infile.close();
}
Any suggestions here?

From boost's documentation we have the following excerpt on websocket::async_write
This function is used to asynchronously write a complete message. This
call always returns immediately. The asynchronous operation will
continue until one of the following conditions is true:
The complete message is written.
An error occurs.
So when you create your buffer object to pass to it net::buffer(TextStart) for example the lifetime of the buffer passed to it is only until the function returns. It could be that even after the function returns you the async write is still operating on the buffer as per the documentation but the contents are no longer valid since the buffer was a local variable.
To remedy this you could, make your TextStart static or declare that as a member of your class and copy it to boost::asio::buffer there are plenty of examples on how to do that. Note I only mention TextStart in the IbmWebsocketSession::send_start function. The problem is pretty much the same throughout your code.
From IBM Watson's API definition, the Initiate a connection requires a certain format which can then be represented as a string. You have the string but missing the proper format due to which the connection is being closed by the peer and you are writing to a closed socket, thus a broken pipe.
The initiate connection requires :
var message = {
action: 'start',
content-type: 'audio/l16;rate=22050'
};
Which can be represented as string TextStart = "action: 'start',\r\ncontent-type: 'audio\/l16;rate=44100'" according to your requirements.
Following on from the discussion in the chat, the OP resolved the issue by adding the code:
if (!IsLast ) {
ws_.async_write(net::buffer(binarydata, Datasize),
bind(&IbmWebsocketSession::send_binary, shared_from_this(),
placeholders::_1));
}
else {
if (mIbmWatsonobj->IsGstFileWriteDone()) { //checks for the file write completion
IbmWebsocketSession::on_binarysent(ec);
} else {
std::this_thread::sleep_for(std::chrono::seconds(1));
IbmWebsocketSession::send_binary(ec);
}
}
Which from discussion stems from the fact that more bytes were being sent to the client before a file write was completed on the same set of bytes. The OP now verifies this before attempting to send more bytes.

Related

Boost C++ UDP Socket stops receive after N packages

I am sending udp packages from server to client. At the server side I split data into packages by 500 bytes, and sent to client. The client receives the packages and accumulate received data and deserializes an object.
The problem is that client receive 133 packages maximum and stops like nothing else was sent to socket, but server send whole object (1238 packages). And this problem exists in Windows only, but works perfectly under OSX.
Here is a server code sending packages:
// sends #buffer of size #length to #endpoint
// #buffer already contains a header, and the method splits #buffer into chunks and send it one by one
void server::send_package(char* buffer, int length, udp::endpoint endpoint){
if (length > BUFFER){
protocol::header header;
int dataLength = length - sizeof (header);
// copy header from buffer
memcpy(&header, buffer, sizeof(header));
header.isEnd = false;
int position = 0;
// allocate memory to collect data to send
char* data_to_send = new char[dataLength];
// copy data
memcpy(data_to_send, &buffer[sizeof(header)], dataLength);
header.totalPackages = dataLength/(BUFFER-sizeof (header));
// create chucks of data and send
while (position < dataLength){
int frame_size = BUFFER;
header.currentPackage++;
if (dataLength-position+sizeof (header) <= BUFFER) {
header.isEnd = true;
frame_size = dataLength-position+sizeof (header);
}
char* temp_buffer = new char[frame_size];
header.length = frame_size-sizeof(header);
// set the header of a chunk
memcpy(temp_buffer, &header, sizeof(header));
// set data to chunk
memcpy(&temp_buffer[sizeof (header)], &data_to_send[position], frame_size-sizeof(header));
// send chunk
socket->send_to(boost::asio::buffer(temp_buffer, frame_size), endpoint);
socket->wait(boost::asio::ip::tcp::socket::wait_write);
position += frame_size-sizeof(header);
}
} else {
socket->async_send_to(boost::asio::buffer(buffer, length), endpoint,
boost::bind(&server::release_sent_buffer,
this,
buffer, length)
);
}
}
Here is the client receives packages:
void connectionManager::handle_receive( const boost::system::error_code &error,
std::size_t size,
udp::endpoint* ep) {
if (size > 0) {
// _lock.try_lock();
protocol::header header;
memcpy(&header, &recv_buffer, sizeof(header));
logg("response from server received " + boost::asio::ip::address_v4(header.ip).to_string());
logg("received header:");
logg(protocol::getHeaderInfo(header));
std::stringstream ss;
ss << "header.length = " << header.length;
logg(ss.str().c_str());
udp::endpoint endpoint(boost::asio::ip::address_v4(header.ip), _server_port);
switch (header.command) {
case protocol::commands::server_instance_instruments_state_response: {
package_chain chain(size-sizeof(header));
memcpy(chain.data, &recv_buffer[sizeof(header)], size-sizeof(header));
packages[header.id].push_back(chain);
// at Windows machine the last package is #133. But 1248 packages expected.
// WHY????...
int packs = (packages[header.id].size());
if (header.isEnd) {
char* buf = getDataFromPackages(header.id, header.length);
std::stringstream str;
str << buf;
boost::archive::text_iarchive ar(str);
instance_plugin_information* inst_inf;
inst_inf = new instance_plugin_information();
try {
ar & inst_inf;
if (onPluginStateResponse != nullptr) {
onPluginStateResponse(*inst_inf);
}
} catch (const std::exception& e) {
}
}
break;
}
}
// We will hang on this line when package #133 received.
socket->wait(boost::asio::ip::tcp::socket::wait_read);
connectionManager::start_receive();
}
I just don't understand what I am missing? Why client receives exactly 133 packages (133 x 500 bytes) and then drops?
I have changed the code in many ways, but with no luck. The last thing I added is
socket->wait(boost::asio::ip::tcp::socket::wait_read);
before I call start_receive() again, and the program hands on this line exactly when package #133 is received.
Please help. I am close to give up and become a pizza delivery guy.

boost::asio: register fd for EPOLLIN / EPOLLOUT once and leave it registered

I have a tcp client which is serviced by a boost::asio::io_context running on a single thread. It is configured non-blocking.
Reads/writes to this client are only ever done within the context of this thread.
I am using async_wait to await for the socket to become readable/writeable.
void Client::awaitReadable()
{
_socket.async_wait(tcp::socket::wait_read, std::bind_front(&Client::onReadable, this));
}
Whenever the socket becomes readable, my onReadable callback is fired, and I read all available data until receive asio::error::would_block.
void Client::onReadable(boost::system::error_code ec)
{
if (!ec)
{
while (1) // drain the socket
{
const std::size_t len = _socket.read_some(_read_buf.writeBuf(), ec);
if (ec)
break;
else
_read_buf.advance(len);
}
}
if (ec == asio::error::would_block)
{
const std::size_t read = _read_cb(*this, _read_buf.readBuf();
_read_buf.dataRead(read);
awaitReadable(); // I have to await readable again
}
else
{
onDisconnected(ec);
}
}
Once I've drained the socket I then need to call awaitReadable again to re-register my onReadable callback.
This necessarily involves a call to epoll_ctl, which effectively changes absolutely nothing.
When writing to the socket, the process if similar.
First, if the socket is currently writeable, I will attempt to send the data immediately. If, during the write, the I receive asio::error::would_block, I buffer the remaining unsent data and call my awaitWriteable function
void Client::write(Data buf)
{
if (_writeable)
{
const auto [ sent, ec ] = doWrite(buf); // calls awaitWriteable if would_block
if (ec == asio::error::would_block)
_write_buf.add(buf.data() + sent, buf.size() - sent);
}
else
{
_write_buf.add(buf); // will be sent when socket becomes writeable
}
}
The awaitWriteable function is very similar to the awaitReadable version
void Client::awaitWriteable()
{
_socket.async_wait(tcp::socket::wait_write, std::bind_front(&Client::onWriteable, this));
}
When the socket becomes writeable again I will be notified, and I will write more data to the socket.
void Client::onWriteable(boost::system::error_code ec)
{
if (!ec)
{
_writeable = true;
if (!_write_buf.empty())
{
const auto [ sent, ec ] = doWrite(_write_buf.writeBuf());
if (!ec)
_write_buf.sent(sent);
}
}
else
{
onDisconnected(ec);
}
}
The actual writing is factored out into a separate function as it is called both by the "synchronous write" function, and from the OnWriteable callback
std::pair<std::size_t, boost::system::error_code> Client::doWrite(Data buf)
{
boost::system::error_code ec;
std::size_t sent = _socket.write_some(buf + sent, ec);
if (ec)
{
if (ec == asio::error::would_block)
awaitWriteable();
else
onDisconnected(ec);
}
return {sent, ec};
}
So the way reads work is
awaitReadable.
when readable, read everything until would_block.
repeat.
and the way writes work is
once connected awaitWriteable.
when writeable, set a flag true, and if any data is pending, send as much as possible.
if the send results in would_block then awaitWriteable again.
when a client wants to send data, if the socket is currently writeable then "synchronously" send as much as possible.
if the send results in would_block then buffer any unsent data and awaitWriteable again.
Question:
I would like to register my socket file descriptor with epoll, and leave it registered forever.
Is there any way to side-step this need to continually call awaitReadable/awaitWriteable?
You're mixing sync/async primitives. So at least the blanket claim "It is configured non-blocking" is inaccurate, because asio is having to switch it for you when you mix sync primitives.
Note: not all Asio-aware IO objects support this. E.g. Beast's tcp_stream (and ssl_stream) object do explicitly not support mixing synchronous and asynchronous operations.
This necessarily involves a call to epoll_ctl, which effectively changes absolutely nothing.
Have you checked? Because it's up to the service implementation to decide how your handlers are serviced. It might be the case that fds are added and removed from the pollfd set. It might do cleverer things. It might not even use (e)poll on a particular system.
Regardless, is there something stopping you from using read operations directly in a loop. You can even used composed read operations, such as asio::async_read_until or asio::async_read with a CompletionCondition.
E.g. in to read incoming data in a loop, returning whenever 1024 bytes or more have been received:
void read_loop() {
net::async_read(
_socket, _read_buf, net::transfer_at_least(1024),
[this](error_code ec, size_t xferred) {
std::cout << "Received " << xferred //
<< " (" << ec.message() << ")" << std::endl;
if (!ec)
read_loop();
});
}
Here's a live demo reading itself:
Live On Coliru
#include <boost/asio.hpp>
#include <iostream>
namespace net = boost::asio;
using boost::system::error_code;
using net::ip::tcp;
using namespace std::chrono_literals;
struct Client {
Client(net::any_io_executor ex, tcp::endpoint ep) : _socket(ex) {
_socket.connect(ep);
assert(_socket.is_open());
std::cout << "Connected " << ep << " from " << _socket.local_endpoint() << "\n";
}
void read_loop() {
net::async_read(
_socket, _read_buf, net::transfer_at_least(1024),
[this](error_code ec, size_t xferred) {
std::cout << "Received " << xferred //
<< " (" << ec.message() << ")" << std::endl;
if (!ec)
read_loop();
});
}
auto get_histo() const {
std::array<unsigned, 256> histo {0};
auto f = buffers_begin(_read_buf.data()),
l = buffers_end(_read_buf.data());
while (f != l)
++histo[uint8_t(*f++)];
return histo;
}
private:
net::streambuf _read_buf;
tcp::socket _socket;
};
int main() {
net::io_context ioc;
Client c(ioc.get_executor(), {{}, 8989});
c.read_loop();
ioc.run_for(10s); // time limit for online compilers
// do something witty with the result
auto histo = c.get_histo();
for (uint8_t ch : {'a','q','e','x'})
std::cout << "Frequency of '" << ch << "' was " << histo[ch] << "\n";
}
Prints
Connected 0.0.0.0:8989 from 127.0.0.1:48730
Received 1024 (Success)
Received 447 (End of file)
Frequency of 'a' was 38
Frequency of 'q' was 2
Frequency of 'e' was 92
Frequency of 'x' was 8
In about 10ms.
BONUS: Profling epoll_ctl calls
Here is the same program eating a dictionay on my machine, while counting calls to epoll_ctl:
Note how only 3 epoll_ctl calls are ever issued:
Connected 0.0.0.0:8989 from 127.0.0.1:52974
Received 1024 (Success)
Received 1024 (Success)
Received 2048 (Success)
Received 4096 (Success)
Received 8192 (Success)
Received 16384 (Success)
Received 16384 (Success)
Received 16384 (Success)
Received 49152 (Success)
...
Received 65536 (Success)
Received 53562 (Success)
Received 0 (End of file)
Frequency of 'a' was 65630
Frequency of 'q' was 1492
Frequency of 'e' was 90579
Frequency of 'x' was 2139
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 3 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 3 total
Summary
Measure. Use async primitives to do the scheduling for you. The only reason to use async_wait in principle is when you have to call third-party code using the native_handle of the socket in response.

Boost.Asio async_write and strands

I 'm using a strand to avoid concurrent writes on TCP server using Boost.Asio. But it seems it only prevents concurrent execution of handlers.
Indeed if I do two successive async_write, one with a very big packet, and the other with a very small one, wireshark shows interleaves. As async_write is composed of multiple calls of async_write_some, it seems that the handler of my second write is allowed to be executed between two handlers of the first call. Which is very bad for me.
Wireshark output: [Packet 1.1] [Packet 1.2] [Packet 2] [Packet 1.3] ... [Packet 1.x]
struct Command
{
// Header
uint64_t ticket_id; // UUID
uint32_t data_size; // size of data
// data
std::vector<unsigned char> m_internal_buffer;
}
typedef std::shared_ptr<Command> command_type;
void tcp_server::write(command_type cmd)
{
boost::asio::async_write(m_socket, boost::asio::buffer(cmd->getData(), cmd->getTotalPacketSize()),
boost::asio::bind_executor(m_write_strand,
[this, cmd](const boost::system::error_code& error, std::size_t bytes_transferred)
{
if (error)
{
// report
}
}
)
);
}
and the main:
int main()
{
tcp_server.write(big_packet); // Packet 1 = 10 MBytes !
tcp_server.write(small_packet); // Packet 2 = 64 kbytes
}
Is the strand not appropriate in my case ?
P.S: I saw that close topic here but it does not cover the same use case in my opinion.
You have to make sure your async operation is initiated from the strand. Your code currently doesn't show this to be the case. Hopefully this helps, otherwise, post a MCVE
So e.g.
void tcp_server::write(command_type cmd)
{
post(m_write_strand, [this, cmd] { this->do_write(cmd); });
}
Making up a MCVE from your question code:
Live On Coliru
#include <boost/asio.hpp>
using boost::asio::ip::tcp;
using Executor = boost::asio::thread_pool::executor_type;
struct command {
char const* getData() const { return ""; }
size_t getTotalPacketSize() const { return 1; }
};
using command_type = command*;
struct tcp_server {
tcp_server(Executor ex) : m_socket(ex), m_write_strand(ex)
{
// more?
}
void write(command_type cmd);
void do_write(command_type cmd);
tcp::socket m_socket;
boost::asio::strand<Executor> m_write_strand;
};
void tcp_server::write(command_type cmd)
{
post(m_write_strand, [this, cmd] { this->do_write(cmd); });
}
void tcp_server::do_write(command_type cmd)
{
boost::asio::async_write(
m_socket,
boost::asio::buffer(cmd->getData(), cmd->getTotalPacketSize()),
bind_executor(m_write_strand,
[/*this, cmd*/](boost::system::error_code error,
size_t bytes_transferred) {
if (error) {
// report
}
}));
}
int main() {
boost::asio::thread_pool ioc;
tcp_server tcp_server(ioc.get_executor());
command_type big_packet{}, small_packet{};
tcp_server.write(big_packet); // Packet 1 = 10 MBytes !
tcp_server.write(small_packet); // Packet 2 = 64 kbytes
ioc.join();
}

Sockets - keeping a socket open after data transfer

I have written simple server/client programs, in which the client sends some hardcoded data in small chunks to the server program, which is waiting for the data so that it can print it to the terminal. In the client, I'm calling send() in a loop while there is more data to send, and on the server, I'm doing the same with read(), that is, while the number of bytes returned is > 0, I continue to read.
This example works perfectly if I specifically call close() on the client's socket after I've finished sending, but if I don't, the server won't actually exit the read() loop until I close the client and break the connection. On the server side, I'm using:
while((bytesRead = read(socket, buffer, BUFFER_SIZE)) > 0)
Shouldn't bytesRead be 0 when all the data has been received? And if so, why will it not exit this loop until I close the socket? In my final application, it will be beneficial to keep the socket open between requests, but all of the sample code and information I can find calls close() immediately after sending data, which is not what I want.
What am I missing?
When the other end of the socket is connected to some other network system halfway around the world, the only way that the receiving socket knows "when all the data has been received" is precisely when the other side of the socket is closed. That's what tells the other side of the socket that "all the data has been received".
All that a socket knows about is that it's connected to some other socket endpoint. That's it. End of story. The socket has no special knowledge of the inner workings of the program that has the other side of the socket connection. Nor should it know. That happens to be the responsibility of the program that has the socket open, and not the socket itself.
If your program, on the receiving side, has knowledge -- by the virtue of knowing what data it is expected to receive -- that it has now received everything that it needs to receive, then it can close its end of the socket, and move on to the next task at hand.
You will have to incorporate in your program's logic, a way to determine, in some form or fashion, that all the data has been transmitted. The exact nature of that is going to be up to you to define. Perhaps, before sending all the data on the socket, your sending program will send in advance, on the same socket, the number of bytes that will be in the data to follow. Then, your receiving program reads the number of bytes first, followed by the data itself, and then knows that it has received everything, and can move on.
That's one simplistic approach. The exact details is up to you. Alternatively, you can also implement a timeout: set a timer and if any data is not received in some prescribed period of time, assume that there is no more.
You can set a flag on the recv call to prevent blocking.
One way to detect this easily is to wrap the recv call:
enum class read_result
{
// note: numerically in increasing order of severity
ok,
would_block,
end_of_file,
error,
};
template<std::size_t BufferLength>
read_result read(int socket_fd, char (&buffer)[BufferLength], int& bytes_read)
{
auto result = recv(socket_fd, buffer, BufferLength, MSG_DONTWAIT);
if (result > 0)
{
return read_result::ok;
}
else if (result == 0)
{
return read_result::end_of_file;
}
else {
auto err = errno;
if (err == EAGAIN or err == EWOULDBLOCK) {
return read_result::would_block;
}
else {
return read_result ::error;
}
}
}
One use case might be:
#include <unistd.h>
#include <sys/socket.h>
#include <cstdlib>
#include <cerrno>
#include <iostream>
enum class read_result
{
// note: numerically in increasing order of severity
ok,
would_block,
end_of_file,
error,
};
template<std::size_t BufferLength>
read_result read(int socket_fd, char (&buffer)[BufferLength], int& bytes_read)
{
auto result = recv(socket_fd, buffer, BufferLength, MSG_DONTWAIT);
if (result > 0)
{
return read_result::ok;
}
else if (result == 0)
{
return read_result::end_of_file;
}
else {
auto err = errno;
if (err == EAGAIN or err == EWOULDBLOCK) {
return read_result::would_block;
}
else {
return read_result ::error;
}
}
}
struct keep_reading
{
keep_reading& operator=(read_result result)
{
result_ = result;
}
const operator bool() const {
return result_ < read_result::end_of_file;
}
auto get_result() const -> read_result { return result_; }
private:
read_result result_ = read_result::ok;
};
int main()
{
int socket; // = open my socket and wait for it to be connected etc
char buffer [1024];
int bytes_read = 0;
keep_reading should_keep_reading;
while(keep_reading = read(socket, buffer, bytes_read))
{
if (should_keep_reading.get_result() != read_result::would_block) {
// read things here
}
else {
// idle processing here
}
}
std::cout << "reason for stopping: " << should_keep_reading.get_result() << std::endl;
}

c++ Protocol buffer sending over network [duplicate]

I'm trying to read / write multiple Protocol Buffers messages from files, in both C++ and Java. Google suggests writing length prefixes before the messages, but there's no way to do that by default (that I could see).
However, the Java API in version 2.1.0 received a set of "Delimited" I/O functions which apparently do that job:
parseDelimitedFrom
mergeDelimitedFrom
writeDelimitedTo
Are there C++ equivalents? And if not, what's the wire format for the size prefixes the Java API attaches, so I can parse those messages in C++?
Update:
These now exist in google/protobuf/util/delimited_message_util.h as of v3.3.0.
I'm a bit late to the party here, but the below implementations include some optimizations missing from the other answers and will not fail after 64MB of input (though it still enforces the 64MB limit on each individual message, just not on the whole stream).
(I am the author of the C++ and Java protobuf libraries, but I no longer work for Google. Sorry that this code never made it into the official lib. This is what it would look like if it had.)
bool writeDelimitedTo(
const google::protobuf::MessageLite& message,
google::protobuf::io::ZeroCopyOutputStream* rawOutput) {
// We create a new coded stream for each message. Don't worry, this is fast.
google::protobuf::io::CodedOutputStream output(rawOutput);
// Write the size.
const int size = message.ByteSize();
output.WriteVarint32(size);
uint8_t* buffer = output.GetDirectBufferForNBytesAndAdvance(size);
if (buffer != NULL) {
// Optimization: The message fits in one buffer, so use the faster
// direct-to-array serialization path.
message.SerializeWithCachedSizesToArray(buffer);
} else {
// Slightly-slower path when the message is multiple buffers.
message.SerializeWithCachedSizes(&output);
if (output.HadError()) return false;
}
return true;
}
bool readDelimitedFrom(
google::protobuf::io::ZeroCopyInputStream* rawInput,
google::protobuf::MessageLite* message) {
// We create a new coded stream for each message. Don't worry, this is fast,
// and it makes sure the 64MB total size limit is imposed per-message rather
// than on the whole stream. (See the CodedInputStream interface for more
// info on this limit.)
google::protobuf::io::CodedInputStream input(rawInput);
// Read the size.
uint32_t size;
if (!input.ReadVarint32(&size)) return false;
// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
input.PushLimit(size);
// Parse the message.
if (!message->MergeFromCodedStream(&input)) return false;
if (!input.ConsumedEntireMessage()) return false;
// Release the limit.
input.PopLimit(limit);
return true;
}
Okay, so I haven't been able to find top-level C++ functions implementing what I need, but some spelunking through the Java API reference turned up the following, inside the MessageLite interface:
void writeDelimitedTo(OutputStream output)
/* Like writeTo(OutputStream), but writes the size of
the message as a varint before writing the data. */
So the Java size prefix is a (Protocol Buffers) varint!
Armed with that information, I went digging through the C++ API and found the CodedStream header, which has these:
bool CodedInputStream::ReadVarint32(uint32 * value)
void CodedOutputStream::WriteVarint32(uint32 value)
Using those, I should be able to roll my own C++ functions that do the job.
They should really add this to the main Message API though; it's missing functionality considering Java has it, and so does Marc Gravell's excellent protobuf-net C# port (via SerializeWithLengthPrefix and DeserializeWithLengthPrefix).
I solved the same problem using CodedOutputStream/ArrayOutputStream to write the message (with the size) and CodedInputStream/ArrayInputStream to read the message (with the size).
For example, the following pseudo-code writes the message size following by the message:
const unsigned bufLength = 256;
unsigned char buffer[bufLength];
Message protoMessage;
google::protobuf::io::ArrayOutputStream arrayOutput(buffer, bufLength);
google::protobuf::io::CodedOutputStream codedOutput(&arrayOutput);
codedOutput.WriteLittleEndian32(protoMessage.ByteSize());
protoMessage.SerializeToCodedStream(&codedOutput);
When writing you should also check that your buffer is large enough to fit the message (including the size). And when reading, you should check that your buffer contains a whole message (including the size).
It definitely would be handy if they added convenience methods to C++ API similar to those provided by the Java API.
IsteamInputStream is very fragile to eofs and other errors that easily occurs when used together with std::istream. After this the protobuf streams are permamently damaged and any already used buffer data is destroyed. There are proper support for reading from traditional streams in protobuf.
Implement google::protobuf::io::CopyingInputStream and use that together with CopyingInputStreamAdapter. Do the same for the output variants.
In practice a parsing call ends up in google::protobuf::io::CopyingInputStream::Read(void* buffer, int size) where a buffer is given. The only thing left to do is read into it somehow.
Here's an example for use with Asio synchronized streams (SyncReadStream/SyncWriteStream):
#include <google/protobuf/io/zero_copy_stream_impl_lite.h>
using namespace google::protobuf::io;
template <typename SyncReadStream>
class AsioInputStream : public CopyingInputStream {
public:
AsioInputStream(SyncReadStream& sock);
int Read(void* buffer, int size);
private:
SyncReadStream& m_Socket;
};
template <typename SyncReadStream>
AsioInputStream<SyncReadStream>::AsioInputStream(SyncReadStream& sock) :
m_Socket(sock) {}
template <typename SyncReadStream>
int
AsioInputStream<SyncReadStream>::Read(void* buffer, int size)
{
std::size_t bytes_read;
boost::system::error_code ec;
bytes_read = m_Socket.read_some(boost::asio::buffer(buffer, size), ec);
if(!ec) {
return bytes_read;
} else if (ec == boost::asio::error::eof) {
return 0;
} else {
return -1;
}
}
template <typename SyncWriteStream>
class AsioOutputStream : public CopyingOutputStream {
public:
AsioOutputStream(SyncWriteStream& sock);
bool Write(const void* buffer, int size);
private:
SyncWriteStream& m_Socket;
};
template <typename SyncWriteStream>
AsioOutputStream<SyncWriteStream>::AsioOutputStream(SyncWriteStream& sock) :
m_Socket(sock) {}
template <typename SyncWriteStream>
bool
AsioOutputStream<SyncWriteStream>::Write(const void* buffer, int size)
{
boost::system::error_code ec;
m_Socket.write_some(boost::asio::buffer(buffer, size), ec);
return !ec;
}
Usage:
AsioInputStream<boost::asio::ip::tcp::socket> ais(m_Socket); // Where m_Socket is a instance of boost::asio::ip::tcp::socket
CopyingInputStreamAdaptor cis_adp(&ais);
CodedInputStream cis(&cis_adp);
Message protoMessage;
uint32_t msg_size;
/* Read message size */
if(!cis.ReadVarint32(&msg_size)) {
// Handle error
}
/* Make sure not to read beyond limit of message */
CodedInputStream::Limit msg_limit = cis.PushLimit(msg_size);
if(!msg.ParseFromCodedStream(&cis)) {
// Handle error
}
/* Remove limit */
cis.PopLimit(msg_limit);
Here you go:
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/io/coded_stream.h>
using namespace google::protobuf::io;
class FASWriter
{
std::ofstream mFs;
OstreamOutputStream *_OstreamOutputStream;
CodedOutputStream *_CodedOutputStream;
public:
FASWriter(const std::string &file) : mFs(file,std::ios::out | std::ios::binary)
{
assert(mFs.good());
_OstreamOutputStream = new OstreamOutputStream(&mFs);
_CodedOutputStream = new CodedOutputStream(_OstreamOutputStream);
}
inline void operator()(const ::google::protobuf::Message &msg)
{
_CodedOutputStream->WriteVarint32(msg.ByteSize());
if ( !msg.SerializeToCodedStream(_CodedOutputStream) )
std::cout << "SerializeToCodedStream error " << std::endl;
}
~FASWriter()
{
delete _CodedOutputStream;
delete _OstreamOutputStream;
mFs.close();
}
};
class FASReader
{
std::ifstream mFs;
IstreamInputStream *_IstreamInputStream;
CodedInputStream *_CodedInputStream;
public:
FASReader(const std::string &file), mFs(file,std::ios::in | std::ios::binary)
{
assert(mFs.good());
_IstreamInputStream = new IstreamInputStream(&mFs);
_CodedInputStream = new CodedInputStream(_IstreamInputStream);
}
template<class T>
bool ReadNext()
{
T msg;
unsigned __int32 size;
bool ret;
if ( ret = _CodedInputStream->ReadVarint32(&size) )
{
CodedInputStream::Limit msgLimit = _CodedInputStream->PushLimit(size);
if ( ret = msg.ParseFromCodedStream(_CodedInputStream) )
{
_CodedInputStream->PopLimit(msgLimit);
std::cout << mFeed << " FASReader ReadNext: " << msg.DebugString() << std::endl;
}
}
return ret;
}
~FASReader()
{
delete _CodedInputStream;
delete _IstreamInputStream;
mFs.close();
}
};
I ran into the same issue in both C++ and Python.
For the C++ version, I used a mix of the code Kenton Varda posted on this thread and the code from the pull request he sent to the protobuf team (because the version posted here doesn't handle EOF while the one he sent to github does).
#include <google/protobuf/message_lite.h>
#include <google/protobuf/io/zero_copy_stream.h>
#include <google/protobuf/io/coded_stream.h>
bool writeDelimitedTo(const google::protobuf::MessageLite& message,
google::protobuf::io::ZeroCopyOutputStream* rawOutput)
{
// We create a new coded stream for each message. Don't worry, this is fast.
google::protobuf::io::CodedOutputStream output(rawOutput);
// Write the size.
const int size = message.ByteSize();
output.WriteVarint32(size);
uint8_t* buffer = output.GetDirectBufferForNBytesAndAdvance(size);
if (buffer != NULL)
{
// Optimization: The message fits in one buffer, so use the faster
// direct-to-array serialization path.
message.SerializeWithCachedSizesToArray(buffer);
}
else
{
// Slightly-slower path when the message is multiple buffers.
message.SerializeWithCachedSizes(&output);
if (output.HadError())
return false;
}
return true;
}
bool readDelimitedFrom(google::protobuf::io::ZeroCopyInputStream* rawInput, google::protobuf::MessageLite* message, bool* clean_eof)
{
// We create a new coded stream for each message. Don't worry, this is fast,
// and it makes sure the 64MB total size limit is imposed per-message rather
// than on the whole stream. (See the CodedInputStream interface for more
// info on this limit.)
google::protobuf::io::CodedInputStream input(rawInput);
const int start = input.CurrentPosition();
if (clean_eof)
*clean_eof = false;
// Read the size.
uint32_t size;
if (!input.ReadVarint32(&size))
{
if (clean_eof)
*clean_eof = input.CurrentPosition() == start;
return false;
}
// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit = input.PushLimit(size);
// Parse the message.
if (!message->MergeFromCodedStream(&input)) return false;
if (!input.ConsumedEntireMessage()) return false;
// Release the limit.
input.PopLimit(limit);
return true;
}
And here is my python2 implementation:
from google.protobuf.internal import encoder
from google.protobuf.internal import decoder
#I had to implement this because the tools in google.protobuf.internal.decoder
#read from a buffer, not from a file-like objcet
def readRawVarint32(stream):
mask = 0x80 # (1 << 7)
raw_varint32 = []
while 1:
b = stream.read(1)
#eof
if b == "":
break
raw_varint32.append(b)
if not (ord(b) & mask):
#we found a byte starting with a 0, which means it's the last byte of this varint
break
return raw_varint32
def writeDelimitedTo(message, stream):
message_str = message.SerializeToString()
delimiter = encoder._VarintBytes(len(message_str))
stream.write(delimiter + message_str)
def readDelimitedFrom(MessageType, stream):
raw_varint32 = readRawVarint32(stream)
message = None
if raw_varint32:
size, _ = decoder._DecodeVarint32(raw_varint32, 0)
data = stream.read(size)
if len(data) < size:
raise Exception("Unexpected end of file")
message = MessageType()
message.ParseFromString(data)
return message
#In place version that takes an already built protobuf object
#In my tests, this is around 20% faster than the other version
#of readDelimitedFrom()
def readDelimitedFrom_inplace(message, stream):
raw_varint32 = readRawVarint32(stream)
if raw_varint32:
size, _ = decoder._DecodeVarint32(raw_varint32, 0)
data = stream.read(size)
if len(data) < size:
raise Exception("Unexpected end of file")
message.ParseFromString(data)
return message
else:
return None
It might not be the best looking code and I'm sure it can be refactored a fair bit, but at least that should show you one way to do it.
Now the big problem: It's SLOW.
Even when using the C++ implementation of python-protobuf, it's one order of magnitude slower than in pure C++. I have a benchmark where I read 10M protobuf messages of ~30 bytes each from a file. It takes ~0.9s in C++, and 35s in python.
One way to make it a bit faster would be to re-implement the varint decoder to make it read from a file and decode in one go, instead of reading from a file and then decoding as this code currently does. (profiling shows that a significant amount of time is spent in the varint encoder/decoder). But needless to say that alone is not enough to close the gap between the python version and the C++ version.
Any idea to make it faster is very welcome :)
Just for completeness, I post here an up-to-date version that works with the master version of protobuf and Python3
For the C++ version it is sufficient to use the utils in delimited_message_utils.h, here a MWE
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/util/delimited_message_util.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
template <typename T>
bool writeManyToFile(std::deque<T> messages, std::string filename) {
int outfd = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC);
google::protobuf::io::FileOutputStream fout(outfd);
bool success;
for (auto msg: messages) {
success = google::protobuf::util::SerializeDelimitedToZeroCopyStream(
msg, &fout);
if (! success) {
std::cout << "Writing Failed" << std::endl;
break;
}
}
fout.Close();
close(outfd);
return success;
}
template <typename T>
std::deque<T> readManyFromFile(std::string filename) {
int infd = open(filename.c_str(), O_RDONLY);
google::protobuf::io::FileInputStream fin(infd);
bool keep = true;
bool clean_eof = true;
std::deque<T> out;
while (keep) {
T msg;
keep = google::protobuf::util::ParseDelimitedFromZeroCopyStream(
&msg, &fin, nullptr);
if (keep)
out.push_back(msg);
}
fin.Close();
close(infd);
return out;
}
For the Python3 version, building on #fireboot 's answer, the only thing thing that needed modification is the decoding of raw_varint32
def getSize(raw_varint32):
result = 0
shift = 0
b = six.indexbytes(raw_varint32, 0)
result |= ((ord(b) & 0x7f) << shift)
return result
def readDelimitedFrom(MessageType, stream):
raw_varint32 = readRawVarint32(stream)
message = None
if raw_varint32:
size = getSize(raw_varint32)
data = stream.read(size)
if len(data) < size:
raise Exception("Unexpected end of file")
message = MessageType()
message.ParseFromString(data)
return message
Was also looking for a solution for this. Here's the core of our solution, assuming some java code wrote many MyRecord messages with writeDelimitedTo into a file. Open the file and loop, doing:
if(someCodedInputStream->ReadVarint32(&bytes)) {
CodedInputStream::Limit msgLimit = someCodedInputStream->PushLimit(bytes);
if(myRecord->ParseFromCodedStream(someCodedInputStream)) {
//do your stuff with the parsed MyRecord instance
} else {
//handle parse error
}
someCodedInputStream->PopLimit(msgLimit);
} else {
//maybe end of file
}
Hope it helps.
Working with an objective-c version of protocol-buffers, I ran into this exact issue. On sending from the iOS client to a Java based server that uses parseDelimitedFrom, which expects the length as the first byte, I needed to call writeRawByte to the CodedOutputStream first. Posting here to hopegully help others that run into this issue. While working through this issue, one would think that Google proto-bufs would come with a simply flag which does this for you...
Request* request = [rBuild build];
[self sendMessage:request];
}
- (void) sendMessage:(Request *) request {
//** get length
NSData* n = [request data];
uint8_t len = [n length];
PBCodedOutputStream* os = [PBCodedOutputStream streamWithOutputStream:outputStream];
//** prepend it to message, such that Request.parseDelimitedFrom(in) can parse it properly
[os writeRawByte:len];
[request writeToCodedOutputStream:os];
[os flush];
}
Since I'm not allowed to write this as a comment to Kenton Varda's answer above; I believe there is a bug in the code he posted (as well as in other answers which have been provided). The following code:
...
google::protobuf::io::CodedInputStream input(rawInput);
// Read the size.
uint32_t size;
if (!input.ReadVarint32(&size)) return false;
// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
input.PushLimit(size);
...
sets an incorrect limit because it does not take into account the size of the varint32 which has already been read from input. This can result in data loss/corruption as additional bytes are read from the stream which may be part of the next message. The usual way of handling this correctly is to delete the CodedInputStream used to read the size and create a new one for reading the payload:
...
uint32_t size;
{
google::protobuf::io::CodedInputStream input(rawInput);
// Read the size.
if (!input.ReadVarint32(&size)) return false;
}
google::protobuf::io::CodedInputStream input(rawInput);
// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
input.PushLimit(size);
...
You can use getline for reading a string from a stream, using the specified delimiter:
istream& getline ( istream& is, string& str, char delim );
(defined in the header)