Pack the contents of arr into a binary sequence in Crystal - crystal-lang

Is there a standard way to achieve the same result as in Ruby with Array#pack method:
[1,2].pack­ "LL"
=> "\x01\x00\x00\x00\x02\x00\x00\x00"

Not yet, and most likely, it wont be. The reason is that usually the result of pack is used to be sent to an IO (do you have any other case in mind?), so instead of allocating the result in memory, we're thinking about providing equivalent methods in the IO itself to send the data directly to the socket, file, etc...
It's not written in stone and still in the design board, but avoiding unnecessary intermediate objects in memory is one of the design principles in Crystal.

Related

Should use a stream or container when working with network, binary data and serialization

I am working on a TCP server using boost asio and I got lost with choosing the best data type to work with when dealing with byte buffers.
Currently I am using std::vector<char> for everything. One of the reasons is that most of examples of asio use vectors or arrays. I receive data from network and put it in a buffer vector. Once a packet is available, it is extracted from the buffer and decrypted/decompressed if needed (both operations may result in more amount of data). Then multiple messages are extracted from the payload.
I am not happy with this solution because it involves inserting and removing data from vectors constantly, but it does the job.
Now I need to work on data serialization. There is not an easy way to read or write arbitrary data types from a char vector so I ended up implementing a "buffer" that hides a vector inside, and allows to write (wrapper for insert) and read (wrapper for casting) from it. Then I can write uint16 code; buffer >> code; and also add serialization/deserialization methods to other objects while keeping things simple.
The thing is that every time I think about this I feel like I am using the wrong data type as container for the binary data. Reasons are:
Streams already do a good job as potentially endless source of data input or data output. While in background this may result in inserting and removing data, probably does a better job than using a char vector.
Streams already allow to read or write basic data types, so I don't have to reinvent the wheel.
There is no need to access to a specific position of data. Usually I need to read or write sequentially.
In this case, are streams the best choice or is there something that I am not seeing?. And if so, is stringstream the one I should use?
Any reasons to avoid streams and work only with containers?
PD: I can not use boost serialization or any other existing solution because I don't have control over the network protocol.
Your approach seems fine. You might consider a deque instead of a vector if you're doing a lot of stealing from the end and erasing from the front, but if you use circular-buffer logic while iterating then this doesn't matter either.
You could switch to a stream, but then you're completely at the mercy of the standard library, its annoyances/oddities, and the semantics of its formatted extraction routines — if these are insufficient then you have to extract N bytes and do your own reinterpretation anyway, so you're back to square one but with added copying and indirection.
You say you don't need random-access, so that's another reason not to care either way. Personally I like to have random-access in case I need to resync, or seek-ahead, or seek-behind, or even just during debugging want to have better capabilities without having to suddenly refactor all my buffer code.
I don't think there's any more specific answer to this in the general case.

std::string vs. byte buffer (difference in c++)

I have a project where I transfer data between client and server using boost.asio sockets. Once one side of the connection receives data, it converts it into a std::vector of std::strings which gets then passed on to the actualy recipient object of the data via previously defined "callback" functions. That way works fine so far, only, I am at this point using methods like atoi() and to_string to convert other data types than strings into a sendable format and back. This method is of course a bit wasteful in terms of network usage (especially when transferring bigger amounts of data than just single ints and floats). Therefore I'd like to serialize and deserialize the data. Since, effectively, any serialisation method will produce a byte array or buffer, it would be convenient for me to just use std::string instead. Is there any disadvantage to doing that? I would not understand why there should be once, since strings should be nothing more than byte arrays.
In terms of functionality, there's no real difference.
Both for performance reasons and for code clarity reasons, however, I would recommend using std::vector<uint8_t> instead, as it makes it far more clear to anyone maintaining the code that it's a sequence of bytes, not a String.
You should use std::string when you work with strings, when you work with binary blob you better work with std::vector<uint8_t>. There many benefits:
your intention is clear so code is less error prone
you would not pass your binary buffer as a string to function that expects std::string by mistake
you can override std::ostream<<() for this type to print blob in proper format (usually hex dump). Very unlikely that you would want to print binary blob as a string.
there could be more. Only benefit of std::string that I can see that you do not need to do typedef.
You're right. Strings are nothing more than byte arrays. std::string is just a convenient way to manage the buffer array that represents the string. That's it!
There's no disadvantage of using std::string unless you are working on something REALLY REALLY performance critical, like a kernel, for example... then working with std::string would have a considerable overhead. Besides that, feel free to use it.
--
An std::string behind the scenes needs to do a bunch of checks about the state of the string in order to decide if it will use the small-string optimization or not. Today pretty much all compilers implement small-string optimizations. They all use different techniques, but basically it needs to test bitflags that will tell if the string will be constructed in the stack or the heap. This overhead doesn't exist if you straight use char[]. But again, unless you are working on something REALLY critical, like a kernel, you won't notice anything and std::string is much more convenient.
Again, this is just ONE of the things that happens under the hood, just as an example to show the difference of them.
Depending on how often you're firing network messages, std::string should be fine. It's a convenience class that handles a lot of char work for you. If you have a lot of data to push though, it might be worth using a char array straight and converting it to bytes, just to minimise the extra overhead std::string has.
Edit: if someone could comment and point out why you think my answer is bad, that'd be great and help me learn too.

C++ - Managing References in Disk Based Vector

I am developing a set of vector classes that all derived from an abstract vector. I am doing this so that in our software that makes use of these vectors, we can quickly switch between the vectors without any code breaking (or at least minimize failures, but my goal is full compatibility). All of the vectors match.
I am working on a Disk Based Vector that mostly conforms to match the STL Vector implementation. I am doing this because we need to handle large out of memory files that contain various formats of data. The Disk Vector handles data read/write to disk by using template specialization/polymorphism of serialization and deserialization classes. The data serialization and deserialization has been tested, and it works (up to now). My problem occurs when dealing with references to the data.
For example,
Given a DiskVector dv, a call to dv[10] would get a point to a spot on disk, then seek there, read out the char stream. This stream gets passed to a deserializor which converts the byte stream into the appropriate data type. Once I have the value, I my return it.
This is where I run into a problem. In the STL, they return it as a reference, so in order to match their style, I need to return a reference. What I do it store the value in an unordered_map with the given index (in this example, 10). Then I return a reference to the value in the unordered_map.
If this continues without cleanup, then the purpose of the DiskVector is lost because all the data just gets loaded into memory, which is bad due to data size. So I clean up this map by deleting the indexes later on when other calls are made. Unfortunately, if a user decided to store this reference for a long time, and then it gets deleted in the DiskVector, we have a problem.
So my questions
Is there a way to see if any other references to a certain instance are in use?
Is there a better way to solve this while still maintaining the polymorphic style for reasons described at the beginning?
Is it possible to construct a special class that would behave as a reference, but handle the disk IO dynamically so I could just return that instead?
Any other ideas?
So a better solution at what I was trying to do is to use SQLite as the backend for the database. Use BLOBs as the column types for key and value columns. This is the approach I am taking now. That said, in order to get it to work well, I need to use what cdhowie posted in the comments to my question.

asio: best way to store a message to be broadcast

I want to make a buffer of characters, write to it using sprintf, then pass it to multiple calls of async_write() (i.e. distribute it to a set of clients). My question is what is the best data structure to use for this? If there are compromises then I guess the priorities for defining "best" would be:
fewer CPU cycles
code clarity
less memory usage
Here is what I have currently, that appears to work:
function broadcast(){
char buf[512];
sprintf(buf,"Hello %s","World!");
boost::shared_ptr<std::string> msg(new std::string(buf));
msg->append(1,0); //NUL byte at the end
for(std::vector< boost::shared_ptr<client_session> >::iterator i=clients.begin();
i!=clients.end();++i) i->write(buf);
}
Then:
void client_session::write(boost::shared_ptr<std::string> msg){
if(!socket->is_open())return;
boost::asio::async_write(*socket,
boost::asio::buffer(*msg),
boost::bind(&client_session::handle_write, shared_from_this(),_1,_2,msg)
);
}
NOTES:
Typical message size is going to be less than 64 bytes; the 512 buffer size is just paranoia.
I pass a NUL byte to mark the end of each message; this is part of the protocol.
msg has to out-live my first code snippet (an asio requirement), hence the use of a shared pointer.
I think I can do better than this on all my criteria. I wondered about using boost::shared_array? Or creating an asio::buffer (wrapped in a smart pointer) directly from my char buf[512]? But reading the docs on these and other choices left me overwhelmed with all the possibilities.
Also, in my current code I pass msg as a parameter to handle_write(), to ensure the smart pointer is not released until handle_write() is reached. That is required isn't it?
UPDATE: If you can argue that it is better overall, I'm open to replacing sprintf with a std::stringstream or similar. The point of the question is that I need to compose a message and then broadcast it, and I want to do this efficiently.
UPDATE #2 (Feb 26 2012): I appreciate the trouble people have gone to post answers, but I feel none of them has really answered the question. No-one has posted code showing a better way, nor given any numbers to support them. In fact I'm getting the impression that people think the current approach is as good as it gets.
First of all, note that you are passing your raw buffer instead of your message to the write function, I think you do not meant to do that?
If you're planning to send plain-text messages, you could simply use std::string and std::stringstream to begin with, no need to pass fixed-size arrays.
If you need to do some more binary/bytewise formatting I would certainly start with replacing that fixed-size array by a vector of chars. In this case I also wouldn't take the roundtrip of converting it to a string first but construct the asio buffer directly from the byte vector. If you do not have to work with a predefined protocol, an even better solution is to use something like Protocol Buffers or Thrift or any viable alternative. This way you do not have to worry about things like endianness, repetition, variable-length items, backwards compatibility, ... .
The shared_ptr trick is indeed necessary, you do need to store the data that is referenced by the buffer somewhere until the buffer is consumed. Do not forget there are alternatives that could be more clear, like storing it simply in the client_session object itself. However, if this is feasible depends a bit on how your messaging objects are constructed ;).
You could store a std::list<boost::shared_ptr<std::string> > in your client_session object, and have client_session::write() do a push_back() on it. I think that is cleverly avoiding the functionality of boost.asio, though.
As I got you need to send the same messages to many clients. The implementation would be a bit more complicated.
I would recommend to prepare a message as a boost::shared_ptr<std::string> (as #KillianDS recommended) to avoid additional memory usage and copying from your char buf[512]; (it's not safe in any case, you cannot be sure how your program will evolve in the future and if this capacity will be sufficient in all cases).
Then push this message to each client internal std::queue. If the queue is empty and no writings are pending (for this particular client, use boolean flag to check this) - pop the message from queue and async_write it to socket passing shared_ptr as a parameter to a completion handler (a functor that you pass to async_write). Once the completion handler is called you can take the next message from the queue. shared_ptr reference counter will keep the message alive until the last client suffesfully sent it to socket.
In addition I would recommend to limit maximum queue size to slow down message creation on insufficient network speed.
EDIT
Usually sprintf is more efficient in cost of safety. If performance is criticical and std::stringstream is a bottleneck you still can use sprintf with std::string:
std::string buf(512, '\0');
sprintf(&buf[0],"Hello %s","World!");
Please note, std::string is not guaranteed to store data in contiguous memory block, as opposite to std::vector (please correct me if this changed for C++11). Practically, all popular implementations of std::string does use contiguous memory. Alternatively, you can use std::vector in the example above.

Is there a way to reduce ostringstream malloc/free's?

I am writing an embedded app. In some places, I use std::ostringstream a lot, since it is very convenient for my purposes. However, I just discovered that the performance hit is extreme since adding data to the stream results in a lot of calls to malloc and free. Is there any way to avoid it?
My first thought was making the ostringstream static and resetting it using ostringstream::set(""). However, this can't be done as I need the functions to be reentrant.
Well, Booger's solution would be to switch to sprintf(). It's unsafe, and error-prone, but it is often faster.
Not always though. We can't use it (or ostringstream) on my real-time job after initialization because both perform memory allocations and deallocations.
Our way around the problem is to jump through a lot of hoops to make sure that we perform all string conversions at startup (when we don't have to be real-time yet). I do think there was one situation where we wrote our own converter into a fixed-sized stack-allocated array. We have some constraints on size we can count on for the specific conversions in question.
For a more general solution, you may consider writing your own version of ostringstream that uses a fixed-sized buffer (with error-checking on the bounds being stayed within, of course). It would be a bit of work, but if you have a lot of those stream operations it might be worth it.
If you know how big the data is before creating the stream you could use ostrstream whose constructor can take a buffer as a parameter. Thus there will be no memory management of the data.
Probably the approved way of dealing with this would be to create your own basic_stringbuf object to use with your ostringstream. For that, you have a couple of choices. One would be to use a fixed-size buffer, and have overflow simply fail when/if you try to create output that's too long. Another possibility would be to use a vector as the buffer. Unlike std::string, vector guarantees that appending data will have amortized constant complexity. It also never releases data from the buffer unless you force it to, so it'll normally grow to the maximum size you're dealing with. From that point, it shouldn't allocate or free memory unless you create a string that's beyond the length it currently has available.
std::ostringsteam is a convenience interface. It links a std::string to a std::ostream by providing a custom std::streambuf. You can implement your own std::streambuf. That allows you to do the entire memory management. You still get the nice formatting of std::ostream, but you have full control over the memory management. Of course, the consequence is that you get your formatted output in a char[] - but that's probably no big problem if you're an embedded developer.