I am repeatedly sending data over UDP socket. Data is of format.
int32_t bid
int32_t ask
int32_t price
All fields are conditional. so, I am using 3 additional bits to track which fields are present in the message. And I am writing these fields to char array before sending.
So my format becomes.
[ first 3 bits | 5 bits] [ ? 32 ] [ ? 32 ] [ ? 32 ]
Problem is that I am wasting 5 bits. I can do dirty pointer airthmetic with binary operations to save those bits but it may be taking away the processing speed. How can I do it cleanly and efficiently in c++ ?
Please provide a simple code snippet for this.
If you care so much about those 5 bits, then you very probably can save much more by dynamically reducing the size of the fields for the bid, ask and price fields. Then in your header field you can allocate two bits per each of your payload fields for holding three possible values:
0 - the field is not present
1 - the field is present and is encoded in 16 bits
2 - the field is present and is encoded in 32 bits
I think the devil is in the detail in these things. It might be wise to study the statistics of the message you send for it might be possible to devise a scheme that improves the average performance.
For example if you are sending a bunch of these at a time (so that the comments about the size of the headers could be misplaced) then you could arrange the message block to be eight arrays (each with a count): first the messages with all 3 fields, then the messages with just, say bid and ask and so on. This does add 8 counts to the message, but means you don't send fields that aren't there; whether you save on average will be depend on the size of the blocks and the statistics of the messages. If some of the combinations were rare you could have a type field before the arrays that specifies which types are there.
Another thing to consider is whether you could steal some bits from the fields. For example, if you had, say, the bid field, do you need the full 32 bits for the others? Could, say, the ask be encoded as a 30 bit difference from the bid?
As already discussed in comments, the first thing is to combine multiple messages into a single UDP packet. And it's not so simple as it can look. The biggest challenge here is to decide the packet size.
The max UDP payload size is 65,507 bytes (I assume it's UDP over IPv4 by default). If UDP payload size is bigger than MTU size UDP will silently segment the packet. In real life UDP packets size usually is equal or less than MTU size. But even MTU size (~1500 bytes) can be too big. There was some research for multimedia streaming application that stated that large UDP packets were more often dropped if network was congested and recommended to use something like 400 bytes payload size as a good balance between a chance to be dropped and to not waste bandwidth for UDP/IP headers. Again, it depends on your application, mainly on your data traffic.
Then, you can apply different compression techniques. bid can be compressed by e.g. Variable Length Encoder or Run Length Encoder, depends on the nature of your bid.
I've no idea what ask field is, but price looks like a good candidate for a fixed point number. If ask is related to price maybe it's worth to send the difference between them and saving some bits.
First of all, decide how many bits per field you really need, then decide how to arrange them to minimise gaps or to optimise performance. Bit manipulation is quite expensive, minimising data copying also can optimise performance.
Related
Im trying to understand hardware Caches. I have a slight idea, but i would like to ask on here whether my understanding is correct or not.
So i understand that there are 3 types of cache mapping, direct, full associative and set associative.
I would like to know is the type of mapping implemented with logic gates in hardware and specific to say some computer system and in order to change the mapping, one would be required to changed the electrical connections?
My current understanding is that in RAM, there exists a memory address to refer to each block of memory. Within a block contains words, each words contain a number of bytes. We can represent the number of options with number of bits.
So for example, 4096 memory locations, each memory location contains 16 bytes. If we were to refer to each byte then 2^12*2^4 = 2^16
16 bit memory address would be required to refer to each byte.
The cache also has a memory address, valid bit, tag, and some data capable of storing a block of main memory of n words and thus m bytes. Where m = n*i (bytes per word)
For an example, direct mapping
1 block of main memory can only be at one particular memory location in cache. When the CPU requests for some data using a 16bit memory location of RAM, it checks for cache first.
How does it know that this particular 16bit memory address can only be in a few places?
My thoughts are, there could be some electrical connection between every RAM address to a cache address. The 16bit address could then be split into parts, for example only compare the left 8bits with every cache memory address, then if match compare the byte bits, then tag bits then valid bit
Is my understanding correct? Thank you!
Really do appreciate if someone read this long post
You may want to read 3.3.1 Associativity in What Every Programmer Should Know About Memory from Ulrich Drepper.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf#subsubsection.3.3.1
The title is a little bit catchy, but it explains everything you ask in detail.
In short:
the problem of caches is the number of comparisons. If your cache holds 100 blocks, you need to perform 100 comparisons in one cycle. You can reduce this number with the introduction of sets. if A specific memory-region can only be placed in slot 1-10, you reduce the number of comparisons to 10.
The sets are addressed by an additional bit-field inside the memory-address called index.
So for instance your 16 Bit (from your example) could be splitted into:
[15:6] block-address; stored in the `cache` as the `tag` to identify the block
[5:4] index-bits; 2Bit-->4 sets
[3:0] block-offset; byte position inside the block
so the choice of the method depends on the availability of hardware-resources and the access-time you want to archive. Its pretty much hardwired, since you want to reduce the comparison-logic.
There are few mapping functions used for map cache lines with main memory
Direct Mapping
Associative Mapping
Set-Associative Mapping
you have to have an slight idea about these three mapping functions
I'm writing a C++ program that simply receives data from another computer and writes the data into an SSD RAID with high throughput (about 100MB/s since GbEthernet).
I have set up 2 overlapped_io each, which are received from Ethernet and written to SSD.
When the receiving is done done, it'll post a message to the writer.
And I use FILE_NO_BUFFERING_FLAG when creating the file on disk.
On the side of network sender, I am using an overlapped IO to send data.
I got stuck in the problem: when received from the socket, the rv = recv() is not aligned with the disk (maybe 4096 times?).
What should I do?
recv and unbuffered writes are not really very compatible with each other. It is possible to get that working, but it will take a little extra work.
When doing unbuffered writes, both the start address of your buffer and the amount to write must be multiples of the sector size (see MSDN). Aligning the buffer is trivial, but dealing with the fact that recv can return pretty much every amount of data (up to the amount you ask for, but in theory it could be just 1 byte) is a bit of work.
Another problem is that while it is pretty much guaranteed that the sector size is a power of two (though at least there used to exist harddisks with non-power-of-two sectors in the 1990s, this fact was hidden by the controller) you do not know what it is. And even if you did know, it might be different on the next computer. It might be 512 or 1024 or something else.
How to handle this? Most programmers resort to simply using a function that allocates complete memory pages, such as VirtualAlloc, or an anonymous memory mapping. Since these operate on pages, they are necessarily page-size aligned, which (usually) means 4096 bytes1.
Since the amount of data to write must, too, be a multiple of the sector size (but the amount of data received probably isn't), you have round down, do a partial write, and save the rest for the next write.
Again, the problem is that you don't know the sector size, so the best thing you can do is round down to the same granularity that you're using for the buffer start (anything else would be nonsensical). In other words, you conceptually have to do something like this:
while(rv < 0xffff) // don't have enough yet
receive_more_and_append();
num_write = rv & ~0xffff;
rv -= num_write;
memcpy(other_buf, buf+num_write, rv);
WriteFileEx(...);
1That is only half the truth, since Windows has a minimum allocation granularity of 64kB. You can't allocate something smaller than 64k and it can't be aligned less than 64k. So in fact, you are good for sectors up to 64k, which is bigger than anything you are likely to ever encounter, realistically.
Also, as a small nitpick, Itanium has 8k pages, not 4k -- but that is no problem, it's actually better.
I have a function which prepares a data buffer and then sends it via an externally provided API function which looks like:
send(uint8_t* data_buf, uint32_t length)
In my particular case, my code is always sending exactly 8 bytes and the first 7 bytes are always the same (I can't change this fact; it's some sort of message header).
Because I am in an limited, embedded environment, I would like to optimize the size and performance of my code, or at least choose the best tradeoff of the two.
Currently, I see two options:
Create a global array. Initialize the first 7 bytes one time and then
just overwrite the last byte before sending the array.
Create a local array, write all 8 bytes and then send it.
Are there any better solutions than the two mentioned above?
Even on an embedded system, you can count on having more than 8 bytes of cache. Hence performance really doesn't matter. databuf[8] will be fully in cache in both cases. The code size for the first case might be smaller by 2-3 instructions.
(There's often another issue: code/flash size and RAM size are distinct constraints, when you can XIP).
I am building an MPI application. In order to reduce the size of the messages being transferred, I am thinking of having tables of "bits" to represent bool tables (since the bool value can take only one of two values: true or false). It is important in my case since the communication is the main performance bottleneck in my application.
Is it possible to create this kind of table? Does this datatype exist in the MPI API?
In C++ std::bitset or boost::dynamic_bitset can be useful to manage a number of bits. Choose the later if the size of the bitset isn't fixed. AFAIK MPI uses MPI_Send and MPI_Rec for inter process communication. How you serialize your output and send them through those interfaces is another matter as neither of the two types is supported by Boost.Serialization.
Based on the tag in the original question, I assume you are using a mix of Fortran and C++. MPI binding for Fortran has the datatype MPI_LOGICAL, which you can readily use in your message passing calls. I am not aware of such type for MPI C binding. As suggested by PlasmaHH, sending integers might work for you in that case.
https://computing.llnl.gov/tutorials/mpi/#Derived_Data_Types
Short answer - no, the shortest MPI datatype is MPI_BYTE, you can't create a type that's just a bit. (The fortran bindings have MPI_LOGICAL which corresponds to the local logical type, but that almost always corresponds to an int or maybe a byte, not a bit).
Now, that's not necessarily a problem; if you had a bit array you could just round up to the next whole number of bytes, and send that, and just ignore the last few bits. (which is pretty much what you'd have to do in your table creation anyway). But I have some questions.
How large are your messages? And what's your networking? Are you sure you're bandwidth limited, rather than latency limited?
If your messages are smallish (say under an MB), then you're likely dominated by latency of messages, not bandwidth, and reducing message size won't help. (You can estimate this using pingpong tests - say in the Intel MPI benchmarks - to see at what sizes your effective bandwidth levels off). If that's the regime you're in, then this will likely make things worse, not better, as communication won't speed up but additional cost of indexing into a bit array will likely slow things down.
On the other hand, if you're sending large messages (say MB sized) and/or you're memory limited, this could be a good thing.
I would transfer your bits into an array of integers.
I will answer in regard to the FORTRAN language.
You can use the intrinsic bit operations to move the bits back and forth.
Also to clearify, you should not use the FORTRAN type LOGICAL as it is an 4-byte variable just as regular integers.
Use these functions:
BIT_SIZE(I)
IBCLR(I, POS) ! Set to 0 in variable I at position POS
IBSET(I, POS) ! Set to 1 in variable I at position POS
BTEST(I, POS) ! To test if bit at POS is 1
Then do a normal transfer in whatever type you are dealing with. You can add tags in the MPI communication to let the receiver know that it is a variable that should be handled bitwise.
This should limit your communication, but requires packing of the data and outpackaging as well. In any case you could also just transfer all your BOOL tables to this scheme.
But it should be noticed that your BOOL tables have to be extensively big to see any effect, how large are we talking?
My (DSP) application produces data at a constant rate. The rate depends on the configuration that is selected by the user. I would like to know how many bytes are generated per second. The data structure contains a repeated (packed) floating point field. The length of the field is constant, but can be changed by the user.
Is there a protocol buffers function that will calculate the message size before serialization?
If you have build the message objects, you can call ByteSize() on the message which returns the number of bytes the serializes message would take up. There is a link to the C++ docs of ByteSize.
It's impossible to know ahead of time, because protobuf packs the structures it is given into the fewest bytes possible - it won't use four bytes for int x = 1; for example - so the library would have to walk the entire graph to know the output size.
I believe you could find this out by doing a serialize operation to a protobuf-compliant stream of your own design that just counts the bytes it is given. That could be costly, but no more costly than it would be for the library to do that work.
You could fill out the message without sending it and then call CalculateSize() on it