Is this the proper way to iterate over a read on a socket? I am having a hard time getting this to work properly. data.size is an unsigned int that is populated from the socket as well. It is correct. data.data is an unsigned char *.
if ( data.size > 0 ) {
data.data = (unsigned char*)malloc(data.size);
memset(&data.data, 0, data.size);
int remainingSize = data.size;
unsigned char *iter = data.data;
int count = 0;
do {
count = read(connect_fd, iter, remainingSize);
iter += count;
remainingSize -= count;
} while (count > 0 && remainingSize > 0);
}
else {
data.data = 0;
}
Thanks in advance.
You need to check the return value from read before you start adding it to other values.
You'll get a zero when the socket reports EOF, and -1 on error. Keep in mind that for a socket EOF is not the same as closed.
Low level socket programming is very tedious and error prone. If you use C++ you should try to use higher level libraries like Boost or ACE.
I would also suggest to read C++ Network Programming: Mastering Complexity Using ACE and Patterns and C++ Network Programming: Systematic Reuse with ACE and Frameworks
Put the read as part of the while condition.
while((remainingSize > 0) && (count = read(connect_fd, iter, remainingSize)) > 0)
{
iter += count;
remainingSize -= count;
}
This way if it fails you immediately stop the loop.
It is very common pattern to use the read as part of the loop condition otherwise you need to check the state inside the loop which makes the code uglier.
Personally:
I would move the whole above test into a separate function for readability but your milage may very.
Also using malloc (and company) is going to lead to a whole boat of memory management issues. I would use a std::vector. This also future proofs the code when you modify it to start throwing exceptions, now it will also be exception safe.
So assuming you change data.data to have a type of std::vector<unsigned char> then
if ( data.size > 0 )
{
std::vector<unsigned char> buffer(data.size);
unsigned char *iter = &buffer[0];
while(... read(connect_fd, iter, remainingSize) )
{
.....
}
... handle error as required
buffer.resize(buffer.size() - remainingSize);
data.data.swap(buffer);
}
Keep in mind that read() calls are system calls and thus a source of possible blocking, and even if you use non-blocking I/O, are inherently heavyweight. I would recommend minimising them.
A good way to go that has always served me well in over a decade of BSD socket programming in C is to use non-blocking I/O and issue a FIONREAD ioctl() to get the total amount of data waiting at a given polling interval (assuming you're using some sort of synchronous I/O mux like select()) and then just read() that amount as many times as necessary to capture all of it, and then return the function for the moment until the next timer tick.
Related
I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.
I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.
Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).
So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?
You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.
Edit:
Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.
#include <fstream>
#include <thread>
#include <memory>
#include <string>
#include <Windows.h> // for SetThreadAffinityMask()
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
void lock_thread(unsigned core_idx)
{
SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}
int main()
{
std::ios_base::sync_with_stdio(false);
lock_thread(0);
auto worker_count = std::thread::hardware_concurrency() - 1;
std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector
for (int i = 0; i < worker_count; ++i)
{
threads[i] = std::thread(
[](unsigned idx) {
lock_thread(idx);
stress_write(1'000'000'000, idx);
},
i + 1
);
}
stress_write(1'000'000'000, 0);
for (int i = 0; i < worker_count; ++i)
{
threads[i].join();
}
}
As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.
The easiest thing to get (up to) a 10x speed up is to change this:
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
to this:
void stress_write(unsigned bytes, int num)
{
constexpr auto chunk_size = (1u << 12u); // tune as needed
std::ofstream out("temp" + std::to_string(num));
for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
{
char chunk_buff[chunk_size];
auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
for (unsigned j = 0; j < count; ++j)
{
unsigned i = j + chunk_size*chunk;
chunk_buff[j] = char(i); // processing
}
out.write( chunk_buff, count );
}
}
where we group writes up to 4096 bytes before sending to the std ofstream.
The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.
By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.
See this SO post for more details asto why.
To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.
But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.
Chunking the I/O is indeed the most important optimization here and should suffice in most cases. However, the direct answer to the exact question asked about asynchronous IO is the following.
Boost::Asio added support for file operations in version 1.21.0. The interface is similar to the rest of Asio.
First, we need to create an object representing a file. The most common use cases would use either a random_access_file or a stream_file. In case of this example code, a streaming file is enough.
Reading is done through async_read_some, but the usual async_read helper function can be used to read a specific number of bytes at once.
If the operating system supports that, these operations do indeed run in the background and use little processor time. Both Windows and Linux do support this.
I'm trying to gain better understanding of controlling memory order when coding for multiple threads. I've used mutexes a lot in the past to serialize variable access, but I'm trying to avoid those where possible to improve performance.
I have a queue of pointers that may be filled by many threads and consumed by many threads. It works fine with a single thread, but crashes when I run with multiple threads. It looks like the consumers may be getting duplicates of the pointers which causes them to be freed twice. It's a little hard to tell since when I put in any print statements, it runs fine without crashing.
To start with I'm using a pre-allocated vector to hold the pointers. I keep 3 atomic index variables to keep track of what elements in the vector need processing. It may be worth noting that I tried using a _queue type where the elements themselves were atomic by that did not seem to help. Here is the simpler version:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
}
break;
}
}
and from the same class
// Read from _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iread.load();
if(idx == iend.load()) return NULL;
JEvent *Event = _queue[idx];
uint32_t inext = (idx+1)%_queue.size();
if( iread.compare_exchange_weak(idx, inext) ){
_nevents_processed++;
return Event;
}
}
I should emphasize that I am really interested in understanding why this doesn't work. Implementing some other pre-made package would get me past this problem, but would not help me avoid making the same type of mistakes again later.
UPDATE
I'm marking Alexandr Konovalov's answer as correct (see my comment in his answer below). In case anyone comes across this page, the corrected code for the "Write" section is:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
uint32_t save_idx = idx;
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
idx = save_idx;
}
break;
}
}
To me, one possible issue can occurs when there are 2 writers and 1 reader. Suppose that 1st writer stops just before
_queue[0] = jevent;
and 2nd writer signals via iend that its _queue[1] is ready to be read. Then, reader via iend sees that _queue[0] is ready to be read, so we have data race.
I recommend you try Relacy Race Detector, that ideally applies to such kind of analysis.
I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.
I'm wondering if there's an easy way to iterate through a fd_set? The reason I want to do this is to not having to loop through all connected sockets, since select() alters these fd_sets to only include the ones I'm interested about. I also know that using an implementation of a type that is not meant to be directly accessed is generally a bad idea since it may vary across different systems. However, I need some way to do this, and I'm running out of ideas. So, my question is:
How do I iterate through an fd_set? If this is a really bad practice, are there any other ways to solve my "problem" except from looping through all connected sockets?
Thanks
You have to fill in an fd_set struct before calling select(), you cannot pass in your original std::set of sockets directly. select() then modifies the fd_set accordingly, removing any sockets that are not "set", and returns how many sockets are remaining. You have to loop through the resulting fd_set, not your std::set. There is no need to call FD_ISSET() because the resulting fd_set only contains "set" sockets that are ready, eg:
fd_set read_fds;
FD_ZERO(&read_fds);
int max_fd = 0;
read_fds.fd_count = connected_sockets.size();
for( int i = 0; i < read_fds.fd_count; ++i )
{
read_fds.fd_array[i] = connected_sockets[i];
if (read_fds.fd_array[i] > max_fd)
max_fd = read_fds.fd_array[i];
}
if (select(max_fd+1, &read_fds, NULL, NULL, NULL) > 0)
{
for( int i = 0; i < read_fds.fd_count; ++i )
do_socket_operation( read_fds.fd_array[i] );
}
Where FD_ISSET() comes into play more often is when using error checking with select(), eg:
fd_set read_fds;
FD_ZERO(&read_fds);
fd_set error_fds;
FD_ZERO(&error_fds);
int max_fd = 0;
read_fds.fd_count = connected_sockets.size();
for( int i = 0; i < read_fds.fd_count; ++i )
{
read_fds.fd_array[i] = connected_sockets[i];
if (read_fds.fd_array[i] > max_fd)
max_fd = read_fds.fd_array[i];
}
error_fds.fd_count = read_fds.fd_count;
for( int i = 0; i < read_fds.fd_count; ++i )
{
error_fds.fd_array[i] = read_fds.fd_array[i];
}
if (select(max_fd+1, &read_fds, NULL, &error_fds, NULL) > 0)
{
for( int i = 0; i < read_fds.fd_count; ++i )
{
if( !FD_ISSET(read_fds.fd_array[i], &error_fds) )
do_socket_operation( read_fds.fd_array[i] );
}
for( int i = 0; i < error_fds.fd_count; ++i )
{
do_socket_error( error_fds.fd_array[i] );
}
}
Select sets the bit corresponding to the file descriptor in the set, so, you need-not iterate through all the fds if you are interested in only a few (and can ignore others) just test only those file-descriptors for which you are interested.
if (select(fdmax+1, &read_fds, NULL, NULL, NULL) == -1) {
perror("select");
exit(4);
}
if(FD_ISSET(fd0, &read_fds))
{
//do things
}
if(FD_ISSET(fd1, &read_fds))
{
//do more things
}
EDIT
Here is the fd_set struct:
typedef struct fd_set {
u_int fd_count; /* how many are SET? */
SOCKET fd_array[FD_SETSIZE]; /* an array of SOCKETs */
} fd_set;
Where, fd_count is the number of sockets set (so, you can add an optimization using this) and fd_array is a bit-vector (of the size FD_SETSIZE * sizeof(int) which is machine dependent). In my machine, it is 64 * 64 = 4096.
So, your question is essentially: what is the most efficient way to find the bit positions of 1s in a bit-vector (of size around 4096 bits)?
I want to clear one thing here:
"looping through all the connected sockets" doesn't mean that you are actually reading/doing stuff to a connection. FD_ISSET() only checks weather the bit in the fd_set positioned at the connection's assigned file_descriptor number is set or not. If efficiency is your aim, then isn't this the most efficient? using heuristics?
Please tell us what's wrong with this method, and what are you trying to achieve using the alternate method.
It's fairly straight-forward:
for( int fd = 0; fd < max_fd; fd++ )
if ( FD_ISSET(fd, &my_fd_set) )
do_socket_operation( fd );
This looping is a limitation of the select() interface. The underlying implementations of fd_set are usually a bit set, which obviously means that looking for a socket requires scanning over the bits.
It is for precisely this reason that several alternative interfaces have been created - unfortunately, they are all OS-specific. For example, Linux provides epoll, which returns a list of only the file descriptors that are active. FreeBSD and Mac OS X both provide kqueue, which accomplishes the same result.
See this section 7.2 of Beej's guide to networking - '7.2. select()—Synchronous I/O Multiplexing' by using FD_ISSET.
in short, you must iterate through an fd_set in order to determine whether the file descriptor is ready for reading/writing...
I don't think what you are trying to do is a good idea.
Firstly its system dependent, but I believe you already know it.
Secondly, at the internal level these sets are stored as an array of integers and fds are stored as set bits. Now according to the man pages of select the FD_SETSIZE is 1024.
Even if you wanted to iterate over and get your interested fd's you have to loop over that number along with the mess of bit manipulation.
So unless you are waiting for more than FD_SETSIZE fd's on select which I don't think so is possible, its not a good idea.
Oh wait!!. In any case its not a good idea.
I don't think you could do much using the select() call efficiently. The information at "The C10K problem" are still valid.
You will need some platform specific solutions:
Linux => epoll
FreeBSD => kqueue
Or you could use an event library to hide the platform detail for you libev
ffs() may be used on POSIX or 4.3BSD for bits iteration, though it expects int (long and long long versions are glibc extensions). Of course, you have to check, if ffs() optimized as good as e.g. strlen and strchr.
I've implemented a simple socket wrapper class. It includes a non-blocking function:
void Socket::set_non_blocking(const bool b) {
mNonBlocking = b; // class member for reference elsewhere
int opts = fcntl(m_sock, F_GETFL);
if(opts < 0) return;
if(b)
opts |= O_NONBLOCK;
else
opts &= ~O_NONBLOCK;
fcntl(m_sock, F_SETFL, opts);
}
The class also contains a simple receive function:
int Socket::recv(std::string& s) const {
char buffer[MAXRECV + 1];
s = "";
memset(buffer,0,MAXRECV+1);
int status = ::recv(m_sock, buffer, MAXRECV,0);
if(status == -1) {
if(!mNonBlocking)
std::cout << "Socket, error receiving data\n";
return 0;
} else if (status == 0) {
return 0;
} else {
s = buffer;
return status;
}
}
In practice, there seems to be a ~15ms delay when Socket::recv() is called. Is this delay avoidable? I've seen some non-blocking examples that use select(), but don't understand how that might help.
It depends on how you using sockets. If you have multiple sockets and you loop over all of them checking for data that may account for the delay.
With non-blocking recv you are depending on data being there. If your application need to use more than one socket you will have to constantly pool each socket in turns to find out if any of them have data available.
This is bad for system resources because it means your application is constantly running even when there is nothing to do.
You can avoid that with select. You basically set up your sockets, add them to group and select on the group. When anything happens on any of the selected sockets select returns specifying what happened and on which socket.
For some code about how to use select look at beej's guide to network programming
select will let you a specify a timeout, and can test if the socket is ready to be read from. So you can use something smaller than 15ms. Incidentally you need to be careful with that code you have, if the data on the wire can contain embedded NULs s won't contain all the read data. You should use something like s.assign(buffer, status);.
In addition to stefanB, I see that you are zeroing out your buffer every time. Why bother? recv returns how many bytes were actually read. Just zero out the one byte after ( buffer[status+1]=NULL )
How big is your MAXRECV? It might just be that you incur a page fault on the stack growth. Others already mentioned that zeroing out the receive buffer is completely unnecessary. You also take memory allocation and copy hit when you create a std::string out of received character data.