Elegant way to add/remove descriptors to/from poll - c++

I have to handle around 1000 descriptors in one poll (I can't use epoll as it's Linux specific) and I have to be able to dynamically add/remove them(handle new connections and remove closed).
This means I should recombine the descriptors array on each iteration.
This is rather obvious from a technical point of view, but does anybody know a beautiful way to do that?

I'd keep the dead descriptors in the array, and purge once in a while.
I'd also maintain the location of each descriptor, for easy removal, but this can be optimized further.
The trick is to keep invalid descriptors in the array instead of rearranging the array every time.
For instance:
struct pollfd pfds[MY_MAX_FDS];
int nfds = 0;
enum { INVALID_FD = -1 };
....
typedef std::map<int,int> desc_to_index_t;
desc_to_index_t d2i;
....
// add descriptor
if (nfds == MY_MAX_FDS){
// purge old fds
// go through pfds and remove everything that has invalid fd
// pfds should point to a condensed array of valid objects and nfds should be updated as well, as long as d2i.
}
pfds[nfds] = { desc, events, revents};
d2i.insert(std::make_pair(desc,nfds));
++nfds;
....
// remove descriptor
desc_to_index_t::iterator it = d2i.find(desc);
assert(it != d2i.end());
pfds[it->second] = { INVALID_FD, 0, 0 };
d2i.erase(it);
This way you only need to purge once a certain threshold is crossed, and you don't need to build the array every time.

Related

Deleting and re instantiating an arrayed struct globally from within a function without knowing the size of the array at compile

Full code: Pastebin
Full code with your comments enabled (Google Drive): SerialGrapherV0.9
Code-in-progress is near the bottom.
youTube example of graphing code running: Grapher
Background: My goal is to write a library to allow a caller Arduino to drive a callee Arduino via serial and print to a master-defined graph or graphs on an SSD1306 I2C display(no SPI version to test with). The graphing code is finished. Currently I can have 4 graphs that can update synchronously or asynchronously, there is no blanking and only writes the portions that need updating.
Both arduinos currently run the same sketch and determine their role via a pullup_input tied to ground, however in later versions the sketch will compile using if statements with a #defined boolean to greatly save on program space for the caller arduino.
So far:
The actual graphing is working and the graph updates whenever a graphAdd(graphNumber, newVal); is called.
The xpos, ypos, xlength, and ylength, of each graph can be defined on the caller side as such:
#define masterGraphNum 4 //Set this to the number of Graphs you want.
graphStruct graph[masterGraphNum] = { //Each set of brackets is an instance of a graph, one for each specified in graphNum, set the array number on the receiver to the max number of graphs you can use.
//Graph 1 //Usage: {LeftX, TopY, width, height}
{0, 0, 31, 32},
//Graph 2
{32, 0, 31, 32},
//Graph 3
{64, 0, 31, 32},
//Graph 4
{96, 0, 31, 32},
};
Currently I am trying to use delete[] (graph); followed by graphStruct *graph = new graphStruct[incomingGraphNum]; where incomingGraphNum is an int sent by the caller and received by the callee, this seems to work at first, however after a short time of graphing ~15 seconds the arduino crashes and restarts.
FLOW:
Callee awaits connection indefinitely
Caller sends ready byte
Callee acks
Caller sends number of graphs wanted
NOT WORKING: Re-initializing graph
Graph adds data via called function
NYI: Sending graph number and new value over serial
My problem is now instantiating a globally accessible array of structs from within a function as I don't want to have to pre-code the number of graphs into the callee, as well as assign the size of the buffer array within the struct.
For the functions to work graph[] needs to be declared globally. I would like to globally declare graph[number of graphs] within a function during the callee setup, as I want to make this into a plug-and play diagnostic tool for my future projects.
Next Steps:
Setting up packets to send the graph data over. Not too hard, essentially sending two ints like (graph#, graphData)
Add graph "titling" (like "ACC" or "Light intensity")
Implemented:
Graphing system
Simple serial "call - response" system and acknowledgement system. (Just discovered the stream function included with the Arduino IDE, currently rewriting a few sections to use Serial.parseInt() instead of a modified serialEvent().
Basic Error Handling
Loops/Second counter
A couple ideas that may help.
The display is 128 pixels, so you need a buffer no larger than that. I suggest you make that a single, global buffer (instead of each struct having its own buffer). This will never need to be re-sized with new/delete, no matter how many graphs you have.
uint8_t global_graph_buffer[128];
Notice that I have changed it from int to byte. The height of the display is only 30 or 40 pixels (?) so there's no need to store any number bigger than that. Just scale the value down when it comes in on the port.
graph_buffer[x] = map(incoming_data, 0, max_input, 0, height_of_graph);
See the Arduino map() function.
Next, do you really need the y_pos and height of the graphs? Do you plan on having more than 1 row of graphs? If not, get rid of those struct members. Also, you can get rid of the x_pos and width fields as well. These can be calculated based on the indexes.
struct graphStruct {
uint8_t gx; // Removed
uint8_t gy; // Removed
uint8_t gw; // Removed
uint8_t gh; // Removed
int gVal; // What is this for?
//int graphBuffer[graphBufferSize]; This is gone
uint8_t start_index; // First index in the global array
uint8_t end_index; // Last index
bool isReady; // What is this for?
};
To calc x_pos and width:
x_pos = start_index
width = end_index - start_index
To handle incoming data, shift just the part of the buffer for the given graph and add the value:
int incoming_data = some_value_from_serial;
// Shift
for (byte i = graph[graphNumber].start_index+1; i < graph[graphNumber].end_index; i++) {
global_graph_buffer[i] = global_graph_buffer[i-1]
}
// Store
global_graph_buffer[i] = map(incoming_data, 0, graphMax, 0, 128);
Lastly, you need to consider: how many graphs can you realistically display at one time? Set a max, and create only that many structs at the start. If you use a global buffer, as I suggest, you can re-use a struct multiple times (without having to use new/delete). Just change the start_index and end_index fields.
Not sure if any of this helps, but maybe you can get some ideas from it.

How to manage the life cycle of Connection using epoll(void* event.data.ptr)

I'm using C++ implementation epoll.
I want to learn nginx to use (void*) event.data.ptr to associate with Connection.
Connection is a smart pointer. But between void* I need to convert the smart pointer to a raw pointer.
There is no increase in the number of references during the conversion and acquisition process, which obviously the program will crash.
So I don't use smart pointer.
I use new and delete. But it is so terrible.
How to manage the life cycle of Connection?
int nfds = epoll_wait(epollfd, event_list, 100, -1);
for (int n = 0; n < nfds; ++n) {
auto event = event_list[n];
auto revents = event.events;
if (revents & EPOLLIN) {
if (event.data.fd == listen_fd_) {
int fd = handleAccept(listen_fd_);
auto conn_ptr = std::make_shared<Connection>(fd, connections_manager_);
//connecions_manager_ is std::set<std::shared_ptr<Connection>>
//it use start, stop and stop_all manage Connection.
connections_manager_.start(conn_ptr);
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET;
ev.data.ptr = static_cast<void*>(conn_ptr.get());
if (epoll_ctl(event_.getEpollFd(), EPOLL_CTL_ADD, fd, &ev) == -1) {
std::cout << "epoll_ctl failed. fd is " << fd << '\n';
perror("epoll_ctl: fd_");
exit(EXIT_FAILURE);
}
continue;
}
auto conn = static_cast<Connection *>(event.data.ptr);
conn->start();
}
}
Here's a general idea that you might want to implement:
Keep a list of all the Connection objects that you create. A simple C-style array would do but you can check the possibility of using an std::vector, std::map or std::unordered_map for the bookkeeping of Connection objects.
Whenever you create a new Connection, you would add it to your list before adding it to your epoll pool. I'd suggest a separate method add() for this.
Similarly, on closing a Connection (from host or client), you need to remove that from epoll pool. But, it'll be in your list of connections. You might need a flag e.g. isClosed and set it true for a closed connection. DON'T delete your Connection objects in your epoll main event loop. Also, you need something to figure out if there's some data left to send before closing a connection from your side. The client might be waiting up on receiving data from the server.
For garbage collection of these closed connection, you may use a separate thread with a fixed time interval to go through your list and remove the closed connections.
In general, you might need some wrapper methods for wait, listen, add, remove, etc. to decompose your operations. It'd be easier to manage. Additionally, you may have separate thread pools dedicated for read / write operations. Accepting connections in the main event loop is just fine.
Hope that helps!
Let me know if you want to discuss further.

Multiple constant buffer - register - dx12

I learn dx12 with that tutorial :
https://www.braynzarsoft.net/viewtutorial/q16390-directx-12-constant-buffers-root-descriptor-tables#c0
I tried to modify this step to got 2 constant buffer (so a register b0 and a b1, if i understood well).
For that I begin to say in my root sign there is 2 parameters:
// create root signature
// create a descriptor range (descriptor table) and fill it out
// this is a range of descriptors inside a descriptor heap
D3D12_DESCRIPTOR_RANGE descriptorTableRanges[1]; // only one range right now
descriptorTableRanges[0].RangeType = D3D12_DESCRIPTOR_RANGE_TYPE_CBV; // this is a range of constant buffer views (descriptors)
descriptorTableRanges[0].NumDescriptors = 2; // we only have one constant buffer, so the range is only 1
descriptorTableRanges[0].BaseShaderRegister = 0; // start index of the shader registers in the range
descriptorTableRanges[0].RegisterSpace = 0; // space 0. can usually be zero
descriptorTableRanges[0].OffsetInDescriptorsFromTableStart = D3D12_DESCRIPTOR_RANGE_OFFSET_APPEND; // this appends the range to the end of the root signature descriptor tables
// create a descriptor table
D3D12_ROOT_DESCRIPTOR_TABLE descriptorTable;
descriptorTable.NumDescriptorRanges = 0;// _countof(descriptorTableRanges); // we only have one range
descriptorTable.pDescriptorRanges = &descriptorTableRanges[0]; // the pointer to the beginning of our ranges array
D3D12_ROOT_DESCRIPTOR_TABLE descriptorTable2;
descriptorTable2.NumDescriptorRanges = 1;// _countof(descriptorTableRanges); // we only have one range
descriptorTable2.pDescriptorRanges = &descriptorTableRanges[0]; // the pointer to the beginning of our ranges array
// create a root parameter and fill it out
D3D12_ROOT_PARAMETER rootParameters[2]; // only one parameter right now
rootParameters[0].ParameterType = D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE; // this is a descriptor table
rootParameters[0].DescriptorTable = descriptorTable; // this is our descriptor table for this root parameter
rootParameters[0].ShaderVisibility = D3D12_SHADER_VISIBILITY_VERTEX; // our pixel shader will be the only shader accessing this parameter for now
rootParameters[1].ParameterType = D3D12_ROOT_PARAMETER_TYPE_DESCRIPTOR_TABLE; // this is a descriptor table
rootParameters[1].DescriptorTable = descriptorTable2; // this is our descriptor table for this root parameter
rootParameters[1].ShaderVisibility = D3D12_SHADER_VISIBILITY_VERTEX; // our pixel shader will be the only shader accessing this parameter for now
But now I failed to link constant buffer to a variable, I try to modify in this part of the code:
// Create a constant buffer descriptor heap for each frame
// this is the descriptor heap that will store our constant buffer descriptor
for (int i = 0; i < frameBufferCount; ++i)
{
D3D12_DESCRIPTOR_HEAP_DESC heapDesc = {};
heapDesc.NumDescriptors = 1;
heapDesc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
heapDesc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
hr = device->CreateDescriptorHeap(&heapDesc, IID_PPV_ARGS(&mainDescriptorHeap[i]));
if (FAILED(hr))
{
Running = false;
}
}
// create the constant buffer resource heap
// We will update the constant buffer one or more times per frame, so we will use only an upload heap
// unlike previously we used an upload heap to upload the vertex and index data, and then copied over
// to a default heap. If you plan to use a resource for more than a couple frames, it is usually more
// efficient to copy to a default heap where it stays on the gpu. In this case, our constant buffer
// will be modified and uploaded at least once per frame, so we only use an upload heap
// create a resource heap, descriptor heap, and pointer to cbv for each frame
for (int i = 0; i < frameBufferCount; ++i)
{
hr = device->CreateCommittedResource(
&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_UPLOAD), // this heap will be used to upload the constant buffer data
D3D12_HEAP_FLAG_NONE, // no flags
&CD3DX12_RESOURCE_DESC::Buffer(1024 * 64), // size of the resource heap. Must be a multiple of 64KB for single-textures and constant buffers
D3D12_RESOURCE_STATE_GENERIC_READ, // will be data that is read from so we keep it in the generic read state
nullptr, // we do not have use an optimized clear value for constant buffers
IID_PPV_ARGS(&constantBufferUploadHeap[i]));
constantBufferUploadHeap[i]->SetName(L"Constant Buffer Upload Resource Heap");
D3D12_CONSTANT_BUFFER_VIEW_DESC cbvDesc = {};
cbvDesc.BufferLocation = constantBufferUploadHeap[i]->GetGPUVirtualAddress();
cbvDesc.SizeInBytes = (sizeof(ConstantBuffer) + 255) & ~255; // CB size is required to be 256-byte aligned.
device->CreateConstantBufferView(&cbvDesc, mainDescriptorHeap[i]->GetCPUDescriptorHandleForHeapStart());
ZeroMemory(&cbColorMultiplierData, sizeof(cbColorMultiplierData));
CD3DX12_RANGE readRange(0, 0); // We do not intend to read from this resource on the CPU. (End is less than or equal to begin)
hr = constantBufferUploadHeap[i]->Map(0, &readRange, reinterpret_cast<void**>(&cbColorMultiplierGPUAddress[i]));
memcpy(cbColorMultiplierGPUAddress[i], &cbColorMultiplierData, sizeof(cbColorMultiplierData));
}
Thank
Your root signature is incorrect, you are trying to set a descriptor table with no range.
You have 3 ways to register a constant buffer in a root signature, with root constants, with a root constant buffer and with descriptor tables. The first two connect one constant buffer per root parameter, while the third allow to set multiple constant buffers in a single table.
In your case, a single root parameter of type descriptor table, with a single range refering to an array of 2 is enough to let you bind 2 constant buffer.
I recommend you to read how root signatures are declared in HLSL to better understand the concept and how it translates to the C++ declaration.
As for the runtime portion of manipulating constant buffer. You will have to be very careful again, their is no life time management in d3d12 nor the driver like it was with d3d11, you cannot update in place a constant buffer memory without making sure the GPU is already done using the previous content. The solution is often to work with a ring buffer to allocate your frame constant buffer, and to use fence to keep you from overwriting too soon.
I highly recommend you to stick to d3d11. d3d12 is not a replacement of it, it is made to overcome some of this performance issues that are only to be find in extremely complex renderer and to be used by people with expert knowledge of the GPU already and d3d11, if your application is not to the level of complexity of a GTA V ( just an example ), you are only shooting you in the foot by switching to d3d12.
Your real problem is: You defined 2 pieces of CBV descriptor in one range, and than defined 2 pieces of descriptor table with this range. So , you defined 4 pieces of CBVs instead of 2, and when you define the descriptor heap, you set the heapDesc.NumDescriptors to 1 instead of 4, because you defined 4 constant-buffer descriptor in the code, not 2.

How to reduce cpu usage during data transfer on TCP ports realtime

I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.

How to iterate through a fd_set

I'm wondering if there's an easy way to iterate through a fd_set? The reason I want to do this is to not having to loop through all connected sockets, since select() alters these fd_sets to only include the ones I'm interested about. I also know that using an implementation of a type that is not meant to be directly accessed is generally a bad idea since it may vary across different systems. However, I need some way to do this, and I'm running out of ideas. So, my question is:
How do I iterate through an fd_set? If this is a really bad practice, are there any other ways to solve my "problem" except from looping through all connected sockets?
Thanks
You have to fill in an fd_set struct before calling select(), you cannot pass in your original std::set of sockets directly. select() then modifies the fd_set accordingly, removing any sockets that are not "set", and returns how many sockets are remaining. You have to loop through the resulting fd_set, not your std::set. There is no need to call FD_ISSET() because the resulting fd_set only contains "set" sockets that are ready, eg:
fd_set read_fds;
FD_ZERO(&read_fds);
int max_fd = 0;
read_fds.fd_count = connected_sockets.size();
for( int i = 0; i < read_fds.fd_count; ++i )
{
read_fds.fd_array[i] = connected_sockets[i];
if (read_fds.fd_array[i] > max_fd)
max_fd = read_fds.fd_array[i];
}
if (select(max_fd+1, &read_fds, NULL, NULL, NULL) > 0)
{
for( int i = 0; i < read_fds.fd_count; ++i )
do_socket_operation( read_fds.fd_array[i] );
}
Where FD_ISSET() comes into play more often is when using error checking with select(), eg:
fd_set read_fds;
FD_ZERO(&read_fds);
fd_set error_fds;
FD_ZERO(&error_fds);
int max_fd = 0;
read_fds.fd_count = connected_sockets.size();
for( int i = 0; i < read_fds.fd_count; ++i )
{
read_fds.fd_array[i] = connected_sockets[i];
if (read_fds.fd_array[i] > max_fd)
max_fd = read_fds.fd_array[i];
}
error_fds.fd_count = read_fds.fd_count;
for( int i = 0; i < read_fds.fd_count; ++i )
{
error_fds.fd_array[i] = read_fds.fd_array[i];
}
if (select(max_fd+1, &read_fds, NULL, &error_fds, NULL) > 0)
{
for( int i = 0; i < read_fds.fd_count; ++i )
{
if( !FD_ISSET(read_fds.fd_array[i], &error_fds) )
do_socket_operation( read_fds.fd_array[i] );
}
for( int i = 0; i < error_fds.fd_count; ++i )
{
do_socket_error( error_fds.fd_array[i] );
}
}
Select sets the bit corresponding to the file descriptor in the set, so, you need-not iterate through all the fds if you are interested in only a few (and can ignore others) just test only those file-descriptors for which you are interested.
if (select(fdmax+1, &read_fds, NULL, NULL, NULL) == -1) {
perror("select");
exit(4);
}
if(FD_ISSET(fd0, &read_fds))
{
//do things
}
if(FD_ISSET(fd1, &read_fds))
{
//do more things
}
EDIT
Here is the fd_set struct:
typedef struct fd_set {
u_int fd_count; /* how many are SET? */
SOCKET fd_array[FD_SETSIZE]; /* an array of SOCKETs */
} fd_set;
Where, fd_count is the number of sockets set (so, you can add an optimization using this) and fd_array is a bit-vector (of the size FD_SETSIZE * sizeof(int) which is machine dependent). In my machine, it is 64 * 64 = 4096.
So, your question is essentially: what is the most efficient way to find the bit positions of 1s in a bit-vector (of size around 4096 bits)?
I want to clear one thing here:
"looping through all the connected sockets" doesn't mean that you are actually reading/doing stuff to a connection. FD_ISSET() only checks weather the bit in the fd_set positioned at the connection's assigned file_descriptor number is set or not. If efficiency is your aim, then isn't this the most efficient? using heuristics?
Please tell us what's wrong with this method, and what are you trying to achieve using the alternate method.
It's fairly straight-forward:
for( int fd = 0; fd < max_fd; fd++ )
if ( FD_ISSET(fd, &my_fd_set) )
do_socket_operation( fd );
This looping is a limitation of the select() interface. The underlying implementations of fd_set are usually a bit set, which obviously means that looking for a socket requires scanning over the bits.
It is for precisely this reason that several alternative interfaces have been created - unfortunately, they are all OS-specific. For example, Linux provides epoll, which returns a list of only the file descriptors that are active. FreeBSD and Mac OS X both provide kqueue, which accomplishes the same result.
See this section 7.2 of Beej's guide to networking - '7.2. select()—Synchronous I/O Multiplexing' by using FD_ISSET.
in short, you must iterate through an fd_set in order to determine whether the file descriptor is ready for reading/writing...
I don't think what you are trying to do is a good idea.
Firstly its system dependent, but I believe you already know it.
Secondly, at the internal level these sets are stored as an array of integers and fds are stored as set bits. Now according to the man pages of select the FD_SETSIZE is 1024.
Even if you wanted to iterate over and get your interested fd's you have to loop over that number along with the mess of bit manipulation.
So unless you are waiting for more than FD_SETSIZE fd's on select which I don't think so is possible, its not a good idea.
Oh wait!!. In any case its not a good idea.
I don't think you could do much using the select() call efficiently. The information at "The C10K problem" are still valid.
You will need some platform specific solutions:
Linux => epoll
FreeBSD => kqueue
Or you could use an event library to hide the platform detail for you libev
ffs() may be used on POSIX or 4.3BSD for bits iteration, though it expects int (long and long long versions are glibc extensions). Of course, you have to check, if ffs() optimized as good as e.g. strlen and strchr.