Heap solution outperforming stack?

Heap solution outperforming stack? - c++

I am coding a C simulation, in which, given a sequence of rules to verify, we break it up into 'slices' and verify each slice. (The basic idea is that the order is important, and the actual meaning of a rule is affected by some rules above it; we can make a 'slice' with each rule and only those rules above it which overlap it. We then verify the slices, which are usually much smaller than the whole sequence was.)
My problem is as follows.
I have a struct (policy) which contains an array of structs (rules), and an int (length).
My original implementation used malloc and realloc liberally:
struct{
struct rule *rules;
int length;
}policy;
...
struct policy makePolicy(int length)
{
struct policy newPolicy;
newPolicy.rules = malloc(length * sizeof(struct rule));
newPolicy.length = length;
return newPolicy;
}
...
struct policy makeSlice(struct policy inPol, int rulePos)
{
if(rulePos > inPol.length - 1){
printf("Slice base outside policy \n");
exit(1);
}
struct slice = makePolicy(inPol.length);
//create slice, loop counter gets stored in sliceLength
slice.rules = realloc(slice.rules, sliceLength * sizeof(struct rule));
slice.length = sliceLength;
return slice;
}
As this uses malloc'ed memory, I'm assuming it makes heavy use of heap.
Now I'm trying to port to an experimental parallel machine, which has no malloc.
I sadly went and allocated everything with fixed size arrays.
Now here's the shocker.
The new code runs slower. Much slower.
(The original code used to wait for minutes on end when the slice length was say 200, and maybe an hour at over 300 ... now it does that when the slice length is 70, 80 ... and has been taking hours for say 120. Still not 200.)
The only thing is that now the slices are given the same memory as a full policy (MAXBUFLEN is 10000), but the whole doesn't seem to be running out of memory at all. 'top' shows that the total memory consumed is quite modest, tens-of-megabytes range, as before. (And of course as I'm storing the length, I'm not looping over the whole thing, just the part with real rules.)
Could anyone please help explain why it suddenly got so much slower?

It seems that when you fixed the size of the struct to a larger size (say 10000 rules), your cache locality could become much worse than the original one. You can use a profiler (oprofile or cachegrind in Valgrind) to see if cache is a problem.
In the original program, one cache line can hold at most 8 struct policy (on a 32bit machine with 64byte cache line). But in the modified verison it can only hold one since it is now much larger than the cache line size.
Move the length field up can improve performance in this case since now the length and the first few struct rules can fit into a single cache line.
struct policy{
int length;
struct rule rules[10000];
};
To solve this problem you need to write your own custom allocator to ensure cache locality. If you are writing a parallel version of this program, also remember to isolate memory used by different threads into different cache lines to avoid cache line contention.

Related

Is std::hardware_constructive_interference_size ever useful?

Existing questions in this area that still don't ask specifically my question:
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size
Correct way to ensure sharing with std::hardware_constructive_interference_size
The answer to second one actually makes me ask this question.
So, assuming I want to have constructive interference. And I'm putting few variables into a single struct that fits std::hardware_constructive_interference_size:
struct together
{
int a;
int b;
};
Advantage seems too weak to ban compilation with static_assert if it does not fit:
// Not going to do the below:
static_assert(sizeof(together) <= std::hardware_constructive_interference_size);
Still aligning is helpful to avoid structure span:
struct alignas(std::hardware_constructive_interference_size) together
{
int a;
int b;
};
However the same effect can be achieved with aligning just on structure size:
struct alignas(std::bit_ceil(2*sizeof(int))) together
{
int a;
int b;
};
If structure size is larger than std::hardware_constructive_interference_size, it may still be helpful to align it on structure size, because:
It is compile-time hint that may become obsolete with later CPUs the compiled program run on
It is one of cache levels cache line size, if there are more than one, exceeding one cache level cache line may still give useful sharing of other level cache line
Aligning on structure size is not going to make much more than twice overhead. Aligning on cache line size may potentially cause more overhead, if cache line size becomes way more than structure size.
So, is there any point left for std::hardware_constructive_interference_size?

Consider a std::deque<T>. It's often implemented using chunks of a given size. But how many T's do you store per chunk? A reasonable answer is std::hardware_constructive_interference_size/sizeof(T), if sizeof(T) is small.
Similarly, a string class with the Small String Optimization may aim for a size of std::hardware_constructive_interference_size. In general, the size is useful when you can have a run-time variable amount of data with high locality of reference.

Fast synchronized access to shared array with changing base address (in C11)

I am currently designing a user space scheduler in C11 for a custom co-processor under Linux (user space, because the co-processor does not run its own OS, but is controlled by software running on the host CPU). It keeps track of all the tasks' states with an array. Task states are regular integers in this case. The array is dynamically allocated and each time a new task is submitted whose state does not fit into the array anymore, the array is reallocated to twice its current size. The scheduler uses multiple threads and thus needs to synchronize its data structures.
Now, the problem is that I very often need to read entries in that array, since I need to know the states of tasks for scheduling decisions and resource management. If the base address was guaranteed to always be the same after each reallocation, I would simply use C11 atomics for accessing it. Unfortunately, realloc obviously cannot give such a guarantee. So my current approach is wrapping each access (reads AND writes) with one big lock in the form of a pthread mutex. Obviously, this is really slow, since there is locking overhead for each read, and the read is really small, since it only consists of a single integer.
To clarify the problem, I give some code here showing the relevant passages:
Writing:
// pthread_mutex_t mut;
// size_t len_arr;
// int *array, idx, x;
pthread_mutex_lock(&mut);
if (idx >= len_arr) {
len_arr *= 2;
array = realloc(array, len_arr*sizeof(int));
if (array == NULL)
abort();
}
array[idx] = x;
pthread_mutex_unlock(&mut);
Reading:
// pthread_mutex_t mut;
// int *array, idx;
pthread_mutex_lock(&mut);
int x = array[idx];
pthread_mutex_unlock(&mut);
I have already used C11 atomics for efficient synchronization elsewhere in the implementation and would love to use them to solve this problem as well, but I could not find an efficient way to do so. In a perfect world, there would be an atomic accessor for arrays which performs address calculation and memory read/write in a single atomic operation. Unfortunately, I could not find such an operation. But maybe there is a similarly fast or even faster way of achieving synchronization in this situation?
EDIT:
I forgot to specify that I cannot reuse slots in the array when tasks terminate. Since I guarantee access to the state of every task ever submitted since the scheduler was started, I need to store the final state of each task until the application terminates. Thus, static allocation is not really an option.

Do you need to be so economical with virtual address space? Can't you just set a very big upper limit and allocate enough address space for it (maybe even a static array, or dynamic if you want the upper limit to be set at startup from command-line options).
Linux does lazy memory allocation so virtual pages that you never touch aren't actually using any physical memory. See Why is iterating though `std::vector` faster than iterating though `std::array`? that show by example that reading or writing an anonymous page for the first time causes a page fault. If it was a read access, it gets the kernel to CoW (copy-on-write) map it to a shared physical zero page. Only an initial write, or a write to a CoW page, triggers actual allocation of a physical page.
Leaving virtual pages completely untouched avoids even the overhead of wiring them into the hardware page tables.
If you're targeting a 64-bit ISA like x86-64, you have boatloads of virtual address space. Using up more virtual address space (as long as you aren't wasting physical pages) is basically fine.
Practical example of allocating more address virtual space than you can use:
If you allocate more memory than you could ever practically use (touching it all would definitely segfault or invoke the kernel's OOM killer), that will be as large or larger than you could ever grow via realloc.
To allocate this much, you may need to globally set /proc/sys/vm/overcommit_memory to 1 (no checking) instead of the default 0 (heuristic which makes extremely large allocations fail). Or use mmap(MAP_NORESERVE) to allocate it, making that one mapping just best-effort growth on page-faults.
The documentation says you might get a SIGSEGV on touching memory allocated with MAP_NORESERVE, which is different than invoking the OOM killer. But I think once you've already successfully touched memory, it is yours and won't get discarded. I think it's also not going to spuriously fail unless you're actually running out of RAM + swap space. IDK how you plan to detect that in your current design (which sounds pretty sketchy if you have no way to ever deallocate).
Test program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void) {
size_t sz = 1ULL << 46; // 2**46 = 64 TiB = max power of 2 for x86-64 with 48-bit virtual addresses
// in practice 1ULL << 40 (1TiB) should be more than enough.
// the smaller you pick, the less impact if multiple things use this trick in the same program
//int *p = aligned_alloc(64, sz); // doesn't use NORESERVE so it will be limited by overcommit settings
int *p = mmap(NULL, sz, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
madvise(p, sz, MADV_HUGEPAGE); // for good measure to reduce page-faults and TLB misses, since you're using large contiguous chunks of this array
p[1000000000] = 1234; // or sz/sizeof(int) - 1 will also work; this is only touching 1 page somewhere in the array.
printf("%p\n", p);
}
$ gcc -Og -g -Wall alloc.c
$ strace ./a.out
... process startup
mmap(NULL, 70368744177664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x15c71ef7c000
madvise(0x15c71ef7c000, 70368744177664, MADV_HUGEPAGE) = 0
... stdio stuff
write(1, "0x15c71ef7c000\n", 15) = 15
0x15c71ef7c000
exit_group(0) = ?
+++ exited with 0 +++
My desktop has 16GiB of RAM (a lot of it in use by Chromium and some big files in /tmp) + 2GiB of swap. Yet this program allocated 64 TiB of virtual address space and touched 1 int of it nearly instantly. Not measurably slower than if it had only allocated 1MiB. (And future performance from actually using that memory should also be unaffected.)
The largest power-of-2 you can expect to work on current x86-64 hardware is 1ULL << 46. The total lower canonical range of the 48-bit virtual address space is 47 bits (user-space virtual address space on Linux), and some of that is already allocated for stack/code/data. Allocating a contiguous 64 TiB chunk of that still leaves plenty for other allocations.
(If you do actually have that much RAM + swap, you're probably waiting for a new CPU with 5-level page tables so you can use even more virtual address space.)
Speaking of page tables, the larger the array the more chance of putting some other future allocations very very far from existing blocks. This can have a minor cost in TLB-miss (page walk) time, if your actual in-use pages end up more scattered around your address space in more different sub-trees of the multi-level page tables. That's more page-table memory to keep in cache (including cached within the page-walk hardware).
The allocation size doesn't have to be a power of 2 but it might as well be. There's also no reason to make it that big. 1ULL << 40 (1TiB) should be fine on most systems. IDK if having more than half the available address space for a process allocated could slow future allocations; bookkeeping is I think based on extents (ptr + length) not bitmaps.
Keep in mind that if everyone starts doing this for random arrays in libraries, that could use up a lot of address space. This is great for the main array in a program that spends a lot of time using it. Keep it as small as you can while still being big enough to always be more than you need. (Optionally make it a config parameter if you want to avoid a "640kiB is enough for everyone" situation). Using up virtual address space is very low-cost, but it's probably better to use less.
Think of this as reserving space for future growth but not actually using it until you touch it. Even though by some ways of looking at it, the memory already is "allocated". But in Linux it really isn't. Linux defaults to allowing "overcommit": processes can have more total anonymous memory mapped than the system has physical RAM + swap. If too many processes try to use too much by actually touching all that allocated memory, the OOM killer has to kill something (because the "allocate" system calls like mmap have already returned success). See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
(With MAP_NORESERVE, it's only reserving address space which is shared between threads, but not reserving any physical pages until you touch them.)
You probably want your array to be page-aligned: #include <stdalign.h> so you can use something like
alignas(4096) struct entry process_array[MAX_LEN];
Or for non-static, allocate it with C11 aligned_alloc().
Give back early parts of the array when you're sure all threads are done with it
Page alignment makes it easy do the calculations to "give back" a memory page (4kiB on x86) if your array's logical size shrinks enough. madvise(addr, 4096*n, MADV_FREE); (Linux 4.5 and later). This is kind of like mmap(MAP_FIXED) to replace some pages with new untouched anonymous pages (that will read as zeroes), except it doesn't split up the logical mapping extents and create more bookkeeping for the kernel.
Don't bother with this unless you're returning multiple pages, and leave at least one page unfreed above the current top to avoid page faults if you grow again soon. Like maybe maintain a high-water mark that you've ever touched (without giving back) and a current logical size. If high_water - logical_size > 16 pages give back all page from 4 past the logical size up to the high water mark.
If you will typically be actually using/touching at least 2MiB of your array, use madvise(MADV_HUGEPAGE) when you allocate it to get the kernel to prefer using transparent hugepages. This will reduce TLB misses.
(Use strace to see return values from your madvise system calls, and look at /proc/PID/smaps, to see if your calls are having the desired effect.)
If up-front allocation is unacceptable, RCU (read-copy-update) might be viable if it's read-mostly. https://en.wikipedia.org/wiki/Read-copy-update. But copying a gigantic array every time an element changes isn't going to work.
You'd want a different data-structure entirely where only small parts need to be copied. Or something other than RCU; like your answer, you might not need the read side being always wait-free. The choice will depend on acceptable worst-case latency and/or average throughput, and also how much contention there is for any kind of ref counter that has to bounce around between all threads.
Too bad there isn't a realloc variant that attempts to grow without copying so you could attempt that before bothering other threads. (e.g. have threads with idx>len spin-wait on len in case it increases without the array address changing.)

So, I came up with a solution:
Reading:
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
int x = array[idx];
cnt--;
Writing:
if (idx == len) {
wait = true;
while (cnt > 0); // busy wait to minimize latency of reallocation
array = realloc(array, 2*len*sizeof(int));
if (!array) abort(); // shit happens
len *= 2; // must not be updated before reallocation completed
wait = false;
}
// this is why len must be updated after realloc,
// it serves for synchronization with other writers
// exceeding the current length limit
while (idx > len) {yield();}
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
array[idx] = x;
cnt--;
wait is an atomic bool initialized as false, cnt is an atomic int initialized as zero.
This only works because I know that task IDs are chosen ascendingly without gaps and that no task state is read before it is initialized by the write operation. So I can always rely on the one thread which pulls the ID which only exceeds current array length by 1. New tasks created concurrently will block their thread until the responsible thread performed reallocation. Hence the busy wait, since the reallocation should happen quickly so the other threads do not have to wait for too long.
This way, I eliminate the bottlenecking big lock. Array accesses can be made concurrently at the cost of two atomic additions. Since reallocation occurs seldom (due to exponential growth), array access is practically block-free.
EDIT:
After taking a second look, I noticed that one has to be careful about reordering of stores around the length update. Also, the whole thing only works if concurrent writes always use different indices. This is the case for my implementation, but might not generally be. Thus, this is not as elegant as I thought and the solution presented in the accepted answer should be preferred.

Keep vector members of class contiguous with class instance

I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).

If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.

What is the theoretical impact of direct index access with "high" memory usage vs. "shifted" index access with "low" memory usage?

Well I am really curious as to what practice is better to keep, I know it (probably?) does not make any performance difference at all (even in performance critical applications?) but I am more curious about the impact on the generated code with optimization in mind (and for the sake of completeness, also "performance", if it makes any difference).
So the problem is as following:
element indexes range from A to B where A > 0 and B > A (eg, A = 1000 and B = 2000).
To store information about each element there are a few possible solutions, two of those which use plain arrays include direct index access and access by manipulating the index:
example 1
//declare the array with less memory, "just" 1000 elements, all elements used
std::array<T, B-A> Foo;
//but make accessing by index slower?
//accessing index N where B > N >= A
Foo[N-A];
example 2
//or declare the array with more memory, 2000 elements, 50% elements not used, not very "efficient" for memory
std::array<T, B> Foo;
//but make accessing by index faster?
//accessing index N where B > N >= A
Foo[N];
I'd personally go for #2 because I really like performance, but I think in reality:
the compiler will take care of both situations?
What is the impact on optimizations?
What about performance?
does it matter at all?
Or is this just the next "micro optimization" thing that no human being should worry about?
Is there some Tradeoff ratio between memory usage : speed which is recommended?

Accessing any array with an index involves adding an index multiplied by element size and adding it to the base-address of the array itself.
Since we are already adding one number to another, making the adjustment for foo[N-A] could easily be done by adjusting the base-address down by N * sizeof(T) before adding A * sizeof(T), rather than actually calculating (A-N)*sizeof(T).
In other words, any decent compiler should comletely hide this subtraction, assuming it is a constant value.
If it's not a constant [say you are using std::vector instread of std::array, then you will indeed subtract A from N at some point in the code. It is still pretty cheap to do this. Most modern processors can do this in one cycle with no latency for the result, so at worst adds a single clock-cycle to the access.
Of course, if the numbers are 1000-2000, probably makes really little difference in the whole scheme of things - either the total time to process that is nearly nothing, or it's a lot becuase you do complicated stuff. But if you were to make it a million elements, offset by half a million, it may make the difference between a simple or complex method of allocating them, or some such.
Also, as Hans Passant implies: Modern OS's with virutal memory handling, memory that isn't actually used doesn't get populated with "real memory". At work I was investigating a strange crash on a board that has 2GB of RAM, and when viewing the memory usage, it showed that this one applciation had allocated 3GB of virtual memory. This board does not have a swap-disk (it's an embedded system). It turns out that some code was simply allocating large chunks of memory that wasn't filled with anything, and it only stopped working when it reached 3GB (32-bit processor, 3+1GB memory split between user/kernel space). So even for LARGE lumps of memory, if you only have half of it, it won't actually take up any RAM, if you do not actually access it.
As ALWAYS when it comes to performance, compilers and such, if it's important, do not trust "the internet" to tell you the answer. Set up a test with the code you actually intend to use, using the actual compiler(s) and processor type(s) that you plan to produce your code with/for, and run benchmarks. Some compiler may well have a misfeature (on processor type XYZ9278) that makes it produce horrible code for a case that most other compilers do this "with no overhead at all".

C++ structure: more members, MUCH slower member access time?

I have a linked list of structures. Lets say I insert x million nodes into the linked list,
then I iterate trough all nodes to find a given value.
The strange thing is (for me at least), if I have a structure like this:
struct node
{
int a;
node *nxt;
};
Then I can iterate trough the list and check the value of a ten times faster compared to when I have another member in the struct, like this:
struct node_complex
{
int a;
string b;
node_complex *nxt;
};
I also tried it with C style strings (char array), the result was the same: just because I had another member (string), the whole iteration (+ value check) was 10 times slower, even if I did not even touched that member ever! Now, I do not know how the internals of structures work, but it looks like a high price to pay...
What is the catch?
Edit:
I am a beginner and this is the first time I use pointers, so chances are, the mistake is on my part. I will post the code ASAP (not being at home now).
Update:
I checked the values again, and I know see a much smaller difference: 2x instead of 10x.
It is much more reasonable for sure.
While it is certainly possible it was the case yesterday too and I was just so freaking tired last night I could not divide two numbers, I have just made more tests and the results are mind blowing.
The times for a the same number of nodes is:
One int and a pointer the time to iterate trough is 0.101
One int and a string: 0.196
One int and 2 strings: 0.274
One int and 3 strings: 0.147 (!!!)
For two ints it is: 0.107
Look what happens when there is more than two strings in the structure! It gets faster! Did somebody drop LSD into my coffee? No! I do not drink coffee.
It is way too fckd up for my brain at the mo' so I think I will just figure it out on my own instead of draining public resources here at SO.
(Ad: I do not think my profiling class is buggy, and anyway I can see the time difference with my own eyes).
Anyhow, thanks for the help.
Cheers.

I must be related to memory access. You speak of a million linked elements. With just an int and a pointer in the node, it takes 8 bytes (assuming 32 bits pointers). This takes up 8 MB memory, which is around the size of cache memory sizes.
When you add other members, you increase the overall size of your data. It does not fit anymore entirely in the cache memory. You revert to plain memory accesses that are much slower.

This may also be caused because during the iteration you may create a copy of your structures. That is:
node* pHead;
// ...
for (node* p = pHead; p; p = p->nxt)
{
node myNode = *p; // here you create a copy!
// ...
}
Copying a simple structure very fast. But the member you've added is a string, which is a complex object. Copying it is a relatively complex operation, with heap access.

Most likely, the issue is that your larger struct no longer fits inside a single cache line.
As I recall, mainstream CPUs typically use a cache line of 32 bytes. This means that data is read into the cache in chunks of 32 bytes at a time, and if you move past these 32 bytes, a second memory fetch is required.
Looking at your struct, it starts with an int, accounting for 4 bytes (usually), and then std::string (I assume, even though the namespace isn't specified), which in my standard library implementation (from VS2010) takes up 28 bytes, which gives us 32 bytes total. Which means that the initial int and the the next pointer will be placed in different cache lines, using twice as much cache space, and requiring twice as many memory accesses if both members are accessed during iteration.
If only the pointer is accessed, this shouldn't make a difference, though, as only the second cache line then has to be retrieved from memory.
If you always access the int and the pointer, and the string is required less often, reordering the members may help:
struct node_complex
{
int a;
node_complex *nxt;
string b;
};
In this case, the next pointer and the int are located next to each others, on the same cache line, so they can be read without requiring additional memory reads. But then you incur the additional cost once you need to read the string.
Of course, it's also possible that your benchmarking code includes creation of the nodes, or (intentional or otherwise) copies being created of the nodes, which would obviously also affect performance.

I'm not a spacialist at all, but the "cache miss" problem rings in my head while reading your problem.
When you had a member, as it makes the size of the structure get bigger, it also might cache misses when going throught the linked list (that is naturally cache-unfriendly if you don't have nodes allocated in one bloc and not far from each other in memory).
I can't find another explaination.
However, we don't have the creation and the loop provided so it's still hard to guess if you're not just having code that don't perform the list exploration in an efficient way.

Perhaps a solution would be a linked list of pointers to your object. It may make things more complicated (unless you use smart pointers, ect.) but it may increase search time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js