I've got a problem with a really tight and tough memory limit. I'm a CPP geek and I want to reduce my memory usage. Please give me some tips.
One of my friends recommended to take functions inside my structs out of them.
for example instead of using:
struct node{
int f()
{}
}
he recommended me to use:
int f(node x)
{}
does this really help?
Note: I have lots of copies of my struct.
here's some more information:
I'm coding some sort of segment tree for a practice problem on an online judge. I get tree nodes in a struct. my struct has these variables:
int start;
int end;
bool flag;
node* left;
node* right;
The memory limit is 16 MB and I'm using 16.38 MB.
I'm guessing by the subtext of your question that the majority of your memory usage is data, not code. Here are a couple of tips:
If your data ranges are limited, take advantage of it. If the range of an integer is -128 to 127, use char instead of int, or unsigned char if it's 0 to 255. Likewise use int16_t or uint16_t for ranges of -32768..32767 and 0..65535.
Rearrange the structure elements so the larger items come first, so that data alignment doesn't leave dead space in the middle of the structure. You can also usually control padding via compiler options, but it's better just to make the layout optimal in the first place.
Use containers that don't have a lot of overhead. Use vector instead of list, for example. Use boost::ptr_vector instead of std::vector containing shared_ptr.
Avoid virtual methods. The first virtual method you add to a struct or class adds a hidden pointer to a vtable.
No, regular member functions don't make the class or struct larger. Introducing a virtual function might (on many platforms) add a vtable pointer to the class. On x86 that would increase the size by four bytes. No more memory will be required as you add virtual functions, though -- one pointer is sufficient. The size of a class or struct type is never zero (regardless of whether it has any member variables or virtual functions). This is to make sure that each instance occupies its own memory space (source, section 9.0.3).
In my opinion, the best way to reduce memory is to consider your algorithmic space compexity instead of justing doing fine code optimizations. Reconsider things like dynamic programming tables, unnecessary copies, generally any thing that is questionable in terms of memory efficiency. Also, try to free memory resources early whenever they are not needed anymore.
For your final example (the tree), you can use a clever hack with XOR to replace the two node pointers with a single node pointer, as described here. This only works if you traverse the tree in the right order, however. Obviously this hurts code readability, so should be something of a last resort.
You could use compilation flags to do some optimization. If you are using g++ you could test with: -O2
There are great threads about the subject:
C++ Optimization Techniques
Should we still be optimizing "in the small"?
Constants and compiler optimization in C++
What are the known C/C++ optimizations for GCC
The two possibilities are not at all equivalent:
In the first, f() is a member function of node.
In the second, f() is a free (or namespace-scope) function. (Note also that the signature of two f() are different.)
Now note that, in the first style, f() is an inline member function. Defining a member function inside the class body makes it inline. Although inlining is not guranteed, it is just a hint to the compiler. For functions with small bodies, it may be good to inline them, as it would avoid function call over head. However, I have never seen that to be a make-or-break factor.
If you do not want or if f() does not qualifiy for inlining, you should define it outside the class body (probably in a .cpp file) as:
int node::f() { /* stuff here */ }
If memory usage is a problem in your code, then most probably the above topics are not relevant. Exploring the following might give you some hint
Find the sizes of all classes in your program. Use sizeof to find this information, e.g. sizeof( node)
Find what is the maximum number of objects of each class that your program is creating.
Using the above two series of information, estimate worst case memory usage by your program
Worst case memory usage = n1 * sizeof( node1 ) + n2 * sizeof( node2 ) + ...
If the above number is too high, then you have the following choices:
Reducing the number of maximum instances of classes. This probably won't be possible because this depends on the input to the program, and that is beyond your control
Reduce the size of each classes. This is in your control though.
How to reduce the size of the classes? Try packing the class members compactly to avoid packing.
As others have said, having methods doesn't increase the size of the struct unless one of them is virtual.
You can use bitfields to effectively compress the data (this is especially effective with your boolean...). Also, you can use indices instead of pointers, to save some bytes.
Remember to allocate your nodes in big chunks rather than individually (e.g., using new[] once, not regular new many times) to avoid memory management overhead.
If you don't need the full flexibility your node pointers provide, you may be able to reduce or eliminate them. For example, heapsort always has a near-full binary tree, so the standard implementation uses an implicit tree, which doesn't need any pointers at all.
Above all, finding a different algorithm may change the game completely...
Related
I have seen char* vs std::string in c++, but am still wondering if accessing the elements of a char* is faster than std::string.
If you need to know, the char*/std::string will contain less than 80 characters, but I would like to know a cutoff if there is one.
I would also like to know the answer to this question for different compilers and different Operating Systems, if there is a difference.
Thanks in advance!
Edit: I would be accessing the elements using array[n], and would set the values once.
(Note: If this doesn't meet the help center, please let me know how I can reword it before down-voting)
They should be equivalent in general, though std::string might be a tiny bit slower. Why? Because of short-string optimization.
Short-string optimization is a trick some implementations use to store short strings in std::string without allocating any memory. Usually this is done by doing something like this (though different variations exist):
union {
char* data_ptr;
char short_string[sizeof(char*)];
};
Then std::string can use the short_string array to store the data, but only if the size of the string is short enough to fit in there. If not, then it will need to allocate memory and use data_ptr to store that pointer.
Depending on how short-string optimization is implemented, whenever you access data in a std::string, it needs to check its length and determine if it's using the short_string or the data_ptr. This check is not totally free: it takes at least a couple instructions and might cause some branch misprediction or inhibit prefetching in the CPU.
libc++ uses short-string optimization kinda like this that requires checking whether the string is short vs long every access.
libstdc++ uses short-string optimization, but they implement it slightly differently and actually avoid any extra access costs. Their union is between a short_string array and an allocated_capacity integer, which means their data_ptr can always point to the real data (whether it's in short_string or in an allocated buffer), so there aren't any extra steps needed when accessing it.
If std::string doesn't use short-string optimization (or if it's implemented like in libstdc++), then it should be the same as using a char*. I disagree with black's statement that there is an extra level of indirection in this situation. The compiler should be able to inline operator[] and it should be the same as directly accessing the internal data pointer in the std::string.
Since you don't have direct access to the underlying CharT sequence, accessing it will require an extra layer through the public interface. So it could be slower, probably requiring 20-30 cycles more. Even then, only in a tight loop you might see a difference.
However, it's extremely easy to optimize this out considering the large range of techniques a compiler can employ (caching, inlining, non-standard function calls and so on) if you instruct it to.
I want to set the padding bytes of a class to 0, since I am saving/loading/comparing/hashing instances at a byte level, and garbage-initialised padding introduces non-determinism in each of those operations.
I know that this will achieve what I want (for trivially copyable types):
struct Example
{
Example(char a_, int b_)
{
memset(this, 0, sizeof(*this));
a = a_;
b = b_;
}
char a;
int b;
};
I don't like doing that though, for two reasons: I like constructor initialiser lists, and I know that setting the bits to 0 isn't always the same as zero-initialisation (e.g. pointers and floats don't necessarily have zero values that are all 0 bits).
As an aside, it's obviously limited to types that are trivially copyable, but that's not an issue for me since the operations I listed above (loading/saving/comparing/hashing at a byte level) require trivially copyable types anyway.
What I would like is something like this [magical] snippet:
struct Example
{
Example(char a_, int b_) : a(a_), b(b_)
{
// Leaves all members alone, and sets all padding bytes to 0.
memset_only_padding_bytes(this, 0);
}
char a;
int b;
};
I doubt such a thing is possible, so if anyone can suggest a non-ugly alternative... I'm all ears :)
There's no way I know of to do this fully automatically in pure C++. We use a custom code generation system to accomplish this (among other things). You could potentially accomplish this with a macro to which you fed all your member variable names; it would simply look for holes between offsetof(memberA)+sizeof(memberA) and offsetof(memberB).
Alternatively, serialize/hash on a memberwise basis, rather than as a binary blob. That's ten kinds of cleaner.
Oh, one other option -- you could provide an operator new which explicitly cleared the memory before returning it. I'm not a fan of that approach, though..... it doesn't work for stack allocation.
You should never use padded structs when binary writing/reading them. Simply because the padding can vary from one platform to another which will lead to binary incompatibility.
Use some compiler options, like #pragma pack (push, 1) to disable padding when defining those writable structs and restore it with #pragma pack(pop).
This sadly means you're losing the optimization provided by it. If that is a concern, by carefully designing your structs you can manually "pad" them by inserting dummy variables. Then zero-initialization becomes obvious, you just assign zeros to those dummies. I don't recommend that "manual" approach as it's very error-prone, but as you're using binary blob write you're probably concerned already. But by all means, benchmark unpadded structs before.
I faced a similar problem - and simply saying that this is a poor design decision (as per dasblinkenlight's comment) doesn't necessarily help as you may have no control over the hashing code (in my case I was using an external library).
One solution is to write a custom iterator for your class, which iterates through the bytes of the data and skips the padding. You then modify your hashing algorithm to use your custom iterator instead of a pointer. One simple way to do this is to templatize the pointer so that it can take an iterator - since the semantics of a pointer and an iterator are the same, you shouldn't have to modify any code beyond the templatizing.
EDIT: Boost provides a nice library which makes it simple to add custom iterators to your container: Boost.Iterator.
Whichever solution you go for, it is highly preferable to avoid hashing the padding as doing so means that your hashing algorithm is highly coupled with your data structure. If you switch data structures (or as Agent_L mentions, use the same data structure on a different platform which pads differently), then it will produce different hashes. On the other hand, if you only hash the actual data itself, then you will always produce the same hash values no matter what data structure you use later.
Let me just say up front that what I'm aware that what I'm about to propose is a mortal sin, and that I will probably burn in Programming Hell for even considering it.
That said, I'm still interested in knowing if there's any reason why this wouldn't work.
The situation is: I have a reference-counting smart-pointer class that I use everywhere. It currently looks something like this (note: incomplete/simplified pseudocode):
class IRefCountable
{
public:
IRefCountable() : _refCount(0) {}
virtual ~IRefCountable() {}
void Ref() {_refCount++;}
bool Unref() {return (--_refCount==0);}
private:
unsigned int _refCount;
};
class Ref
{
public:
Ref(IRefCountable * ptr, bool isObjectOnHeap) : _ptr(ptr), _isObjectOnHeap(isObjectOnHeap)
{
_ptr->Ref();
}
~Ref()
{
if ((_ptr->Unref())&&(_isObjectOnHeap)) delete _ptr;
}
private:
IRefCountable * _ptr;
bool _isObjectOnHeap;
};
Today I noticed that sizeof(Ref)=16. However, if I remove the boolean member variable _isObjectOnHeap, sizeof(Ref) is reduced to 8. That means that for every Ref in my program, there are 7.875 wasted bytes of RAM... and there are many, many Refs in my program.
Well, that seems like a waste of some RAM. But I really need that extra bit of information (okay, humor me and assume for the sake of the discussion that I really do). And I notice that since IRefCountable is a non-POD class, it will (presumably) always be allocated on a word-aligned memory address. Therefore, the least significant bit of (_ptr) should always be zero.
Which makes me wonder... is there any reason why I can't OR my one bit of boolean data into the least-significant bit of the pointer, and thus reduce sizeof(Ref) by half without sacrificing any functionality? I'd have to be careful to AND out that bit before dereferencing the pointer, of course, which would make pointer dereferences less efficient, but that might be made up for by the fact that the Refs are now smaller, and thus more of them can fit into the processor's cache at once, and so on.
Is this a reasonable thing to do? Or am I setting myself up for a world of hurt? And if the latter, how exactly would that hurt be visited upon me? (Note that this is code that needs to run correctly in all reasonably modern desktop environments, but it doesn't need to run in embedded machines or supercomputers or anything exotic like that)
If you want to use only the standard facilities and not rely on any implementation then with C++0x there are ways to express alignment (here is a recent question I answered). There's also std::uintptr_t to reliably get an unsigned integral type large enough to hold a pointer. Now the one thing guaranteed is that a conversion from the pointer type to std::[u]intptr_t and back to that same type yields the original pointer.
I suppose you could argue that if you can get back the original std::intptr_t (with masking), then you can get the original pointer. I don't know how solid this reasoning would be.
[edit: thinking about it there's no guarantee that an aligned pointer takes any particular form when converted to an integral type, e.g. one with some bits unset. probably too much of a stretch here]
The problem here is that it is entirely machine-dependent. It isn't something one often sees in C or C++ code, but it has certainly been done many times in assembly. Old Lisp interpreters almost always used this trick to store type information in the low bit(s). (I have seen int in C code, but in projects that were being implemented for a specific target platform.)
Personally, if I were trying to write portable code, I probably wouldn't do this. The fact is that it will almost certainly work on "all reasonably modern desktop environments". (Certainly, it will work on every one I can think of.)
A lot depends on the nature of your code. If you are maintaining it, and nobody else will ever have to deal with the "world of hurt", then it might be ok. You will have to add ifdef's for any odd architecture that you might need to support later on. On the other hand, if you are releasing it to the world as "portable" code, that would be cause for concern.
Another way to handle this is to write two versions of your smart pointer, one for machines on which this will work and one for machines where it won't. That way, as long as you maintain both versions, it won't be that big a deal to change a config file to use the 16-byte version.
It goes without saying that you would have to avoid writing any other code that assumes sizeof(Ref) is 8 rather than 16. If you are using unit tests, run them with both versions.
Any reason? Unless things have changed in the standard lately, the value representation of a pointer is implementation-defined. It is certainly possible that some implementation somewhere may pull the same trick, defining these otherwise-unused low bits for its own purposes. It's even more possible that some implementation might use word-pointers rather than byte-pointers, so instead of two adjacent words being at "addresses" 0x8640 and 0x8642, they would be at "addresses" 0x4320 and 0x4321.
One tricky way around the problem would be to make Ref a (de facto) abstract class, and all instances would actually be instances of RefOnHeap and RefNotOnHeap. If there are that many Refs around, the extra space used to store the code and metadata for three classes rather than one would be made up by the space savings in having each Ref being half the size. (Won't work too well, the compiler can omit the vtable pointer if there are no virtual methods and introducing virtual methods will add the 4-or-8 bytes back to the class).
You always have at least a free bit to use in the pointer as long as
you're not pointing to arbitrary positions inside a struct or array with alignment of 1, or
the platform gives you a free bit
Since IRefCountable has an alignment of 4, you'll have 2 free bottom bits in IRefCountable* to use
Regarding the first point, storing data in the least significant bit is always reliable if the pointer is aligned to a power of 2 larger than 1. That means it'll work for everything apart from char*/bool* or a pointer to a struct containing all char/bool members, and obviously it'll work for IRefCountable* in your case. In C++11 you can use alignof or std::alignment_of to ensure that you have the required alignment like this
static_assert(alignof(Ref) > 1);
static_assert(alignof(IRefCountable) > 1);
// This check for power of 2 is likely redundant
static_assert((alignof(Ref) & (alignof(Ref) - 1)) == 0);
// Now IRefCountable* is always aligned,
// so its least significant bit can be used freely
Even if you have some object with only 1-byte alignment, for example if you change the _refCount in IRefCountable to uint8_t, then you can still enforce alignment requirement with alignas, or with other extensions in older C++ like __declspec(align). Dynamically allocated memory is already aligned to max_align_t, or you can use aligned_alloc() for a higher level alignment
My second bullet point means in case you really need to store arbitrary pointers to objects with absolute 1-byte alignment then most of the time you can still utilize the feature from the platform
On many 32-bit platforms the address space is split in half for user and kernel processes. User pointers will always have the most significant bit unset so you can use that to store data. Of course it won't work on platforms with more than 2GB of user address space, like when the split is 3/1 or 4/4
On 64-bit platforms currently most have only 48-bit virtual address, and a few newer high-end CPUs may have 57-bit virtual address which is far from the total 64 bits. Therefore you'll have lots of bits to spare. And in reality this always work in personal computing since you'll never be able to fill that vast address space
This is called tagged pointer
If the data is always heap-allocated then you can tell the OS to limit the range of address space to use to get more bits
For more information read Using the extra 16 bits in 64-bit pointers
Yes, this can work reliably. This is, in fact, used by the Linux kernel as part of its red-black tree implementation. Instead of storing an extra boolean to indicate whether a node is red or black (which can take up quite a bit of additional space), the kernel uses the low-order bit of the parent node address.
From rbtree_types.h:
struct rb_node {
unsigned long __rb_parent_color;
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
The __rb_parent_color field stores both the address of the nodes parent and the color of the node (in the least-significant bit).
Getting The Pointer
To retrieve the parent address from this field you just clear the lower order bits (this clears the lowest 2-bits).
From rbtree.h:
#define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3))
Getting The Boolean
To retrieve the color you just extract the lower bit and treat it like a boolean.
From rbtree_augmented.h:
#define __rb_color(pc) ((pc) & 1)
#define __rb_is_black(pc) __rb_color(pc)
#define __rb_is_red(pc) (!__rb_color(pc))
#define rb_color(rb) __rb_color((rb)->__rb_parent_color)
#define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color)
#define rb_is_black(rb) __rb_is_black((rb)->__rb_parent_color)
Setting The Pointer And Boolean
You set the pointer and boolean value using standard bit manipulation operations (making sure to preserve each part of the final value).
From rbtree_augmented.h:
static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p)
{
rb->__rb_parent_color = rb_color(rb) | (unsigned long)p;
}
static inline void rb_set_parent_color(struct rb_node *rb,
struct rb_node *p, int color)
{
rb->__rb_parent_color = (unsigned long)p | color;
}
You can also clear the boolean value setting it to false via (unsigned long)p & ~1.
There will be always a sense of uncertainty in mind even if this method is working, because ultimately you are playing with the internal architecture which may or may not be portable.
On the other hand to solve this problem, if you want to avoid bool variable, I would suggest a simple constructor as,
Ref(IRefCountable * ptr) : _ptr(ptr)
{
if(ptr != 0)
_ptr->Ref();
}
From the code, I smell that the reference counting is needed only when the object is on heap. For automatic objects, you can simply pass 0 to the class Ref and put appropriate null checks in constructor/destructor.
Have you thought about an out of class storage ?
Depending on whether you have (or not) to worry about multi-threading and control the implementation of new/delete/malloc/free, it might be worth a try.
The point would be that instead of incrementing a local counter (local to the object), you would maintain a "counter" map address --> count that would haughtily ignore addresses passed that are outside the allocated area (stack for example).
It may seem silly (there is room for contention in MT), but it also plays rather nice with read-only since the object is not "modified" only for counting.
Of course, I have no idea of the performance you might hope to achieve with this :p
I have a linked list of structures. Lets say I insert x million nodes into the linked list,
then I iterate trough all nodes to find a given value.
The strange thing is (for me at least), if I have a structure like this:
struct node
{
int a;
node *nxt;
};
Then I can iterate trough the list and check the value of a ten times faster compared to when I have another member in the struct, like this:
struct node_complex
{
int a;
string b;
node_complex *nxt;
};
I also tried it with C style strings (char array), the result was the same: just because I had another member (string), the whole iteration (+ value check) was 10 times slower, even if I did not even touched that member ever! Now, I do not know how the internals of structures work, but it looks like a high price to pay...
What is the catch?
Edit:
I am a beginner and this is the first time I use pointers, so chances are, the mistake is on my part. I will post the code ASAP (not being at home now).
Update:
I checked the values again, and I know see a much smaller difference: 2x instead of 10x.
It is much more reasonable for sure.
While it is certainly possible it was the case yesterday too and I was just so freaking tired last night I could not divide two numbers, I have just made more tests and the results are mind blowing.
The times for a the same number of nodes is:
One int and a pointer the time to iterate trough is 0.101
One int and a string: 0.196
One int and 2 strings: 0.274
One int and 3 strings: 0.147 (!!!)
For two ints it is: 0.107
Look what happens when there is more than two strings in the structure! It gets faster! Did somebody drop LSD into my coffee? No! I do not drink coffee.
It is way too fckd up for my brain at the mo' so I think I will just figure it out on my own instead of draining public resources here at SO.
(Ad: I do not think my profiling class is buggy, and anyway I can see the time difference with my own eyes).
Anyhow, thanks for the help.
Cheers.
I must be related to memory access. You speak of a million linked elements. With just an int and a pointer in the node, it takes 8 bytes (assuming 32 bits pointers). This takes up 8 MB memory, which is around the size of cache memory sizes.
When you add other members, you increase the overall size of your data. It does not fit anymore entirely in the cache memory. You revert to plain memory accesses that are much slower.
This may also be caused because during the iteration you may create a copy of your structures. That is:
node* pHead;
// ...
for (node* p = pHead; p; p = p->nxt)
{
node myNode = *p; // here you create a copy!
// ...
}
Copying a simple structure very fast. But the member you've added is a string, which is a complex object. Copying it is a relatively complex operation, with heap access.
Most likely, the issue is that your larger struct no longer fits inside a single cache line.
As I recall, mainstream CPUs typically use a cache line of 32 bytes. This means that data is read into the cache in chunks of 32 bytes at a time, and if you move past these 32 bytes, a second memory fetch is required.
Looking at your struct, it starts with an int, accounting for 4 bytes (usually), and then std::string (I assume, even though the namespace isn't specified), which in my standard library implementation (from VS2010) takes up 28 bytes, which gives us 32 bytes total. Which means that the initial int and the the next pointer will be placed in different cache lines, using twice as much cache space, and requiring twice as many memory accesses if both members are accessed during iteration.
If only the pointer is accessed, this shouldn't make a difference, though, as only the second cache line then has to be retrieved from memory.
If you always access the int and the pointer, and the string is required less often, reordering the members may help:
struct node_complex
{
int a;
node_complex *nxt;
string b;
};
In this case, the next pointer and the int are located next to each others, on the same cache line, so they can be read without requiring additional memory reads. But then you incur the additional cost once you need to read the string.
Of course, it's also possible that your benchmarking code includes creation of the nodes, or (intentional or otherwise) copies being created of the nodes, which would obviously also affect performance.
I'm not a spacialist at all, but the "cache miss" problem rings in my head while reading your problem.
When you had a member, as it makes the size of the structure get bigger, it also might cache misses when going throught the linked list (that is naturally cache-unfriendly if you don't have nodes allocated in one bloc and not far from each other in memory).
I can't find another explaination.
However, we don't have the creation and the loop provided so it's still hard to guess if you're not just having code that don't perform the list exploration in an efficient way.
Perhaps a solution would be a linked list of pointers to your object. It may make things more complicated (unless you use smart pointers, ect.) but it may increase search time.
In C or C++, there is no checking of arrays for out of bounds. One way to work around this is to package it with a struct:
struct array_of_foo{
int length;
foo *arr; //array with variable length.
};
Then, it can be initialized:
array_of_foo *ar(int length){
array_of_foo *out = (array_of_foo*) malloc(sizeof(array_of_foo));
out->arr = (foo*) malloc(length*sizeof(foo));
}
And then accessed:
foo I(array_of_foo *ar, int ix){ //may need to be foo* I(...
if(ix>ar->length-1){printf("out of range!\n")} //error
return ar->arr[ix];
}
And finally freed:
void freeFoo(array_of_foo *ar){ //is it nessessary to free both ar->arr and ar?
free(ar->arr); free(ar);
}
This way it can warn programmers about out of bounds. But will this packaging slow down the preformance substantially?
I agree on the std::vector recommendation. Additionally you might try boost::array libraries, which include a complete (and tested) implementation of fixed sized array containers:
http://svn.boost.org/svn/boost/trunk/boost/array.hpp
In C++, there's no need to come up with your own incomplete version of vector. (To get bounds checking on vector, use .at() instead of []. It'll throw an exception if you get out of bounds.)
In C, this isn't necessarily a bad idea, but I'd drop the pointer in your initialization function, and just return the struct. It's got an int and a pointer, and won't be very big, typically no more than twice the size of a pointer. You probably don't want to have random printfs in your access functions anyway, as if you do go out of bounds you'll get random messages that won't be very helpful even if you look for them.
Most likely the major performance hit will come from checking the index for every access, thus breaking pipelining in the processor, rather than the extra indirection. It seems to me unlikely that an optimizer would find a way to optimize away the check when it's definitely not necessary.
For example, this will be very noticed in long loops traversing the entire array - which is a relatively common pattern.
And just for the sake of it:
- You should initialize the length field too in ar()
- You should check for ix < 0 in I()
I don't have any formal studies to cite, but echoes I've had from languages where array bound checking is optional is that turning it off rarely speeds up a program down perceptibly.
If you have C code that you'd like to make safer, you may be interested in Cyclone.
You can test it yourself, but on certain machines you may have serious performance issues under different scenarios. If you are looping over millions of elements, then checking the bounds every time will lead to numerous cache misses. How much of an impact that will have depends on what your code is doing. Again, you could test this pretty quickly.