why the memory access is so slow? - c++

I've got very strange performance issue, related to access to memory.
The code snippet is:
#include <vector>
using namespace std;
vector<int> arrx(M,-1);
vector< vector<int> > arr(N,arrx);
...
for(i=0;i<N;i++){
for(j=0;j<M;j++){
//>>>>>>>>>>>> Part 1 <<<<<<<<<<<<<<
// Simple arithmetic operations
int n1 = 1 + 2; // does not matter what (actually more complicated)
// Integer assignment, without access to array
int n2 = n1;
//>>>>>>>>>>>> Part 2 <<<<<<<<<<<<<<
// This turns out to be most expensive part
arr[i][j] = n1;
}
}
N and M - are some constants of the order of 1000 - 10000 or so.
When I compile this code (release version) it takes approximately 15 clocks to finish it if the Part 2 is commented. With this part the execution time goes up to 100+ clocks, so almost 10 times slower. I expected the assignment operation to be much cheaper than even simple arithmetic operations. This is actually true if we do not use arrays. But with that array the assignment seems to be much more expensive.
I also tried 1-D array instead of 2-D - the same result (for 2D is obviously slower).
I also used int** or int* instead of vector< vector< int > > or vector< int > - again the result is the same.
Why do I get such poor performance in array assignment and can I fix it?
Edit:
One more observation: in the part 2 of the given code if we change the assignment from
arr[i][j] = n1; // 172 clocks
to
n1 = arr[i][j]; // 16 clocks
the speed (numbers in comments) goes up. More interestingly, if we change the line:
arr[i][j] = n1; // 172 clocks
to
arr[i][j] = arr[i][j] * arr[i][j]; // 110 clocks
speed is also higher than for simple assignment
Is there any difference in reading and writing from/to memory? Why do I get such strange performance?
Thanks in advance!

Your assumptions are really wrong...
Writing to main memory is much slower than doing a simple addition.
If you don't do the write, it is
likely that the loop will be optimized away entirely.

Unless your actual "part 1" is significantly more complex than your example would lead us to believe, there's no surprise here -- memory access is slow compared to basic arithmetic. Additionally, if you're compiling with optimizations, most or all of "part 1" may be getting optimized away because the results are never used.

You have discovered a sad truth about modern computers: memory access is really slow compared to arithmetic. There's nothing you can do about it. It is because electric fields in a copper wire only travel at about two-thirds of the speed of light.

Your nested arrays are going to be somewhere in the region of 50–500MB of memory. It takes time to write that much memory, and no amount of clever hardware memory caching is going to help that much. Moreover, even a single memory write will take time as it has to make its way over some copper wires on a circuit board to a lump of silicon some distance away; physics wins.
But if you want to dig in more, try the cachegrind tool (assuming it's present on your platform). Just be aware that the code you're using above doesn't really permit a lot of optimization; its data access pattern doesn't have enormous amounts of reuse potential.

Let's make a simple estimate. A typical CPU clock speed nowadays is about 1--2 GHz (giga=10 to power nine). Simplifying a (really) great deal, it means that a single processor operation takes about 1ns (nano=10 to power negative nine). A simple arithmetic like int addition takes several CPU cycles, of order ten.
Now, memory: typical memory access time is about 50ns (again, it's not necessary just now to go into gory details, which are aplenty).
You see that even in the very best case scenario, the memory is slower than CPU by a factor of about 5 to 10.
In fact, I'm sweeping under the rug an enormous amount of detail, but I hope the basic idea is clear. If you're interested, there are books around (keywords: cache, cache misses, data locality etc). This one is dated, but still very good at the explaining general concepts.

Related

Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?
#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;
void function1() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0;j<n;j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
void function2() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0; j<n; j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0; j<n; j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
int main(int argc, const char * argv[]) {
function1();
function2();
return 0;
}
TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...
Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.
Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).
And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.
Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.
The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.
However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?
Is your real use case copying the same mid-sized array over and over again?
If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?
Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?
Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).
Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.
If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)
And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)
And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.
If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.
This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.
So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.
The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first.
from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.
I'll just chime in here with a little something to keep in mind when looking into performance - unless you are writing embedded software for a real-time device, the performance of such low level code as this should not be a concern.
In 99.9% of all other cases, they will be fast enough.

OpenMP first kernel much slower than the second kernel

I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!
Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless

Efficiency in C and C++

So my teacher tells me that I should compute intermediate results as needed on the fly rather than storing them, because the speed of processors nowadays is much more faster than the speed of memory.
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me ?
your teacher is right speed of processors nowadays is much more faster than the speed of memory. Access to RAM is slower what access to the internal memory: cache, registers, etc.
Suppose you want to compute a trigonometric function: sin(x). To do this you can either call a function (math library offers one, or implement your own) which is computing the value; or you can use a lookup table stored in memory to get the result which means storing the intermediate values (sort of).
Calling a function will result in executing a number of instructions, while using a lookup table will result in fewer instructions (getting the address of the LUT, getting the offset to the desired element, reading from address+offset). In this case, storing the intermediate values is faster
But if you were to do c = a+b, computing the value will be much faster than reading it from somewhere in RAM. Notice that in this case the number of instructions to be executed would be similar.
So while it is true that access to RAM is slower, whether it's worth accessing RAM instead of doing the computation is a sensible question and several things need to be considered: number of instructions to be executed, if the computation happens in a loop and you can take advantage the architectures pipeline, cache memory, etc.
There is no one answer, you need to analyze each situation individually.
Your teacher's advice is oversimplifying advice on a complex topic.
If you think of "intermediate" as a single term (in the arithmetical sense of the word), then ask yourself, is your code re-using that term anywhere else ? I.e. if you have code like:
void calculate_sphere_parameters(double radius, double & area, double & volume)
{
area = 4 * (4 * acos(1)) * radius * radius;
volume = 4 * (4 * acos(1)) * radius * radius * radius / 3;
}
should you instead write:
void calculate_sphere_parameters(double radius, double & area, double *volume)
{
double quarter_pi = acos(1);
double pi = 4 * quarter_pi;
double four_pi = 4 * pi;
double four_thirds_pi = four_pi / 3;
double radius_squared = radius * radius;
double radius_cubed = radius_squared * radius;
area = four_pi * radius_squared;
volume = four_thirds_pi * radius_cubed; // maybe use "(area * radius) / 3" ?
}
It's not unlikely that a modern optimizing compiler will emit the same binary code for these two. I leave it to the reader to determine what they prefer to see in the sourcecode ...
The same is true for a lot of simple arithmetics (at the very least, if no function calls are involved in the calculation). In addition to that, modern compilers and/or CPU instruction sets might have the ability to do "offset" calculations for free, i.e. something like:
for (int i = 0; i < N; i++) {
do_something_with(i, i + 25, i + 314159);
}
will turn out the same as:
for (int i = 0; i < N; i++) {
int j = i + 25;
int k = i + 314159;
do_something_with(i, j, k);
}
So the main rule should be, if your code's readability doesn't benefit from creating a new variable to hold the result of a "temporary" calculation, it's probably overkill to use one.
If, on the other hand, you're using i + 12345 a dozen times in ten lines of code ... name it, and comment why this strange hardcoded offset is so important.
Remember just because your source code contains a variable doesn't mean the binary code as emitted by the compiler will allocate memory for this variable. The compiler might come to the conclusion that the value isn't even used (and completely discard the calculation assigning it), or it might come to the conclusion that it's "only an intermediate" (never used later where it would've to be retrieved from memory) and so store it in a register, to overwrite after "last use". It's far more efficiently to do something like calculate the value i + 1 each time you need it than to retrieve it from a memory location.
My advice would be:
keep your code readable first and foremost - too many variables rather obscure than help.
don't bother saving "simple" intermediates - addition/subtraction or scaling by powers of two is pretty much a "free" operation
if you reuse the same value ("arithmetic term") in multiple places, save it if it is expensive to calculate (for example involves function calls, a long sequence of arithmetics, or a lot of memory accesses like an array checksum).
So when we compute an intermediate result, we also need to use some memory right ? Can anyone please explain it to me?
There are several levels of memory in a computer. The layers look like this
registers – the CPU does all the calculations on this and access is instant
Caches - memory that's tightly coupled to the CPU core; all memory accesses to main system memory go through the cache actually and to the program it looks like if the data goes and comes from system memory. If the data is present in the cache and the access is well aligned the access is almost instant as well and hence very fast.
main system memory - connected to the CPU through a memory controller and shared by the CPU cores in a system. Accessing main memory introduces latencies through addressing and the limited bandwidth between memory and CPUs
When you work with in-situ calculated intermediary results those often never leave the registers or may go only as far as the cache and thus are not limited by the available system memory bandwidth or blocked by memory bus arbitration or address generation interlock.
This hurts me.
Ask your teacher (or better, don't, because with his level of competence in programming I wouldn't trust him), whether he has measured it, and what the difference was. The rule when you are programming for speed is: If you haven't measured it, and measured it before and after a change, then what you are doing is purely based on presumption and worthless.
In reality, an optimising compiler will take the code that you write and translate it to the fastest possible machine code. As a result, it is unlikely that there is any difference in code or speed.
On the other hand, using intermediate variables will make complex expressions easier to understand and easier to get right, and it makes debugging a lot easier. If your huge complex expression gives what looks like the wrong result, intermediate variables make it possible to check the calculation bit by bit and find where the error is.
Now even if he was right and removing intermediate variables made your code faster, and even if anyone cared about the speed difference, he would be wrong: Making your code readable and easier to debug gets you to a correctly working version of the code quicker (and if it doesn't work, nobody cares how fast it is). Now if it turns out that the code needs to be faster, the time you saved will allow you to make changes that make it really faster.

Structure of arrays and array of structures - performance difference

I have a class like this:
//Array of Structures
class Unit
{
public:
float v;
float u;
//And similarly many other variables of float type, upto 10-12 of them.
void update()
{
v+=u;
v=v*i*t;
//And many other equations
}
};
I create an array of objects of Unit type. And call update on them.
int NUM_UNITS = 10000;
void ProcessUpdate()
{
Unit *units = new Unit[NUM_UNITS];
for(int i = 0; i < NUM_UNITS; i++)
{
units[i].update();
}
}
In order to speed up things, and possibly autovectorize the loop, I converted AoS to structure of arrays.
//Structure of Arrays:
class Unit
{
public:
Unit(int NUM_UNITS)
{
v = new float[NUM_UNITS];
}
float *v;
float *u;
//Mnay other variables
void update()
{
for(int i = 0; i < NUM_UNITS; i++)
{
v[i]+=u[i];
//Many other equations
}
}
};
When the loop fails to autovectorize, i am getting a very bad performance for structure of arrays. For 50 units, SoA's update is slightly faster than AoS.But then from 100 units onwards, SoA is slower than AoS. At 300 units, SoA is almost twice as worse. At 100K units, SoA is 4x slower than AoS. While cache might be an issue for SoA, i didnt expect the performance difference to be this high. Profiling on cachegrind shows similar number of misses for both approach. Size of a Unit object is 48 bytes. L1 cache is 256K, L2 is 1MB and L3 is 8MB. What am i missing here? Is this really a cache issue?
Edit:
I am using gcc 4.5.2. Compiler options are -o3 -msse4 -ftree-vectorize.
I did another experiment in SoA. Instead of dynamically allocating the arrays, i allocated "v" and "u" in compile time. When there are 100K units, this gives a performance which is 10x faster than the SoA with dynamically allocated arrays. Whats happening here? Why is there such a performance difference between static and dynamically allocated memory?
Structure of arrays is not cache friendly in this case.
You use both u and v together, but in case of 2 different arrays for them they will not be loaded simultaneously into one cache line and cache misses will cost huge performance penalty.
_mm_prefetch can be used to make AoS representation even faster.
Prefetches are critical to code that spends most of its execution time waiting for data to show up. Modern front side busses have enough bandwidth that prefetches should be safe to do, provided that your program isn't going too far ahead of its current set of loads.
For various reasons, structures and classes can create numerous performance issues in C++, and may require more tweaking to get acceptable levels of performance. When code is large, use object-oriented programming. When data is large (and performance is important), don't.
float v[N];
float u[N];
//And similarly many other variables of float type, up to 10-12 of them.
//Either using an inlined function or just adding this text in main()
v[j] += u[j];
v[j] = v[j] * i[j] * t[j];
Two things you should be aware that can made a huge difference, depending on your CPU:
alignment
cache line aliasing
Since you are using SSE4, using a specialized memory allocation function that returns an address that aligned at a 16 byte boundary instead of new may give you a boost, since you or the compiler will be able to use aligned load and stores. I have not noticed much difference in newer CPUs, but using unaligned load and stores on older CPUs may be a little bit slower.
As for cache line aliasing, Intel explicit mentions it on its reference manuals
(search for "Intel® 64 and IA-32 Architectures Optimization Reference Manual"). Intel says it is something you should be aware, specially when using SoA. So, one thing you can try is to pad your arrays so the lower 6 bits of their addresses are different. The idea is to avoid having them fighting for the same cache line.
Certainly, if you don't achieve vectorization, there's not much incentive to make an SoA transformation.
Besides the fairly wide de facto acceptance of __RESTRICT, gcc 4.9 has adopted #pragma GCC ivdep to break assumed aliasing dependencies.
As to use of explicit prefetch, if it is useful, of course you may need more of them with SoA. The primary point might be to accelerate DTLB miss resolution by fetching pages ahead, so your algorithm could become more cache hungry.
I don't think intelligent comments could be made about whatever you call "compile time" allocation without more details, including specifics about your OS. There's no doubt that the tradition of allocating at a high level and re-using the allocation is important.

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?
Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?
Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.
Did you try to use raw pointer while array accessing?
// regular place
for (int i = 0; i < arr.size(); ++i)
wcout << arr[i];
// In bottleneck
int *pArr = &arr.front();
for (int i = 0; i < arr.size(); ++i)
wcout << pArr[i];
I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:
reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }
You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.
As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.
The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.
If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.
EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:
Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).
First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)
Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.
Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).
Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).
Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).
I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.
What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.