Type safe physics operations in C++ - c++

Does it make sens in C++ to define physics units as separate types and define valid operations between those types?
Is there any advantage in introducing a lot of types and a lot of operator overloading instead of using just plain floating point values to represent them?
Example:
class Time{...};
class Length{...};
class Speed{...};
...
Time operator""_s(long double val){...}
Length operator""_m(long double val){...}
...
Speed operator/(const Length&, const Time&){...}
Where Time, Length and Speed can be created only as a return type from different operators?

Does it make sens in C++ to define physics units as separate types and define valid operations between those types?
Absolutely. The standard Chrono library already does this for time points and durations.
Is there any advantage in introducing a lot of types and a lot of operator overloading instead of using just plain floating point values to represent them?
Yes: you can use the type system to catch errors like adding a mass to a distance at compile time, without adding any runtime overhead.
If you don't feel like defining the types and operators yourself, Boost has a Units library for that.

I would really recommend boost::units for this. It does all the conversion compile-time and also it gives you a compile time error if you're trying using erroneous dimensions
psuedo code example:
length l1, l2, l3;
area a1 = l1 * l2; // Compiles
area a2 = l1 * l2 * l3; // Compile time error, an area can't be the product of three lengths.
volume v1 = l1 * l2 * l3; // Compiles

I've gone down this road. The advantages are all the normal numerous and good advantages of type safety. The disadvantages I've run into:
You'll want to save off intermediate values in calculations... such as seconds squared. Having these values be a type is somewhat meaningless (seconds^2 obviously isn't a type like velocity is).
You'll want to do increasingly complex calculations which will require more and more overloads/operator defines to achieve.
At the end of the day, it's extremely clean for simple calculations and simple purposes. But when math gets complicated, it's hard to have a typed unit system play nice.

Everyone has mentioned the type-safety guarantees as a plus. Another HUGE plus is the ability to abstract the concept (length) from the units (meter).
So for example, a common issue when dealing with units is to mix SI with metric. When the concepts are abstracted as classes, this is no longer an issue:
Length width = Length::fromMeters(2.0);
Length height = Length::fromFeet(6.5);
Area area = width * height; //Area is computed correctly!
cout << "The total area is " << area.toInches() << " inches squared.";
The user of the class doesn't need to know what units the internal-representation uses... at least, as long as there are no severe rounding issues.
I really wish more trigonometry libraries did this with angles, because I always have to look up whether they're expecting degrees or radians...

For those looking for a powerful compile-time type-safe unit library, but are hesitant about dragging in a boost dependency, check out units. The library is implemented as a single .h file with no dependencies, and comes with a project to build unit tests/documentation. It's tested with msvc2013, 2015, and gcc-4.9.2, and should work with later versions of those compilers as well.
Full Disclosure: I'm the author of the library

Yes, it makes sense. Not only in physics, but in any discipline. In finance, e.g. interest rates are in units of inverse time intervals (typically express per year). Money has many different units. Converting between them can only be done with a cross-rate, has dimensions of one currency divided by another. Interest payments, dividend payments, principal payments, etc. ordinarily occur at a frequency.
It can prevent multiplying two values and ending up with an illegal value. It can prevent summing dollars and euros, etc.

I'm not saying you're wrong to do so, but we've gone overboard with that on the project I'm working on and frankly I doubt its benefits outweigh its hassle. Particularly if you're on a team, good variable naming (just spell the darn things out), code reviews, and unit testing will prevent any problems. On the other hand, if you can use Boost, units might be something to check into (I haven't).

To check for type safety, you can use a dedicated library.
The most wiedly use is boost::units, it works perfertly with no execution time overhead, a lot of features. If this library theoritically solve your problem. From a more practical point of vew, the interface is so awkward and badly documented that you may have problems. Morever the compilation time increase drastically with the number of dimensions, so clearly check that you can compile in a reasonable time a large project before using it.
doc : http://www.boost.org/doc/libs/1_56_0/doc/html/boost_units.html
An alternative is to use unit_lite. There are less features than the boost library but the compilation is faster, the interface simpler and errors messages are readables. This lib requires C++11.
code : https://github.com/pierreblavy2/unit_lite
The link to the doc is in the github description (I'm not allowed to post more than 2 links here !!!).

I gave a tutorial presentation at CPPcon 2015 on the Boost.Units library. It's a powerful library that every scientific application should be using. But it's hard to use due to poor documentation. Hopefully my tutorial helps with this. You can find the slides/code here:

If you want to use a very light weight header-only library based on c++20, you can use TU (Typesafe Units). It supports manipulation (*, /, +, -, sqrt, pow to arbitrary floating point numbers, unary operations on scalar units) for all SI units. It supports units with floating point dimensions like length^(d) where d is a decimal number. Moreover it is simple to define your own units. It also comes with a test suite...
...and yes, I'm the author of this library.

Related

Is there a library to convert C++ expressions to MathML or similar?

I have many calculations in my program. Now I want to convert these caculations to formulas that can be part of reports and documentation (PDF). I want to be 100% certain that the reports match the actual code. What I don't want to do is parse the code myself.
In proof-of-concepts I created classes that contain both the value and the string of an expression.
Expression a("a", 7);
Expression b("b", 3);
Expression c("c");
c = a * b;
std::cout << c.formula() << std::endl; // would print "c = a * b"
I don't want to handle all cases, such as if, loops, ... myself. So is there a library that can do the job?
Most ways of representing a formula don't have concise representations of every kind of C++ control flow structure. What more, C++ "math" isn't the same as abstract mathematical equations; C++ default integer math is bounded, floating point math isn't real numbers, etc.
You'd probably be better off removing the mathematics from the C++ code entirely. There are computing platforms that can produce formatted representations of their mathematics as part of their feature set.
This also depends on the kind of mathematics you are doing. If you are doing abstract algebra, you'd probably use a different program or engine than if you where doing number crunching simulations.
If you don't want to use a 3rd party program, what I would do in your situation is write my own domain specific language, and either parse it or use expression templates to build it, and then use the expression tree in two ways (one for computation and the other for display). My approach would probably be a bit different than what you are describing, but that is probably my abstract mathematics bias showing through.
Unless I am doing a LOT of calculation (like, many kilobytes of equations), the reliability of this would probably be lower than just transcribing the formulas.

What is the standard way to maintain accuracy when dealing with incredibly precise floating point calculations in C++?

I'm in the process of converting a program to C++ from Scilab (similar to Matlab) and I'm required to maintain the same level of precision that is kept by the previous code.
Note: Although maintaining the same level of precision would be ideal. It's acceptable if there is some error with the finished result. The problem I'm facing (as I'll show below) is due to looping, so the calculation error compounds rather quickly. But if the final result is only a thousandth or so off (e.g. 1/1000 vs 1/1001) it won't be a problem.
I've briefly looked into a number of different ways to do this including:
GMP (A Multiple Precision
Arithmetic Library)
Using integers instead of floats (see example below)
Int vs Float Example: Instead of using the float 12.45, store it as an integer being 124,500. Then simply convert everything back when appropriate to do so. Note: I'm not exactly sure how this will work with the code I'm working with (more detail below).
An example of how my program is producing incorrect results:
for (int i = 0; i <= 1000; i++)
{
for (int j = 0; j <= 10000; j++)
{
// This calculation will be computed with less precision than in Scilab
float1 = (1.0 / 100000.0);
// The above error of float2 will become significant by the end of the loop
float2 = (float1 + float2);
}
}
My question is:
Is there a generally accepted way to go about retaining accuracy in floating point arithmetic OR will one of the above methods suffice?
Maintaining precision when porting code like this is very difficult to do. Not because the languages have implicitly different perspectives on what a float is, but because of what the different algorithms or assumptions of accuracy limits are. For example, when performing numerical integration in Scilab, it may use a Gaussian quadrature method. Whereas you might try using a trapezoidal method. The two may both be working on identical IEEE754 single-precision floating point numbers, but you will get different answers due to the convergence characteristics of the two algorithms. So how do you get around this?
Well, you can go through the Scilab source code and look at all of the algorithms it uses for each thing you need. You can then replicate these algorithms taking care of any pre- or post-conditioning of the data that Scilab implicitly does (if any at all). That's a lot of work. And, frankly, probably not the best way to spend your time. Rather, I would look into using the Interfacing with Other Languages section from the developer's documentation to see how you can call the Scilab functions directly from your C, C++, Java, or Fortran code.
Of course, with the second option, you have to consider how you are going to distribute your code (if you need to).Scilab has a GPL-compatible license, so you can just bundle it with your code. However, it is quite big (~180MB) and you may want to just bundle the pieces you need (e.g., you don't need the whole interpreter system). This is more work in a different way, but guarantees numerical-compatibility with your current Scilab solutions.
Is there a generally accepted way to go about retaining accuracy in floating
point arithmetic
"Generally accepted" is too broad, so no.
will one of the above methods suffice?
Yes. Particularly gmp seems to be a standard choice. I would also have a look at the Boost Multiprecision library.
A hand-coded integer approach can work as well, but is surely not the method of choice: it requires much more coding, and more severe a means to store and process aritrarily precise integers.
If your compiler supports it use BCD (Binary-coded decimal)
Sam
Well, another alternative if you use GCC compilers is to go with quadmath/__float128 types.

Retrofitting existing code with a floating point arbitrary precision C++ library, any chance of success?

Let say I have a snippet of code like this:
typedef double My_fp_t;
My_fp_t my_fun( My_fp_t input )
{
// some fp computation, it uses operator+, operator- and so on for type My_fp_t
}
My_fp_t input = 0.;
My_fp_t output = my_fun( input );
Is it possible to retrofit my existing code with a floating point arbitrary precision C++ library?
I would like to simple add #include <cpp_arbitrary_precision_fp>, change my typedef double My_fp_t; into typedef arbitrary_double_t My_fp_t; and let the operator overloading of C++ doing its job...
My main problem is that actually my code do NOT have the typedef :-( and so maybe my plan is doomed to failure.
Assuming that my code had the typedef, what other problems would I face?
This might be tough. I used a template approach in my PhD thesis code do deal with different numerical types. You might want to take a look at it to see the problems I encountered.
The thing is you are fine if all you do with your numbers is use the standard arithmetic operators. However, as soon as you use a square root or some other non operator function you need to create helper objects to detect your object's type (at compile time as it is too slow to do this at run time; see the boost metaprogramming library for help on that) and then call the correct function and return it as the correct type. It is all totally doable, but is likely to take longer than you think and will add considerably to the complexity of your code.
In my experience, (I was using GMP which must be the fastest arbitrary precision library available for C++) after all of the effort and complexity I had introduced, I found that GMP was just too slow for the sorts of computation that I was doing; so it was academically interesting, but practically useless. Before you start on this do some speed tests to see whether your library will still be usable if you use arbitrary precision arithmetic.
If the library defines a type that correctly overloads the operators you use, I don't see any problem...

Definitions of sqrt, sin, cos, pow etc. in cmath

Are there any definitions of functions like sqrt(), sin(), cos(), tan(), log(), exp() (these from math.h/cmath) available ?
I just wanted to know how do they work.
This is an interesting question, but reading sources of efficient libraries won't get you very far unless you happen to know the method used.
Here are some pointers to help you understand the classical methods. My information is by no means accurate. The following methods are only the classical ones, particular implementations can use other methods.
Lookup tables are frequently used
Trigonometric functions are often implemented via the CORDIC algorithm (either on the CPU or with a library). Note that usually sine and cosine are computed together, I always wondered why the standard C library doesn't provide a sincos function.
Square roots use Newton's method with some clever implementation tricks: you may find somewhere on the web an extract from the Quake source code with a mind boggling 1 / sqrt(x) implementation.
Exponential and logarithms use exp(2^n x) = exp(x)^(2^n) and log2(2^n x) = n + log2(x) to have an argument close to zero (to one for log) and use rational function approximation (usually Padé approximants). Note that this exact same trick can get you matrix exponentials and logarithms. According to #Stephen Canon, modern implementations favor Taylor expansion over rational function approximation where division is much slower than multiplication.
The other functions can be deduced from these ones. Implementations may provide specialized routines.
pow(x, y) = exp(y * log(x)), so pow is not to be used when y is an integer
hypot(x, y) = abs(x) sqrt(1 + (y/x)^2) if x > y (hypot(y, x) otherwise) to avoid overflow. atan2 is computed with a call to sincos and a little logic. These functions are the building blocks for complex arithmetic.
For other transcendental functions (gamma, erf, bessel, ...), please consult the excellent book Numerical Recipes, 3rd edition for some ideas. The good'old Abramowitz & Stegun is also useful. There is a new version at http://dlmf.nist.gov/.
Techniques like Chebyshev approximation, continued fraction expansion (actually related to Padé approximants) or power series economization are used in more complex functions (if you happen to read source code for erf, bessel or gamma for instance). I doubt they have a real use in bare-metal simple math functions, but who knows. Consult Numerical Recipes for an overview.
Every implementation may be different, but you can check out one implementation from glibc's (the GNU C library) source code.
edit: Google Code Search has been taken offline, so the old link I had goes nowhere.
The sources for glibc's math library are located here:
http://sourceware.org/git/?p=glibc.git;a=tree;f=math;h=3d5233a292f12cd9e9b9c67c3a114c64564d72ab;hb=HEAD
Have a look at how glibc implements various math functions, full of magic, approximation and assembly.
Definitely take a look at the fdlibm sources. They're nice because the fdlibm library is self-contained, each function is well-documented with detailed explanations of the mathematics involved, and the code is immensely clear to read.
Having looked a lot at math code, I would advise against looking at glibc - the code is often quite difficult to follow, and depends a lot on glibc magic. The math lib in FreeBSD is much easier to read, if somehow sometimes slower (but not by much).
For complex functions, the main difficulty is border cases - correct nan/inf/0 handling is already difficult for real functions, but it is a nightmare for complex functions. C99 standard defines many corner cases, some functions have easily 10-20 corner cases. You can look at the annex G of the up to date C99 standard document to get an idea. There is also a difficult with long double, because its format is not standardized - in my experience, you should expect quite a few bugs with long double. Hopefully, the upcoming revised version of IEEE754 with extended precision will improve the situation.
Most modern hardware include floating point units that implement these functions very efficiently.
Usage: root(number,root,depth)
Example: root(16,2) == sqrt(16) == 4
Example: root(16,2,2) == sqrt(sqrt(16)) == 2
Example: root(64,3) == 4
Implementation in C#:
double root(double number, double root, double depth = 1f)
{
return Math.Pow(number, Math.Pow(root, -depth));
}
Usage: Sqrt(Number,depth)
Example: Sqrt(16) == 4
Example: Sqrt(8,2) == sqrt(sqrt(8))
double Sqrt(double number, double depth = 1) return root(number,2,depth);
By: Imk0tter

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks
First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.
What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.
I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.
You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.
If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.
If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).
If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.
There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.
If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.
Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).
I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.
From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.
Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.
Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.
Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?
(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root
Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.
Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.
The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.
I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.
See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.