What's a good PPC based >MACHINE< for profiling code for in-order processors

What's a good PPC based >MACHINE< for profiling code for in-order processors - c++

I know that older Macs have PPC processors in them, which is perfect, but which specific models are suitable for dropping a linux distribution onto? I've not used a Mac in over 10 years now so I have no idea which to go for. In particular, I ask about ones that accept Linux because I believe Apple ask you to pay to develop on their machines or is it possible to use c++ with gcc and the LLVM for free on the mac?
I need to be able to profile code on an in-order risc processor, and the PPC seems like the best place to start, but what other CPUs offer similar coding experience? That is, with a much reduced instruction set, stalls when branching, microcode instructions and load-hit-store problems when switching between float/int/vector representations.

There is no charge to develop on Mac. There is a charge to install iOS products on an iPhone, and there is a charge to sell Mac products through App Store. But you can build c++ apps for free on Mac. Xcode itself is free.
Any PowerBook G4 is fine for this kind of work, and there are many pages on installing Linux on a PowerBook G4 if you wanted to do that (though I'd probably just Xcode rather than go through the hassle).

Use Mac OS X and get the free Xcode developer tools from Apple (Xcode 3.x) and also the free CHUD performance tools package which includes Shark, a very good sampling profiler which you will find extremely useful.

Slightly off-topic, but
in-order
It depends on precisely what you mean by in-order! PowerPC has a variety of synchronizing instructions like like sync, lwsync, and eieio to enforce (different types of!) memory ordering, and isync which flushes the instruction pipeline. IBM has a decent summary.
risc processor
I really wouldn't call the PPC "reduced" ;)
stalls when branching
IIRC, a correctly-predicted branch with its target in the instruction cache does not stall the G4 (I forget how the different models of G4 differ). OTOH, the G5 performs better if branch targets are 16-byte aligned (something about the branch target buffer).
microcode instructions
I thought half the point of RISC was to avoid microcode? I'm not aware of microcode updates, at any rate.
load-hit-store problems when switching between float/int/vector representations
I'm not sure what this means...
"Traditional" ARM might is probably closer to what you're looking for, but I suspect the more recent processors have some of the more "modern" processor features. My ARM box of choice is probably the SheevaPlug or similar, though the WZR-HP-G300NH router is cheaper (and comes with Wi-Fi) if you don't mind being constrained to 64 MB.

Related

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html

For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.

There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.

You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.

From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim

Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.

Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.

The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.

A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.

Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).

What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.

If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.

I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

Is there any real point compiling a Windows application as 64-bit?

I'd confidently say 99% of applications we write don't need to address more than 2Gb of memory. Of course, there's a lot of obvious benefit to the OS running 64-bit to address more RAM, but is there any particular reason a typical application would be compiled 64bit?

There are performance improvements that might see with 64-bit. A good example is that some parameters in function calls are passed via registers (less things to push on the stack).
Edit
I looked up some of my old notes from when I was studying some of the differences of running our product with a 64-bit build versus a 32-bit build. I ran the tests on a quad core 64-bit machine. So there is the question of comparing apples to oranges since the 32-bit was running under the emulation mode obviously. However, it seems that many things I read such as this, for example, consistently say that the speed hit for WOW64 is not significant. But even if that statement is not true, your application will almost certainly be run on a 64-bit OS. Thus a comparison of a 32-bit build versus 64-bit on a 64-bit machine has value.
In the testing I performed (certainly not comprehensive), I did not find any cases where the 32-bit build was faster. However many of the SQL intensive operations I ran (high CPU and high I/O) were 20% to 50% faster when running with the 64-bit build. These tests involved some fairly “ugly” SQL statements and also some TPCC tests with high concurrency. Of course, a lot depends on compiler switches quite a bit, so you need to do your own testing.

Building them as 64-bit now, even if you never release the build, can help you find and repair problems that you will encounter later when you're forced to build and release as 64-bit.

x64 has eight more general purpose registers that aren't available when running 32-bit code. That's three times as many (twice as many if you count ESI, EDI, EBP and ESP as general purpose; I don't). That can save a lot of loads and stores in functions that use more than four variables.

Don't underestimate the marketing value of offering a native 64-bit version of your product.
Also, you might be surprised just how many people work on apps that require as much memory as they can get.

I'd say only do it if you need more that 2GB.
One thing is 64-bit compilation means (obviously) 64-bit pointers. That means the code and data structures get a bit bigger, meaning that the app. will benefit a little less from cache and will hit the virtual memory a bit more often etc.
So, if you don't need it, the basic effect is to make your app a bit slower and more bloated for no reason.
That said, as time goes on, you'll care more about 64 bit anyway just because that's what all the tools and libraries etc will be written for. Even if your app can live quite happily in 64K, you're unlikely to use 16 bit code - the gains don't really matter (it's a small fast app anyway) and are certainly outweighed by the hassle involved. In time, we'll see 32-bit much the same way.

You could consider it as future-proofing. It may be a long way away, but consider some years in to the future, where 64-bit OS and CPUs are ubiquitous (consider how 16-bit faded away when 32-bit took over). If your application is 32-bit and all your competitors have moved on to 64-bit by then, your application could be seen as (or accused by your competitors as) out of date, slower, or incapable of change. Maybe even one day support for 32-bit applications will be dropped or incomplete (can Windows 7 run 16-bit apps properly?). If you're already building a 64-bit version of your application today, you avoid these problems. If you put it off till later, you might write a lot more code between now and when you port, then your port will be even harder.
For a lot of applications there aren't many compelling technical reasons, but if it's easy, porting now might save you effort in future.

If you don't need the extended address space, delivering in 64 bits mode offers nothing and has some disadvantage like increasing the memory consumption and the cache pressure.
While we offer 64 bits builds, our customer who are at the limit are pushing us to reduce the memory consumption so that they get these advantages.

All applications that may need lots of memory: database servers that want to cache lots of data in memory, scientific applications that handle lots of data, ...

I've recently read this article,Optimizing software in C++. In chapter 2.3 Choice of operating system there is a comparison between advantadges and disavantages of 64 and 32 bits system, with some specific observations regarding Windows.
Mark Wilkins already noted in this thread about more registers for function calls. Another interesting property of 64 bit system is this:
The SSE2 instruction set is supported on all 64-bit CPUs and operating systems.
SSE2 instructions can provide excellent optimizations and they are being increasingly used, so in my opinion this is a notable feature.

Fastcall makes calling subroutines faster by keeping the first four parameters in registers.

When you say that 99% of apps won't benefit from 64-bit, that may well be true for you personally, but during the day I use Visual Studio and Xcode to compile C++ with a large codebase, search the multi-Gb repositories with Google Desktop and Spotlight. Then I come home to write music using a sequencer using several Gb of sound libraries, and do some photoshopping on my 20Gb of photos, and maybe do a bit of video editing with my holiday clips.
So for me (and I dare say many other users), having 64-bit versions of many of these apps will be a great advantage. Word processor, web browser, email client: maybe not. But anything involved with large media will really benefit.

More data can be processed per clock cycle, which can deliver performance improvements to e.g. crypto, video encoding, etc. applications

Porting 32 bit C++ code to 64 bit - is it worth it? Why?

I am aware of some the obvious gains of the x64 architecture (higher addressable RAM addresses, etc)... but:
What if my program has no real need to run in native 64 bit mode. Should I port it anyway?
Are there any foreseeable deadlines for ending 32 bit support?
Would my application run faster / better / more secure as native x64 code?

x86-64 is a bit of a special case - for many architectures (eg. SPARC), compiling an application for 64 bit mode doesn't give it any benefit unless it can profitably use more than 4GB of memory. All it does is increase the size of the binary, which can actually make the code slower if it impacts on cache behaviour.
However, x86-64 gives you more than just a 64 bit address space and 64 bit integer registers - it also doubles the number of general purpose registers, which on a register-deficient architecture like x86 can result in a significant performance increase, with just a recompile.
It also lets the compiler assume that many extensions, like SSE and SSE2, are present, which can also significantly improve code optimisation.
Another benefit is that x86-64 adds PC-relative addressing, which can significantly simplify position-independent code.
However, if the app isn't performance sensitive, then none of this is really important either.

One possible benefit I haven't seen mentioned yet is that it might uncover latent bugs. Once you port it to 64-bit, a number of changes are made. The sizes of some datatypes change, the calling convention changes, the exception handling mechanism (at least on Windows) changes.
All of this might lead to otherwise hidden bugs surfacing, which means that you can fix them.
Assuming your code is correct and bug-free, porting to 64-bit should in theory be as simple as flicking a compiler switch. If that fails, it is because you're relying on things not guaranteed by the language, and so, they're potential sources of errors.

Here's what 64-bit does for you:
64-bit allows you to use more memory than a 32-bit app.
64-bit makes all pointers 64-bits, which makes your code footprint larger.
64-bit gives you more integer and floating point registers, which causes less spilling registers to memory, which should speed up your app somewhat.
64-bit can make 64-bit ALU operations faster (only helpful if you're using 64-bit data types).
You DO NOT get any extra security (another answer mentioned security, I'm not aware of any benefits like that).
You're limited to only running on 64-bit operating systems.
I've ported a number of C++ apps and seen about a 10% speedup with 64-bit code (same system, same compiler, the only change was a 32-bit vs 64-bit compiler mode), but most of those apps were doing a fair amount of 64-bit math. YMMV.
I wouldn't worry about 32-bit support going away any time soon.
(Edited to include notes from comments - thanks!)

Although its true that 32-bit will be around for a while in some form or another, Windows Server 2008 R2 ships with a 64-bit SKU only. I would not be surprised to see WOW64 as an install option as early as Windows 8 as more software migrates to 64-bit. WOW64 is a install, memory and performance overhead. The 3.5GB RAM limit in 32-bit Windows along with increasing RAM densities will encourage this migration. I'd rather have more RAM than CPU...
Embrace 64-bit! Take the time to make your 32-bit code 64-bit compatible, its a no brainer and straightforward. For normal applications the changes are more accurately describes as code corrections. For drivers the choice is: adapt or lose users. When the time comes you'll be ready to deploy on any platform with a recompile.
IMO the current cache related issues are moot; silicon improvements in this area and further 64-bit optimisation will be forthcoming.

If your program has no need to run under 64-bit, why would you? If you are not memory bound, and you don't have huge datasets, there is no point. The new Miata doesn't have bigger tires, because it doesn't NEED them.
32-bit support (even if only via emulation) will extend long past when your software ceases to be useful. We still emulate Atari 2600s, right?
No, in all likelyhood, your application will be slower in 64-bit mode, simply because less of it will fit in the processor's cache. It might be slightly more secure, but good coders don't need that crutch :)
Rico Mariani's post on why Microsoft isn't porting Visual Studio to 64-bit really sums it up Visual Studio: Why is there no 64 bit version? (yet)

It depends on whether your code is an application or a reusable library. For a library, keep in mind that the client of that library may have good reasons to run in 64-bit mode, so you have to ensure that your scenario works. This may also apply to applications when they are extensible via plugins.

If you don't have any real need now, and likely never will, for 64-bit mode, you shouldn't do porting.
If you don't have the need now, but may have it some day, you should try to estimate how much effort it will be (e.g. by turning on all respective compiler warnings, and attempting a 64-bit compilation). Expect that some things aren't trivial, so it will be useful to know how what problems you would likely encounter, and how long it would likely take to fix them.
Notice that a need may also arise from dependencies: if your program is a library (e.g. a DLL), it may be necessary to port it to 64-bit mode just because some host application gets ported.
For a foreseeable future, 32-bit applications will continue to be supported.

Unless there's a business reason to go to 64 bit, then there's no real "need" to support 64 bit.
However, there are some good reasons for going to 64 bit at some point, aside from all those that others have already mentioned.
It's getting harder to buy PCs that aren't 64 bit. Even though 32 bit apps will run in compatibility mode for years to come, any new PCs being sold today or in the future are likely to be 64 bit. If I have a shiny 64 bit operating system I don't really want to run "smelly old 32 bit apps" in compatibility mode!
Some things just don't run properly in comptibility mode - it's not the same thing as running on a 32-bit OS on 32-bit hardware. I've run into a few issues (e.g. registry access across the 32/64 bit registry hives, programs that fail because they're not in the folder they expect to be in, etc) when running in compatibility mode. I always feel nervous about running my code in compatibility mode - it's simply "not the real thing", and it often shows.
If you have written your code cleanly, then chances are you only have to recompile it as a 64 bit exe and it'll work fine, so there's no real reason not to give it a try.
the earlier you build a native 64 bit version, the easier it will be to keep it working on 64 bit as you add new features. That's a much better plan than continuing to develop in the dark ages for another 'n' years and then trying to jump out into the light.
When you go for your next job interview, you will be able to say that you have 64-bit expeirence and 32->64 porting experience.

You are already aware of the x64 advantages (most importantly the increased RAM size) and you are not interested in any, then don't port an executable (exe). Usually performance degrades after a port, mainly due to the increase in size of a x64 module over x86: all pointers now require double length, and this percolates everywhere, including code size (some jumps, function calls, vtables, virtual invokes, global symbols etc etc). Is not a significant degradation, but is usually measurable (3-5% speed decrease, depends on many factors).
DLLs are worth porting because you gain a new 'audience' in x64 apps that are able to consume your DLL.

Some OSs or configurations are unable to run 32-bit programs. A minimal Linux without 32-bit libc installed for example. Also IIRC I usually compile out the 32-bit support from the kernel.
If these OSs or configurations are part of your potential user base then yes, you should port it.
If you need more speed, then you should also port it (as others have said, x86-64 has more registers and cool instructions that speed it up).
Or, of course, if you want to mmap() or otherwise map a large file or lots of memory. Then 64-bit helps.

For example, if you had written 32-bit code (GNU C/++) as below
EDIT: format code
struct packet {
unsigned long name_id;
unsigned short age;
};
for network messaging, then you need to do porting while re-compiling on a 64 bit system, because of htonl/ntohl etc, communication get broken in the case of the network peer is still using the 32 bit system (using the same code as yours); you know sizeof(long) will be changed from 32 to 64 too at your side.
See more notes about 32/64 porting at http://code.google.com/p/effocore/downloads/list, document name EffoCoreRef.pdf.

It's pretty unlikely that you'd see any benefit unless you're in need of extreme security measures or obscene amounts of RAM.
Basically, you'd most likely know intuitively if your code was a good candidate for 64-bit porting.

Regarding deadlines. I would not worry, things like 32bit will be around for a good while natively and for a foreseeable future in some other form.

See my answer to this question here. I closed out that post saying that a 64-bit computer can store and retrieve much more information than a 32-bit computer. For most users this really doesn't mean a whole lot because things like browsing the web, checking email and playing Solitaire all work comfortably within the confines of 32-bit addressing. Where the 64-bit benefit will really shine is in areas where you have a lot of data the computer will have to churn through. Digital signal processing, gigapixel photography and advanced 3D gaming are all areas where their massive amounts of data processing would see a big boost in a 64-bit environment.
As for your code running faster/better, it's entirely up to your code and the requirements imposed on it.

As to performance issues, it depends on your program actually. If your program is pointer-intensive, porting to 64-bit may cause performance downgrading, since for CPU cache with the same size, each 64-bit pointer occupy more space on cache, and virtual-to-physical mappings also occupies more TLB space. Otherwise, if your program is not pointer-intensive, its performance will benefit from x64.
Of course performance is not the only reason for porting, other issues like porting effort, time scheduling should also be considered.

I would recommend porting it to 64 bit just so you are running "native" (Also, I use OpenBSD. In their AMD64 port, they do not provide any 32 bit emulation support, everything must be 64 bit)
Also, stdint.h is your best friend! By porting your application, you should learn how to code portably. Which will make your code work right when we have 128 bit processors too (in a few decades hopefully)
I've ported 2 or 3 things to 64 bit and now develop for both (which is very easy if you use stdint.h) and on my first project porting to 64 bit caused like 2 or 3 bugs to show up, but that was it. Most of it was a simple recompile and now I don't worry about the differences between 32 and 64 bit when making new code because I just automatically code portably. (using intptr_t and size_t and such)

In the case of a dll being called from a 64 bits process then the dll have to be 64 bits as well.
Then it does not matter if it's worth it, you simply have no choice.

One issue to keep in mind is the software libraries available. For instance, my company develops an application that uses multiple OpenGL libraries, and we do so on the OpenSuSE OS. In older versions of the OS, one could download a 32-bit versions of these libraries on the x86_64 architecture. Newer versions, however, don't have this. It made it easier to just compile in 64-bit mode.

64 bit will run a lot faster, when 64 bits compilers become mature, but when it will occur I dont know

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js