Back to Simugraph
Back to the Simugraph library
|
A guide written by Hansjörg Malthaner ©2001, 2002, 2003, 2004 |
|
Initial version: 12-Nov-2001 Last update: 12-Nov-2004 |
The intention of this article is not to explain all the mentioned optimizations in detail, but rather to list a lot of possible optimizations. It is a high level guide to learn which aspects can be considered in optimizing code but probably you need other literature to learn all details of the mentioned optimizations.
You'll also find that most optimizations are tradeoffs. You trade one benefit for another drawback. Depending on what you need and what drawbacks you cannot accept, you need to choose the 'best' optimization for your problem very carefully. I'll try to list all drawbacks which I know for each proposed optimization, but I might miss some. It's also possible that the drawback does not exist right now, but will be introduced by a yet unknown processor/mainboard/memory architecture that future will bring.
Last, but not least, if you find something wrong or missing, just drop me a message. Thanks!
Examples are always C or C++ unless otherwise stated. But I try to keep all suggestion as general as possible to make them useful for other similar languages, too. Namely Java should benefit from nearly all mentioned optimizations as well, except from the low level things.
Many of the optimizations do not work in general. It's always a good idea to make measurements of the impact of a optimization. Measurements tell precise timing. You won't be able to tell a difference of less than 10% between to runs by yourself. Only exact numbers will tell you the truth. Along with this, I have a request: If you make measurements of one or the other optimization, please send me the numbers. This document needs some serious backing of the proposed optimizations, and numbers from real experiments are probably very valuable for other readers.
There are two goals for optimizing code:
In some cases both optimizations go hand in hand, in other cases you trade in one for the other. Using less memory means to transfer less memory which reduces the time needed for memory transfers. But often memory is used to store precalculated values avoid the actual calculation at runtime. In this case you trade space consumption for runtime efficiency.
This is most often the best way to optimize a program, but it is a fairly high level optimization and might be difficult to apply to existing programs. Basically this is a design stage optimization step.
Example: Unsorted list vs. sorted listAssuming, you are storing data in a unsorted list. The key is a numeric value, and lookup is the most frequent operation.
In this case it is better to use a sorted list, because in a sorted list you can look up elements in O(log(n)) while in unsorted lists lookups are done in O(n). That's because in unsorted lists each element must be checked until the searched is found, in average testing n/2 elements. For a list of 1000 elements this means a lookup in a unsorted list needs in average 500 steps, while in a sorted list it can be done in about 10 steps. This is 50 times faster! In practice it might be a little less than 50 times faster because the memory is not accessed linearly when searching sorted lists, which is bad for the processor cache.
Conclusion:If you are storing data, think about which data structure offers the most efficient implementation of the operations you need frequently. Compare at least arrays, vectors (dynamic resizeable arrays), list-like structures, queues, stacks, sets, tree-like structures and hashtables. Consider ordered and unordered versions, if both exist. If you consider using trees think of balanced and unbalanced trees.
This includes moving constant subexpressions out of a loop as well as using precalculated tables of values.
Example: precalculated dataAssuming you need to call sin(x) frequently, but only on a limited set of distinct values of x. In this case you could just calculate all sin(x) for all x at program start and store the results in a table. Later on you just look up the entries of that table. On most architectures a table lookup is faster than calling sin(). If memory bandwidth is a issue, calling sin(x) might still be better than a table lookup that inludes meory accesses.
But why calculate the table at program start? It will slow down the start! Even better than calculating the table at program start is to calculate the values only once, store them in a file and read the file on program start. If disk space consumption is a problem, calculating the table at program start might be better, though.
Example: constant subexpressionIf you have a fake 2D array and iterate over all entries, you often find loops like this:
int x,y;
color_t colors[width*height];
// ... fill color table ...
for(y=0; y<height; y++) {
for(x=0; x<width; x++) {
set_pixel(x,y, colors[y*width+x];
}
}
In this case y*width is constant and can be moved outside of the for(x=0; x<width; x++) loop:
int x,y;
color_t colors[width*height];
// ... fill color table ...
for(y=0; y<height; y++) {
const int line = y*width;
for(x=0; x<width; x++) {
set_pixel(x,y, colors[line+x]);
}
}
In C and C++ where arrays behave a bit like pointers you could actually move out more constant expressions:
int x,y;
color_t colors[width*height];
// ... fill color table ...
for(y=0; y<height; y++) {
const color_t * color = colors+y*width;
for(x=0; x<width; x++) {
set_pixel(x, y, *color++);
}
}
Test results
Compiled with egcs 2.91.66 on a Pentium III at 666MHz running NetBSD, 1.000 executions took:
| Compilerflags | What | Time |
| -O0 | Loop | 5.93 seconds in average |
| -O0 | Loop after CSE | 5.28 seconds in average |
| -O1 | Loop | 4.93 seconds in average |
| -O1 | Loop after CSE | 4.93 seconds in average |
| -O2 | Loop | 4.93 seconds in average |
| -O2 | Loop after CSE | 4.93 seconds in average |
Egcs' built-in optimizer obviously does constant subexpression elimination. I guess all modern compilers perform such optimizations, so there is no need to do this optimization by hand.
To measure your own system you can use my free subexpr.c.
The loop exit checks cost CPU time, too. Loop unrolling tries to get rid of the checks completely or to reduce the number of checks.
If you know a loop is only perfomed a certain number of times, or if you know the number of times it will be reperated is a multiple of a constant you can unroll this loop.
Example: fixed count
// old loop
for(int i=0; i<3; i++) {
color_map[n+i] = i;
}
// unrolled version
int i = 0;
colormap[n+i] = i;
i++;
colormap[n+i] = i;
i++;
colormap[n+i] = i;
Example: count is multiple of four
// old loop
for(int i=0; i<n; i++) {
*dest++ = *source++;
}
// unrolled version
// knowing n ist always a multiple of four
for(int i=0; i<n; i+=4) {
*dest++ = *source++;
*dest++ = *source++;
*dest++ = *source++;
*dest++ = *source++;
}
In the first example, the loop checks could be eliminated completely. In the second example, the amount of loop checks was reduced to 25% of the original code.
Excessively unrolling of loops will increase code size, and it some cases this results in slower execution. If the loop fit into the processor cache and the unrolled version does not, chances are good that the tight loop is executing faster.
Test resultsCompiled with egcs 2.91.66 on a Pentium III at 666MHz running NetBSD, 10.000.000 executions took:
| Compilerflags | What | Time |
| -O0 | Loop | 0.58 seconds in average |
| -O0 | Unrolled loop | 0.38 seconds in average |
| -O1 | Loop | 0.10 seconds in average |
| -O1 | Unrolled loop | 0.05 seconds in average |
| -O2 | Loop | 0.09 seconds in average |
| -O2 | Unrolled loop | 0.04 seconds in average |
I also timed the example of the partly unrolled loop. Still, on my setup the loop is significantly faster than the unrolled variant. Compiled with egcs 2.91.66 for NetBSD, on a Pentium III at 666MHz, 10.000.000 executions took:
| Compilerflags | What | Time |
| -O0 | Loop | 7.59 seconds in average |
| -O0 | Partly unrolled loop | 7.17 seconds in average |
| -O1 | Loop | 2.04 seconds in average |
| -O1 | Partly unrolled loop | 1.72 seconds in average |
| -O2 | Loop | 3.21 seconds in average |
| -O2 | Partly unrolled loop | 1.72 seconds in average |
The fully unrolled loop was about twice as fast as the original loop. The partly unrolled loop comparison gave a strange result: with flags -O2 the loop was 50% slower than the version compiled with -O1 (less optimization). I guess egcs-2.91.66 optimizer has a bug here. Nevertlkess the partly unrolled loop was always faster than its inrolled ancestor.
To measure your own system you can use my free looptest.c.
Usually the programmer knows much more about the constraints of the program than the compiler can deduce from the source code. At times the programmer can help the compiler to produce better code by adding information about the constraints to the program code. But not all languages allow this and not all compilers use the information. I'll refer to C++ in the following subsections.
In languages like C++ which offer value and reference types there are three different types of constants:
Please note that the latter two can be combined to make constant references to constant values. In object oriented languages you might also find the option to declare methods of objects constant, that means they only read object attributes but do not change them.
In C++ you declare them like this:
No measurements exist yet. But in my experience, constant values help gcc and egcs compilers to produce much better code. The impact of declaring references to point to constant values is less and constant references seem not to make any difference to non-constant references. I suggest to declare all methods constant, that can be declared constant, because only constant methods can be called by constant references and on constant value objects.
The code of the function body is moved right at the place of the function call, thus avoiding the call while providing the same functionality. You can either rely on your compiler or langauge runtime to do inlining for you automatically or use macros or just copy&paste to inline the code.
But be warned, the copy&paste approach will leave you probably with a unmaintainable program, and macros may have unexpected side effects. Hand-made inlining is dangerous.
Test resultsI timed calling a empty function void dummy() {return;}
Compiled with egcs 2.91.66 for NetBSD, on a Pentium III at 666MHz, and optimization -O1, 10.000.000 function calls took 0.07 seconds. Inlining only wins on very large numbers of function calls. Probably the win is bigger if the function takes parameters or returns a value.
Conclusion:Function calls are very fast according to my measurements. Use inlining if your compiler provides support for inlining or if you can use makros. But don't give up the single source principle for this oprimization, it's not worth the drawbacks, unless you need the tiniest bit of additional performance desperately.
If you called the function before with the same arguments, and it is a side effect free function, you can as well use the old result. If the function is very time consuming it might be worth to cache the results. If a call is to be done, look up the cache first and only call the function if the data is not yet present in the cache.
This approach works only with side effect free functions, and the cache needs some memory. Cache lookup doesn't come for free either, so depending on the function it may or may not worth all the overhead just to save a few calls. On the other hand if the function is called often, and time consuming a cache can speed up things a lot. Another problem is to derive good keys from the parameters for cache lookup. You must derive a unique key for chache lookup from the parameters.
Calls by function pointers and calls of virtual methods (in case of OOP) are slower than direct calls, but according to my measurements the difference is very small and will only sum up to a noticeable amount if the function is called many hundred thousand times. I suggest to make measurements to determine the difference for your compiler/runtime system, before you consider wrecking your OO design to get rid of virtual methods.
Test resultsI timed calling a empty function void dummy() {return;} by a pointer.
I used the same setup like I did when timing calls to dummy() directly. Results:
| Call type: | Time: |
| 10.000.000 direct calls | 0.07 seconds |
| 10.000.000 calls by function pointer | 0.08 seconds |
The difference between calling a function directly or indirectly by a pointer is too small to be worthy optimization in most cases. I'd not even bother thinking about it after theese results. 1/100 of a second difference for 10 million calls, this far too less to make a difference in most programs.
Virtual methods are indirectly called functions, but the penalty for calling virtual methods might be unexpected high: non virtual methods can be inlined by the compiler. Virtual methods cannot be inlined. Even if calling a virtual method is about as fast as calling a plain method, you loose the option to inline this method, which can give quite some speedup if the inlined method is called often. But the tradeoff isn't between virtual vs. nonvirtual, but calling vs. inlining in fact. Declaring methods virtual prohibits inlining with most compilers.
To measure your own system you can use my free calltest.c.
Copying data is a relatively slow operation which should be avoided if possible. The idea of this optimization strategy is to share a value instead of copying it. Sharing is usually done by using pointers or references to the value.
Sharing includes an indirection, which also has drawbacks for efficiency. Very small values, of about 8 bytes size might be equally well copied as shared. 4 byte sized values may be more efficiently used as copies. But everything above 8 bytes size should benefit from this optimization.
ExampleOne program part fills a buffer with data. This data is needed by many other parts of the program. Instead of copying the contents of the buffer, it is more efficient to hand pointers to the buffer to all program parts that want to access or process the data in the buffer.
If data is assembled from several blocks of fixed size, allocate a buffer big enough for all the data. Then hand pointers to subsections of this buffer to the routines that provide/calculate that part of the data. This is more efficient than copying the result of the routines into the buffer.
C++ at times copies memory without warning you. Initalizing objects with other objects (copy constructor) is a copy operation. Assignment of value types is a copy operation. Passing value parameters is a copy operation. In each case think twice if you need a copy, or if it is better to use a reference or pointer to the original data instead.
This part of the guide applies mainly to OOP, but it might be useful for other methodologies as well.
The idea of this optimizations is to call new() or malloc() as seldom as possible and to avoid calls to free() or delete() at all if possible. This assumes that those functions are slow or have other drawbacks.
If your OS or runtime system has very fast methods for allocating and deallocationg memory theese optimizations will probably gain no benefit, but most Java runtimes and C as well as C++ programs on most systems can benefit from theese optimizations.
Example: FreelistContainer classes often use nodes, which have pointers to nodes again, like this:
struct node {
node *next;
// ... payload data here ...
};
If the container grows and shrinks, nodes are allocated from the heap and handed back to the heap. This can be avoided by using a freelist:
class freelist {
private:
node * list;
public:
freelist() {
list = 0;
};
node * gimme_node() {
if(list) {
node *p = list;
list = list->next;
return p;
} else {
return new node();
}
};
void putback_node(node *nd) {
nd->next = list;
list = nd;
};
};
Instead of using new node() and delete() you use my_freelist.gimme_node() and my_freelist.putback_node(). This way you just implicitely call new if the freelist is empty. Calls to delete are optimized out completely.
The drawback is that a freelist conserves unused memory. If you need lots of nodes, but only for a very short time, a freeelist is not the right thing for you.
Test resultsI timed allocating 10.000.000 times a node and free'ing it directly after allocation. Compiled with egcs-2.91.66, measured on a Pentium III at 666MHz running NetBSD.
| Alloc type: | Time for 10.000.000 allocations: |
| Single node, using malloc() and free() | 9.10 seconds |
|
Single node, using gimme_node() and putback_node() |
0.29 seconds |
It might be unfair to compare allocations of single objects which are handed back to the runtime right after allocation and use. But I think the freelist implementation shows their benefits very clearly here, being 30 times faster than the malloc()/free() combo.
To measure your own system you can use my free freelist.c.
Example: Object cacheIf your program alloctes and frees lots of objects, caching might help. The idea is to allocate new objects if the cache is empty, but never to free them, but instead storing them in the cache to reuse them later on. This is a more generalized case of the freelist example.
#include
template class cache {
private:
stack the_cache;
public:
T* get() {
T* p;
if(stack.emtpy()) {
p = new T();
} else {
p = the_cache.pop();
}
};
void put(T* p) {
if(p) {
the_cache.push(p);
}
};
};
An object cache has the same benefits and drawbacks as a freelist.
Allcoating objects on the stack has a few benefits over allocating objects from the heap. The stack never fragmentates like the heap can, and the top of the stack is most often in the processor or second level cache and therefore pretty quick to access.
One drawback is that C++ does not allow polymorph object to be allocated on the stack, you can just allocate objects of a distinct subclass. Another drawback is that on some systems there seems to be a size limit for stack allocated objects.
Example: Stack allocated array
// simple example
void dummy() {
int values [1000];
};
Static variables, objects and arrays have known address offsets at compile time. This allows the compiler to generate more efficient access code than for heap or stack allocated objects. The benefit is small, and it enlarges the scope of the objects. Think twice before you allocate everything static, but in some cases the benefits might outweight the drawbacks.
Example: Static array
// simple example
static int values [1000];
Often a program does not always use 100% CPU time but only sometimes. To get a better performance in theese hot spots, you can try to move calculations to another time. There are two options: move calculations to be calculated beforehand or move calculation to be calculated later on. If you need the result right in he hot spot, the only choice is to move calculations to a earlier executed location, otherwise you can try top defer the actual calculation untilthe result is needed.
If there is no time to calculate the data now, but the result is needed now, the calculation can be moved to a earlier executed part of the program. Often it is moved in front of a loop, to program start or even before program start to a offline calculation which leaves its results on permanent storage to be read by the program at program start. A example was given in section "Moving calculations out of a loop".
Sometimes a calculation can be deffered, until the result is actually accessed. In this case accessing the result triggers the calculation.
Example: Lazy tableYou remember the example about storing the results of sin(x) in a table top avoid calling sin(x) every time? One of the proposed solutions was to calculate a table of sin(x) fro all x. But if you do not need all sin(x) fro all x you have spend time on calculations which were not needed at all. Lazy evaluations helps here. The idea works like this:
Fill the table with invalid, but checkable values on program start. Write a accesor function like this:
// This table is supposed to hold sin(x) in precision of 1 degree,
// each entry initialized to -10000.0 on program start
float table[360];
//
// ... program code ...
//
// table accessor, implements lazy evaluation scheme
float table_sin(int x) {
if(table[x] == -10000.0) {
table[x] = sin(x);
}
return table[x];
}
This way you can ensure that only sin(x) is calculated if it is actaully needed and at the latest possible moment. This combines the speed of table lookup for already calculated sin(x) with the benefits of only calculating actually needed values.
If the program is waiting for something, it can start calculating data that might be needed later on. This is the opposite of lazy evaluation. Instead of waiting until a value is actually needed you use the spare time to prepare data which you assume to be needed later. Extending the example of the sin(x) table the program could scan the table for not yet calculated entries while waiting for user input and calculationg the entries so that they are ready for use later on. It works well in combination with lazy evaluation.
At times, you cannot reduce the total number of calculations, but can get away with calculations that are faster to process. This are fairly low-level calculation that heavily depend on your hardware.
You need to know quite precisely which code your compiler produces from which calulation, but I try to give some general hints here:
| Calculation: | Possible replacements: |
| x*2 | x+x or x<<1 |
| x*n where n is 2^i, i in 1,2,3 ... | x<<i |
| x/n where n is 2^i, i in 1,2,3 ... | x>>i |
| Calculation: | Possible replacements: |
| x/constant |
If you can compute R=(1.0/constant) before compiling then it's faster to use (x*R) instead of (x/constant) |
But usually you don't need to bother with this transformations. Compilers like egcs or gcc are able to do this tranformations, and they usually do them really well.
Modern processor and mainboard architectures favor some types of memory access. I'll list the most important aspects here in this section.
The processor cache works on data lines of a certain size. Currently most processors have cache lines of 32 bytes. A cache line is read or written from/to memory in burst transfers. Although burst transfers are very efficient, if you only use 4 of the transferred 32 bytes, you have lost a lot of memory bandwidth. Try to read write data linearly to get most benefit from the burst memory access.
Test resultsI did some tests with a Pentium III at 666MHz running NetBSD:
| Linear byte read: | 125.000000 MB/s |
| Linear short read: | 166.666667 MB/s |
| Linear int read: | 200.000000 MB/s |
| Random byte read: | 4.672897 MB/s |
| Random short read: | 9.174312 MB/s |
| Random int read: | 17.857143 MB/s |
Try always to read or write data linearly. The drawbacks of scatter reads and writes are big. According to my measurements, linear reads are up to 25 times faster than scatter reads!
To measure your own system you can use my free memaccess.c. Make sure to compile it without optimization, to avoid biased results.
If you process data, try to access it in chunks of cache line size, because it's then possible to transfer it with fast burst transfers. Try to organize your data that in is read, processed and stored in cache line sized amounts or multiples thereof.
Most processors prefer to read certain data types from certain boundaries. I.e. a Pentium III reads 32 bit integers most efficiently from 32 bit aligned addresses and 64 bit floats from 64 bit aligned addresses. Usually the compiler will align data structures best for your processor architecture but if you code very low level you might consider reading the processors manual to learn how to align your data best.
The working set of the program is the data it is currently working on. Usually a program holds much data, but most of the time processes only subsets of the data.
CPU caches and second level caches try to hold the current working set of a program, to allow fast access to the currently processed data.
To get most benefit from caches you should try to arrange your program data that way, that the working set fits into the cache. Caches are often organized in lines or pages. Try to bundle data that is processed together in bundels of page or cache line size. Avoid holes of unprocessed data in your data, they waste cache space. The system can only separate blocks of page or cache line size, avoiding unprocessed data withing this quantities is the programmers task, the system can't work this out.
This rule is pretty imprecise, and the details depend much on CPU, 2nd level cache and your program. But I hope you get the basic idea how to make best use of the caches.
If some parts of the program work on data of the same type and size, but a strictly disjunct times, they can all store their data at the same location. Drawbacks are, that this is likely to introduce errors if the program is changed, so that the usage intervals start to overlap. Also initialisation and/or cleanup of shared areas need to be considered carefully.
C and C++ offer the data type union to allow to store data of different types at the same memory location. This is a expansion of the concept of shared use of the same location.
Shared use of the same memory location is potentially dangerous, but with good documentation of the intention and usage of shared memory areas the program should stay maintainable.
If you have a structured data type with fields, that are only used for a small number of allocated objects, those fields can be removed from the object and accessed with a lookup table. The key for the lookup is usually the objects address. Removing the fields will shrink the objects size, but the tradeoff is a slower access to the off-loaded fields.
Example: off-loaded description stringLet's assume, you have a structure like this:
struct example_t
{
char * description;
/* more fields here */
};
and description is used only for a small fraction of all created example_t structures, then you're wasting 4 bytes for most of them. You can save those 4 bytes at the expense of a slower lookup and a lookup table. Overall the savings will only pay off if the lookup table needs less memory than the saved memory (4 bytes * number of example_t without description).
The lookup could be done with a hashtable, that maps the object to the description string:
static Hashtable description_table;
void store_description(struct example_t *ex, char * description)
{
description_table.put(ex, description);
}
char * lookup_description(const struct example_t *ex)
{
return description_table.get(ex);
}
This way, you can save 4 bytes for each allocated example_t. If the sum of those savings outweights the size of the lookup table (i.e. because a huge number of example_t get allocated and only very few have description strings) then overall this construct will reduce the memory consumption of your program.
If you need to conserve memory you should use the smallest possible data type.
On my Linux system, using g++, I measured the following sizes:
| Type: | sizeof(type): |
| char | 1 byte |
| short | 2 bytes |
| int | 4 bytes |
| long | 4 bytes |
| void * | 4 bytes |
With 8 bits you can store a value range of 0..255 or -128..127 (assuming 2s complement). Often larger values are needed, and you are forced to use larger data types.
But sometimes values come in discrete amounts, i.e. in steps of 10. In that case you can still only store 255 discrete values in 8 bits but given the fact that the step from one value to the next is 10 you can encode a range of 0..2550 in 8 bits in that case.
Assuming variable b can hold 8 bits, unsigned:
Store a value: b = value / 10;
Retrieve a value: value = b * 10;
The division and multiplication will cause a noticeable slowdown but this scheme will help to save a few bytes of memory for each value. If the scale is a power of 2 you can use shift operations instead of division/multiplication, the shifts usually execute faster.
This scheme also works with offsets, i.e. to store a range of values from 1000 to 3550 with a step range of 10 you can use:
Assuming variable b can hold 8 bits, unsigned:
Store a value: b = (value - 1000) / 10;
Retrieve a value: value = b * 10 + 1000;
In generic form:
Store a value: b = (value - offset) / scale;
Retrieve a value: value = b * scale + offset;
This trick can be used to hold fractional values as well. I.e. assuming the value range is 0..1.0 and the step range is 0.01 then an offset of 0 and a scale of 0.01 will fit that nicely into 8 bits of data (7 actually - so you can still use one bit for something else).
Pentium class Processors are fast in reading 32 bit sized types and 8 bit sized types. 16 bit types are read using a 32 bit transfer and discarding 16 bits. This means reading two 16 bit values is slower than reading one 32 bit value on those processors. You can play tricks and read a pair of 16 bit values with one 32 bit access, and split them after reading. This needs additional calculation steps, but if memory bandwidth is a problem in your application, this might still be faster than two 16 bit reads.
On many systems malloc() and new() allocate a minimum of 8 bytes or even more. If you have objects smaller than 8 bytes or want to allocate memory of less than 8 bytes, the runtime still reserves 8 bytes of memory for this object.
If you need many of those objects, it's better to allocate a rather large block of memory, enough for many of those objects, and then allocate individual objects from this block. In C++ you can override the new() operator for this purpose.
Such a block is usually referred to as a 'pool'. Depending on your malloc() implementation, allocation from pools not only saves memory, but also might be faster. See below.
In one of my programs I used lots of objects of 4 bytes size. I used a linked list of memory blocks of 4096 bytes each, and allocated the small objects from those blocks. This way I could conserve 4 bytes for each allocated object.
Using a linked list allowed to allocate additional blocks if a block was used up, and it also allows to hand back empty blocks to the runtime system. Overall I found that with Linux, allocation from a block was about 30% fater than the native memory allcoation routine. I used bitmaps to determine free slots in a block and a linear search within the bitmap.
Surprisingly managing object allocation with this linked list of blocks turned out to be 30% faster than using malloc() and free(). (Measured with gcc 2.95.2, optimization level -O1, on Linux.)
Back to Simugraph
Back to the Simugraph library