Слайд 1Common C++ Performance Mistakes in Games
Pete Isensee
Xbox Advanced Technology Group
Слайд 2About the Data
ATG reviews code to find bottlenecks and make perf
recommendations
50 titles per year
96% use C++
1 in 3 use “advanced” features like templates or generics
Слайд 3Why This Talk Is Important
The majority of Xbox games are CPU
bound
The CPU bottleneck is often a language or C++ library issue
These issues are not usually specific to the platform
Слайд 4Format
Definition of the problem
Examples
Recommendation
For reference
A frame is 17 or 33 ms
(60fps / 30fps)
Bottlenecks given in ms per frame
Слайд 5Issue: STL
Game using std::list
Adding ~20,000 objects every frame
Rebuilding the list every
frame
Time spent: 6.5 ms/frame!
~156K overhead (2 pointers per node)
Objects spread all over the heap
Слайд 6std::set and map
Many games use set/map as sorted lists
Inserts are slow
(log(N))
Memory overhead: 3 ptrs + color
Worst case in game: 3.8 ms/frame
Слайд 7std::vector
Hundreds of push_back()s per frame
VS7.1 expands vector by 50%
Question: How many
reallocations for 100 push_back()s?
Answer: 13! (1,2,3,4,5,7,10,14,20,29,43,64,95)
Слайд 9Use the Right Tool for the Job
The STL is powerful, but
it’s not free
Filling any container is expensive
Be aware of container overhead
Be aware of heap fragmentation and cache coherency
Prefer vector, vector::reserve()
Слайд 10The STL is Evil, Sometimes
The STL doesn’t solve every problem
The STL
solves some problems poorly
Sometimes good old C-arrays are the perfect container
Mike Abrash puts it well:
“The best optimizer is between your ears”
Слайд 11Issue: NIH Syndrome
Example: Custom binary tree
Sorted list of transparent objects
Badly unbalanced
1
ms/frame to add only 400 items
Example: Custom dynamic array class
Poorer performance than std::vector
Fewer features
Слайд 12Optimizations that Aren’t
void appMemcpy( void* d, const void* s, size_t b
)
{
// lots of assembly code here ...
}
appMemcpy( pDest, pSrc, 100 ); // bottleneck
appMemcpy was slower than memcpy for anything under 64K
Слайд 13Invent Only What You Need
std::set/map more efficient than the custom tree
by 10X
Tested and proven
Still high overhead
An even better solution
Unsorted vector or array
Sort once
20X improvement
Слайд 14Profile
Run your profiler
Rinse. Repeat.
Prove the improvement.
Don’t rewrite the C runtime or
STL just because you can. There are more interesting places to spend your time.
Слайд 15Issue: Tool Knowledge
If you’re a programmer, you use C/C++ every day
C++
is complex
CRT and STL libraries are complex
The complexities matter
Sometimes they really matter
Слайд 16vector::clear
Game reused global vector in frame loop
clear() called every frame to
empty the vector
C++ Standard
clear() erases all elements (size() goes to 0)
No mention of what happens to vector capacity
On VS7.1/Dinkumware, frees the memory
Every frame reallocated memory
Слайд 17Zero-Initialization
struct Array { int x[1000]; };
struct Container {
Array arr;
Container() : arr() { }
};
Container x; // bottleneck
Costing 3.5 ms/frame
Removing : arr() speeds this by 20X
Слайд 18Know Thine Holy Standard
Use resize(0) to reduce container size without affecting
capacity
T() means zero-initialize PODs. Don’t use T() unless you mean it.
Get a copy of the C++ Standard. Really.
www.techstreet.com; search on 14882
Only $18 for the PDF
Слайд 19Issue: C Runtime
void BuildScore( char* s, int n )
{
if(
n > 0 )
sprintf( s, “%d”, n );
else
sprintf( s, “” );
}
n was often zero
sprintf was a hotspot
Слайд 20qsort
Sorting is important in games
qsort is not an ideal sorting function
No
type safety
Comparison function call overhead
No opportunity for compiler inlining
There are faster options
Слайд 22Understand Your Options
itoa() can replace sprintf( s, “%d”, n )
*s
= ‘\0’ can replace sprintf( s, “” )
std::sort can replace qsort
Type safe
Comparison can be inlined
Other sorting options can be even faster: partial_sort, partition
Слайд 23Issue: Function Calls
50,000-100,000 calls/frame is normal
At 60Hz, Xbox has 12.2M cycles/frame
Function
call/return averages 20 cycles
A game calling 61,000 functions/frame spends 10% CPU (1.7 ms/frame) in function call overhead
Слайд 24Extreme Function-ality
120,000 functions/frame
140,000 functions/frame
130,000 calls to a single function/frame (ColumnVec::operator[])
And the
winner:
340,000 calls per frame!
9 ms/frame of call overhead
Слайд 25Beware Elegance
Elegance → levels of indirection → more functions → perf
impact
Use algorithmic solutions first
One pass through the world
Better object rejection
Do AI/physics/networking less often than once/frame
Слайд 26Inline Judiciously
Remember: inline is a suggestion
Try “inline any suitable” compiler option
15
to 20 fps
68,000 calls down to 47,000
Try __forceinline or similar keyword
Adding to 5 funcs shaved 1.5 ms/frame
Don’t over-inline
Слайд 27Issue: for loops
// Example 1: Copy indices to push buffer
for( DWORD
i = 0; i < dwIndexCnt; ++i )
*pPushBuffer++ = arrIndices[ i ];
// Example 2: Initialize vector array
for( DWORD i = 0; i < dwMax; ++i )
mVectorArr[i] = XGVECTOR4(0,0,0,0);
// Example 3: Process items in world
for( itr i = c.begin(); i < c.end(); ++i )
Process( *i );
Слайд 28Watch Out For For
Never copy/clear a POD with a for loop
std::algorithms
are optimized; use them
memcpy( pPushBuffer, arrIndices,
dwIndexCnt * sizeof(DWORD) );
memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) );
for_each( c.begin(), c.end(), Process );
Слайд 29Issue: Exception Handling
Most games never throw
Most games never catch
Yet, most games
enable EH
EH adds code to do stack unwinding
A little bit of overhead to a lot of code
10% size increase is common
2 ms/frame in worst case
Слайд 30Disable Exception Handling
Don’t throw or catch exceptions
Turn off the C++ EH
compiler option
For Dinkumware STL
Define “_HAS_EXCEPTIONS=0”
Write empty _Throw and _Raise_handler; see stdthrow.cpp and raisehan.cpp in crt folder
Add #pragma warning(disable: 4530)
Слайд 31Issue: Strings
Programmers love strings
Love hurts
~7000 calls to stricmp in frame loop
1.5
ms/frame
Binary search of a string table
2 ms/frame
Слайд 32Avoid strings
String comparisons don’t belong in the frame loop
Put strings in
an table and compare indices
At least optimize the comparison
Compare pointers only
Prefer strcmp to stricmp
Слайд 33Issue: Memory Allocation
Memory overhead
Xbox granularity/overhead is 16/16 bytes
Overhead alone is often
1+ MB
Too many allocations
Games commonly do thousands of allocations per frame
Cost: 1-5 ms/frame
Слайд 34Hidden Allocations
push_back(), insert() and friends typically allocate memory
String constructors allocate
Init-style calls
often allocate
Temporary objects, particularly string constants that convert to string objects
Слайд 35Minimize Per-Frame Allocations
Use memory-friendly data structures, e.g. arrays, vectors
Reserve memory in
advance
Use custom allocators
Pool same-size allocations in a single block of memory to avoid overhead
Use the explicit keyword to avoid hidden temporaries
Avoid strings
Слайд 36Other Tidbits
Compiler settings: experiment
dynamic_cast: just say no
Constructors: performance killers
Unused static array
space: track this
Loop unrolling: huge wins, sometimes
Suspicious comments: watch out
“Immensely slow matrix multiplication”
Слайд 37Wrap Up
Use the Right Tool for the Job
The STL is Evil,
Sometimes
Invent Only What You Need
Profile
Know Thine Holy Standard
Understand Your Options
Beware Elegance
Inline Judiciously
Watch Out For For
Disable Exception Handling
Avoid Strings
Minimize Per-frame Allocations
Слайд 38Call to Action: Evolve!
Pass the rubber chicken
Share your C++ performance mistakes
with your team
Mentor junior programmers
So they only make new mistakes
Don’t stop learning
You can never know enough C++
Слайд 39Questions
Fill out your feedback forms
Email: pkisensee@msn.com
This presentation: www.tantalon.com/pete.htm