Adobe has a new name for an old friend, its called “Ahead of Time (AOT) compilation”.

from http://www.adobe.com/devnet/logged_in/abansod_iphone.html

Wow, to think I have been compiling code for the last 20 years or so and I instead could have been “ahead of time” compiling it. 20 years wasted…..

mtm


Filed under: Uncategorized — admin @ February 18, 2010 6:15 pm

And while I am learning ObjectiveC/ObjectiveC++….

This guy eloquently enumerates some reasons I like C++.
Although I do imagine once I get a C++0x compiler, I will like it the most..
D has a lot of promise also…..


Filed under: Uncategorized — admin @ February 15, 2010 10:44 am

Currently learning ObjectiveC/ObjectiveC++

My new employer bought me a shiny new 17″ MacBook Pro.
Currently developing an iPad application.
Needless to say, my biggest obstacles are ObjectiveC/ObjectiveC++.
I am guessing the best way to learn would be re-implementing Orkid’s C++ static introspection/reflection system in terms of ObjectiveC++ to see what ObjectiveC++ actually buys me in dynamic features.
Shame It wont be portable to MSVC++.


Filed under: Uncategorized — Tags: — admin @ February 13, 2010 4:32 pm

Thoughts on multithreaded pipeline game engine architecture

Introduction

Today I am going to share my current thoughts on a scene design supporting concurrency evolved from experiences on the last game I worked on. My previous design had at least some parallelism in that it had separate threads for update and render.

New is my formal realization that this previous method was in fact a 2 stage pipeline and that more pipeline stages will promote more task and data parallelism at the cost of more latency. It mimics a CPU/GPU multi staged pipeline where all stages execute concurrently. Every stage of this pipeline produces a result which is usable in the next stage. These results must be multi-buffered for threaded access (similar to a double buffered GPU commandlist). Only one stage can touch a result buffer at a time. When the previous stage is done writing to a result buffer the next stage may begin reading and acting upon it.

This model attempts to combine traits of both the “Synchronous function parallel model” and “Data parallel model” as described in Multithreaded Game Engine Architectures( Ville Mönkkönen: gamasutra.com-sep-2006). Our model differs from the Synchronous function parallel model in that it executes input and control concurrently with physics and animation, with each stage operating on data sets from different frames. There is also a pipeline model described in Multi Threading Models (GeorgiaTech). Under “Disadvantages”, it states “limits degree of parallelism, throughput driven by slowest stage”. Our new pipeline model described here at least attempts to address that disadvantage by opening up the possibility for per stage data-parallelism.

Review current system

Lets illustrate how our game entities are structured. This will aid in explaining some of the design decisions we will have to make later. Here is some c++ pseudo-code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
////////////////////////////////////////////////////////////////
// copyright 2010, Michael T. Mayers : michael@tweakoz.com
////////////////////////////////////////////////////////////////

typedef TableString ComponentFamily; // TableString is a flyweighted string that sorts very quickly

//ork::Object is a class that implements editing, serialization and rtti via a c++ reflection mechanism

////////////////////////////////////////////////////////////////
// Editable Entity Component Data Abstract Base Class
// Examples include: ModelComponentData,
//   BulletPhysicsObjectComponentData, ParticleControllerComponentData, etc..
////////////////////////////////////////////////////////////////
class EntityComponentData : public ork::Object
{
public:
    virtual EntityComponentInst* CreateComponentInst() const = 0;
    const ComponentFamily& GetFamily() const = 0;
private:
    const EntityData* mpEntData;
};
////////////////////////////////////////////////////////////////
// Runtime Instantiated(Mutable) Entity Component Instance Abstract Base Class
// Examples include: ModelComponentInst,
//   BulletPhysicsObjectComponentInst, ParticleControllerComponentInst, etc..
////////////////////////////////////////////////////////////////
class EntityComponentInst : public ork::Object
{
public:
    EntityComponentInst( const EntityComponentData& edata );
    //void Update( SceneInst* psi ); // TODO: old method, needs formalized multibuffer data
private:
    //virtual void DoUpdate( SceneInst* psi ) = 0; // TODO: old method, needs formalized multibuffer data
    const EntityComponentData& mData;
};
////////////////////////////////////////////////////////////////
// Editable Scene Component Data Abstract Base Class
// Scene Components are like Entity Components,
//  with the added restriction that there is only one per type per scene.
// Examples include: BulletPhysicsManagerComponentData, AudioManagerComponentData, etc..
////////////////////////////////////////////////////////////////
class SceneComponentData : public ork::Object
{
};
////////////////////////////////////////////////////////////////
// Runtime Instantiated(Mutable) Scene Component Instance Abstract Base Class
// Examples include: BulletPhysicsManagerComponentInst, AudioManagerComponentInst, etc..
////////////////////////////////////////////////////////////////
class SceneComponentInst : public ork::Object
{
public:
    SceneComponentInst( const SceneComponentData& scd );
    //void Update( SceneInst* psi ); // TODO: old method, needs formalized multibuffer data
private:
    //virtual void DoUpdate( SceneInst* psi ) = 0; // TODO: old method, needs formalized multibuffer data
    const SceneComponentData& mData;
};
////////////////////////////////////////////////////////////////
// Scene Object Base
////////////////////////////////////////////////////////////////
class SceneObject : public ork::Object
{
};
////////////////////////////////////////////////////////////////
// an Archetype is a data-driven composite "builder" design pattern
////////////////////////////////////////////////////////////////
class Archetype : public SceneObject
{
public:
    void ComposeEntity( Entity* pent );                 // compose an entity, create its component instances
    void DeComposeEntity( Entity* pent );               // decompose an entity, destroy its component instances
    void LinkEntity( SceneInst* psi, Entity* pent );    //  link this entity's components with other entities/components in the scene
    void UnLinkEntity( Entity* pent );                  // unlink ......
private:
    virtual void DoComposeEntity( Entity* pent ) = 0;
    virtual void DoDeComposeEntity( Entity* pent ) = 0;
    virtual void DoLinkEntity( SceneInst* psi, Entity* pent ) = 0;
    virtual void DoUnLinkEntity( SceneInst* psi, Entity* pent ) = 0;

   orklut < ComponentFamily,EntityComponentData* > mComponentDatas;

};
////////////////////////////////////////////////////////////////
// an EntData is a spawn record for an entity, with
// a spawn location and any spawn modifiers for
// the entity
////////////////////////////////////////////////////////////////
class EntData : public SceneObject
{
private:
    Archetype* mpArchetype;    // what kind of entity are we spawning ?
    SpawnInfo  mSpawnInfo;     // spawn location, spawn modifiers, etc...
};
////////////////////////////////////////////////////////////////
// Runtime Entity (Instantiated, mutable EntData)
////////////////////////////////////////////////////////////////
class Entity
{
public:
    Entity( const EntData& ed );
private:
    const EntData& mEntData;
    orklut < ComponentFamily, EntityComponentInst* > mComponents;
};
////////////////////////////////////////////////////////////////
// root container of all data driven scene objects
// source of a SceneInstance
////////////////////////////////////////////////////////////////
class SceneData : public ork::Object
{
private:
    orklut < String, SceneObject* >                 mSceneObjects;
    orklut < ComponentFamily, SceneComponentData* > mSceneComponentDatas;
};
////////////////////////////////////////////////////////////////
// collection of Runtime-Mutable Entity Components
////////////////////////////////////////////////////////////////
class EntityComponentInstGroup
{
private:
    orklist< EntityComponentInst* > mEntityComponents;
};
////////////////////////////////////////////////////////////////
// collection of Runtime-Mutable Scene Components
////////////////////////////////////////////////////////////////
class SceneComponentInstGroup
{
private:
    orklist< SceneComponentInst* > mSceneComponents;
};
////////////////////////////////////////////////////////////////
// Runtime-Mutable Scene Instance
////////////////////////////////////////////////////////////////
class SceneInst
{
public:
    void Update();
    SceneInst( const SceneData& sd );
private:
    void ProcessSpawnDespawnQueues();
    const SceneData& mSceneData;
    orklut< ComponentFamily, EntityComponentInstGroup* > mActiveEntityComponents;
    orklut< ComponentFamily, SceneComponentInstGroup* > mActiveSceneComponents;
    orkqueue< Entity* > mEntitySpawnQueue;
    orkqueue< Entity* > mEntityDeSpawnQueue;
    void ActivateEntity( Entity* pent );
    void DeactivateEntity( Entity* pent );
};

What Next?

From the above code we can see that our game entities (class Entity) are composites made from a collection of EntityComponentInst objects, with one EntityComponentInst per family. A family is just a grouping of components that have similar functions and are executed together as one. Examples of a family would be “input”, “control”, “camera”, “physics”, “animation”, “particles”, “audio”, etc… A given family may have many component types (classes) assigned to it, for example the “control” family may include player control components and AI control components among others. The scene maintains a list of active components per family and families are usually updated in a specific order. Originally components were executed in family order to improve instruction cache behavior. This will also give us some properties beneficial to parallelism.

Take a moment to look at the pictures below to get an idea of where we are headed. It is not import yet to fully grok what they are portraying.

To map this entity/component/family architecture to the pipeline model, we will constrain certain families to certain stages. On a given stage, components assigned to that stage can optionally be updated in family order. As an example, if the “input” component family requires updating before the “control” component family, then all “input” components can optionally be guaranteed to finish before any “control” components. To accomplish this we will use a family barrier. This family barrier will become important when we use inner-stage data parallelism. For now, realize that not all families will require this and since it will block until all threads for a given family have finished, we will only use it when absolutely necessary. Family barriers will never be necessary when a component in a later stage depends upon data calculated in a component assigned to an earlier stage.

Weird issues

If any given component needs to mutate the state of another component, it will likely require the use of a queued message. If some conditions are met then this message will be dequeued and processed the same frame, otherwise it will be dequeued and processed on the next frame. Conditions that allow same frame processing are:

  • If the receiving component’s family is assigned to a later stage than the sending component’s family.
  • Both components are assigned to the same stage, the receiving component’s family sorts after the sender’s family, and there is a family barrier between the two component types.
  • Both components are assigned to the same stage, and that stage is currently single-threaded for that frame.

Symmetrically, if a component needs to query the state of another component, some constraints need to be put in place. If certain conditions are met, then you can query a component’s post update state this frame. Otherwise you will get that component’s post update state from the previous frame. Conditions that enable same-frame querying are:

  • If the querying component’s family is assigned to a later stage than the queried component’s family.
  • Both components are assigned to the same stage, the querying component’s family sorts after the queried component’s family, and there is a family barrier between the two component types.
  • Both components are assigned to the same stage and that stage is currently single-threaded for that frame.

These two methods are necessary to guarantee consistency when dealing with data parallelism. They promote more data parallelism within the pipeline stage by hiding inner-stage data dependencies and preventing pipeline data hazard stalls. One technique you can use to help deal with the strange (but consistent) delivery timing of messages is: In a component’s update method, always process queued messages before doing anything else. This will at least enforce the correct ordering of state mutation.

Another method that will likely prove useful is the dual use of predictive and reactive distribution of work. Pipeline stages will attempt to predict how many data-parallel subtasks to split their workload into. Execution timing history may be kept to aid in the prediction process.

Latency

There is a 1 frame latency hit per pipeline stage. With 4 stages @ 60hz this equates to 66.6ms (4/60ths or 1/15th of a second) total latency – not including device driver command buffering or display latency. This seems reasonable to me (depending on the game of course). For some reference on latency, see this article. Obviously, at 120hz+ the latency gets better.

A picture is worth a thousand words

Here is a block diagram:




Notes:

  • Only 1 job each may be executed concurrently for stages 1, 2 and 4. For stages 1 and 2, jobs may be split into subtasks for data parallel execution. Stage 4 communicates with hardware and should have no concurrency. If you are using Direct3D 11 and want multithreaded deferred contexts then they would execute as subtasks on stage 3 (AV preprocessing).
  • This will support particles emitted from or influenced by post physics/skeletal animated data if the “particle” family sorts after the “physics” and “animation” families.
  • There is a “Frame Persistant Data Boundary” that marks the boundary between data retained for further updating and data that dead ends at the AV renderers.

Thread Pools

I was thinking a Thread Pool would be good for servicing the pipeline stages. This is vaguely similar to the pattern I use in my mulithreaded dataflow graphs. The primary difference being the dataflow graphs have a lot more unhidden data dependencies than this pipeline model. The dataflow graph scheduler does a lot of dependency tracking. The graphs have static topology and are for the most part statically scheduled. The dataflow scheduler tries to schedule modules in parallel, but the data dependencies fight it. Multiple dataflow graphs have more freedom to execute concurrently. In contrast, the jobs processed by the pipeline model’s stages will vary more in number and will have minimal data dependencies that would benefit from a more dynamic scheduler.

Visualize the timing

Here is an example timing diagram: (estimated)


Other desired properties

It would be nice if we could maintain a few properties of the last system I worked on:

  • ZERO heap memory allocations per frame. Use fine-grain-locked object pools and objects on the stack instead of the heap.
  • Object data memory locality. Pools can be contiguous, strategic object placement within the pools can help data cache locality.
  • Instruction cache locality. The old system updated entire families in whole and in sequence via per family lists. Since the update was single threaded and like objects with like v-tables were updated in sequence, it had decent instruction cache locality. When multiple threads are involved (in data parallel mode) I could see this property degenerating if not addressed. Perhaps by aggregating components into pairs or triples (keeping an eye on instruction cache usage and keeping the overall loop iteration code footprint down), we can get the best of instruction cache locality and data parallelism. Fortunately newer multicore CPU’s tend to have separate L1/L2 instruction caches per core which we can utilize.

References

Here is some reference material: (though I do not see much mention of the generalized “pipeline” model)
Multithreaded Game Engine Architectures( Ville Mönkkönen: gamasutra.com-sep-2006)
Multithreaded game design – help(gamedev.net-feb-2008)
Multithreading Models(GeorgiaTech)
Concurrency Patterns (WikiPedia>


Filed under: Game Engines — admin @ December 31, 2009 2:23 am

Whew

I am in the process of moving my portfolio site over to WordPress. I expect it will take a while, but I think it will be worth it. For now, the site will be mixed until I can get it all moved over.


Filed under: Uncategorized — admin @ October 12, 2009 12:35 am