D3D Ex-Programm Manager zum Geometry Instancing Problem [Archiv]

Demirug

2004-09-10, 22:28:28

Aus der DX Entwickler Liste:

I was the program manager for Direct3D at the time we created the instancing feature. I've moved on to a different role, but there's a short hiatus right now before my successor can step in here, and I can still give you and the others on this thread the benefit of my history with this issue.

You are dead-on with regard to the vertex stream frequency divider feature in d3d9 shader 3. Indeed, there is no new "instancing API" per se; instead, the behavior of the DrawIndexedPrimitive[UP] API is changed to incorporate instancing semantics according to the control information specified through the SetStreamSourceFreq API.

Emil and others down the thread are also correct that this does not tie directly into any other shader 3 hardware features. However, at the time we first proposed the stream frequency capability (going back to 2002), we received consistent feedback from hardware vendors that frequency divider and modulo vertex fetching controls were beyond the scope of any planned shader 2 parts. (I believe this is still true.) Since we had also received consistent feedback from developers that we needed to rein in the proliferation of caps and drive more standardization of features, we made this a required feature for shader 3.

More detail: instancing allows the application to factor highly repetitive scenes elements into the common elements (instanced data), and individual variations (instancing data). A scene graph architecture (where this technique originally comes from) would create a node or segment representing the instanced data and repeatedly invoke it with different state values, modeling transforms, etc. In an immediate mode (with vertex buffers) architecture, we can achieve similar factoring by putting the indexed data in one vertex buffer and the varying data in a second vb. The hardware ideally has to support the ability to traverse the second vb at a reduced rate (the divided frequency) compared with the first vb - this allows one block of instancing data to apply to the complete traversal of the instanced data. And there also should be some mechanism to repeatedly traverse the contents of the instanced vb (like a modulo applied to the vb index).

As originally shipped in d3d9 "gold", the stream frequency controls only included divider. We got very strong feedback after d3d9 shipped that instancing would be much more interesting with full modulo support, and that support for index buffers was so much more important that if necessary we could drop support for the non-indexed case. We proposed these changes for the 9.0c runtime update and worked closely with a number of software and hardware vendors to validate that this feature would meet their requirements and could be implemented on the coming shader 3 parts. One additional ask we incorporated was to enable some downlevel shader programs to take advantage of instancing on hardware that supported the stream frequency controls. But the feature was *always* specified (going all the way back to d3d9 "gold" in December 2002) as a shader 3 feature because of the expected hardware implementation.

The nub of the current issue is that some vendors have found an interesting and reasonably performant way to implement the *effect* of the modulo capability, without requiring native support for modulo in the vertex fetch engine. One naïve approach would be for the driver to simply replicate the instanced data and create a much larger vb that can be traversed one time end to end. No doubt there are other techniques that might rely on more hardware support; we don't know for sure what any particular driver does, nor whether their techniques are generally applicable with similar performance benefits across other vendors' architectures.

That said, these techniques do offer significant performance improvements on hardware that lacks the modulo capability. They may not meet the hardware intent of the feature, but given the perf benefits they do raise the question: could we have provided a shader 2 cap to expose such an implementation on older hardware/drivers? We took a hard look at this, but there were a number of factors that made this extremely difficult:

+ Shader 2 was well over a year "old" in terms of shipping implementations, and over 2 years old in terms of discussions with hardware vendors; ex post facto changes like this are not undertaken lightly. We *did* add some support to shader 2 in 9.0c (e.g., centroid sampling support), but this had very broad hardware support lined up well before (~1 year) 9.0c was to ship.

+ In marked contrast, we did not understand until very late in the 9.0c cycle that hardware and software vendors who had reviewed the spec had issues with the shader 3 limitation, and/or had alternate implementations on shader 2 parts to bring to the table. By the time this became clear, we had no time to review this proposal with other vendors.

+ We also had no beta cycles left to get customer exposure to and feedback on the proposed change. And worse: we had integrated 9.0c into XP SP2, and XP SP2 was essentially locked down, so taking a change would have also required creating a future runtime update 9.0d to install on XP SP2.

+ The DX graphics and DRG teams have received consistent feedback over the past several years that driver implementations of features intended for hardware tend to create hidden hw- or driver-specific resource or performance tradeoffs that confuse developers. In cases where there are ways for developers to implement the feature (and control the tradeoffs) with application-side work, we try to encourage those techniques and discourage driver implementations.

+ In this case, I believe the performance benefits of the driver implementations I have seen are also generally achievable by application techniques. The SDK team has a sample program that compares several application instancing techniques; I understand it should be available in some form (perhaps a web SDK update) in October. The team has provided an early version of this to a number of hw and sw vendors for review and has not received any negative feedback, apart from the general and obvious downside that it necessarily requires more work from the application developer.

I hope this helps. As Richard Huddy suggests on another branch of the thread, this is a complicated issue. If you want to keep it simple, my recommendation would be to take a look at application techniques, and particularly the sample program when it is published next month. (If these approaches don't work for you, we'd like to know about it!) If not, please bear in mind these "caveat developer"'s:

1) Naïve or overt efforts to expose the frequency divider API on shader 2 hardware will fail WHQL, so you'll have to live with whatever circumlocutions are required to covertly enable the feature.

2) The debug runtime will simply fail the frequency divider API on shader 2 hardware.

3) No telling what other hw vendors will do or when, you may need application side code after all.

4) Bottom line: for better or worse, it's not part of the shader 2 spec, it's against the letter and spirit of the frequency divider spec, it's not supported d3d behavior.

Keep us posted on what does and doesn't work for you, here and elsewhere!

Steve Wright

Ist wohl recht eindeutig wer da mehrfach Mist gebaut hat.

Benedikt

2004-09-10, 23:02:41

<snip>
The nub of the current issue is that some vendors have found an interesting and reasonably performant way to implement the *effect* of the modulo capability, without requiring native support for modulo in the vertex fetch engine. One naïve approach would be for the driver to simply replicate the instanced data and create a much larger vb that can be traversed one time end to end. No doubt there are other techniques that might rely on more hardware support; we don't know for sure what any particular driver does, nor whether their techniques are generally applicable with similar performance benefits across other vendors' architectures. <...>

Tja, wer wird das wohl sein? :)

MFG;
B. W.

Coda

2004-09-10, 23:47:01

Falls du ATi meinst: Ich glaube nicht dass sie das machen, sie übergehen wahrscheinlich nur den Calling-Overhead von DirectX.

Gast

2004-09-11, 10:26:35

Kein Grund sich aufzuregen. Der Text bezieht sich klar auf ATI, aber in keinster Weise negativ.
Offenbar hats bei ATI wirklich gute Treiberprogrammierer, die eine derart performante Möglichkeit des GI implementieren können, ohne dass die Hardware dafür vorgesehen ist. Davor ziehe ich den Hut, weil es von Know-How zeugt.