Pixel sorter shared memory

That is just pure speculation though - I would be happy with a programmable rasterizer, but I don't know if one would ever come around and/or be useful.US20120017062A1 - Data Processing Using On-Chip Memory In Multiple Processing Units People have been talking about programmable rasterization for a while too, so perhaps sometime down the road there could be selectable group sizes for rasterization.

Who knows what will be coming in the next versions of D3D, but this seems like a logical extension of the possibilities. This makes it impossible for a developer to write a shader that will have a coherent access strategy to the shared memory. Instead, it is up to the vendors to determine the optimal split size to be used when rasterizing a primitive, and then it is done more or less behind the scenes. In a pixel shader on the other hand, there is currently no method or concept of a thread group. This gives very fine control over how many threads will be needing to access the memory, and you can design your algorithm very precisely to coordinate access to it. Part of your thread group size declaration is the declaration of how much shared memory it will be using. In the compute shader, you specify how large the thread groups are that you will be working with, and how many of them will be executed in a particular dispatch. If you consider for a moment how it is used in compute shaders, I think it will be clear why. I don't know how will the performance suffer if all units (fragments) try to write to the same memory location using InterlockedMax() or similarĪny thoughts on pixel shaders (not compute shaders!) and shared and atomic stuff in DX11?īuilding on what the others have said, there is no access to the group shared memory in pixel shaders. I'm not going to elaborate on my scenario further, I just state that I'll need to analyse what has been rasterised.

But I have a scenario where I need to rasterise normal geometry with a lot of textures and where I might benefit from being able to reduce a lot of info from pixel shaders using atomic operations on global (device) buffers, instead of writing out shitloads of texture data and reducing it parallelly afterwards. I see some OIT and Bokehs around which use Append Buffers. In GL4.2 I noticed they released the GL_ARB_shader_image_load_store extension, which obviously supports the same stuff but still nothing for the scarce but fast shared memory manipulation I did implement various parallel algorithms in OpenCL, so although I might seem little confused now, I'm very much aware which memory is which and what's it good for in GPGPU via CUDA/OpenCL/DX11 CS.Īlso, I see virtually nobody discussing using the atomic instructions outside compute shaders and wonder why. I don't see the API for that and maybe that makes sense. My question now is whether it's principally impossible to take advantage of the group shared memory in PS too. DeviceMemoryBarrier() seems to work in both PS and CS and it seems to be the only barrier instruction usable in PS. I can only see that I can use Interlocked*() instructions in both PS and CS. We all know that SM5 brought the possibility to scatter stuff in pixel shaders, too (not only compute shaders).

YOUR CART

Pixel sorter shared memory