There's a lot of challenges developing a 2D game that is 3D on screen.

Recently I added multi-floor buildings, which transitioned the game into 3D space (where as before, flat terrain and single story buildings meant the Z (height) position of everything was 0). Some of the functionality of the game relied on looking up what pixel (object) was under the mouse. This was relatively easy to do before building height. I looked up all the objects in a cone moving towards the camera (so down in screen space) and determined which one was "closest" to the camera by sampling the sprite image and comparing the world.y position.

This method falls apart when there are multi-story buildings, since a really tall building could be blocking the view of something further up the map. I came up with a solution that uses a 2nd buffer that draws everything to a tiny 3x3 texture, but instead of drawing the sprite, it draws a UUID encoded as pure color (by using the Vertex shader color property). On the CPU side of things, I take this color, convert it back into an unsigned int, look up the element position in a (C++) vector, and grab the basic properties of the object from there. The reason the texture has to be so small is because it is stored on the GPU, and needs to be converted into an image format for the CPU. It would be very expensive to send back the entire screen as an image to the CPU every frame.