Before diving into solutions, I'm curious about what problem you're trying to solve. Are you hoping to generate actual GUI code from a mock-up image? Is this more of an academic exercise? Something else entirely?
I have done something like this in the past for the purposes of finding rendering differences between two images. To accomplish that task, we used connected components to break down the image into segments onto which we could apply heuristics about what different GUI components should look like. That allowed us to ignore a set of allowable differences while pointing out any other per-pixel differences as well as higher-level layout issues.
Another approach could be to try to define templates for what these different GUI objects should look like (with some parameterization to allow for sizing, etc.), and then searching for instances within your image that adhere to these template definitions.
But here are some considerations to take into account. How well can you describe the GUI elements in your input images? For instance, if they all come from one mock-up program, then you may have a small set for which to search. However, if you are using screen shots from different platforms, from different browsers on the same platform, or from the same platform with different display settings, you will find differences ranging from subtle to dramatic. So limiting your input will be critical.
If you use connected components you will probably find that either you need to look for relationships between different connected components (because they make up part of the same GUI object) or you need to control how the connected component algorithm connects sets of pixels to make sure that you end up with the component set that you want. I bet that you will find a lot of ambiguity here. For instance, perhaps in one GUI object you want two shades of gray to be considered part of the same object, but in another situation you want those two grays to be kept in separate components.