Vision should provide an explanation of the scene in terms of a causal semantics - what affects what, and why. An important part of the causal explanation of static scenes is what supports what, or, counterfacturally: Why aren't things moving? We use simple naive physical knowledge as the basis of a vertically integrated vision system that explains arbitrarily complex stacked block structures. The semantics provides a basis for controlling the application of visual attention, and forms a framework for the explanation that is generated. We show how the program sequentially explores scenes of complex blocks structures, identifies functional substructures such as arches and cantilevers, and develops an explanation of why the whole construction stands and the role of each block in its stability.