We present a method for autonomously learning representations of visual disparity between images from left and right eye, as well as appropriate vergence movements to fixate objects with both eyes. A sparse coding model (perception) encodes sensory information using binocular basis functions, while a reinforcement learner (behavior) generates the eye movement, according to the sensed disparity. Perception and behavior develop in parallel, by minimizing the same cost function: the reconstruction error of the stimulus by the generative model. In order to efficiently cope with multiple disparity ranges, sparse coding models are learnt at multiple scales, encoding disparities at various resolutions. Similarly, vergence commands are defined on a logarithmic scale to allow both coarse and fine actions. We demonstrate the efficacy of the proposed method using the humanoid robot iCub. We show that the model is fully self-calibrating and does not require any prior information about the camera parameters or the system dynamics.