My self-supervised model supports psychophysical findings that show there is a key difference in the processing required for relating dimensions within and between sensory modalities. The algorithm performs much better when the visual inputs are treated as a group and processed separately from the auditory ones (before feedback) than when all the input dimensions are randomly divided into two ``pseudo-modalities''. In fact, my current research shows that having dimensions from the same modality on the other side of a ``self-teaching" network is harmful. This indicates that there are good computational reasons for separating the modalities. In order to understand this result, I am mathematically analysing the statistical relationship between dimensions in a pattern ensemble to determine how they should best be integrated in learning.