Figures and data

Description of models.
a, Each area of the model receives driving feedforward input and modulatory feedback input. Feedback input alters the gain of a neuron, but it doesn’t affect its threshold of activation (multiplicative feedback). In later experiments, we explore an alternative mechanism, where it weakly affects the threshold of activation (composite feedback). b, Modeled regions and their externopyramidisation values (i.e thickness and relative differentiation of supragranular layers, used as proxy measure for sensory-associational hierarchical position). Note higher overall externopyramidisation values in the occipital lobe compared to the temporal lobe. c, Using above hierarchical measures, we constructed models where each connection has a direction (i.e regions send either feedforward or feedback connections to other regions). In the brainlike model based on human cytoarchitectural data, where all visual regions send feedforward to and receive feedback from the auditory regions. In the reverse model, all auditory regions send feedforward to and receive feedback from visual regions, while connections within a modality remain the same. d, The resulting ANN. Outputs of image identification tasks are read out taken from IT, outputs of audio identification tasks are read out from A4, an auditory associational area. Connections between modules are simplified for illustration.

Multimodal visual tasks.
a, Training conditions. Models must identify the visual stimulus given an ambiguous image and a matching audio clue (VS1) or unambiguous image and distracting audio (VS2). b-c, Accuracy across epochs for tasks VS1 and VS2 on holdout datasets, d Trained models were given an ambiguous visual stimulus and a nonmatching audio stimulus (VS3) to assess which modality they align most closely with. e, Alignment of trained models across epochs based on task VS3. f, Models were additionally trained and tested on image stimuli only to assess their baseline performance (VS4). g, Accuracy of models across epochs on task VS4

Multimodal auditory tasks.
a, Training conditions. Models must identify the auditory stimulus given an ambiguous audio and a matching visual clue (AS1) or unambiguous audio and distracting image (AS2). b-c, Accuracy across epochs for tasks AS1 and AS2 on holdout datasets, d, Trained models were given an ambiguous audio stimulus and a nonmatching visual stimulus (AS3) to assess which modality they align most heavily with. e, Alignment of trained models across epochs based on task AS3. f, Models were additionally trained and tested on audio stimuli only to assess their baseline performance (VS4), g, Accuracy of models across epochs on task AS4

Composite versus multiplicative feedback in multimodal tasks.
a, c, e, Test performances of models with composite feedback and feedforward-only models trained on visual tasks (VS1 and VS2). The final epoch accuracy of models with composite feedback (C) is compared to that of models with multiplicative feedback (M) shown in previous figures b, d, f, Test performances of models trained on auditory tasks (AS1 and AS2).

Audiovisual switching task.
a, All models were given a new audiovisual output area (AV) connecting to IT and A4, b, The output area receives an attention flag telling which stream of information to attend to (visual or auditory). The models with feedback use with composite feedback. c-d, Test performance of models with composite feedback and feedforward-only models on all tasks. The models were trained simultaneously on all tasks, e, Alignment of models given ambiguous visual and ambiguous audio input with differing labels.

Model activity during multimodal tasks.
a, Information flows through the model from area to area across time. At the first time step, only the primary visual and auditory areas will process information. The areas they feed forward to are activated at the next time step, incorporating topdown information if there is any. b, Comparison of t-SNE reduced latent space and clustering metric in three areas of the brainlike model at different time stages on task VS2 (ignore audio stimulus). c, Neighborhood Hit scores in all areas of the trained models across time. Trained models were taken from experiments in Fig 4