Disrupting adversarial transferability in deep neural networks

Patterns (N Y). 2022 Mar 24;3(5):100472. doi: 10.1016/j.patter.2022.100472. eCollection 2022 May 13.

Abstract

Adversarial attack transferability is well recognized in deep learning. Previous work has partially explained transferability by recognizing common adversarial subspaces and correlations between decision boundaries, but little is known beyond that. We propose that transferability between seemingly different models is due to a high linear correlation between the feature sets that different networks extract. In other words, two models trained on the same task that are distant in the parameter space likely extract features in the same fashion, linked by trivial affine transformations between the latent spaces. Furthermore, we show how applying a feature correlation loss, which decorrelates the extracted features in corresponding latent spaces, can reduce the transferability of adversarial attacks between models, suggesting that the models complete tasks in semantically different ways. Finally, we propose a dual-neck autoencoder (DNA), which leverages this feature correlation loss to create two meaningfully different encodings of input information with reduced transferability.

Keywords: adversarial attacks; artificial intelligence; attack transferability; computer vision; decorrelation; deep learning; radiomics.