Non-intrusive detection of adversarial deep learning attacks via observer networks

Recent studies have shown that deep learning models
are vulnerable to specifically crafted adversarial inputs that are
quasi-imperceptible to humans. In this letter, we propose a novel
method to detect adversarial inputs, by augmenting the main
classification network with multiple binary detectors (observer
networks) which take inputs from the hidden layers of the
original network (convolutional kernel outputs) and classify the
input as clean or adversarial. During inference, the detectors
are treated as a part of an ensemble network and the input
is deemed adversarial if at least half of the detectors classify
it as so. The proposed method addresses the trade-off between
accuracy of classification on clean and adversarial samples, as
the original classification network is not modified during the
detection process. The use of multiple observer networks makes
attacking the detection mechanism non-trivial even when the
attacker is aware of the victim classifier. We achieve a 99.5%
detection accuracy on the MNIST dataset and 97.5% on the
CIFAR-10 dataset using the Fast Gradient Sign Attack in a semiwhite box setup. 
The number of false positive detections is a mere0.12% in the worst case scenario.