How the Oscillatory Activation Function Overcomes Gradient Descent and XOR Problems
In a neural network modeled after the human brain, the activation function is one of the most important components, applied to the input before deriving a transformed output, deciding the weighting required for a neuron is activated and transferred to the next layer. The main job of an activation function is to introduce nonlinearity into a neural network. Nonlinear activation functions have been key transformers allowing CNNs to learn complex high-dimensional functions. Moreover, the discovery of the rectified linear unit activation function, or ReLU, has played a huge role in solving the leakage gradient problem. However, all popular activation functions increase monotonically with a single zero at the origin.
The famous XOR problem involves training a neural network to learn the XOR gate function. Papert and Minsky first pointed out that a single neuron cannot learn the XOR function since a single hyperplane (line, in this case) cannot separate the output classes for that function definition. This fundamental limitation of single neurons (or single-layer networks) has led to pessimistic predictions for the future of neural network research. He was responsible for a brief hiatus in AI history. In a recent paper, these limitations were found to be invalid for some oscillatory activation functions. The XOR problem consists of learning the following set of data:
The paper, “Biologically Inspired Oscillating Activation Functions Can Bridge the Performance Gap Between Biological and Artificial Neurons”, proposes oscillating activation functions to overcome both the gradient flow problem and the XOR problem, essentially solving “classification problems with fewer neurons and reducing training time”. Analytics India Magazine spoke to Dr. Matthew Mithra Noel, Dean of School of Electrical Engineering at VIT and Shubham Bharadwaj for further research. Additionally, Praneet Dutta, an alumnus of the university, volunteered to provide high-level advice as an independent researcher on the proposal.
Look for activation functions better than ReLU
Dr. Noel talked about ReLU and the need to research better activation functions. “Neural layers with nonlinear activation functions are essential in real-world applications of neural networks, because composing a finite number of linear functions equals a single linear function. Therefore, an ANN composed of purely linear neurons is equivalent to a single linear layer network capable of learning only linear relations and solving only linear separable problems,” he explained. “Despite the crucial importance of the nature of the activation function in determining the performance of neural networks, simple non-decreasing monotonic nonlinear activation functions are universally used. We explored the effects of using functions non-monotonic and oscillatory nonlinear activation functions in deep neural networks In the past, sigmoidal s-shaped saturating activation functions were popular because they approximated the step or sign function while still being differentiable.
Moreover, the outputs of s-shaped saturating activations have the important property of being interpretable as a binary yes/no decision and are therefore useful. However, deep ANNs composed of purely sigmoidal activation functions cannot be trained due to the leakage gradient problem that arises when saturating activation functions are used. The adoption of the non-saturating, non-sigmoidal recti-linear unit (ReLU) activation function to alleviate the vanishing gradient problem is considered an important step in the evolution of deep neural networks.
Oscillatory and non-monotonic activation functions have been largely ignored in the past, perhaps due to perceived biological implausibility. “Our research explores a variety of complex oscillatory activation functions. Oscillating, non-monotonic activation functions could be advantageous in solving the leakage gradient problem, because these functions have non-zero derivatives throughout their domain except at isolated points,” Dr. Noel said.
“During our research, we discovered a variety of new activation functions that outperform all known activation functions on the Imagenette, CIFAR-100, CIFAR-10, and MNIST datasets. Moreover, these new activation functions appear to reduce network size and training time. For example, the XOR function which can only be learned with a minimum of 3 sigmoid neurons or ReLU can be learned with a single GCU neuron.
overcome the leakage gradient problem with oscillatory activation functions
“Back in 2017, it was clear that if you replaced saturating sigmoidal activation functions with non-saturating activation functions, like ReLU, the performance was significantly better,” Dr. Noel said. “The only way to train very deep neural networks is to replace saturating sigmoidal activation functions with activation functions that only partially saturate.” This realization led the team to rethink ReLU as the best activation function and the possibility to improve performance beyond that.
“Neural network learning works on the principle of gradient descent and the parameters are updated based on the derivatives. Thus, the vanishing gradient problem is a fundamental problem that all deep nets must overcome,” Dr. Noel continued. “The solution that seemed obvious was to improve ReLU by exploring activation functions that never saturate, regardless of value. The problem of vanishing and exploding gradients can be mitigated by using oscillatory activation functions with derivatives that do not go to zero or infinity and resemble classical bipolar sigmoidal activations near zero.
Oscillating activation function: the XOR problem
Each neuron in the neural network makes a simple Yes/No decision, a binary classification. If all classical activation functions have only one zero, the decision boundary is a single hyperplane. You need two hyperplanes to separate the dataset for the XOR problem. “Essentially, you need the activation function to be positive, then negative, and positive again,” Dr. Noel explained, this need to oscillate. In theory, two hyperplanes are needed to separate the classes in the XOR dataset. And for two hyperplanes, an activation function with two zeros is needed. The data show that for small input values, the output of biological neurons increases. In the longer term, the output saturates, then “the output must decay to another zero if a biological neuron is able to learn the XOR function,” the paper explains. The proposed model captures this rising and falling oscillation with multiple zeros. The model has multiple zeros and multiple hyperplanes as part of the decision boundary. This replaces the traditional need for two layers to learn the XOR function with a single neuron with oscillatory activation as the Growing Cosine Unit (GCU). In the paper, the researchers discovered and introduced many oscillating functions that could solve the XOR problem with a single neuron.
Discovery of unique biological neurons capable of learning XOR function
“The human brain demonstrates the highest levels of general intelligence, a quality not found in any other animal,” Dr. Noel explained. “If these XOR neurons in the human brain behave similarly to the models we have proposed, independent of biology, we are on the right track.” Indeed, the biological neurons of the human brain can individually solve the learning problems of the XOR function. A 2020 study by Albert Gidon et al. identified new classes of neurons in the human cortex with the potential to enable single neurons to solve computational problems, including XOR, which typically require multi-layered neural networks. The article discussed how the activation function increases and then decreases, essentially oscillates, turning out to be a biological counterpart of the proposed theory.
“If we want to bridge the gap between human intelligence and artificial intelligence, we have to bridge the gap between biological neurons and artificial neurons,” Dr. Noel concluded. Oscillatory activation functions were tested on different neural network model architectures, datasets, and benchmarks, resulting in at least one of the new activation functions outperforming previous functions on all models evaluated .
Currently, the team is testing this solution on various practical problems with VIT students. Some of the use cases explored include cryptocurrency price estimation and image tasks related to retinal scans. In this latest research, Shubham noted that of the 52 function combinations tried for the conV and dense layer, the top five combinations resulted from the oscillatory activation function. Continuing on this phenomenon, he explained, “The total number of combinations we try on a simple VGG network for the retinopathy task is about 676, and the potential for seeing activation functions as a hyperparameter is huge. We observed it from an example of experimental test on 52 combinations of activation functions on the convolution (conV) and the dense layer, the top five combinations which obtained the maximum AUC score included three combinations of functions of oscillatory activation in the feature extraction layer. Additionally, combinations are also explored with oscillating activations in a dense layer, which is a different angle from what the proposed use case of this new family of oscillating activation functions should look like.