In this article we are going to address the topic of AlexNet, which has gained great relevance in recent years due to its impact on various areas of society. From its origins to its current implications, AlexNet has captured the attention of experts, scholars and the general public, generating debates, reflections and analysis from different perspectives. Along these lines we will explore the multiple facets of AlexNet, from its influence on politics, economy, culture, to its effect on people's daily lives. In this way, we will delve into a topic of universal interest that invites us to reflect and dialogue in search of a greater understanding about AlexNet and its impact on the contemporary world.
AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of graphics processing units (GPUs) during training.
Historic context
AlexNet was not the first fast GPU-implementation of a CNN to win an image recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU. A deep CNN of Dan Cireșan et al. (2011) at IDSIA was already 60 times faster and outperformed predecessors in August 2011. Between May 15, 2011, and September 10, 2012, their CNN won no fewer than four image competitions. They also significantly improved on the best performance in the literature for multiple image databases.
According to the AlexNet paper, Cireșan's earlier net is "somewhat similar." Both were originally written with CUDA to run with GPU support. In fact, both are actually just variants of the CNN designs introduced by Yann LeCun et al. (1989) who applied the backpropagation algorithm to a variant of Kunihiko Fukushima's original CNN architecture called "neocognitron." The architecture was later modified by J. Weng's method called max-pooling.
In 2015, AlexNet was outperformed by a Microsoft Research Asia project with over 100 layers, which won the ImageNet 2015 contest.
Network design
AlexNet contains eight layers: the first five are convolutional layers, some of them followed by max-pooling layers, and the last three are fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU. The entire structure can be written as:
where
CNN = convolutional layer (with ReLU activation)
RN = local response normalization
MP = maxpooling
FC = fully connected layer (with ReLU activation)
Linear = fully connected layer (without activation)
DO = dropout
It used the non-saturating ReLU activation function, which showed improved training performance over tanh and sigmoid.
Influence
AlexNet is considered one of the most influential papers published in computer vision, having spurred many more papers published employing CNNs and GPUs to accelerate deep learning. As of early 2023, the AlexNet paper has been cited over 120,000 times according to Google Scholar.
^Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". Proc. 4th International Conf. Computer Vision: 121–128.