Description
Rice is the most widely consumed staple food for over half of the world's population and is one of the most produced agricultural commodities worldwide. Rice consists of numerous genetic varieties. These varieties are separated from each other due to some of their features such a texture, shape, color. Using these visual features, it is possible to classify and evaluate the quality of rice grains. Using the Rice Image Dataset from Kaggle, we aim to explore various transfer learning approaches with neural network models to see how they generalize to classifying rice varieties. The specific neural network models that we apply transfer learning (using their PyTorch implementations) to are:
- AlexNet (a classic model)
- ResNet18 (a classic model)
- SqueezeNet (a more recent-ish model)
We also use three different optimization strategies (SGD, Adadelta, RMSprop) for each model to see how they impact model performance for transfer learning.
Dataset
The Rice Image Dataset from Kaggle is provided by Murat Koklu. The dataset consists of 75,000 images of 5 rice varieties (often grown in Turkey):
There are 15,000 images for each variety in the dataset. Each image consists of a single grain of rice of the appropriate variety. Here are some examples of images from the dataset:
Preprocessing
The data was pre-labeled appropriately, and so, we did not have to manually label or re-label the data. When loading in the images, we resize them to 256x256x3 and then take a center crop of the image of size 224x224x3, which is the input size for all three neural network models being trained. Then, we normalize the images. We then split the data into training, validation and testing sets using an 80:10:10 split (60000 training images, 7500 validation images, and 7500 testing images). The code for data preprocessing (and the rest of the code for the experiments) can be found in our GitHub repository.
Experiments
The three neural network models we apply transfer learning to were all trained on the ImageNet dataset. We experiment to see how well the "knowledge" learned by these models on the ImageNet dataset can be translated to classify these rice varieties. The models' PyTorch implementations are loaded using the Torchvision models subpackage. The optimizers' PyTorch implementations are loaded using the Torch optim subpackage. We use cross entropy loss as the loss criterion during training.
The models are trained for 25 epochs. After each training epoch, the accuracy of the model on the validation set is computed. The epoch at which the model exhibits the highest validation accuracy is used to save the best-performing model. After training, the accuracy of the model on the testing set is computed.
Models
For all the models, since we want to take advantage of what was already learned by training on the ImageNet dataset, we load pretrained versions of the models, but we only finetune the weights on the output layers that we change to match the dimensions of the number of classes in our dataset. The architectures for each model are available in the source code linked in the title of each model below.
AlexNet
AlexNet is a landmark model based on CNN architecture. It won the ImageNet large-scale visual recognition challenge in 2012. The model was proposed by Alex Krizhevsky and his colleagues. We load the pretrained version of this model and then replace the fully connected output layer with a newly initialized fully connected layer that has 5 output nodes (for the 5 rice varieties in the dataset). We then initialize the optimizer such that only the parameters of this newly initialized fully connected layer are optimized so that we don't finetune the weights of the previous layers that have been loaded from pretraining. AlexNet has 8 layers.
ResNet18
ResNet is another landmark CNN model that won the ImageNet challenge in 2015. It is the most cited neural network of the 21st century. The model was proposed by Kaiming He and his colleagues. We load the pretrained version of this model and then replace the fully connected output layer with a newly initialized fully connected layer that has 5 output nodes (for the 5 rice varieties in the dataset). We then initialize the optimizer such that only the parameters of this newly initialized fully connected layer are optimized so that we don't finetune the weights of the previous layers that have been loaded from pretraining. There are many variants of ResNet such as ResNet18, ResNet34, ResNet50 and so on, but we chose to use the ResNet18 model as it has 18 layers and is the most comparable to AlexNet in terms of number of layers.
SqueezeNet (1.1)
SqueezeNet is a smaller CNN model that was designed as a more compact replacement for AlexNet. It has almost 50x fewer parameters, performs 3x faster and achieves comparable accuracy to AlexNet on the ImageNet dataset. SqueezeNet was developed by researchers at DeepScale, Stanford University, and the University of California, Berkeley. It was proposed in a paper called SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5MB model size. We load the pretrained version of this model and then replace the last convolutional layer with a newly initialized convolutional layer that has 5 output nodes (for the 5 rice varieties in the dataset). We then initialize the optimizer such that only the parameters of this newly initialized convolutional layer are optimized so that we don't finetune the weights of the previous layers that have been loaded from pretraining.
Optimizers
For all the optimizers, we used the same initial learning rate of 0.01. For SGD and RMSprop, we also use a momentum value of 0.9 as is standard practice.
SGD
In contrast to regular gradient descent, mini-batch stochastic gradient descent (SGD) performs parameter updates for each batch of training examples rather than the entire training set. This reduces the variance of the parameter updates and generally leads to the best performance out of gradient descent variants.
Adadelta
Adadelta is an extension of Adagrad. Adagrad tries to lower the learning rate for parameters associated with frequently occurring features and larger updates for infrequent features. Adadelta tries to reduce the monotonically decreasing learning rate of Adagrad by restricting the window of accumulated past gradients to a fixed size.
RMSprop
RMSprop tries to solve the same issue of Adagrad that Adadelta does by resolving the rapidly diminishing learning rates. RMSprop additionally divides the learning rate by an exponentially decaying average of squared gradients.
Results
Here are the final test accuracies for the models.
AlexNet
Here are the loss and accuracy plots for AlexNet using the three different optimizers.
Loss plots:
Accuracy plots:
ResNet18
Here are the loss and accuracy plots for ResNet18 using the three different optimizers.
Loss plots:
Accuracy plots:
SqueezeNet
Here are the loss and accuracy plots for SqueezeNet using the three different optimizers.
Loss plots:
Accuracy plots:
Discussion
- Across the board, AlexNet is the best performing model and Adadelta is the best performing optimizer.
- SqueezeNet performs better than ResNet18 with SGD and Adadelta, but not with RMSprop.
- Perhaps the learning rate decay of RMSprop is too aggressive.
- AlexNet has 61M parameters, ResNet18 has 11M parameters and SqueezeNet has 1M parameters.
- SqueezeNet with Adadelta and RMSprop loss and accuracy curves are not smooth.
- Perhaps lower number of parameters contributes to less stable learning when combined with learning rate decay.
References
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
- Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.
- Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.
- Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- https://ruder.io/optimizing-gradient-descent/
- https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf