An Intro to Deep Learning

Last month, I officially graduated from the University of Texas at Austin with a Master of Science in Data Science. Part of my final semester entailed taking a course about deep learning, a subset of machine learning that uses neural networks to model a wide variety of complex tasks. The course avoided a meandering, convoluted lesson plan by primarily focusing on computer vision to motivate and build intuition for fundamental deep learning concepts. Computer vision serves as an excellent introduction to deep learning because it requires operations on high-dimensional data, specifically images or video, emphasizing the implications of dimensionality and the structure of a deep neural network.

This was one of the most pragmatic courses in the program: many lectures directly discussed data collection pipelines, training methods, and neural network architectures from influential research papers within computer vision literature. These concepts then became focal points for homework assignments, presenting the opportunity - and the challenge - to implement several popular and practical deep learning techniques. Unfortunately, I can’t share any code from these assignments; however, I can still summarize their objectives to showcase a few capabilities of deep learning and explain some of the interesting things I learned this semester.

The course had a total of five homeworks and one final project, with each focusing on the implementation of a distinct computer vision task. All assignments were coded in Python, using the PyTorch library as a framework for deep learning operations and an open-source racing video game called SuperTuxKart to frame the motivations of the tasks and generate data. 

Image Classification 

The first two homeworks were dedicated to image classification: a fundamental computer vision task that seeks to classify an entire image under a single category. For Homework 1, our first objective was to coalesce image files into a dataset compatible with Pytorch’s DataLoader class. This step ensured that the images were converted to appropriate data types (PyTorch tensors) and easily iterable when eventually training the deep neural networks. The second part of the assignment was to define classes for two different types of fully connected neural networks (FCNNs): a simple linear classifier and a multilayer perceptron model. Although these were relatively basic models, once trained, they achieved a classification accuracy of 75% and 80%, respectively. 

Despite achieving moderate success in this particular image classification context, FCNNs are typically a poor choice for computer vision tasks: when working with image data, they are computationally inefficient, overparameterized, and fail to learn spatial patterns. For Homework 2, we circumvented the shortcomings of FCNNs by utilizing convolution operations to implement a convolutional neural network (CNN). Convolutions are ubiquitous in image processing and computer vision; in short, they are versatile operations capable of capturing spatial patterns while simultaneously transforming the dimensionality of an input. As a result, one of the most tedious processes of designing a convolutional neural network is ensuring that the dimensions of the network’s layers properly accommodate the data as it is passed through a series of convolution operations.

After defining the architecture and implementing several optimizations, such as input normalization, batch normalization, data augmentations, and residual blocking, the CNN achieved a classification accuracy of 94%, a much better result than the FCNNs.

Image Segmentation

In a natural progression from the first two homeworks, Homework 3 focused on image segmentation: a paradigm that seeks to classify each pixel within an image rather than the entire image. This task requires the output of the image segmentation network to have the same resolution (width and height in pixels) as the input image. Although this seems simple, returning an input to its original resolution after a series of convolution layers requires a considerable amount of forethought. For my network, I emulated ideas from the U-Net architecture and used a series of convolution blocks, max pooling operations, and skip connections to achieve a valid network structure.

Many image segmentation problems face the issue of large class imbalances. In SuperTuxKart, interesting objects like karts, hazards, and pickup items only appear in about 4% of the total pixels in a frame on average, so most of the frame consists of just the racetrack and background details. This class imbalance necessitated a weighted loss function, meaning that uncommon classes had a greater influence when updating the network’s parameters during training. After implementing my U-Net-inspired architecture and training optimizations, the segmentation network was quite successful.

Object Detection

In Homework 4, we repurposed the segmentation network into a point-based object detector. This time we were specifically interested in detecting the three most interesting SuperTuxKart classes: karts, hazards, and pickup items. Instead of predicting a class for each pixel, this network took a SuperTuxKart frame as an input and predicted a three-channel (a channel for each class) heatmap corresponding to the probabilities of each pixel being an object center for its respective class. The next step was to create a few helper functions that took the heatmap and returned the most probable object centers and their bounding boxes. Since the network architecture was very similar to the segmentation network, the toughest challenge of this homework was determining the appropriate types of data processing to be performed in the helper functions and conceptualizing the holistic procedure of transforming an image into a meaningful set of coordinates and bounding boxes that represented detected objects.

Tensorboard output including the original SuperTuxKart frame (image), the ground truth object locations (label), and the model’s prediction of object locations (pred).

Vision-Based Self-Driving Kart

My favorite assignment of the course was Homework 5, where we were tasked with designing a self-driving kart in SuperTuxKart. The first step of this process was to implement an automated controller that would determine inputs for the kart, such as steering and acceleration percentages, based on the screen coordinates of the center of the road. With an effective automated controller, we generated training data by simulating races on several tracks, using the game engine to record each frame along with the aforementioned screen coordinates. This approach allowed us to use a CNN with a linear output layer to predict the screen coordinates using only a frame from the game, enabling a completely autonomous self-driving kart when paired with the controller. Once the CNN was trained, my self-driving kart completed every track within the assignment-defined time limit, indicating a successful system.

This is my self-driving kart completing a race track. The red circle represents the true center of the road, where the green circle represents the neural network's prediction of the center of the road.

Imitation Learning

For the course's final project, we were tasked with programming a system to score as many goals as possible in SuperTuxKart’s ice hockey game mode. This project was almost entirely open-ended, and everything from brainstorming to data collection to network design needed to be completed from scratch. The single constraint was that a deep neural network needed to exclusively determine how the karts move. 

There were two possible approaches to the project: a vision-based system or a state-based system. Since the game state data contained information about everything that seemed relevant to the task, like puck position and velocity, goal line coordinates, and comprehensive kart information, I decided to go with the state-based approach. 

This is a SuperTuxKart ice hockey match played between my model (red team) and a computer opponent (blue team). My model won this match 2-1.

In addition to writing code, the project required a formal journal publication-style research paper to provide background information and report the project’s results. With the course staff’s permission, I’m actually allowed to share the report, so if you are interested in reading about some of the more technical details and insights about the project, you can check it out here.

I am extremely happy that one of the final classes in the program was among my favorites. The learning-by-doing approach in this class helped build invaluable intuition about deep learning - from the process of optimizing and debugging to designing and interacting with complex systems. Although my formal education has come to an end for the foreseeable future, I will definitely continue learning and am excited to begin new projects.

Resources:

U-Net: Convolutional Networks for Biomedical Image Segmentation

What is a Convolutional Neural Network?

Convolution Visualizer by Edward Z. Yang

CNN Visualizer by Adam Harley

Next
Next

AI is Changing Computer Graphics