Advanced Regression Techniques

Going into my first semester of graduate school, I was excited to see how my new courses would extend and expand upon the topics covered in my undergraduate Statistics coursework. I had worked with linear regression and statistical models pretty frequently during my undergrad, so I was relieved when my “Foundations of Regression & Modeling” course exposed me to many new concepts and techniques. This proof-based course showcased plenty of challenging derivations in different regression settings and was very rewarding due to its rigorous nature. Now that I’ve completed the course, I’d like to share some of the most interesting and useful concepts I learned throughout the semester. 

Ridge Regression

In October 2022, I published How Linear Regression Works, where I explained the fundamental ideas and motivations of linear regression. In that post, I briefly mentioned the concept of multicollinearity and how it can invalidate the estimates of a model. At the time, my undergraduate-level statistics courses explained how to identify and understand the implications of collinearity, but did not offer any reasonable solution to addressing the problem. 

Very early in the semester, my graduate course introduced the idea of ridge regression: a transformation of the traditional least squares estimator that reduces the estimators' variances by manipulating the data's eigenvalues.* The reduction in variance is at the expense of a slight bias, akin to the bias-variance tradeoff, but ultimately these ridge estimates are expected to have a lower mean squared error than the least squares estimator. 

*Since many of these concepts require further mathematical reasoning, some ideas may seem slightly abstract without an example. In hopes to elucidate each concept, I made an R Markdown document providing derivations and simulation studies.

Variable Selection: LASSO

Variable selection is one of the most essential components of creating a regression model. In my undergrad, forward or backward stepwise selection or best subset regression were the typical methods of variable selection, but each of those quickly becomes unsustainable as you increase the number of predictor variables in a model. 

The least absolute shrinkage and selection operator, more commonly known as LASSO, is one way to solve this problem. It is similar to ridge regression in the sense that it reduces the variances of each estimator, but it also has the ability to set estimator coefficients to zero.* In order for LASSO to be its most effective, it needs an optimal tuning parameter: if the parameter is too high, all of the estimates in the model will be zero, and if it is too low, then you will just get the least squares estimates, both of which are undesirable outcomes.

Since the ideas of LASSO are similar to Ridge Regression, there has been a surge in the popularity of combining the two to create elastic net regularization, which mitigates the shortcomings of each technique applied individually. We didn’t go into the details of elastic nets in the course, but after researching them for this post, I’m certain I will see them crop up soon.

*A coefficient of zero implies that the variable is not important for prediction.

Resulting coefficient estimates using several different regression techniques. See my R Markdown File for further details.

Mediation and the Sobel Test

All regression techniques are based on the assumption that one or more independent variables directly influence a dependent variable. Mediation refers to a phenomenon where one independent variable influences another independent variable, which in turn influences the dependent variable. For example, consider a regression model created for a financial transaction dataset that includes a consumer’s spending, education level, and disposable income. Both education level and disposable income are likely statistically significant predictors for consumer spending, but it seems like education level may actually be a stronger predictor for disposable income rather than consumer spending. In this scenario, you can perform a Sobel test to formally evaluate whether disposable income is a statistically significant mediator variable. Identifying mediation can be quite difficult, even in scientific settings, so it often requires the discretion of an expert in the relevant field of study.

Iterative Algorithms

When working with large datasets and advanced regression techniques, the traditional closed-form ordinary least squares fails and iterative algorithms are required to calculate estimates. Iterative algorithms, also known as numerical methods, rely on first-order (and many times second-order) derivatives to find a local minimum or maximum of a function of interest within the regression method. You know that you’ve found optimal estimates when your iterative algorithm converges - when the result is the same after each iteration. This is a fundamental concept in machine learning, and I discuss it further in my How Neural Networks Learn post.

Poisson Regression and Generalized Linear Model

In most regression settings, the response variable is continuous and is assumed to be normal. When either of those conditions is not met, you must be very strategic about your regression model. Poisson Regression is one method that works well when your response variable is discrete and positive, such as count data or time-related observations. The Poisson Distribution is a natural regression candidate because it comes from the exponential family and only has one parameter to estimate. The math behind the calculations is also very similar to other regression techniques, with the caveat that you are trying to find a maximum likelihood estimator of the Poisson Distribution in order to minimize the residual sum of squares.

Another approach to discrete data would be with a generalized linear model (GLM): a versatile regression model designed to work with data from any distribution in the exponential family. It uses link functions to transform the response into a linear form and then uses maximum likelihood estimators to find the model's parameters. By this definition, a Poisson Regression can be implemented as a GLM, and although the process is slightly different, it still arrives at the same estimates. Wikipedia has an excellent article about the history and overview of GLMs, so if you’re a statistics nerd like me, I highly recommend that you check it out.

Nonparametric Regression, Splines and Kernel Methods

The regression models discussed so far have some strict assumptions: linear regression models assume that the data is normally distributed, and generalized linear models, which are more flexible, assume that the data comes from a distribution in the exponential family. Whenever you cannot comfortably make any assumptions about your data, you may choose to perform nonparametric regression, which aims to find a function that fits your data well without any predefined parameters; this can be accomplished by using splines or kernel methods.

Kernel smoothing requires a kernel function to calculate a weighted average at each observation with nearby predictor values, creating a smooth function across the predictor space. Splines accomplish smoothing by fitting the data with piecewise polynomial functions, with each polynomial bounded by points called knots. Both of these methods are highly customizable: there are many kernel functions and several ways to define the polynomial degrees and knots of splines.

Splines and kernel methods require a fine balance between the function’s fit to the data and its smoothness. The dynamic is analogous to the bias-variance tradeoff: if the function that models the data fits too well, there is overfitting which results in a high variance, and if the function is too smooth, there is underfitting which results in a high bias.

Over the semester, I gained a fresh perspective on regression and an understanding of some topics I only heard about in passing during my undergrad. Many of the best moments of this course embodied my favorite parts of statistics and math: getting a sense of accomplishment after finally understanding an abstract concept and appreciating the clever math that makes it work. Over the summer, I’ll take the second part of this course: “Advanced Predictive Models for Complex Data”, so I’m looking forward to seeing how learning more comprehensive models can help improve my skills as a data scientist.

Previous
Previous

Image Compression (SVD, PCA, JPEG, and PNG)

Next
Next

Data Visualization Tips