## The theory

**Multinomial logistic regression** is a statistical technique used for predicting the outcome of a **categorical dependent variable** based on one or more independent variables. It is similar to binary logistic regression, but is used when the **dependent variable has more than two categories**.

The theoretical foundation of multinomial logistic regression is based on the idea of **using probability to predict the outcome of a categorical dependent variable**. The algorithm estimates the probability that an observation belongs to each category of the dependent variable, and then assigns the observation to the category with the highest probability.

To do this, the algorithm uses a **logistic function** to model the relationship between the dependent variable and the independent variables. The logistic function is used to transform the output of the model into probabilities, which can then be used to make predictions about the dependent variable.

The coefficients of the model are estimated using **maximum likelihood estimation**, which is a method for estimating the parameters of a statistical model based on the observed data. The goal of maximum likelihood estimation is to find the values of the coefficients that maximize the likelihood of the observed data, given the model.

Once the model has been trained, it can be used to make predictions about the dependent variable by inputting new values for the independent variables and estimating the probability that the observation belongs to each category of the dependent variable. The observation is then assigned to the category with the highest probability.

Overall, multinomial logistic regression is a powerful and widely-used tool for predicting categorical outcomes in a wide range of applications.

## The code

To build a multinomial logistic regression model in **python**, we can use the LogisticRegression class from the sklearn library. Here is an example of how to build a multinomial logistic regression model in python:

# import the necessary libraries from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # load the data X = # independent variables y = # dependent variable # split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # create the model model = LogisticRegression(multi_class='multinomial', solver='newton-cg') # fit the model on the training data model.fit(X_train, y_train) # make predictions on the test data predictions = model.predict(X_test) # evaluate the model performance accuracy = model.score(X_test, y_test) print(f'Test accuracy: {accuracy:.2f}')

To build a multinomial logistic regression model i**n R**, we can use the multinom function from the nnet library. Here is an example of how to build a multinomial logistic regression model in R:

# install and load the necessary libraries install.packages("nnet") library(nnet) # load the data data = # data frame with independent and dependent variables # split the data into training and test sets train_index = sample(1:nrow(data), 0.8*nrow(data)) train = data[train_index, ] test = data[-train_index, ] # create the model model = multinom(dependent_variable ~ ., data=train) # make predictions on the test data predictions = predict(model, test) # evaluate the model performance accuracy = mean(test$dependent_variable == predictions) print(paste("Test accuracy:", accuracy))

## To pay attention!

- It is important to note that multinomial logistic regression assumes that the
**independent variables are independent of each other**, and that the log odds of the dependent variable are a linear combination of the independent variables. **Multicollinearity**is a common problem that can arise when working with logistic regression. It occurs when two or more independent variables are highly correlated with each other, which can lead to unstable and unreliable results.

**What is multicollinearity?**

Multicollinearity occurs when two or more independent variables are highly correlated with each other. This can be a problem because it can lead to unstable and unreliable results.

Imagine that you are using logistic regression to predict whether a customer will make a purchase based on their income and education level. If income and education level are highly correlated (e.g., people with higher education levels tend to have higher incomes), then it may be difficult to accurately determine the unique contribution of each variable to the prediction. This is because the two variables are highly dependent on each other, and it may be difficult to disentangle their individual effects.

**How does multicollinearity affect logistic regression?**

Multicollinearity can have several negative impacts on logistic regression:

– It can make it difficult to interpret the results of the model. If two or more independent variables are highly correlated, it may be difficult to determine the unique contribution of each variable to the prediction. This can make it difficult to interpret the results of the model and draw meaningful conclusions.

– It can lead to unstable and unreliable results. Multicollinearity can cause the coefficients of the model to change significantly when different subsets of the data are used. This can make the results of the model difficult to replicate and may lead to incorrect conclusions.

– It can increase the variance of the model. Multicollinearity can cause the variance of the model to increase, which can lead to overfitting and poor generalization to new data.

**What can you do to address multicollinearity?**

There are several steps you can take to address multicollinearity in logistic regression:

– Identify correlated variables. The first step is to identify which variables are highly correlated with each other. You can use statistical methods, such as variance inflation factor (VIF), to identify correlated variables.

– Remove one of the correlated variables. If two variables are highly correlated with each other, you can remove one of them from the model to reduce multicollinearity.

– Combine correlated variables. Alternatively, you can combine correlated variables into a single composite variable. This can help reduce multicollinearity and improve the stability and reliability of the model.

– Use penalized regression methods. Penalized regression methods, such as ridge or lasso regression, can help reduce multicollinearity by adding a penalty term to the model that encourages the coefficients of correlated variables to be close to zero.

Multicollinearity is a common problem that can arise when working with logistic regression. It can lead to unstable and unreliable results, as well as difficulty interpreting the results of the model. To address multicollinearity, you can identify correlated variables, remove one of the correlated variables, combine correlated variables, or use penalized regression methods. By taking these steps, you can improve the stability and reliability of your logistic regression model.

## How many model parameters will need to be estimated?

In a multinomial logistic regression analysis involving five outcomes (class labels) and three input variables, the number of model parameters that will need to be estimated depends on the number of categories in the dependent variable and the number of independent variables.

The number of model parameters that need to be estimated is equal to:

(number of categories in the dependent variable – 1) * (number of independent variables + 1)

For a multinomial logistic regression with five categories in the dependent variable and three independent variables, the number of model parameters that will need to be estimated is:

(5 – 1) * (3 + 1) = 12

Therefore, in this example, the model will need to estimate 12 parameters in order to make predictions about the dependent variable based on the independent variables.

In addition, it is recommended to evaluate the model performance using different evaluation metrics, such as precision, recall, and f1-score, in addition to accuracy. This will give a more complete picture of the model’s performance.

## The books

*(Note: I participate in the affiliate amazon program. This post may contain affiliate links from Amazon or other publishers I trust (at no extra cost to you). I may receive a small commission when you buy using my links, this helps to keep the blog alive! See **disclosure** for details.)*

There are many excellent books that provide detailed descriptions of multinomial logistic regression. Here are a few that I would recommend:

- “Applied Multivariate Statistical Analysis” by Richard Johnson and Dean Wichern: This comprehensive textbook provides a thorough treatment of multinomial logistic regression, as well as other techniques for analyzing multivariate data. It is suitable for readers with a strong background in statistics and is often used as a textbook in advanced undergraduate and graduate-level courses.
- “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman: This widely-acclaimed book is a must-read for anyone interested in statistical learning and data mining. It covers a wide range of topics, including multinomial logistic regression, and is written in a clear and accessible style.
- “Multivariate Analysis” by Joseph Hair, William Black, Barry Babin, and Rolph Anderson: This book provides a comprehensive overview of multivariate analysis techniques, including multinomial logistic regression. It is written in a clear and easy-to-understand style, making it suitable for both students and practitioners.

I hope this helps!

*This is a personal blog. My opinion on what I share with you is that “All models are wrong, but some are useful”. Improve the accuracy of any model I present and make it useful!*