Regression with Categorical Dependent Variables
Montserrat Guillén
- Frees, E.W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.
- Greene, W.H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.
DATA DESCRIPTION
Name | Content description |
FullCoverage.csv |
4000 policy holders of motor insurance |
VehOwned.csv | 2067 customers of an insurance firm who are
offered unordered options |
VehChoicePrice.csv | This example corresponds to a similar
situation to previous example |
LOGISTIC REGRESSION MODEL
In the logistic regression model the dependent
variable is binary. This model is the most popular for
binary dependent variables. It is highly recommended
to start from this model setting before more
sophisticated categorical modeling is carried out.
Dependent variable yi can
only take two possible outcomes. We assume yi
follows a Bernoulli distribution with probability πi.
The probability of the 'event' response πi
depends on a set of individual characteristics xi, i = 1, . . ., n, where
n is the number of observations.
- Specification
Show/Hide
The logistic regression model specifies that:
\begin{equation} Pr(y_1=1|\mathbf{x}_i)=\pi_i=\frac{1}{1+exp(-\mathbf{x}_i^\prime \mathbf{\beta})}=\frac{exp(\mathbf{x}_i^\prime \bf{\beta})}{1+exp(\mathbf{x}_i^\prime \bf{\beta})} \end{equation}
and the inverse of this relationship, called the link function in generalized linear models, expresses x'i β as a function of πi as:
\begin{equation} \mathbf{x}_i^\prime \beta=\ln \left(\frac{\pi_i}{1-\pi_i}\right)= logit(\pi_i). \end{equation}
- Parameter estimation
Show/Hide
The maximum likelihood method is the procedure used for parameter estimation and standard error estimation in logistic regression models. This method is based on the fact that responses from the observed units are independent and that the likelihood function can then be obtained from the product of the likelihood of single observations.
The likelihood of a single observation in a logistic regression model is simply the probability of the event that is observed, so it can be expressed as,
\begin{equation} Pr(y_i=1|\mathbf{x}_i)^{y_i}(1-Pr(y_i=1|\mathbf{x}_i))^{(1-{y_i})}=\pi_i^{y_i}(1-\pi_i)^{(1-y_i)}. \end{equation}
The log-likelihood function for a data set is a function of the vector of unknown parameters β. The observed values of yi and xi are given by the information in the data set on n individuals. Then, when observations are independent, we can write the log-likelihood function as, \begin{eqnarray*} L(\beta)&=&\ln \left[\Pi_{i=1}^n Pr(y_i=1|\mathbf{x}_i)^{y_i}(1-Pr(y_i=1|\mathbf{x}_i))^{(1-{y_i})} \right]= \\ & =&\sum_{i=1}^n \left[ {y_i}\ln Pr(y_i=1|\mathbf{x}_i) - (1-y_i)\ln(1-Pr(y_i=1|\mathbf{x}_i)) \right]. \end{eqnarray*}
Conventional software will do the job to maximize the log-likelihood function, provide the parameter estimates and their standard errors. Unless covariates are perfectly correlated the parameter estimates exist and are unique.
- Script
- Results
MODELS FOR ORDINAL CATEGORICAL DEPENDENT VARIABLES
In ordinal categorical dependent variable models the responses have a natural ordering. This is quite common in insurance, an example is to model possible claiming outcomes as ordered categorical responses.
- Specification
Show/Hide
Let us assume that an ordinal categorical variable has J possible choices. The most straightforward model in this case is the cumulative logit model, also known as ordered logit. Let us denote by yi the choice of individual i for a categorical ordered response variable. Let us assume that πij is the probability that i choses j, j=1,...,J. So, πi1+πi2+ ... + πiJ = 1. Response probabilities depend on the individual predictors, again, we assume they depend on x'i β. It is important to bear in mind that the ordered logit model concentrates on the cumulative probabilities Pr(yi ≤ j | xi ). Then,
\begin{equation} logit(Pr(y_i\le j|\mathbf{x}_i ))=\alpha_j+\mathbf{x}_i^\prime \beta. \end{equation} Note that, \begin{equation} logit(Pr(y_i\le j| \mathbf{x}_i))=\ln \left(\frac{Pr(y_i\le j|\mathbf{x}_i )} {1-Pr(y_i\le j|\mathbf{x}_i )}\right). \end{equation}
- Script
- Results
MODELS FOR NOMINAL CATEGORICAL DEPENDENT VARIABLES
Let us start with the generalized logit model. This model is often called the multinomial logit model, which we will present later and which is a bit more general. However, the generalized logit model is so widely used that this is the reason why it is often called the multinomial logit model. It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables that measure individual risk factors.
- Specification
Show/Hide
Let us denote by yi the choice of individual i for a nominal categorical response variable. Let us assume that πij is the probability that i choses j, j=1,...,J and i=1,...,n. So, πi1+ πi2 + ... + πiJ = 1. The probabilities depend on the individual predictors, and we assume these choice probabilities depend on x'i βj.
We assume that the J-th alternative is the baseline choice. Then, the generalized logit regression model is specified as:
\begin{eqnarray} Pr(y_1=j|\mathbf{x}_i)&=&\frac{exp(\mathbf{x}_i^\prime \beta_j)}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}, \,\, j=1,...,J-1 \nonumber\\ Pr(y_1=J|\mathbf{x}_i)&=&\frac{1}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}.\nonumber \end{eqnarray}
So there are J-1 vectors of parameters to be estimated, namely β1, β2, ... , βJ-1. We set vector βJ to zero for identification purposes.
- Script
- Results
MULTINOMIAL LOGISTIC REGRESSION MODEL
In the multinomial logistic regression model individual characteristics can be different for different choices. This model is also known as the conditional logit model due to the fact that individual characteristics depend on the chosen alternative.
- Specification
Show/Hide
The multinomial logistic regression model specification is,
\begin{eqnarray} Pr(y_1=j|\mathbf{x}_{ij})=\frac{exp(\mathbf{x}_{ij}^\prime \beta)}{\sum_{k=1}^{J}exp(\mathbf{x}_{ik}^\prime \beta)}, \,\, j=1,...,J. \end{eqnarray}
There is only one vector of unknown parameters β, but we have J vectors of known characteristics xi1, xi2, ..., xiJ.
- Script
- Results
REFERENCES
[1] Allison, P. D. (1999). Logistic regression using
the SAS system: theory and application. Cary, NC: SAS
Institute.
[2] Cameron, A. C. and Trivedi, P. K. (2005)
Microeconometrics: methods and applications. Cambridge
University Press. New York.
[3] Frees, E. W. (2010). Regression modeling with
actuarial and financial applications. Cambridge
University Press. New York.
[4] Greene, W. H. (2011). Econometric analysis. 7th
edition. Prentice Hall. New York.
[5] Hilbe, J. M. (2009). Logistic regression models.
CRC Press, Chapman & Hall. Boca Raton, FL.
[6] Hosmer, D. W. and Lemeshow, S. (2000). Applied
logistic regression. John Wiley & Sons, New York,
2nd edition.
[7] Long, J. S. (1997). Regression models of
categorical and limited dependent variables. Sage,
Thousand Oaks, CA.