Modelling Auto Insurance Claims in Singapore

Claim frequency data in general insurance may not follow the traditional Poisson distribution when there are many zeros. When the number of observed zeros exceeds the number of expected zeros under the Poisson distribution, extra dispersion appears. This paper summarizes several dispersed and zero-inflated count data models, which are used to handle dispersion and excess zeros. We model the insurance claim count data with excess zeros with these models. We use chi-square goodness-of-fit, to test the validity of the assumption of the count data distribution and fit count data regression model with predictors. We compare the fits through AIC and BIC. The generalized Poisson model and Negative binomial model provide a good fit to the data.


Introduction
The modeling of count data is of a primary interest in many fields such as insurance, public health, epidemiology, psychology, and many other research areas. The Poisson model is most commonly used for modelling count data. It assumes that the mean and variance are equal. However, this assumption is violated in many applications. The situation in which the variance is larger (smaller) than the mean is known as overdispersion (under-dispersion). The dispersion occurs when the single parameter of the Poisson distribution is unable to fully describe event counts. For modelling the dispersed data, a choice of analysis is a Negative binomial (NB) model and a Generalized Poisson (GP) model (introduced by Consul and Jain, 1973).
Generally, two sources of over-dispersion are determined: heterogeneity of the population and excess of zeroes. The heterogeneity is observed when the population can be divided into many homogeneous subpopulations. The excess of zeroes is detected when the number of observed zeroes exceeds largely the number of zeroes produced by the fitted Poisson model, i.e. when the frequency of 'zero' is significantly higher than the one predicted by the Poisson model. Generally, in insurance a precise ratemaking system allows insurers to cover expected losses, expenses and make adequate provision for contingencies. The first step in ratemaking is to model the claim frequency distribution. Traditionally, the claim count distribution in general insurance is assumed to follow the Poisson and NB distributions (Thomas &Samson, 1987;Renshaw, 1994). Under the usual deductible agreement in general insurance policies, a claim will not be created unless the loss exceeds the prescribed deductible limit. Furthermore, the no claim discount (NCD) system, which is widely adopted by automobile insurers, leads to excess zero claims because policyholders seldom make a claim if the amount to be claimed is small. Such practice results in excess zeros in the observed claim frequency distribution, even though the original accident count distribution follows a Poisson or NB distribution. Yip and Yau (2005) discussed the claims in motor insurance data may contain excess zeros due to the conditions of deductible and no claim discount that discourage insured drivers to report small claims. Neyman (1939) and Feller (1943) introduced the concept of zero-inflation for the problem of more zeros. The zero-inflation phenomenon is a very particular type of over-dispersion, which is specifically handled by the zero-inflated count data models. There are many situations in insurance, econometric, medical, engineering, manufacturing, public health, road safety, epidemiology,etc. where zero-inflation can be observed. Lambert (1992) Bohning et al., (1999) (dental epidemiology) etc. The zero-inflation phenomenon is common in general insurance practice and appropriate modelling of such a data structure is necessary to precisely fit the claim frequency distribution. However, the use of zero-inflated Poisson (ZIP) model receives little attention in the insurance and actuarial literature (Lambert, 1992). Among mostly discussed zero-inflated models, most common models are zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB) and zeroinflated generalized Poisson (ZIGP). In this paper, an application of the ZIP, IASSL ISSN-2424-6271 107 ZIGP and ZINB distribution on the modeling of the claim frequency is discussed. The paper is organized as follows. Description of the automobile insurance dataset is given in Section 2. In Section 3, we describe various count data models and zero-inflated models, which are most common for modeling over-dispersed and zero-inflated count data. In Section 4, we discuss goodness of fit measures, fit various count data models to automobile insurance dataset and compare the fitted models. In Section 5, we demonstrate the performance of different zero-inflated models and compare the results with the Poisson, GP and NB models via the log-likelihood and related statistics. Finally, in Section 6, we provide concluding remarks.

Insurance claim data from Singapore
These data are from a portfolio of year 1993, which includes 7,483 automobile insurance policies from a major insurance company in Singapore. The data are described in Table 2 and provide the distribution of the claim counts. The variable of interest is the number of insurance claims per policyholder. For this dataset, it turns out that the maximum number of accidents in a year was three. There were on an average 0.06989 accidents per person. Frees and Valdez (2008) investigated hierarchical models of Singapore driving experience. Here, we consider a subset of this data, focusing on counts of automobile accidents and claims in 1993. The purpose of the analysis is to understand the impact of vehicle and driver characteristics on accident experience. We compare the performance of different zero-inflated models with the Poisson, GP and NB models to find the most suitable fit for modeling automobile insurance data.

Description of data
The number of observations in the present database is 7,483. Around 90.64% of the policyholders are males. From this database, several characteristics were available to explain the number of claims. These characteristics include different vehicle variables, such as type and age of vehicle, as well as the variables related to driver, such as age, sex, and prior driving experience.

Distribution of claim frequency
Claim frequency

Dispersed and zero-inflated count data models
If the count data contains excess zeros along with over-dispersion, one may consider ZIGP or ZINB model to fit the data. Several forms of GP model have been introduced by the researchers. The primary interest to introduce the different functional forms of GP model is to use the proper mean-variance interrelation in a regression context. The other objective is to achieve flexibility in the development of various test procedures and inferential advantages.
Zamani and Ismail (2012) introduced a functional form of the GP regression model, which is referred as the GP-P model, that parametrically nests the Poisson and the two well-known GP regression models (GP-1 and GP-2). Zamani and Ismail (2014) proposed a functional form of ZIGP regression model, which mixes a distribution degenerate at zero with the GP-P distribution. The ZIGP-1 and ZIGP-2 regression models are particular cases of ZIGP-P model with and , respectively. We summarize the functional forms of several dispersed and zero-inflated count data models in Table 3, along with its mean and variance.

Goodness-of-fit statistic and model selection criteria
Data is said to be over-dispersed if the conditional variance exceeds the conditional mean. An indication of the magnitude of over-dispersion or under-dispersion can be obtained simply by comparing the sample mean and variance of the dependent count variable using the dispersion index, which is the ratio of variance and mean. Here the dispersion index is calculated as, 1.0832, it shows that the data is over-dispersed. The zero-inflation index (ZI index) is a measure of detecting zero-inflation from the Poisson distribution. If is a nonnegative integer random variable (count variable) with mean and is the proportion of zeros in a random sample of size then ZI index, for the sample is defined as, For the large sample from Poisson distribution isclose to zero with high probability. The proportion of zeros in the insurance claim count data is 0.95 and ZI index is 0.037150. We fit various count data models to insurance claim count data to identify the best fit. The Chi-squared statistic is used to assess the goodness of fit of the distribution. For large sample sizes, the distribution of the Chi-squared statistic is approximately a Chi-squared with degrees of freedom, where is the number of observations and is the number of parameters. A significant valueindicates that the model does not fit the data well and another model with an additional parameter or parameters may be considered a significant improvement over the nested model. In assessing the performance of the models and for model selection, the zeroinflated models are compared with the Poisson, GP, and NB models by means of the log-likelihood, Akaike's information criteria (AIC) and Bayesian information criterion (BIC). In general, the smaller is the AIC and BIC, the better is the model fit. If is the number of parameters estimated and is the number of observations in the data, the AIC and BIC are defined as follows.
Results of fitting the claim frequency distribution by using various models are given in Table 4. Based on the chi-square statistic, the Poisson distribution does not provide an adequate fit to the automobile insurance data. The Poisson distribution is inappropriate in modeling the automobile claim frequency data. Based on the chi-square statistic =1.7599), the ZIGP-1 distribution provides a good fit to the data. However, it should be noted that the Poisson variance is inflated by the dispersion parameter of the GPD. Therefore, the GPD, NB distribution indicate an adequate fit according to the statistic when the underlying count, random variable is in fact zero-inflated.

Regression model fitting results
The zero-inflated models can be extended to accommodate covariates in a regression setting. The estimation of the proportion of zeros and the parameters can be divided into two parts. A logit part can be used to model the odds of structural zeros proportion and a Poisson part to model the counts that follow the Poisson distribution. Covariates could enter both the logit part and the Poisson part (Lambert, 1992). Estimation of parameters for zeroinflated models are done through the R and SAS software via the optimization of relevant log-likelihood functions using the NLMIXED procedure.
The following Table 5 shows the various count data models fitted to claim frequency data with covariates. To further analyze the automobile insurance data, the use of zero-inflated regression models is examined. The choice of variable is achieved through an exhaustive search of the database in which variables providing a significant improvement in the Poisson's log-likelihood function at convergence are chosen. Among these nine explanatory variables, the NCD, VAgeCat, gender of policyholders is shown to be significant in the Poisson regression model. For comparison, the same set of variables is used in other models. Results from fitting the Poisson, GP, NB, ZIP, ZIGP and ZINB regression models are given in Table 5. We consider the Poisson, GP, NB, ZIP, ZIGP and ZINB models and the performance of these models is evaluated via the log-likelihood, AIC, BIC.   Bracketed figures denote the standard error of the parameter estimates while * denotes the significant estimates at the 5% level of significance. Based on the AIC values, the GP2, NB and the zero-inflated regression models fit to the automobile insurance data reasonably well.

Discussion and conclusions
Accurate modeling of the claim count distribution is one of the essential steps in calculating policy rates. Motivated by the dispersion and zero-inflation problem in the claim counts of the automobile insurance dataset, this study proposes the use of several count data models. The method accommodates the extra zeros possibly caused by the unreported minor losses. The Poisson, GP, NB, ZIP, ZIGP and ZINB models are considered and the performance of these models is evaluated via the log-likelihood, AIC, BIC. Based on the findings shown in the previous section, the GP-2, NB and the zero-inflated regression models fit the automobile insurance data reasonably well. The ZIGP distribution provides the best fit to the data. Other than the zeroinflated models, parametric methods such as the mixture of distributions can be used to model the claim frequency distribution with extra zeros. Hurlimann (1990) discussed the use of several pseudo compound Poisson distributions in modelling the claim count data. Dobbie and Welsh (2001) considered the use of the Neyman type-A distribution to model zero-inflated counts. Accordingly, the Neymen type-A distribution models the count data via two Poisson parameters and it becomes more flexible in modelling multimodal data. Due to the possible over-dispersion in the Poisson part, the baseline ZIP model may not be adequate. As such, the Poisson part in ZIP has been modified by using NB distributions. Referring to this over-dispersion problem in the Poisson part when fitting the claim count data, the quasilikelihood (QL) model defined by Wedderburn (1974) can be an alternative in modelling the extra-dispersion.