Simulate Individual Data

Gabriele Pittarello

2024-11-14

Introduction

In this vignette we show how to simulate the individual data we included in the simulation study of Hiabu, Hofman, and Pittarello (2023). The simulations are based on the SynthETIC package and they can be used to replicate our results. In the manuscript, we named the \(5\) scenarios Alpha, Beta, Gamma, Delta, Epsilon. The \(5\) scenarios have the same data features described in the following table. Conversely, they have specific characteristics that we will describe in the coming sections.

Covariates Description
claim_number Policy identifier.
claim_type \(\in \left\{0, 1 \right\}\) Type of claim.
AP Accident month.
RP Reporting month.

For each scenario we will show if they satisfy the chain ladder assumptions (CL), the proportionality assumption in Cox (1972) (PROP) and if interactions are present (INT). Details on the simulation mechanism and the simulation parameters can be found in the manuscript.

Scenario Alpha

This scenario is a mix of claim_type 0 and claim_type 1 with same number of claims at each accident month (i.e. the claims volume).

# Input data

input_data_0 <- data_generator(
  random_seed = 1964,
  scenario = "alpha",
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)
input_data_0 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

Scenario Beta

This scenario is similar to simulation Alpha but the volume of claim_type 1 is decreasing in the most recent accident dates. When the longer tailed bodily injuries have a decreasing claim volume, aggregated chain ladder methods will overestimate reserves, see Ajne (1994).

input_data_1 <- data_generator(
  random_seed = 1964,
  scenario = 1,
  time_unit = 1 / 360,
  years = 4,
  period_exposure  = 200
)
input_data_1 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

Scenario Gamma

An interaction between claim_type 1 and accident period affects the claims occurrence. One could imagine a scenario, where a change in consumer behavior or company policies resulted in different reporting patterns over time. For the last simulated accident month, the two reporting delay distributions will be identical.

# Input data

input_data_2 <- data_generator(
  random_seed = 1964,
  scenario = 2,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)
input_data_2 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

Scenario Delta

A seasonality effect dependent on the accident months for claim_type 0 and claim_type 1 is present. This could occur in a real world setting with increased work load during winter for certain claim types, or a decreased workforce during the summer holidays.

input_data_3 <- data_generator(
  random_seed = 1964,
  scenario = 3,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)
input_data_3 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

Scenario Epsilon

The data generating process violates the proportional likelihood in Cox (1972). We generate the data assuming that a) there is an effect of the covariates on the baseline and b) the proportionality assumption is not valid.

# Input data

input_data_4 <- data_generator(
  random_seed = 1964,
  scenario = 4,
  time_unit = 1 / 360,
  years = 4,
  period_exposure = 200
)
input_data_4 %>%
  as.data.frame() %>%
  mutate(claim_type = as.factor(claim_type)) %>%
  ggplot(aes(x = RT - AT, color = claim_type)) +
  stat_ecdf(size = 1) +
  labs(title = "Empirical distribution of simulated notification delays", x =
         "Notification delay (in days)", y = "Cumulative Density") +
  xlim(0, 1500) +
  scale_color_manual(
    values = c("royalblue", "#a71429"),
    labels = c("Claim type 0", "Claim type 1")
  ) +
  scale_linetype_manual(values = c(1, 3),
                        labels = c("Claim type 0", "Claim type 1")) +
  guides(
    color = guide_legend(title = "Claim type", override.aes = list(
      color = c("royalblue", "#a71429"), size = 2
    )),
    linetype = guide_legend(
      title = "Claim type",
      override.aes = list(linetype = c(1, 3), size = 0.7)
    )
  ) +
  theme_bw()

Bibliography

Ajne, Björn. 1994. “Additivity of Chain-Ladder Projections.” ASTIN Bulletin 24 (2): 311–18.
Cox, David R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society: Series B (Methodological) 34 (2): 187–202.
Hiabu, Munir, Emil Hofman, and Gabriele Pittarello. 2023. “A Machine Learning Approach Based on Survival Analysis for IBNR Frequencies in Non-Life Reserving.” Preprint.