Births in Belgium between 1992 - 2020; or how you should NOT visualize the data.


Recently, one of my colleagues shared an open source dataset on the daily number of births in Belgium. Exploring the data, it quickly became clear birthdays are less random than you might think.

In the next few blocks, I hope to have shared some insights. —

Let’s start with the time series plot on the left. We can notice that there is 1. quite some variation in the number of births on any given moment 2. an indication for a change in time

The plot isn’t ideal though… you’ll see why :-)


Details on the data

The dataset is made available Statbel, the Belgian statistical office. You can access it here.

The number of births per day between 1 Jan 1992 and 31 Dec 2020 are reported.

Based on the dates, some extra features are included such as the weekday, day of year and Belgian historical holidays that were scraped here

Our first improvement


Clearly there is something going on…

If you hover over the scatter cloud you quickly see an indication for the reason of such a difference. Or… just plain reasoning about impacting factors should work perfectly as well in this simple case :)

The effect of weekdays on births


We could expect that births were lower in the weekend. However, that there is such a large difference might be surprising nonetheless.


Over the years, the average number of births during a weekend day is 215. For weekdays, it is 373. On average, one is 1.73 times more likely to be born during a weekday.*


*Simplified example which does not take year, holidays etc. into account.

An alternative representation: birthday heatmap


The tiles on the left represent a heatmap of birthday popularity by year. Purple tiles indicate more births on those days compared to the year average - the opposite holds for the red tiles.


Again, we can identify the weekend effect on this plot. The red tiles are mostly concentrated with two less common birthdays next to each other.

The heatmap, in motion


Similar to the previous visual, we now have a sequence of heatmaps for consecutive years. Playing the animation will also make clear that the red blocks just shift horizontally as weekdays shift each year.


Hovering over the tiles will give you some details. If you’re born in 1992 or later, you can check how common your birthday is during that year.


In one of the next blocks, this visualization will help us to avoid making false conclusions!


GIF available here

Patterns in births = patterns in conceptions. When do people conceive babies?



*The choice of a ‘spiral plot’ for the conceptions was non-intentional. This visualization was somewhat inspired by this visualization on COVID


---
title: "Births & conceptions in Belgium"
output: 
  flexdashboard::flex_dashboard:
    theme: lumen
    source: embed
    storyboard: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
library(flexdashboard)
library(dplyr)
library(ggplot2)
library(ggiraph)
library(plotly)

births_conceptions_be = readRDS('births_conceptions_be.rds')
theme_set(theme_minimal())

# helpers ----------------------------------------------------------

scale_date_custom = function() scale_x_date(
  date_breaks = '2 years', date_labels = "'%y", 
  limits = as.Date(c('1992-01-01', '2020-01-01'))
)

# knitr::opts_chunk$set(eval = FALSE)
```


### Births in Belgium between 1992 - 2020; or how you should **NOT** visualize the data.

```{r}
gg = ggplot(births_conceptions_be, 
            aes(x=date, y = nbirths))+
  geom_line()+
  geom_smooth()+
  scale_date_custom()+
  scale_y_continuous(limits = c(0, 500))+ 
  labs(
    x = 'Year', 
    y = 'Number of births', 
    title = 'Daily births in Belgium
Between 1992 and 2020', ) plotly::ggplotly(gg) %>% plotly::style(traces = 2:3, hoverinfo = 'none') ``` *** Recently, one of my colleagues shared an open source dataset on the daily number of births in Belgium. Exploring the data, it quickly became clear birthdays are less random than you might think. In the next few blocks, I hope to have shared some insights. --- Let's start with the time series plot on the left. We can notice that there is 1. quite some **variation** in the number of births on any given moment 2. an indication for a **change in time** The plot isn't ideal though... you'll see why :-) --- Details on the data The dataset is made available Statbel, the Belgian statistical office. You can access it [here](https://statbel.fgov.be/en/open-data/number-births-day). The number of births per day between 1 Jan 1992 and 31 Dec 2020 are reported. Based on the dates, some extra features are included such as the weekday, day of year and Belgian historical holidays that were scraped [here](https://www.kalender-365.be/feestdagen/1991.html) ### Our first improvement ```{r} gg = ggplot(births_conceptions_be, aes(x=date, y = nbirths))+ geom_point(aes(weekday=weekday), size=.1)+ geom_smooth(se = FALSE)+ scale_date_custom()+ scale_y_continuous(limits = c(0, 500))+ scale_color_discrete()+ labs( x = 'Year', y = 'Number of births', color = 'Weekday', title = 'Daily births in Belgium' ) plotly::ggplotly(gg) %>% style(hoverinfo = 'none', traces = 2) ``` *** Clearly there is something going on... If you hover over the scatter cloud you quickly see an indication for the reason of such a difference. Or... just plain reasoning about impacting factors should work perfectly as well in this simple case :) ### The effect of weekdays on births ```{r} gg = ggplot(births_conceptions_be, aes(x=date, y = nbirths, color = weekday))+ geom_point(size=.1)+ geom_smooth(se = FALSE)+ scale_date_custom()+ scale_y_continuous(limits = c(0, 500))+ scale_color_discrete()+ labs( x = 'Year', y = 'Number of births', color = 'Weekday', title = 'Daily births in Belgium
Presence of a weekend effect' ) plotly::ggplotly(gg) %>% style(hoverinfo = 'none', traces = 8:14) ``` *** We could expect that births were lower in the weekend. However, that there is such a large difference might be surprising nonetheless. --- ```{r, eval=FALSE, echo=FALSE} births_conceptions_be %>% filter(!is.na(nbirths)) %>% group_by(weekend = weekday %in% c('Sat', 'Sun')) %>% summarise(avg_daily_births = mean(nbirths)) ``` Over the years, the average number of births during a weekend day is 215. For weekdays, it is 373. On average, one is 1.73 times more likely to be born during a weekday.* --- *_Simplified example which does not take year, holidays etc. into account._ ### An alternative representation: birthday heatmap ```{r} hovertemplate = "{yy} {weekday} {dd} {month.abb[mm]} No. births: {nbirths} Birthday popularity: {round(bd_pop, 2)}{tooltip_holiday}" heatmapdata = births_conceptions_be %>% filter(!is.na(nbirths)) %>% group_by(yy) %>% mutate( bd_pop = nbirths / mean(nbirths), tooltip_holiday = ifelse(is.na(holiday), '', paste0('\nHoliday: ', holiday, '')), tooltip = glue::glue(hovertemplate), Year = yy ) gg = ggplot(heatmapdata %>% filter(yy == 1992), aes(fill=bd_pop, x = dd, y = mm, frame = Year, text=tooltip))+ geom_tile()+ scale_fill_gradient2(midpoint = 1, breaks = seq(.4, 1.4, .2))+ scale_y_discrete(limits = rev)+ labs( title = "Birthday popularity in 1992", x = 'Day', y = "Month", fill = 'Birthday\npopularity')+ theme( panel.background = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank() ) plotly::ggplotly(gg, tooltip = 'text') %>% layout( hoverlabel=list(bgcolor="white") ) ``` *** The tiles on the left represent a heatmap of birthday popularity by year. Purple tiles indicate more births on those days compared to the year average - the opposite holds for the red tiles. --- Again, we can identify the weekend effect on this plot. The red tiles are mostly concentrated with two less common birthdays next to each other. ### The heatmap, in motion ```{r} gg = ggplot(heatmapdata, aes(fill=bd_pop, x = dd, y = mm, frame = Year, text=tooltip))+ geom_tile()+ scale_fill_gradient2(midpoint = 1, breaks = seq(.4, 1.4, .2))+ scale_x_discrete(position='top')+ scale_y_discrete(limits = rev)+ labs(x = 'Day', y = "Month", fill = 'Birthday\npopularity')+ theme( panel.background = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank() ) plotly::ggplotly(gg, tooltip = 'text') %>% plotly::animation_slider( pad = list(t = 10) ) %>% layout( hoverlabel=list(bgcolor="white"), xaxis=list(side='top') ) %>% animation_button( visible = TRUE, pad = list(r = 20) ) ``` *** Similar to the previous visual, we now have a sequence of heatmaps for consecutive years. Playing the animation will also make clear that the red blocks just shift horizontally as weekdays shift each year. --- Hovering over the tiles will give you some details. If you're born in 1992 or later, you can check how common your birthday is during that year. --- In one of the next blocks, this visualization will help us to avoid making false conclusions! --- GIF available [here](file.gif) ### Patterns in births = patterns in conceptions. _When do people conceive babies?_ {data-commentary-width=500} ```{r} # make data spiral_data = births_conceptions_be %>% filter(!is.na(nconceptions)) %>% filter(!(mm == 2 & dd == 29)) %>% group_by(yy, mm) %>% # daily avg per yymm mutate( yymm_mean = round(mean(nconceptions)) ) %>% group_by(yy) %>% # comparison yymm daily average to yy daily average and create rank mutate( month_conc_popularity = 13 - dense_rank(yymm_mean), tooltip = glue::glue("{month.name[mm]} {yy}\n\nAvg daily conceptions: {yymm_mean}\nConception rank within year: {month_conc_popularity}"), ) %>% ungroup() %>% # helper variable for polar spiral mutate( day_of_year_cumsum = cumsum(day_of_year) ) # tickpos years; angular axis tickvals_years = spiral_data %>% filter(dd == 1, mm == 1, yy %in% c(1995, 2020)) %>% pull(day_of_year_cumsum) # tickpos months tickvals_months = as.integer(seq(as.Date('1970-01-15'), as.Date('1970-12-15'), by = 'month')) plot_ly( spiral_data, colors = colorRamp(c('darkblue', 'white', 'darkred')), type = 'scatterpolar', mode = 'markers', theta = ~day_of_year, r = ~day_of_year_cumsum, text = ~tooltip, color = ~factor(month_conc_popularity), hoverinfo = 'text', marker = list(size = 4) ) %>% layout( legend = list(title = list(text = 'Daily average\nof conceptions\nrelative to month within year\n\n1: most fertile\n12: least fertile\n')), polar = list( angularaxis = list( showline = FALSE, showgrid = FALSE, rotation = 90, direction = "clockwise", tickmode="array", tickvals = tickvals_months, ticktext = month.abb, ticks = '' ), radialaxis = list( # angle = 90, # tickangle = 90, showline = FALSE, showgrid = FALSE, ticks = '', tickmode = 'array', tickvals = tickvals_years, ticktext = c("1995", "2020"), tickfont = list(color = "black") ) ), xaxis = list( 'showgrid' = FALSE, # thin lines in the background 'zeroline' = FALSE, # thick line at x=0 'visible' = FALSE # numbers below ), yaxis = list( 'showgrid' = FALSE, # thin lines in the background 'zeroline' = FALSE, # thick line at x=0 'visible' = FALSE # numbers below ) ) ``` *** --- *_The choice of a 'spiral plot' for the conceptions was non-intentional. This visualization was somewhat inspired by this visualization on [COVID](https://www.nytimes.com/2022/01/06/opinion/omicron-covid-us.html)_ ---