--- title: "Histogram and Density" author: "Wayne Oldford and Zehao Xu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true geometry: margin=.75in urlcolor: blue graphics: yes vignette: > %\VignetteIndexEntry{Histogram and Density} %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} header-includes: - \usepackage{graphicx} - \usepackage{epic} - \usepackage{color} - \usepackage{hyperref} - \usepackage{multimedia} - \PassOptionsToPackage{pdfmark}{hyperref}\RequirePackage{hyperref} - \newcommand{\code}[1]{\texttt{#1}} - \newcommand{\ve}[1]{\mathbf{#1}} - \newcommand{\pop}[1]{\mathcal{#1}} - \newcommand{\samp}[1]{\mathcal{#1}} - \newcommand{\subspace}[1]{\mathcal{#1}} - \newcommand{\sv}[1]{\boldsymbol{#1}} - \newcommand{\sm}[1]{\boldsymbol{#1}} - \newcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}} - \newcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert} - \newcommand{\size}[1]{\left\lvert {#1} \right\rvert} - \newcommand{\norm}[1]{\left|\left|{#1}\right|\right|} - \newcommand{\field}[1]{\mathbb{#1}} - \newcommand{\Reals}{\field{R}} - \newcommand{\Integers}{\field{Z}} - \newcommand{\Naturals}{\field{N}} - \newcommand{\Complex}{\field{C}} - \newcommand{\Rationals}{\field{Q}} - \newcommand{\widebar}[1]{\overline{#1}} - \newcommand{\wig}[1]{\tilde{#1}} - \newcommand{\bigwig}[1]{\widetilde{#1}} - \newcommand{\leftgiven}{~\left\lvert~} - \newcommand{\given}{~\vert~} - \newcommand{\indep}{\bot\hspace{-.6em}\bot} - \newcommand{\notindep}{\bot\hspace{-.6em}\bot\hspace{-0.75em}/\hspace{.4em}} - \newcommand{\depend}{\Join} - \newcommand{\notdepend}{\Join\hspace{-0.9 em}/\hspace{.4em}} - \newcommand{\imply}{\Longrightarrow} - \newcommand{\notimply}{\Longrightarrow \hspace{-1.5em}/ \hspace{0.8em}} - \newcommand*{\intersect}{\cap} - \newcommand*{\union}{\cup} - \DeclareMathOperator*{\argmin}{arg\,min} - \DeclareMathOperator*{\argmax}{arg\,max} - \DeclareMathOperator*{\Ave}{Ave\,} - \newcommand{\permpause}{\pause} - \newcommand{\suchthat}{~:~} - \newcommand{\st}{~:~} --- ```{r setup, include=FALSE, warning=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 4, fig.height = 3, fig.align = "center", out.width = "70%", collapse = TRUE, comment = "#>", tidy.opts = list(width.cutoff = 65), tidy = FALSE) library(knitr) set.seed(12314159) imageDirectory <- "./images/histogram-density-" dataDirectory <- "./data/histogram-density-" path_concat <- function(path1, ..., sep="/") { # The "/" is standard unix directory separator and so will # work on Macs and Linux. # In windows the separator might have to be sep = "\" or # even sep = "\\" or possibly something else. paste(path1, ..., sep = sep) } library(dplyr, quietly = TRUE) ``` Histograms (and bar plots) are common tools to visualize a single variable. The x axis is often used to locate the bins and the y axis is for the counts. Density plots can be considered as the smoothed version of the histogram. Boxplot is another method to visualize one dimensional data. Five summary statistics can be easily traced on the plot. However, compared with histograms and density plots, boxplot can accommodate two variables, `group`s (often on the `x` axis) and `y`s (on the `y` axis). In `ggplot2`, `geom_histogram` and `geom_density` only accept one variable, `x` or `y` (swapped). Providing both positions is forbidden. Inspired by the boxplot (`geom_boxplot` in `ggplot2`), we create functions `geom_histogram_`, `geom_bar_` and `geom_density_` which can accommodate both variables, just like the `geom_boxplot`! ## Hist (histogram and bar plot) ### Two dimensional bar plot: `geom_bar_` Consider the `mtcars` data set. ```{r set data} mtcars2 <- mtcars %>% mutate( cyl = factor(cyl), gear = factor(gear) ) ``` Suppose that we are interested in the relationship of number of gears given the `cyl` (number of cylinders). ```{r geom_bar_graph} library(ggmulti) ggplot(mtcars2, mapping = aes(x = cyl, y = gear)) + geom_bar_(as.mix = TRUE) + labs(caption = "Figure 1") ``` Though the Figure 1, we can tell that * Compare vertically: given the number of engines, tell the gears - Most V8 engine cars prefer 3 gear transmission. V8 cars do not use 4 gear transmission - Most V4 engine cars prefer 4 gears transmission. * Compare horizontally: given the number of gears, tell the engines - Most 3 gear transmission cars carry a V8 engine. - Most 4 gear transmission cars carry a V4 engine, then V6 engine, but never V8 engine. - Five gear transmission cars can carry either a V4, V6 or V8 engine. However, compared with other two transmissions, 5 gear is not a common choice. ### Two dimensional histogram: `geom_histogram_` Suppose now, we are interested in the distribution of `mpg` (miles per gallon) with the respect to the `cyl` (as "x" axis) and `gear` (as "fill"). Through the Figure 2, we can easily tell that as the number of cylinders rises, the miles/gallon drops significantly. Moreover, the number of six cylinder cars is much less that the other two in our data. In addition, the transmission of V8 cars is either 3 or 5 (identical to the conclusion we draw before). ```{r geom_histogram_graph} g <- ggplot(mtcars2, mapping = aes(x = cyl, y = mpg, fill = gear)) + geom_histogram_(as.mix = TRUE) + labs(caption = "Figure 2") g ``` ### Just call `geom_hist`! Function `geom_histogram_` is often used as one factor is categorical and the other is numerical, while function `geom_bar_` accommodate two categorical variables. The former one relies on **`stat = bin_`** and the latter one is on **`stat = count_`**. However, if we turn the factor of interest as numerical in `geom_bar_`, there would be no difference between the output of a bar plot and a histogram. Hence, function `geom_hist` is created by simplifying the process. It understands both cases and users can just call `geom_hist` to create either a bar plot or a histogram. ## Density We could also draw density plot side by side to better convey the data of interest. With `geom_density_`, both summaries can be displayed simultaneously in one chart. Note that for cylinder 4 and 6, the density representing 3 gear transmission and 5 gear transmission cars are missing respectively. The reason is that for these two subgroups, the number of observations is not big enough to be used to compute the density. ```{r geom_density_graph} g + # parameter "positive" controls where the summaries face to geom_density_(as.mix = TRUE, positive = FALSE, alpha = 0.2) + labs(caption = "Figure 3") ``` In Figure 3, an argument `as.mix` is set as `TRUE`. What does it mean? Before we introduce it, let us look at the total count of `cyl` in `mtcars`. ```{r geom_density_count} tab <- table(mtcars2$cyl) knitr::kable( data.frame( cyl = names(tab), count = unclass(tab), row.names = NULL ) ) ``` In the sample, the total number of cylinder 8 cars is approximately twice as much as the group cylinder 6. Within each group, if the `as.mix` is set as `FALSE` (default), shown as Figure 4, the area for each subgroup (in general, one color represents one subgroup) is 1 and the whole area is **3** in total. There is no problem to think it as a real "density", however, if we consider it as a "continuous histogram" (the binwidth is approaching 0), it may be misleading somehow. Instead, the `as.mix` could be set as `TRUE` so that the sum of the density area within each group is **1**. The area for each subgroup is proportional to the count, as Figure 5. ```{r geom_density, fig.width=10, fig.height=6} gd1 <- ggplot(mtcars2, mapping = aes(x = mpg, fill = cyl)) + # it is equivalent to call `geom_density()` geom_density_(alpha = 0.3) + scale_fill_brewer(palette = "Set3") + labs(caption = "Figure 4") gd2 <- ggplot(mtcars2, mapping = aes(x = mpg, fill = cyl)) + geom_density_(as.mix = TRUE, alpha = 0.3) + scale_fill_brewer(palette = "Set3") + labs(caption = "Figure 5") gridExtra::grid.arrange(gd1, gd2, nrow = 1) ``` Additionally, function `geom_density_` (so does `geom_histogram_`) provides another parameter `scale.y` to set the scales across different groups (different cylinder types). The default `data` indicates that the area of each density estimates is proportional to the overall count. If the `scale.y` is set as `group`, regardless of the other groups, the density estimate of each subgroup is scaled respecting by its own group, as Figure 6. The benefit is that, within each group, the pattern of the density is easier to be visualized and compared. However, across different groups, it is *meaningless* to compare. ```{r geom_density_graph scale.y} g + # parameter "positive" controls where the summaries face to geom_density_(positive = FALSE, alpha = 0.2, scale.y = "group") + labs(caption = "Figure 6") ``` ## Scaling Functions `geom_density_` and `geom_histogram_` provide two scaling controls, `as.mix` and `scale.y`. **DO NOT** be confused. The `as.mix` controls the scaling within each group and the `scale.y` controls the scaling across different groups. Well, ..., if you are still confused, the following graph may help you better understand the `as.mix` and `scale.y`. The data has two groups "1" and "2". Within each group, there are two subgroups "A" and "B". The count of each subgroup is shown as follows. ```{r arbitrary data} data <- data.frame(x = c(rep("1", 900), rep("2", 100)), y = rnorm(1000), z = c(rep("A", 100), rep("B", 800), rep("A", 10), rep("B", 90))) data %>% dplyr::group_by(x, z) %>% summarise(count = n()) %>% kable() ``` Figure 7 shows the all four combinations of `scale.y` and `as.mix`. + Compare the graphics vertically: * `scale.y = group`: within each group, either bins or densities are in a relatively large scale. For example, for group "2", the total count is only one tenth of the group "1". With such scaling strategy, the pattern of its bin/density can be visualized easily. However, through this figure, we cannot tell the ratio of the total number of group "1" to group "2". * `scale.y = data`: the area of each group is proportional to its count. Through the figure, we can easily tell that there is more observations in group "1". However, for the minority group "2", it is really hard to tell its distribution. + Compare the graphics horizontally: * `as.mix = TRUE`: within each group, the area of the subgroup is proportional to its count. For example, in group "1", the ratio of the count for subgroup "A" over "B" is $\frac{1}{8}$ so that the area of the "A" over "B" in group 1 is approximate $\frac{1}{8}$. * `as.mix = FALSE`: within each group, the area of the subgroup is identical. For group "1", the ratio of the count for subgroup "A" over "B" is $\frac{1}{8}$, but in the Figure, the area of "A" over "B" is approximate $1:1$. ```{r overall comparison, fig.width=10, fig.height=8} grobs <- list() i <- 0 position <- "stack_" prop <- 0.4 for(scale.y in c("data", "group")) { for(as.mix in c(TRUE, FALSE)) { i <- i + 1 g <- ggplot(data, mapping = aes(x = x, y = y, fill = z)) + geom_histogram_(scale.y = scale.y, as.mix = as.mix, position = position, prop = prop) + geom_density_(scale.y = scale.y, as.mix = as.mix, positive = FALSE, position = position, alpha = 0.4, prop = prop) + ggtitle( label = paste0("`scale.y` is ", scale.y, "\n", "`as.mix` is ", as.mix) ) if(i == 4) g <- g + labs(caption = "Figure 7") grobs <- c(grobs, list(g)) } } gridExtra::grid.arrange(grobs = grobs, nrow = 2) ``` ## Set Positions Note that when we set `position` in function `geom_histogram_()` or `geom_density`, we should use the underscore case, that is "stack_", "dodge_" or "dodge2_" (instead of "stack", "dodge" or "dodge2"). ### Position `stack_` We can stack the bin/density on top of each other by setting `position = 'stack_'` (default `position = 'identity_'`) ```{r set position stack} ggplot(mtcars, mapping = aes(x = factor(am), y = mpg, fill = factor(cyl))) + geom_density_(position = "stack_", prop = 0.75, as.mix = TRUE) + labs(caption = "Figure 8") ``` ### Position `dodge_`(`dodge2_`) Dodging preserves the vertical position of an geom while adjusting the horizontal position. ```{r set position dodge} ggplot(mtcars, mapping = aes(x = factor(am), y = mpg, fill = factor(cyl))) + # use more general function `geom_hist_` # `dodge2` works without a grouping variable in a layer geom_hist_(position = "dodge2_") + labs(caption = "Figure 7") ``` ## Conclusions Functions `geom_histogram_` and `geom_density_` give a general solution of histogram and density plot. Two variables can be provided to compactly display the distribution of a continuous variable. Besides, different scaling strategies are provided for users to tailor their own specific problems. If only one variable is provided in `geom_density_()`, `geom_histogram_()` or `geom_bar_()`, the function `ggplot2::geom_density()`, `ggplot2::geom_histogram()` and `ggplot2::geom_bar()` will be executed automatically.