--- title: "Classification by Tour" author: "Wayne Oldford and Zehao Xu" date: "`r Sys.Date()`" bibliography: references.bib fontsize: 12pt link-citations: yes linkcolor: blue output: rmarkdown::html_vignette: toc: true geometry: margin=.75in urlcolor: blue graphics: yes vignette: > %\VignetteIndexEntry{Classification by Tour} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} header-includes: - \usepackage{graphicx} - \usepackage{epic} - \usepackage{color} - \usepackage{hyperref} - \usepackage{multimedia} - \PassOptionsToPackage{pdfmark}{hyperref}\RequirePackage{hyperref} - \newcommand{\code}[1]{\texttt{#1}} - \newcommand{\ve}[1]{\mathbf{#1}} - \newcommand{\pop}[1]{\mathcal{#1}} - \newcommand{\samp}[1]{\mathcal{#1}} - \newcommand{\subspace}[1]{\mathcal{#1}} - \newcommand{\sv}[1]{\boldsymbol{#1}} - \newcommand{\sm}[1]{\boldsymbol{#1}} - \newcommand{\tr}[1]{{#1}^{\mkern-1.5mu\mathsf{T}}} - \newcommand{\abs}[1]{\left\lvert ~{#1} ~\right\rvert} - \newcommand{\size}[1]{\left\lvert {#1} \right\rvert} - \newcommand{\norm}[1]{\left|\left|{#1}\right|\right|} - \newcommand{\field}[1]{\mathbb{#1}} - \newcommand{\Reals}{\field{R}} - \newcommand{\Integers}{\field{Z}} - \newcommand{\Naturals}{\field{N}} - \newcommand{\Complex}{\field{C}} - \newcommand{\Rationals}{\field{Q}} - \newcommand{\widebar}[1]{\overline{#1}} - \newcommand{\wig}[1]{\tilde{#1}} - \newcommand{\bigwig}[1]{\widetilde{#1}} - \newcommand{\leftgiven}{~\left\lvert~} - \newcommand{\given}{~\vert~} - \newcommand{\indep}{\bot\hspace{-.6em}\bot} - \newcommand{\notindep}{\bot\hspace{-.6em}\bot\hspace{-0.75em}/\hspace{.4em}} - \newcommand{\depend}{\Join} - \newcommand{\notdepend}{\Join\hspace{-0.9 em}/\hspace{.4em}} - \newcommand{\imply}{\Longrightarrow} - \newcommand{\notimply}{\Longrightarrow \hspace{-1.5em}/ \hspace{0.8em}} - \newcommand*{\intersect}{\cap} - \newcommand*{\union}{\cup} - \DeclareMathOperator*{\argmin}{arg\,min} - \DeclareMathOperator*{\argmax}{arg\,max} - \DeclareMathOperator*{\Ave}{Ave\,} - \newcommand{\permpause}{\pause} - \newcommand{\suchthat}{~:~} - \newcommand{\st}{~:~} --- ```{r setup, include=FALSE, warning=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = "center", fig.width = 6, fig.height = 5, out.width = "60%", collapse = TRUE, comment = "#>", tidy.opts = list(width.cutoff = 65), tidy = FALSE) library(knitr) library(magrittr) library(loon.tourr, quietly = TRUE) library(tidyverse, quietly = TRUE) library(class, quietly = TRUE) set.seed(12314159) imageDirectory <- "./images/classification" dataDirectory <- "./data/classification" path_concat <- function(path1, ..., sep="/") { # The "/" is standard unix directory separator and so will # work on Macs and Linux. # In windows the separator might have to be sep = "\" or # even sep = "\\" or possibly something else. paste(path1, ..., sep = sep) } ``` ## Introduction A grand tour "method" is an algorithm for assigning a sequence of projections onto a lower dimensional spaces. After the original multivariate dataset is projected onto some "interesting" plane, a question may be raised here, "what's next?" Well, one of the usage could be in "classification". Rather than putting the original data set into the classifier. Controlling other hyper-parameters, can an interesting projection improve the performance of the prediction? In this vignette, we will learn about it. ## Data The data set `olive` records the percentage composition of 8 fatty acids (`palmitic`, `palmitoleic`, `stearic` and etc) found in the lipid fraction of 572 Italian olive oils. The oils are samples taken from three Italian regions varying number of areas within each region. The regions and their areas are recorded as shown in the following table [@loon]: ```{r table, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} data.frame( Region = c("North", "South", "Sardinia"), Area = c("North-Apulia, South-Apulia, Calabria, Sicily", "East-Liguria, West-Liguria, Umbria", "Coastal-Sardinia, Inland-Sardinia") ) %>% kable() ``` First, we randomly select 80% as the training set and leave the rest 20% as the test set. ```{r split data, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} set.seed(123) N <- nrow(olive) trainId <- sample(seq(N), size = floor(0.8 * N)) testId <- setdiff(seq(N), trainId) acids <- setdiff(colnames(olive), c("region", "area")) trainX <- olive[trainId, acids] testX <- olive[testId, acids] trainY <- olive[trainId, "region"] testY <- olive[testId, "region"] ``` Then, as the magnitude of each variable is very different, to avoid one specific factor dominate the projection, a scaling technique would be provided. In our case, the `variable` scaling method is applied that each variable is scaled to zero one (the detailed description of different scaling methods can be found in `l_tour` documentation `help("l_tour")`). ```{r scaling, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} row.names(trainX) <- NULL kable(head(trainX)) scalingTrainX <- loon::l_getScaledData(trainX, scaling = "variable") scalingTestX <- loon::l_getScaledData(testX, scaling = "variable") kable(head(scalingTrainX), digits = 2) ``` ## Projection Methods The classifier we used in this vignette is k-nearest neighborhood, `knn` [@altman1992introduction]. ```{r xgboost, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} knn_pred <- function(trainX, trainY, testX, testY, k = c(5, 10, 20)) { len_test <- length(testY) vapply(k, function(num) { yhat <- class::knn(trainX, testX, trainY, k = num) sum(yhat == testY)/len_test }, numeric(1L)) } low_dim_knn_pred <- function(dims = 2:5, fun, trainX, trainY, testX, testY, k = c(5, 10, 20), setNames = TRUE) { tab <- lapply(dims, function(d) { knn_pred( fun(trainX, d), trainY, fun(testX, d), testY ) }) %>% as.data.frame() %>% as_tibble() if(setNames) { tab <- tab %>% setNames(nm = paste0("d = ", dims)) } rownames(tab) <- paste0("k = ", k) tab } ``` ### Method I: p choose k The most basic projection is that to choose $d$ dimensional subspace from the $p$ dimensional space. Since we have 8 dimensions, suppose $d = 2$, there are ${8 \choose 2} = 28$ combinations. To simplify the process, with each pair, we will only extract the highest prediction accuracy one. ```{r p choose k, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} # the number of k dims <- 2:5 var_names <- colnames(scalingTrainX) low_dim_names <- c() K <- ncol(trainX) pChooseD <- lapply(dims, function(d) { com <- combn(K, d) pred <- apply(com, 2, function(pair) { knn_pred(trainX[, pair], trainY, testX[, pair], testY) }) mean_pred <- apply(pred, 2, mean) id <- which.max(mean_pred) low_dim_names <<- c(low_dim_names, paste(var_names[com[, id]], collapse = ":")) pred[, id] }) %>% as.data.frame() %>% as_tibble() %>% setNames(nm = paste0("d = ", dims)) rownames(pChooseD) <- paste0("k = ", c(5, 10, 20)) ``` The best pair's name is ```{r pairs} names(low_dim_names) <- paste0("d = ", dims) low_dim_names ``` The prediction accuracy is ```{r nav graph prediction table} kable(pChooseD, row.names = TRUE, digits = 3) ``` ### Method II: PCA PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component, determined by the largest eigen value), the second greatest variance (the second largest eigen value) on the second coordinate, and so on. The eigen values of PCA projection on our data set is ```{r PCA, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} trainXPCA <- princomp(scalingTrainX) testXPCA <- princomp(scalingTestX) round(trainXPCA$sdev, 2) ``` The first 5 eigen values are picked, as the sum of them is above 85\%. ```{r PCA knn, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} PCA <- low_dim_knn_pred(2:5, fun = function(princomp, d) {princomp$scores[, seq(d)]}, trainXPCA, trainY, testXPCA, testY) kable(PCA, row.names = TRUE, digits = 3) ``` ### Method III: LLE LLE (Local Linear Embedding) [@roweis2000nonlinear] begins by finding a set of the nearest neighbors of each point, then computes a set of weights for each point that best describes the point as a linear combination of its neighbors. Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points. ```{r LLE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", eval = FALSE} library(RDRToolbox) lle <- low_dim_knn_pred(2:5, fun = function(data, d) { LLE(data, dim = d, k = 5) }, scalingTrainX, trainY, scalingTestX, testY) kable(lle, row.names = TRUE, digits = 3) ``` ```{r LLE readRDS, echo = FALSE} lle <- readRDS(path_concat(dataDirectory, "lle.RDS")) kable(lle, row.names = TRUE, digits = 3) ``` ### Method IV: Random Tour A simple call `l_tour` ```{r tour 2D, eval = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center"} p2 <- l_tour(scalingTrainX, color = trainY) l <- l_layer_hull(p2, group = trainY) ``` Here, we assign different groups different colors. Besides, a convex hull is constructed (`l_layer_hull`) so that the separation of each group is much easier to tell. As we scroll the bar, one random projection can split the groups well (no intersections among the hulls). ```{r 2D projection, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=3, fig.height=3, fig.align="center", out.width = "70%"} include_graphics(path_concat(imageDirectory, "proj2D.PNG")) proj2D <- readRDS(path_concat(dataDirectory, "proj2D.RDS")) %>% as.matrix() ``` The matrix of projection vectors is ```{r 2D show, eval = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", out.width = "70%"} proj2D <- p2["projection"] ``` ```{r 2D projection display, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=3, fig.height=3, fig.align="center", out.width = "70%"} kable(as.data.frame(proj2D, row.names = colnames(trainX)), digits = 2) ``` Then, we will create 3, 4 and 5 dimension tour paths (by modifying `tour_path`). The "interesting" projection could be that, on at least one axis, the three groups are split well. For example, in this 3D projection, at the axis V1, the group "gray" is distinguished from the team; at the axis V2, the group "pink" could be told significantly different from the rest; at the axis V3, the "blue" group is popped up. ```{r tour 3D, eval = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", out.width = "70%"} p3 <- l_tour(scalingTrainX, tour_path = tourr::grand_tour(3), color = trainY, axesLayout = "parallel") proj3D <- p3["projection"] ``` ```{r tour 3D projection, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=3, fig.height=3, fig.align="center", out.width = "70%"} include_graphics(path_concat(imageDirectory, "proj3D.PNG")) proj3D <- readRDS(path_concat(dataDirectory, "proj3D.RDS")) %>% as.matrix() ``` ```{r tour 4D, eval = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", out.width = "70%"} p4 <- l_tour(scalingTrainX, tour_path = tourr::grand_tour(4), color = trainY, axesLayout = "parallel") proj4D <- p4["projection"] ``` ```{r tour 4D projection, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=3, fig.height=3, fig.align="center", out.width = "70%"} include_graphics(path_concat(imageDirectory, "proj4D.PNG")) proj4D <- readRDS(path_concat(dataDirectory, "proj4D.RDS")) %>% as.matrix() ``` ```{r tour 5D, eval = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", out.width = "70%"} p5 <- l_tour(scalingTrainX, tour_path = tourr::grand_tour(5), color = trainY, axesLayout = "parallel") proj5D <- p5["projection"] ``` ```{r tour 5D projection, echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, fig.width=3, fig.height=3, fig.align="center", out.width = "70%"} include_graphics(path_concat(imageDirectory, "proj5D.PNG")) proj5D <- readRDS(path_concat(dataDirectory, "proj5D.RDS")) %>% as.matrix() ``` ```{r tour compute, warning=FALSE, message=FALSE, error=FALSE, fig.width=4, fig.height=3, fig.align="center", out.width = "70%"} tour <- low_dim_knn_pred(list(proj2D, proj3D, proj4D, proj5D), fun = function(data, proj) { data %*% as.matrix(proj) }, scalingTrainX, trainY, scalingTestX, testY, setNames = FALSE) colnames(tour) <- paste0("d = ", 2:5) kable(tour, row.names = TRUE, digits = 3) ``` However, sometimes, it is not possible to find a matrix of projection vectors to separate each group perfectly. We need a tool to monitor the classification performance of each projection. A powerful function `loon::l_bind_state()` could be used. It takes three arguments, `target`, `event` and a `callback` function. If changes are detected for the given event of this target, the `callback` function will be fired. In our case, *suppose it is very difficult to separate the 5D tour*, a callback function could be built as ```{r bind_states, warning=FALSE, message=FALSE, echo=FALSE, eval = FALSE} callback <- function(W) { # W is the widget path name pred <- knn_pred(scalingTrainX %*% p5["projection"], trainY, scalingTestX %*% p5["projection"], testY, k = c(5, 10, 20)) cat( paste0( "k = 5: accuracy rate ", round(pred[1], 3), "\n", "k = 10: accuracy rate ", round(pred[2], 3), "\n", "k = 20: accuracy rate ", round(pred[3], 3), "\n" ) ) } l_bind_state(target = l_getPlots(p5), event = "all", callback = callback) # [1] "stateBinding0" ``` As one scrolls the bar, the accuracy rates of each projection will be displayed in the console. The performance of each projection can be visualized very straightforward. In our scenario, the most "interesting" matrix of projection vectors should be corresponding to the highest accuracy rate. ================= R console ================= \> ... \> k = 5: accuracy rate 0.991 \> k = 10: accuracy rate 0.991 \> k = 20: accuracy rate 0.991 \> k = 5: accuracy rate 1 \> k = 10: accuracy rate 1 \> k = 20: accuracy rate 1 \> ... ============================================ ## Graphical Summaries ```{r graphical summary, warning=FALSE, message=FALSE, error=FALSE, fig.width=12, fig.height=6, fig.align="center", out.width = "70%"} rbind(tour, lle, PCA, pChooseD) %>% mutate(k = rep(c(5, 10, 20), 4), method = rep(c("tour", "LLE", "PCA", "pChooseD"), each = 3)) %>% pivot_longer(cols = -c(k, method), names_to = "Dimensions", values_to = "Accuracy") %>% mutate(Dimensions = parse_number(Dimensions)) %>% ggplot(mapping = aes(x = Dimensions, y = Accuracy, colour = method)) + geom_path() + facet_wrap(~k) + ggtitle("Facet by the number of neibourhoods") ``` Through this chart we can tell, * In general, `tour` has the best performance. The accuracy of three dimensional tour with 10 or 20 neighbors can be 100\%! * `LLE` has a good prediction as `d = 2`, nevertheless, as the dimension rises, the performance is worse than that of `PCA`. * `PCA` is has a clear monotone increase trend. The more dimensions it included, the more accuracy it could provide. ## Conclusion ### Pros In this data set, `tour` gives the best performance. Even in two dimensional space, the accuracy could be as high as **98.3\%**. Also, such process is very intuitive. The `loon.tourr` also provides several scaling methods, like `data` (scale to zero one based on the whole data set), `variable` (scale to zero one based on per column), `observation` (scale to zero one based on pre row), `sphere` (PCA). Additionally, users can customize their own scaling methods. ### Cons * The process is hard to reproduce. As the projection is randomly generated, it is very arbitrary to find a good projection. Alternatively, *refresh* button is provided. If none of the existing projections is "interesting". Press the *refresh* button and new random projections are created instantaneously. * Computing speed. If the number of observations is large (say 10,000), as we scroll the bar, the points are not rotated smoothly that may affect our identification. * The difficulty of looking for an interesting projection is positively correlated with the number of groups. * The projection is hard to interpret. ## Reference