Author: Alan O’Callaghan (alan.b.ocallaghan@gmail.com)
This vignette is intended to provide an overview of the use of heatmaply
in the analysis of biological data. In particular, this vignette deals with its use to visualize gene expression patterns and sample relationships in RNAseq data relating to breast cancer samples, as retrieved from from Genomic Data Commons.
Due to the size of the objects, this file is seperated to three fils. You can view this series in the following links:
This is file 1 in the series.
# Let's load the packages
library(heatmaply)
#> Loading required package: plotly
#> Loading required package: ggplot2
#>
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#>
#> last_plot
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following object is masked from 'package:graphics':
#>
#> layout
#> Loading required package: viridis
#> Loading required package: viridisLite
#>
#> ---------------------
#> Welcome to heatmaply version 0.11.1
#> Type ?heatmaply for the main documentation.
#> The github page is: https://github.com/talgalili/heatmaply/
#>
#> Suggestions and bug-reports can be submitted at: https://github.com/talgalili/heatmaply/issues
#> Or contact: <tal.galili@gmail.com>
#>
#> To suppress this message use: suppressPackageStartupMessages(library(heatmaply))
#> ---------------------
library(heatmaplyExamples)
The data used in this vignette is available within the package.
In order to recreate the output you will need to manually run the code available here:
## This script downloads the expression data for all breast cancer samples
## from GDC/TCGA, and filters them to have only the genes in the
## PAM50 gene set
library('TCGAbiolinks')
library('SummarizedExperiment')
library('biomaRt')
library('voom')
query <- GDCquery(project = 'TCGA-BRCA',
data.category = 'Transcriptome Profiling',
data.type = 'Gene Expression Quantification',
workflow.type = 'HTSeq - Counts'
)
GDCdownload(query)
data <- GDCprepare(query)
## Retain only tumour samples
ind_tumor <- colData(data)[['definition']]== 'Primary solid Tumor'
data <- data[, ind_tumor]
## Annotate using biomaRt
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl'))
genes <- rownames(data)
symbols <- getBM(filters= 'ensembl_gene_id',
attributes= c('ensembl_gene_id','hgnc_symbol'),
values = genes,
mart= mart)
## Remove those not annotated
ind_nchar <- as.logical(nchar(symbols[['hgnc_symbol']]))
symbols <- symbols[ind_nchar, ]
data <- data[symbols[['ensembl_gene_id']], ]
rownames(data) <- symbols[['hgnc_symbol']]
## Subset to those annotated with a symbol here
pam50_genes <- intersect(pam50_genes, symbols[['hgnc_symbol']])
clinical_cols <- c('subtype_Integrated.Clusters..with.PAM50.',
'subtype_ER.Status',
'subtype_PR.Status',
'subtype_HER2.Final.Status'
)
## Only interested in those which have all subtypes.
subtypes <- colData(data)[, clinical_cols]
ind_has_subtypes <- sapply(seq_len(nrow(subtypes)),
function(i) {
all(!is.na(subtypes[i, ]))
})
data <- data[, ind_has_subtypes]
tcga_brca_clinical <- colData(data)
tcga_brca_clinical <- tcga_brca_clinical[, clinical_cols]
colnames(tcga_brca_clinical) <- gsub('subtype_', '', colnames(tcga_brca_clinical))
stypes <- c('ER.Status', 'PR.Status', 'HER2.Final.Status')
tcga_brca_clinical[, stypes] <- lapply(tcga_brca_clinical[, stypes],
function(col) {
col <- as.character(col)
ifelse(col %in% c('Positive', 'Negative'), col, NA)
}
)
## Set up final objects
tcga_brca_clinical <- as.data.frame(tcga_brca_clinical)
expression <- assay(data, 'HTSeq - Counts')
expression <- expression[!duplicated(rownames(expression)), ]
voomed_expression <- as.matrix(voom(expression))
raw_expression <- expression
The R package limma is a commonly used statistical analysis tool using empirical Bayes methods. This package was originally developed for micro-array data. voom is a function within this package which transforms RNAseq count data in log2 counts per million reads, which allows it to be treated similarly. This function was used in the present example to provide normalized expression values. While pre-processing the data, quantile and TMM normalization were also applied. To compare these normalized values to raw read counts, the raw read counts were log2-transformed. 0.5 was added to prevent infinite values resulting from log2(0)
.
A common workflow for RNAseq differential expression analysis is to visualize gene expression measures and sample-sample correlations using a heatmap. This is useful to observe whether samples appear to cluster together, for the purposes of identifying poor quality samples or outliers. Furthermore, it may be useful to visualize differentially-expressed genes alongside row annotations which indicate patient or sample sub-group.
The following examples show sample-sample correlation based on all genes measured. Clustering in both instances appears to be related to membrane receptor subtypes, as shown in the row annotation.
When using the heatmaply
function, notice the use of the showticklabels
argument - by turning the labels off, the resulting heatmap is much lighter and allows fast zoom-in. The actual values can still be identified when hovering the mouse over the cells. Also, notice the use of the default color scheme chosen so to emphasize the correlation (which has low variability).
cor_mat_raw_logged <- cor(log2(raw_expression + 0.5))
heatmaply(cor_mat_raw_logged,
row_side_colors = tcga_brca_clinical,
main = 'Sample-sample correlation, log2 counts',
showticklabels = c(FALSE, FALSE),
plot_method = 'plotly')