bulkAnalyseR: An accessible, interactive pipeline for analysing and sharing bulk sequencing results

Bulk sequencing experiments are essential for exploring a wide range of biological questions. To bring the data analysis closer to its interpretation and facilitate both interactive, exploratory tasks and the sharing of (easily accessible) information, we present bulkAnalyseR an R package that offers a seamless, customisable solution for most bulk sequencing datasets. By integrating state-of-the-art approaches without relying on extensive computational support, and replacing static images with interactive panels, our aim is to further support and strengthen the reusability of data.

bulkAnalyseR enables standard analyses of bulk data, using an expression matrix as starting point. It presents the outputs of various steps in an interactive web-based interface, making it easy to generate, explore and verify hypotheses. The plots and tables generated can be easily downloaded and presented elsewhere. Moreover, the app can be easily shared and published, incentivising research reproducibility and allowing others to explore the same processed data and enhance the biological conclusions.

The example app described in this vignette can be found here

Installation

To install the package, first install all CRAN dependencies:

packages.cran <- c("ggplot2",
                   "shiny",
                   "shinythemes",
                   "gprofiler2",
                   "stats",
                   "ggrepel",
                   "utils",
                   "RColorBrewer",
                   "circlize",
                   "shinyWidgets",
                   "shinyjqui",
                   "dplyr",
                   "magrittr",
                   "ggforce",
                   "rlang",
                   "glue",
                   "matrixStats",
                   "noisyr",
                   "tibble",
                   "ggnewscale",
                   "ggrastr",
                   "visNetwork",
                   "shinyLP",
                   "grid",
                   "DT",
                   "scales",
                   "shinyjs",
                   "tidyr",
                   "UpSetR")
new.packages.cran <- packages.cran[!(packages.cran %in% installed.packages()[, "Package"])]
if(length(new.packages.cran))
  install.packages(new.packages.cran)

Then install bioconductor dependencies:

packages.bioc <- c("edgeR", 
                   "DESeq2",
                   "preprocessCore",
                   "GENIE3",
                   "ComplexHeatmap")

new.packages.bioc <- packages.bioc[!(packages.bioc %in% installed.packages()[,"Package"])]
if(length(new.packages.bioc)){
  if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  BiocManager::install(new.packages.bioc)
}

Finally, you can install the package from CRAN, or install the latest stable version of bulkAnalyseR from GitHub:

install.packages("bulkAnalyseR")

### OR

if (!requireNamespace("devtools", quietly = TRUE))
  install.packages("devtools")

devtools::install_github("Core-Bioinformatics/bulkAnalyseR")

Preprocessing

First, load bulkAnalyseR (plus dplyr and ggplot2 for convenience):

library(bulkAnalyseR)
library(dplyr)
##> Warning: package 'dplyr' was built under R version 4.1.2
##> 
##> Attaching package: 'dplyr'
##> The following objects are masked from 'package:stats':
##> 
##>     filter, lag
##> The following objects are masked from 'package:base':
##> 
##>     intersect, setdiff, setequal, union
library(ggplot2)

Loading an expression matrix

For this vignette we are using a subset of the count matrix for an experiment included in a 2019 paper by Yang et al. Rows represent genes/features and columns represent samples:

counts.in <- system.file("extdata", "expression_matrix.csv", package = "bulkAnalyseR")
exp <- as.matrix(read.csv(counts.in, row.names = 1))
head(exp)
##>                    SRR7624365 SRR7624366 SRR7624371 SRR7624372 SRR7624375
##> ENSMUSG00000102693          2          0          0          0          0
##> ENSMUSG00000051951          6          4          2          0         47
##> ENSMUSG00000102851          0          0          0          0          0
##> ENSMUSG00000103377          0          0          2          0          8
##> ENSMUSG00000104017          0          0          0          2          0
##> ENSMUSG00000103025          0          0          0          0          0
##>                    SRR7624376
##> ENSMUSG00000102693          0
##> ENSMUSG00000051951         37
##> ENSMUSG00000102851          1
##> ENSMUSG00000103377          6
##> ENSMUSG00000104017          2
##> ENSMUSG00000103025          3

Defining metadata

The subset of the dataset we are using has samples from 3 timepoints: 0h, 12h and 36h, each with 2 biological replicates. We define a metadata table detailing which sample correspond to which timepoint:

meta <- data.frame(
  srr = colnames(exp), 
  timepoint = rep(c("0h", "12h", "36h"), each = 2)
)
meta
##>          srr timepoint
##> 1 SRR7624365        0h
##> 2 SRR7624366        0h
##> 3 SRR7624371       12h
##> 4 SRR7624372       12h
##> 5 SRR7624375       36h
##> 6 SRR7624376       36h

This metadata table should be a data frame containing at minimum two columns: the first column must contain the column names of the expression.matrix, while the last column is assumed to contain the experimental conditions that will be tested for differential expression.

Denoising and normalisation

Before using the expression matrix to create our shiny app, some preprocessing should be performed. bulkAnalyseR contains the function preprocessExpressionMatrix which takes the expression matrix as input then denoises the data using noisyR and normalises it (quantile normalisation by default, other methods can bespecified using the normalisation.method parameter). By specifying output.plot = TRUE, you can also print the expression-similarity line plots from noisyR to console and you can specify further parameters from the noisyR noisyr_counts.

exp.proc <- preprocessExpressionMatrix(exp, output.plot = TRUE)
##> >>> noisyR counts approach pipeline <<<
##> The input matrix has 38705 rows and 6 cols
##>     number of genes: 38705
##>     number of samples: 6
##> Calculating the number of elements per window
##>     the number of elements per window is 3870
##>     the step size is 193
##>     the selected similarity metric is correlation_pearson
##>   Working with sample 1
##>   Working with sample 2
##>   Working with sample 3
##>   Working with sample 4
##>   Working with sample 5
##>   Working with sample 6
##> Calculating noise thresholds for 6 samples...
##>     similarity.threshold = 0.25
##>     method.chosen = Boxplot-IQR
##> Denoising expression matrix...
##>     removing noisy genes
##>     adjusting matrix
##> >>> Done! <<<
##> Performing quantile normalisation...
##> Done!

It is not recommended to use data which has not been denoised and normalised as input to generateShinyApp. You can also perform your own preprocessing outside preprocessExpressionMatrix.

Creating a shiny app

The example app for the Yang 2019 data can be found here. Readers are encouraged to try out the different features described below in the live version of the app to familiarise themselves with the different features.

The central function in bulkAnalyseR is generateShinyApp. This function creates an app.R file and all required objects to run the app in .rda format in the target directory. The key inputs to generateShinyApp are expression.matrix (after being processed using preprocessExpressionMatrix) and metadata. You can also specify the title of the app (which will appear in the navigation bar at the top of the app) with app.title, the modality (RNA by default) that will be displayed as the name of the main tab, the directory where the app should be saved with shiny.dir and the shiny theme you wish to use (‘flatly’ is the default, you can find the other options here). It is also recommended that you specify the organism on which your data was generated, firstly using the organism parameter using the gprofiler2 naming convention e.g. ‘hsapiens’,‘mmusculus’ (see here for the full list of organisms and IDs), and secondly specifying the database for annotations to convert ENSEMBL IDs to gene names e.g. org.Hs.eg.db - the full list of bioconductor packaged databases can be seen using this command:

BiocManager::available("^org\\.")

The dataset in this example was generated on M. musculus so we would generate the app using this function call (note that the org.Mm.eg.db needs to be installed):

generateShinyApp(
  shiny.dir = "shiny_Yang2019",
  app.title = "Shiny app for visualisation of three timepoints from the Yang 2019 data",
  modality = "RNA",
  expression.matrix = exp.proc,
  metadata = meta,
  organism = "mmusculus",
  org.db = "org.Mm.eg.db"
)

If no organism is specified for g:profiler then the enrichment tab will be automatically excluded. If no model organism database is specified then the row names of the expression matrix will be used throughout. It is recommended to use ENSEMBL ids as the row names and supply the model organism through the org.db parameter if your organism is among the ones provided, to ensure compatibility.

This will create a folder called shiny_Yang2019 in which there will be 2 data files expression_matrix.rda and metadata.rda and app.R which defines the app. To see the app, you can call shiny::runApp(‘shiny_Yang2019’) and the app will start. The app generated is standalone and can be shared with collaborators or published online through a platform like . This provides an easy way for anyone to explore the data and verify the conclusions, increasing access and promoting reproducibility of the bioinformatics analysis.

By default, the app will have 10 panels: Home, Sample select, Quality checks, Differential expression, Volcano and MA plots, DE summary, Enrichment, Expression patterns, Cross plot, GRN inference. You can choose to remove one or more panels using the panels.default parameter.

generateShinyApp(
  shiny.dir = "shiny_Yang2019_onlyQC_DE",
  app.title = "Shiny app for visualisation of three timepoints from the Yang 2019 data",
  modality = "RNA",
  expression.matrix = exp.proc,
  metadata = meta,
  organism = "mmusculus",
  org.db = "org.Mm.eg.db",
  panels.default = c('QC','DE')
)

An app that contains multiple modalities, different datasets, or different subsets of the same dataset can also be generated. The modality, expression.matrix, metadata, organism, org.db, and panels.default parameters are all vectorised and accept multiple arguments (a vector or a list if the original argument was more complex). Note that all these parameters must be the same length (or length 1) to avoid errors. An example of creating an app with two different versions of the original app created above is presented below:

generateShinyApp(
  shiny.dir = "shiny_Yang2019_two_modalities",
  app.title = "Shiny app for visualisation of three timepoints from the Yang 2019 data",
  modality = c("RNA 1", "RNA 2"),
  expression.matrix = exp.proc,
  metadata = meta,
  organism = "mmusculus",
  org.db = "org.Mm.eg.db",
  panels.default = list(
    c("Landing", "SampleSelect", "QC", "DE", "DEplot", "DEsummary", 
      "Enrichment", "Patterns", "Cross", "GRN"),
    c('QC','DE')
  )
)

See the following sections for more details about the default panels. For each panel, you can also find example code of how you could generate similar results or plots to the ones displayed in the panel without using the app.

Home

This tab contains some useful links (including this vignette!) to get you started with bulkAnalyseR.

Sample Select

This tab allows you to select a subset of samples to use for the analysis. In particular this allows you to exclude low quality samples (which you could discover using QC tab) or focus on particular parts of the data. Once you have selected the samples you want, make sure to press the ‘Use the selected sample!’ button to confirm your choice. If you do not use this button or select samples then all samples will be used.

Quality check (QC)

The Quality Check (QC) tab includes a Jaccard Similarity Index (JSI) heatmap, and a PCA dimensionality reduction, with groups based on the metadata information. This enables a high-level overview of the similarity across samples, usually reflecting the experimental design quite closely and serving as a sanity check. Within this panel, there is also the option to show the MA plot between any 2 columns - this can be used to further investigate any sample similarity or differences found through the JSI and PCA plots.

Jaccard Similarity Index

Within the JSI plot, you can change the order of samples using an ordering of metadata columns as well as specifying the number of top abundance genes that should be used to calculate the JSI. Finally, you can choose whether JSI values should be shown on the heatmap.