This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
RNA sequencing (RNA-seq) is a powerful technique for quantifying gene expression across the transcriptome. It enables the identification of differentially expressed genes (DEGs), detection of novel transcripts, and characterization of transcript isoforms under various biological conditions.
In this analysis, we investigate transcriptomic differences between HS vs. CL lesions using publicly available RNA-seq data from GSE127831, which is part of the study by Amorim et al., 2019. The dataset comprises 28 samples,from human skin biopsies of healthy individuals(n=7) and from lesions of patients infected with Leishmania braziliensis (n=21).
Instead of starting from raw FASTQ files, we retrieved pre-aligned gene-level expression data from the ARCHS4 database. ARCHS4 provides uniformly processed RNA-seq data for human and mouse samples, enabling efficient reuse and reanalysis of public datasets.
Our primary goals are: - To perform quality control and normalization of RNA-seq count data - To identify genes that are significantly differentially expressed between the experimental groups -To interpret biological significance using gene set enrichment or pathway analysis
We use the limma-voom and edgeR frameworks for normalization and differential expression analysis, both of which are well-established methods for RNA-seq data with complex experimental designs.
This document provides a reproducible workflow combining code, output, and interpretation. All steps are performed in R and rendered using R Markdown and the Knitr package.
A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the tidyverse suite of packages. All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github.
Pre-aligned gene-level expression data from the ARCHS4 database was retreived for this dataset. ARCHS4 provides uniformly processed RNA-seq data for human and mouse samples, enabling efficient reuse and reanalysis of public datasets.
library(tidyverse)#for data wrangling
library(rhdf5)#for retreiving data from ARCHS4 database
library(edgeR)#for DGE analysis
library(gplots)#for heatmap.2
library(enrichplot)
archs4.human <- "human_gene_v2.latest.h5"
# use the h5 list (h5ls) function from the rhdf5 package to look at the contents of these databases
hdim <- h5ls(archs4.human)
# data for 67,186 HUMAN genes across 819,856 samples
all.samples.human <- h5read(archs4.human, name="meta/samples/geo_accession")
dim(all.samples.human)
## [1] 993191
#the sample identifiers are obtained from the GEO database for GSE 127831
mySamples <- c( "GSM3639530", #skin_HS1
"GSM3639531", #skin_HS2
"GSM3639532", #skin_HS3
"GSM3639533", #skin_HS4
"GSM3639534", #skin_HS5
"GSM3639535", #skin_HS6
"GSM3639536", #skin_HS7
"GSM3639537", #skin_CL1
"GSM3639538", #skin_CL2
"GSM3639539", #skin_CL3
"GSM3639540", #skin_CL4
"GSM3639541", #skin_CL5
"GSM3639542", #skin_CL6
"GSM3639543", #skin_CL7
"GSM3639544", #skin_CL8
"GSM3639545", #skin_CL9
"GSM3639546", #skin_CL10
"GSM3639547", #skin_CL11
"GSM3639548", #skin_CL12
"GSM3639549", #skin_CL13
"GSM3639550", #skin_CL14
"GSM3639551", #skin_CL15
"GSM3639552", #skin_CL16
"GSM3639553", #skin_CL17
"GSM3639554", #skin_CL18
"GSM3639555", #skin_CL19
"GSM3639556", #skin_CL20
"GSM3639557") #skin_CL21
# Identify columns to be extracted from ARCHS4 database
my.sample.locations <- which(all.samples.human %in% mySamples) # first time you've seen the %in% operator.
# extract gene symbols from the metadata
genes <- h5read(archs4.human, "meta/genes/symbol")
# Extract expression data from ARCHS4 ----
expression <- h5read(archs4.human, "data/expression",
index=list(my.sample.locations, NULL))
# transpose to get genes as rows and samples as columns
expression <- t(expression)
rownames(expression) <- genes
colnames(expression) <- all.samples.human[my.sample.locations]
colSums(expression) #this shows the sequencing depth for each of the samples you've extracted
## GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535 GSM3639536
## 53481826 64940474 52745249 20848486 2837407 51722656 37390057
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543
## 50779924 35415106 101406579 18945879 66455004 84689579 40410575
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550
## 73600594 74039897 39497323 44325744 52241136 88109937 89112596
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557
## 84331463 64685799 56634215 56572087 69779095 38525553 78194883
archs4.dgelist <- DGEList(expression)
archs4.cpm <- cpm(archs4.dgelist)
colSums(archs4.cpm)
## GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535 GSM3639536
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
table(rowSums(archs4.dgelist$counts==0)==28)
##
## FALSE TRUE
## 55449 11737
keepers <- rowSums(archs4.cpm>1)>=7
archs4.dgelist.filtered <- archs4.dgelist[keepers,]
dim(archs4.dgelist.filtered)
## [1] 21135 28
archs4.dgelist.filtered.norm <- calcNormFactors(archs4.dgelist.filtered, method = "TMM")
archs4.filtered.norm.log2.cpm <- cpm(archs4.dgelist.filtered.norm, log=TRUE)
# Extract sample metadata from ARCHS4 to create a study design file ----
# extract the sample source
sample_source_name <- h5read(archs4.human, "meta/samples/source_name_ch1")
# extract sample title
sample_title <- h5read(archs4.human, name="meta/samples/title")
# extract sample characteristics
sample_characteristics<- h5read(archs4.human, name="meta/samples/characteristics_ch1")
# let's try putting this all together in a study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
Sample_source = sample_source_name[my.sample.locations],
Sample_characteristics = sample_characteristics[my.sample.locations])
#based on what we extracted from ARCHS4 above, lets customize and clean-up this study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
genotype = rep(c("HS", "CL"), times= c(7,21)),
treatment = rep(c("NInf", "Inf"), times= c(7,21)))
#capture experimental variables as factors from this study design
genotype <- factor(studyDesign$genotype)
treatment <- factor(studyDesign$treatment)
sampleName <- studyDesign$Sample_title
Filtering was carried out to remove lowly expressed genes. Genes with less than 1 count per million (CPM) in at least 7 or more samples filtered out. This reduced the number of genes from 67186 to 21135.
library(tidyverse)
library(DT)
# use dplyr 'mutate' function to add new columns based on existing data
targets <- read_tsv("GSE127831_studydesign.txt")
sampleLabels <- targets$sample
colnames(archs4.filtered.norm.log2.cpm) <- c(sampleLabels)
archs4.filtered.norm.log2.cpm.df <- as_tibble(archs4.filtered.norm.log2.cpm, rownames = "geneID")
archs4.filtered.norm.log2.cpm.df
## # A tibble: 21,135 × 29
## geneID HS1 HS2 HS3 HS4 HS5 HS6 HS7 CL1 CL2 CL3 CL4
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TSPAN6 4.42 4.57 4.52 4.49 4.17 4.05 4.51 2.85 2.37 3.58 3.55
## 2 TNMD 3.18 2.62 3.61 2.79 3.25 2.06 2.57 -4.82 -4.82 -3.49 -4.82
## 3 DPM1 5.16 5.24 5.02 5.10 4.76 5.01 4.84 5.28 4.76 5.53 5.32
## 4 SCYL3 4.85 4.64 4.93 4.67 4.25 4.58 4.73 4.77 4.52 4.48 4.32
## 5 C1ORF112 2.83 2.77 3.01 3.05 2.31 2.97 3.05 3.13 3.39 2.72 3.22
## 6 FGR 0.574 2.43 0.610 1.86 1.51 2.50 0.792 5.87 6.57 5.80 5.68
## 7 CFH 7.14 7.87 7.68 8.58 9.14 8.66 8.12 8.19 7.26 8.31 8.35
## 8 FUCA2 3.95 5.01 4.56 5.20 4.73 5.13 4.76 5.46 4.73 5.61 6.44
## 9 GCLC 5.89 5.74 5.56 5.80 5.18 4.66 5.39 4.73 4.19 5.02 4.97
## 10 NFYA 4.55 4.46 4.35 4.88 4.69 4.62 5.23 4.86 4.64 4.65 5.12
## # ℹ 21,125 more rows
## # ℹ 17 more variables: CL5 <dbl>, CL6 <dbl>, CL7 <dbl>, CL8 <dbl>, CL9 <dbl>,
## # CL10 <dbl>, CL11 <dbl>, CL12 <dbl>, CL13 <dbl>, CL14 <dbl>, CL15 <dbl>,
## # CL16 <dbl>, CL17 <dbl>, CL18 <dbl>, CL19 <dbl>, CL20 <dbl>, CL21 <dbl>
mydata.df <- archs4.filtered.norm.log2.cpm.df %>%
mutate(healthy.AVG = (HS1 + HS2 + HS3 + HS4 + HS5 + HS6 +HS7)/7,
disease.AVG = (CL1 + CL2 + CL3 + CL4 + CL5 + CL6 + CL7 + CL8 + CL9 + CL10 + CL11 + CL12 + CL13 + CL14 + CL15 + CL16 + CL17 + CL18 + CL19 + CL20 + CL21)/21,
#now make columns comparing each of the averages above that you're interested in
LogFC = (disease.AVG - healthy.AVG)) %>%
mutate_if(is.numeric, round, 2)
#now look at this modified data table
mydata.df
## # A tibble: 21,135 × 32
## geneID HS1 HS2 HS3 HS4 HS5 HS6 HS7 CL1 CL2 CL3 CL4
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TSPAN6 4.42 4.57 4.52 4.49 4.17 4.05 4.51 2.85 2.37 3.58 3.55
## 2 TNMD 3.18 2.62 3.61 2.79 3.25 2.06 2.57 -4.82 -4.82 -3.49 -4.82
## 3 DPM1 5.16 5.24 5.02 5.1 4.76 5.01 4.84 5.28 4.76 5.53 5.32
## 4 SCYL3 4.85 4.64 4.93 4.67 4.25 4.58 4.73 4.77 4.52 4.48 4.32
## 5 C1ORF112 2.83 2.77 3.01 3.05 2.31 2.97 3.05 3.13 3.39 2.72 3.22
## 6 FGR 0.57 2.43 0.61 1.86 1.51 2.5 0.79 5.87 6.57 5.8 5.68
## 7 CFH 7.14 7.87 7.68 8.58 9.14 8.66 8.12 8.19 7.26 8.31 8.35
## 8 FUCA2 3.95 5.01 4.56 5.2 4.73 5.13 4.76 5.46 4.73 5.61 6.44
## 9 GCLC 5.89 5.74 5.56 5.8 5.18 4.66 5.39 4.73 4.19 5.02 4.97
## 10 NFYA 4.55 4.46 4.35 4.88 4.69 4.62 5.23 4.86 4.64 4.65 5.12
## # ℹ 21,125 more rows
## # ℹ 20 more variables: CL5 <dbl>, CL6 <dbl>, CL7 <dbl>, CL8 <dbl>, CL9 <dbl>,
## # CL10 <dbl>, CL11 <dbl>, CL12 <dbl>, CL13 <dbl>, CL14 <dbl>, CL15 <dbl>,
## # CL16 <dbl>, CL17 <dbl>, CL18 <dbl>, CL19 <dbl>, CL20 <dbl>, CL21 <dbl>,
## # healthy.AVG <dbl>, disease.AVG <dbl>, LogFC <dbl>
datatable(mydata.df[,c(1,30:32)],
extensions = c('KeyTable', "FixedHeader"),
filter = 'top',
options = list(keys = TRUE,
searchHighlight = TRUE,
pageLength = 10,
lengthMenu = c("10", "25", "50", "100")))
The table shown below includes expression data for 21135 genes. You can sort and search the data directly from the table.
library(plotly)
# Principal component analysis (PCA) -------------
pca.res <- prcomp(t(archs4.filtered.norm.log2.cpm), scale.=F, retx=T)
#look at pca.res in environment
ls(pca.res)
## [1] "center" "rotation" "scale" "sdev" "x"
summary(pca.res) # Prints variance summary for all principal components.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 110.1039 41.57249 38.68737 36.96260 31.13845 28.89612
## Proportion of Variance 0.4786 0.06823 0.05909 0.05394 0.03828 0.03296
## Cumulative Proportion 0.4786 0.54680 0.60589 0.65983 0.69810 0.73107
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 24.92572 22.74729 21.30897 21.18538 20.21495 19.44713
## Proportion of Variance 0.02453 0.02043 0.01793 0.01772 0.01613 0.01493
## Cumulative Proportion 0.75559 0.77602 0.79394 0.81166 0.82779 0.84272
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 18.8979 18.39168 17.97927 17.46455 17.46077 16.93905
## Proportion of Variance 0.0141 0.01335 0.01276 0.01204 0.01204 0.01133
## Cumulative Proportion 0.8568 0.87018 0.88294 0.89498 0.90701 0.91834
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 16.53302 16.19797 15.68047 15.36846 15.16967 14.96179
## Proportion of Variance 0.01079 0.01036 0.00971 0.00932 0.00908 0.00884
## Cumulative Proportion 0.92913 0.93949 0.94920 0.95852 0.96761 0.97644
## PC25 PC26 PC27 PC28
## Standard deviation 14.30138 14.05055 13.95698 1.001e-13
## Proportion of Variance 0.00807 0.00779 0.00769 0.000e+00
## Cumulative Proportion 0.98452 0.99231 1.00000 1.000e+00
pc.var<-pca.res$sdev^2 #sdev^2 gives you the eigenvalues
pc.per<-round(pc.var/sum(pc.var)*100, 1)
pc.per
## [1] 47.9 6.8 5.9 5.4 3.8 3.3 2.5 2.0 1.8 1.8 1.6 1.5 1.4 1.3 1.3
## [16] 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.0
# Visualize your PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)
pca.plot <- ggplot(pca.res.df) +
aes(x=PC1, y=PC2, color=treatment, shape = genotype) +
geom_point(size=4) +
#geom_label() +
# stat_ellipse() +
xlab(paste0("PC1 (",pc.per[1],"%",")")) +
ylab(paste0("PC2 (",pc.per[2],"%",")")) +
labs(title="PCA plot",
caption=paste0("produced on ", Sys.time())) +
coord_fixed() +
theme_bw()
ggplotly(pca.plot)
#reading in the study design file
design <- read.csv(gzfile("GSE127831_studydesign.csv.gz"))
write.table(design, "GSE127831_studydesign.txt", sep = "\t", row.names = FALSE, quote = FALSE)
#HS vs. CL lesions as a factor
disease <- factor(targets$disease)
#preparing a model matrix for CL vs HS
design1 <- model.matrix(~0 + disease)
colnames.design1 <- levels(disease)
# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
v.DEGList.filtered.norm <- voom(archs4.dgelist.filtered.norm, design1, plot = TRUE)
# fit a linear model to your data
fit <- lmFit(v.DEGList.filtered.norm, design1)
# Contrast matrix ----
contrast.matrix.disease <- makeContrasts("infection vs. control"= diseasecutaneous - diseasecontrol,
levels=design1)
# extract the linear model fit -----
fits <- contrasts.fit(fit, contrast.matrix.disease)
#get bayesian stats for your linear model fit
ebFit <- eBayes(fits)
#write.fit(ebFit, file="lmfit_results.txt")
# TopTable to view DEGs -----
myTopHits <- topTable(ebFit, adjust ="BH", coef=1, number=40000, sort.by="logFC")
# convert to a tibble
myTopHits.df <- myTopHits %>%
as_tibble(rownames = "geneID")
# now plot
vplot <- ggplot(myTopHits.df) +
aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", ID)) +
geom_point(size=2) +
geom_hline(yintercept = -log10(0.01), linetype="longdash", colour="grey", linewidth=1) +
geom_vline(xintercept = 1, linetype="longdash", colour="#BE684D", linewidth=1) +
geom_vline(xintercept = -1, linetype="longdash", colour="#2C467A", linewidth=1) +
annotate("rect", xmin = 1, xmax = 12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#BE684D") +
annotate("rect", xmin = -1, xmax = -12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#2C467A") +
labs(title="Volcano plot",
subtitle = "Cutaneous leishmaniasis",
caption=paste0("produced on ", Sys.time())) +
theme_bw()
ggplotly(vplot)
To identify differentially expressed genes, precision weights were first applied to each gene based on its mean-variance relationship using VOOM, then data was normalized using the TMM method in EdgeR. Linear modeling and bayesian stats were employed via Limma to find genes that were up- or down-regulated in leishmania patients by 4-fold or more, with a false-discovery rate (FDR) of 0.01.
#extract the DEGs
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=1)
sampleLabels <- targets$sample
head(v.DEGList.filtered.norm$E)
## GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535
## TSPAN6 4.4215045 4.567633 4.5217375 4.493129 4.181086 4.048499
## TNMD 3.1711747 2.609734 3.6070871 2.786308 3.275945 2.050838
## DPM1 5.1582223 5.242238 5.0148733 5.096451 4.770710 5.004531
## SCYL3 4.8530178 4.634597 4.9248901 4.670525 4.262770 4.582371
## C1ORF112 2.8228494 2.765199 3.0018776 3.045262 2.352567 2.969108
## FGR 0.5488926 2.423070 0.5843514 1.857862 1.582048 2.488291
## GSM3639536 GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541
## TSPAN6 4.5128208 2.844244 2.367165 3.577585 3.552950 3.183080
## TNMD 2.5661884 -6.653608 -6.404325 -4.086458 -5.205274 -5.469167
## DPM1 4.8399398 5.277499 4.758696 5.531426 5.319268 4.991289
## SCYL3 4.7276518 4.773181 4.515283 4.482361 4.316327 4.559199
## C1ORF112 3.0429112 3.122825 3.381945 2.712676 3.216791 3.218500
## FGR 0.7724368 5.871668 6.568476 5.800485 5.675075 5.940400
## GSM3639542 GSM3639543 GSM3639544 GSM3639545 GSM3639546 GSM3639547
## TSPAN6 2.967780 4.1684385 4.015413 2.710141 3.551772 3.793119
## TNMD -7.463718 0.7829547 -2.421751 -3.098569 -1.468298 -2.233914
## DPM1 5.093985 5.4922622 5.303504 5.078982 4.705077 5.418648
## SCYL3 4.692049 4.4714853 4.688136 4.612419 4.518426 4.980149
## C1ORF112 3.263351 2.6594248 2.276274 3.176085 3.406736 3.681808
## FGR 5.972343 4.7753634 5.800151 5.828897 5.926682 6.276590
## GSM3639548 GSM3639549 GSM3639550 GSM3639551 GSM3639552 GSM3639553
## TSPAN6 3.278385 3.366171 3.437352 3.004041 3.304448 3.528899789
## TNMD -4.352429 -3.798349 -1.557289 -3.868563 -4.703420 0.004827957
## DPM1 5.208668 5.135795 5.097529 4.807394 4.899836 4.983038471
## SCYL3 4.914826 4.596278 4.643880 4.365056 4.743250 4.939981026
## C1ORF112 3.346623 3.023793 2.953459 2.577392 3.240267 3.049577464
## FGR 6.136014 5.930487 6.383336 6.156954 6.447216 5.098143524
## GSM3639554 GSM3639555 GSM3639556 GSM3639557
## TSPAN6 2.760758 3.050437 4.604296 4.694747
## TNMD -6.989111 -4.758205 -6.029607 -1.451383
## DPM1 5.078659 5.016911 5.669531 4.915910
## SCYL3 4.881639 4.420211 4.578648 4.445319
## C1ORF112 3.515708 2.875516 3.044534 2.727420
## FGR 6.328160 6.593948 5.390879 2.974136
colnames(v.DEGList.filtered.norm$E) <- sampleLabels
diffGenes <- v.DEGList.filtered.norm$E[results[,1] !=0,]
head(diffGenes)
## HS1 HS2 HS3 HS4 HS5 HS6 HS7
## TSPAN6 4.4215045 4.567633 4.5217375 4.493129 4.181086 4.048499 4.5128208
## TNMD 3.1711747 2.609734 3.6070871 2.786308 3.275945 2.050838 2.5661884
## FGR 0.5488926 2.423070 0.5843514 1.857862 1.582048 2.488291 0.7724368
## SEMA3F 6.4256798 5.729345 5.7213567 5.965981 5.821015 5.597115 5.9773292
## CFTR 4.7814568 3.834730 4.4752119 3.718728 4.033744 4.264076 3.5245915
## LAP3 6.1812705 6.351662 5.5855963 5.872332 6.349454 6.191206 6.6269640
## CL1 CL2 CL3 CL4 CL5 CL6 CL7
## TSPAN6 2.844244 2.3671647 3.577585 3.552950 3.1830801 2.9677795 4.1684385
## TNMD -6.653608 -6.4043248 -4.086458 -5.205274 -5.4691673 -7.4637181 0.7829547
## FGR 5.871668 6.5684756 5.800485 5.675075 5.9404004 5.9723428 4.7753634
## SEMA3F 4.607899 4.7662263 5.726011 3.841850 5.6955301 6.0136607 5.2318355
## CFTR -1.099019 -0.7318994 -1.817970 -1.117811 0.3804984 -0.8638052 2.8762362
## LAP3 8.917641 8.5193277 9.413784 9.324462 8.7676193 9.0871640 8.8607082
## CL8 CL9 CL10 CL11 CL12 CL13
## TSPAN6 4.015413 2.710141 3.551772 3.7931190 3.2783846 3.3661709
## TNMD -2.421751 -3.098569 -1.468298 -2.2339139 -4.3524286 -3.7983493
## FGR 5.800151 5.828897 5.926682 6.2765900 6.1360141 5.9304870
## SEMA3F 4.779627 4.479655 4.459173 3.4414860 3.8907454 4.1428112
## CFTR 2.118082 1.024191 0.639505 0.8310415 0.5640481 0.6044988
## LAP3 9.119156 9.093318 8.553639 9.6550896 9.8108594 9.1633091
## CL14 CL15 CL16 CL17 CL18 CL19
## TSPAN6 3.4373524 3.0040415 3.304448 3.528899789 2.7607584 3.0504371
## TNMD -1.5572886 -3.8685634 -4.703420 0.004827957 -6.9891110 -4.7582053
## FGR 6.3833360 6.1569537 6.447216 5.098143524 6.3281603 6.5939485
## SEMA3F 4.3467463 5.1042858 4.326144 4.818452750 5.1411378 5.2886458
## CFTR 0.5544217 -0.7888362 1.046114 2.515584086 -0.1436209 -0.4802206
## LAP3 9.5488085 9.2699556 9.551564 9.442892888 8.8670470 9.5120170
## CL20 CL21
## TSPAN6 4.6042962 4.6947469
## TNMD -6.0296073 -1.4513833
## FGR 5.3908794 2.9741357
## SEMA3F 5.7845746 5.3132672
## CFTR 0.3102427 0.2366727
## LAP3 8.7840236 6.5096483
dim(diffGenes)
## [1] 5836 28
#convert your DEGs to a dataframe using as_tibble
diffGenes.df <- as_tibble(diffGenes, rownames = "geneID")
datatable(diffGenes.df,
extensions = c('KeyTable', "FixedHeader"),
caption = 'Table 1: DEGs in cutaneous leishmaniasis',
options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
formatRound(columns=c(2:11), digits=2)
Pearson correlation was used to cluster 5836 differentially expressed genes, which were then represented as heatmap with the data scaled by Zscore for each row.
#clustering - using unsupervised method: correlation
clustRows <- hclust(as.dist(1-cor(t(diffGenes), method="pearson")), method="complete")
clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete")
#we'll look at these clusters in more detail later
module.assign <- cutree(clustRows, k=2)
#now assign a color to each module (makes it easy to identify and manipulate)
module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9)
module.color <- module.color[as.vector(module.assign)]
myheatcolors2 <- colorRampPalette(c("purple", "white", "darkgreen"))(50)
myheatcolors1 <- bluered(75)
# Produce a static heatmap of DEGs ----
#plot the hclust results as a heatmap
heatmap.2(diffGenes,
Rowv=as.dendrogram(clustRows),
Colv=as.dendrogram(clustColumns),
RowSideColors=module.color,
col=rev(myheatcolors1), scale='row', labRow=rownames(diffGenes),
density.info="none", trace="none",
cexRow = 0.8, # Slightly smaller gene labels
cexCol = 0.7, # Smaller column labels
srtCol = 45, # Rotate column labels
adjCol = c(1, 1), # Right-justify rotated labels
offsetCol = 0.5, # Better spacing
margins = c(10, 20)) # Increase bottom margin)
# View modules of co-regulated genes ----
# view your color assignments for the different clusters
names(module.color) <- names(module.assign)
module.assign.df <- tibble(
geneID = names(module.assign),
module = as.vector(module.assign)
)
module.assign.df <- module.assign.df %>%
mutate(moduleColor = case_when(
module == 1 ~ "#FF9900",
module == 2 ~ "#FF0099"))
ggplot(module.assign.df) +
aes(module) +
geom_bar(aes(fill=moduleColor)) +
theme_bw()
#choose a cluster(s) of interest by selecting the corresponding number based on the previous graph
modulePick <- 2 #use 'c()' to grab more than one cluster from the heatmap. e.g., c(1,2)
#now we pull out the genes from this module using a fancy subsetting operation on a named vector
myModule <- diffGenes[names(module.assign[module.assign %in% modulePick]),]
hrsub <- hclust(as.dist(1-cor(t(myModule), method="pearson")), method="complete")
# Create heatmap for chosen sub-cluster.
heatmap.2(myModule,
Rowv=as.dendrogram(hrsub),
Colv=NA,
labRow = NA,
col=rev(myheatcolors1), scale="row",
density.info="none", trace="none",
RowSideColors=module.color[module.assign%in%modulePick],
cexRow = 0.8, # Slightly smaller gene labels
cexCol = 0.7, # Smaller column labels
srtCol = 45, # Rotate column labels
adjCol = c(1, 1), # Right-justify rotated labels
offsetCol = 0.5, # Better spacing
margins = c(10, 20)) # Increase bottom margin)
modulePick <- 1
myModule_down <- diffGenes[names(module.assign[module.assign %in% modulePick]),]
hrsub_down <- hclust(as.dist(1-cor(t(myModule_down), method="pearson")), method="complete")
heatmap.2(myModule_down,
Rowv=as.dendrogram(hrsub_down),
Colv=NA,
labRow = NA,
col=rev(myheatcolors1), scale="row",
density.info="none", trace="none",
RowSideColors=module.color[module.assign%in%modulePick],
cexRow = 0.8, # Slightly smaller gene labels
cexCol = 0.7, # Smaller column labels
srtCol = 45, # Rotate column labels
adjCol = c(1, 1), # Right-justify rotated labels
offsetCol = 0.5, # Better spacing
margins = c(10, 20)) # Increase bottom margin
sortedupFC <- myTopHits[order(-myTopHits$logFC),]
sortedupFC <- sortedupFC[!duplicated(sortedupFC$ID) & !is.na(sortedupFC$ID), ] # avoid duplicates
rownames(sortedupFC) <- sortedupFC$ID
sorteduptop <- sortedupFC[-grep("IG", sortedupFC$ID),]
#Top 100 genes upregulated:
top100up <- sorteduptop[1:100,]$ID
TopUPtable100 <- archs4.filtered.norm.log2.cpm[c(top100up),]
TopUPmatrixcoding100 <- as.matrix(TopUPtable100)
colormapX <- colorRampPalette(colors=c("darkgreen","white","purple"))(50)
HeatmapUP100 <- heatmap.2(TopUPmatrixcoding100,
scale = "row", key=TRUE,
keysize = 1, key.title = NA,
col=colormapX, dendrogram = "none", Rowv = F,
margins=c(5,25),
labCol = NA, labRow = rownames(TopUPtable100),
main = "",
density.info="none", trace="none",
cexRow=0.8, cexCol=1)
# Save heatmap of top 100 upregulated coding genes to PNG file
png("Top100_UP_heatmap.png", width=1200, height=1800) #width/height in pixels
heatmap.2(TopUPmatrixcoding100,
scale = "row", key=TRUE,
keysize = 1, key.title = NA,
col=colormapX, dendrogram = "none", Rowv = F,
margins=c(5,25),
labCol = NA, labRow = rownames(TopUPmatrixcoding100),
main = "",
density.info="none", trace="none",
cexRow=0.7, cexCol=1)
dev.off()
## png
## 2
GO enrichment for the 3028 genes induced by infection
library(tidyverse)
library(limma)
library(gplots) #for heatmaps
library(DT) #interactive and searchable tables of our GSEA results
library(GSEABase) #functions and methods for Gene Set Enrichment Analysis
library(Biobase) #base functions for bioconductor; required by GSEABase
library(gprofiler2) #tools for accessing the GO enrichment results using g:Profiler web resources
library(clusterProfiler) # provides a suite of tools for functional enrichment analysis
library(msigdbr) # access to msigdb collections directly within R
library(enrichplot) # great for making the standard GSEA enrichment plots
gost.res_up <- gost(rownames(myModule), organism = "hsapiens", correction_method = "fdr")
gostplot(gost.res_up, interactive = T, capped = T) #set interactive=FALSE to get plot for publications
GO enrichment for the 2808 genes downregulated due to infection
gost.res_down <- gost(rownames(myModule_down), organism = "hsapiens", correction_method = "fdr")
gostplot(gost.res_down, interactive = T, capped = T) #set interactive=FALSE to get plot for publications
msigdbr_species()
## # A tibble: 20 × 2
## species_name species_common_name
## <chr> <chr>
## 1 Anolis carolinensis Carolina anole, green anole
## 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
## 3 Caenorhabditis elegans <NA>
## 4 Canis lupus familiaris dog, dogs
## 5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
## 6 Drosophila melanogaster fruit fly
## 7 Equus caballus domestic horse, equine, horse
## 8 Felis catus cat, cats, domestic cat
## 9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
## 10 Homo sapiens human
## 11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica gray short-tailed opossum
## 13 Mus musculus house mouse, mouse
## 14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes chimpanzee
## 16 Rattus norvegicus brown rat, Norway rat, rat, rats
## 17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
## 18 Schizosaccharomyces pombe 972h- <NA>
## 19 Sus scrofa pig, pigs, swine, wild boar
## 20 Xenopus tropicalis tropical clawed frog, western clawed frog
hs_gsea <- msigdbr(species = "Homo sapiens") #gets all collections/signatures with human gene IDs
#take a look at the categories and subcategories of signatures available to you
hs_gsea %>%
dplyr::distinct(gs_collection, gs_subcollection) %>%
dplyr::arrange(gs_collection, gs_subcollection)
## # A tibble: 25 × 2
## gs_collection gs_subcollection
## <chr> <chr>
## 1 C1 ""
## 2 C2 "CGP"
## 3 C2 "CP"
## 4 C2 "CP:BIOCARTA"
## 5 C2 "CP:KEGG_LEGACY"
## 6 C2 "CP:KEGG_MEDICUS"
## 7 C2 "CP:PID"
## 8 C2 "CP:REACTOME"
## 9 C2 "CP:WIKIPATHWAYS"
## 10 C3 "MIR:MIRDB"
## # ℹ 15 more rows
# choose a specific msigdb collection/subcollection
# since msigdbr returns a tibble, we'll use dplyr to do a bit of wrangling
hs_gsea_c2 <- msigdbr(species = "Homo sapiens", # change depending on species your data came from
category = "C2") %>% # choose your msigdb collection of interest
dplyr::select(gs_name, gene_symbol) #just get the columns corresponding to signature name and gene symbols of genes in each signature
# Now that you have your msigdb collections ready, prepare your data
# grab the dataframe you made in and pull out just the columns corresponding to gene symbols and LogFC for at least one pairwise comparison for the enrichment analysis
# Pull out just the columns corresponding to gene symbols and LogFC for at least one pairwise comparison for the enrichment analysis
mydata.df.sub <- dplyr::select(mydata.df, geneID, LogFC)
# construct a named vector
mydata.gsea <- mydata.df.sub$LogFC
names(mydata.gsea) <- as.character(mydata.df.sub$geneID)
mydata.gsea <- sort(mydata.gsea, decreasing = TRUE)
mydata.gsea <- mydata.gsea[!duplicated(names(mydata.gsea))]
# run GSEA using the 'GSEA' function from clusterProfiler
set.seed(123) #set a random seed so that we can reproducible ordering for our GSEA results below
myGSEA.res <- GSEA(mydata.gsea, TERM2GENE=hs_gsea_c2, verbose=FALSE) #could replace C2CP with hs_gsea_c2 object you retrieved from msigdb above
myGSEA.df <- as_tibble(myGSEA.res@result) #using @ symbol - a slot in an S4 class object
# view results as an interactive table
datatable(myGSEA.df,
extensions = c('KeyTable', "FixedHeader"),
caption = 'Signatures enriched in leishmaniasis',
options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
formatRound(columns=c(2:10), digits=2)
# create enrichment plots using the enrichplot package
gseaplot2(myGSEA.res,
geneSetID = c(94,91), #can choose multiple signatures to overlay in this plot
pvalue_table = FALSE, #can set this to FALSE for a cleaner plot
#title = myGSEA.res$Description[c(94,90)]
) #can also turn off this title
# add a variable to this result that matches enrichment direction with phenotype
myGSEA.df <- myGSEA.df %>%
mutate(phenotype = case_when(
NES > 0 ~ "disease",
NES < 0 ~ "healthy"))
# create 'bubble plot' to summarize y signatures across x phenotypes
myGSEA.sig <- myGSEA.df %>%
filter(p.adjust < 0.01)
# Get top 10 pathways per phenotype
top_10_each <- myGSEA.sig %>%
group_by(phenotype) %>%
arrange(p.adjust) %>%
slice_head(n = 10)
# Now: get unique pathways (to keep common y-axis)
top_IDs <- unique(top_10_each$ID)
# Subset full data for those IDs (to get both phenotypes for each)
top_df <- myGSEA.df %>%
filter(ID %in% top_IDs)
# Optional: shorten pathway names for display
top_df$ID <- gsub("REACTOME_", "", top_df$ID)
top_df$ID <- gsub("_", " ", top_df$ID)
# Plot
ggplot(top_df, aes(x = phenotype, y = ID)) +
geom_point(aes(size = setSize, color = NES, alpha = -log10(p.adjust))) +
scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
scale_alpha(range = c(0.4, 1)) +
theme_bw() +
labs(
title = "Top GSEA Pathways by Phenotype",
x = "Phenotype", y = "Pathway"
) +
theme(
axis.text.y = element_text(size = 7),
axis.text.x = element_text(size = 10),
plot.title = element_text(hjust = 0.5)
)
# Identification of highly variable gene expression signatures in CL lesions.
exprs_voom <- v.DEGList.filtered.norm$E
# 1. Get group-wise means
group_means <- data.frame(
gene = rownames(exprs_voom),
disease_mean = rowMeans(exprs_voom[, targets$disease == "cutaneous"]),
healthy_mean = rowMeans(exprs_voom[, targets$disease == "control"])
)
# 2. Compute log2 fold change
group_means$logFC <- group_means$disease_mean - group_means$healthy_mean
# 3. Filter for genes:
# - high disease expression (e.g. > 5)
# - low healthy expression (e.g. < 1)
# - large FC (e.g. logFC > 4)
filtered_genes <- group_means %>%
filter(disease_mean > 5, healthy_mean < 1, logFC > 4) %>%
arrange(desc(logFC))
# View top candidates
head(filtered_genes, 10)
## gene disease_mean healthy_mean logFC
## 1 IGLV1-51 5.756022 -5.964370 11.720392
## 2 IGHG3 7.840361 -3.124648 10.965009
## 3 IGHV3-7 5.531642 -5.380446 10.912088
## 4 IGHV3-21 6.176464 -4.491587 10.668051
## 5 IGHV1-18 5.732944 -4.912038 10.644982
## 6 IGHV4-34 5.389532 -4.848115 10.237647
## 7 IGHV3-74 5.519295 -4.685615 10.204910
## 8 IGLV3-1 5.999906 -3.833910 9.833816
## 9 IGHG2 5.562514 -3.981531 9.544045
## 10 IGKV1-5 6.717677 -2.617405 9.335082
#removing the genes that start with "IG" to identify other differentially expressed genes that can be potential treatment targets
# Remove genes that start with "IG"
filtered_genes_noIG <- filtered_genes %>% filter(!str_detect(gene,"^IG"))
#subset the expression matrix for genes of interest
# Keep only the first matching row for each gene
top5_genes <- head(filtered_genes_noIG, 5)
expr_subset <- exprs_voom[rownames(exprs_voom) %in% top5_genes$gene, ]
# Step 3: Convert to long format
df_long <- as.data.frame(expr_subset) %>%
rownames_to_column(var = "gene") %>%
pivot_longer(-gene, names_to = "sample", values_to = "expression")
# Step 4: Add sample metadata
df_long <- df_long %>%
left_join(targets, by = "sample")
fc_table <- df_long %>%
group_by(gene, disease) %>%
summarise(mean_expr = mean(expression), .groups = "drop") %>%
pivot_wider(names_from = disease, values_from = mean_expr) %>%
mutate(log2FC = cutaneous - control,
FC = round(2^log2FC)) %>%
arrange(desc(FC)) # Optional: sort
fc_labels <- paste0("FC ", fc_table$FC)
label_df <- df_long %>%
filter(disease == "cutaneous") %>%
group_by(gene) %>%
summarise(y = max(expression) + 0.5) %>%
mutate(label = fc_labels)
pd <- position_dodge(width = 0.75)
ggplot(df_long, aes(x = gene, y = expression, fill = disease)) +
# Boxplot with consistent width and dodge
geom_boxplot(position = pd, outlier.shape = NA, alpha = 0.8, width = 0.5, color = "black") +
# Jitter points, aligned using jitterdodge
geom_jitter(aes(color = disease),
size = 1.8, alpha = 0.7,
position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.75)) +
# FC labels above the boxes
geom_text(data = label_df, aes(x = gene, y = y, label = label),
inherit.aes = FALSE,
vjust = -0.5, size = 4.5, fontface = "bold") +
# Color & fill mapping
scale_fill_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
scale_color_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
# Labels and theme settings
labs(y = "CPM (log2 scale)", x = NULL) +
geom_hline(yintercept = 0, linetype = "dotted", color = "gray30") +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
legend.title = element_blank(),
axis.text.x = element_text(face = "italic", size = 13),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.title.y = element_text(size = 14, face = "bold")
) +
labs( title = "Top 5 most variable gene expression in CL vs. HS")
expr_matrix <- v.DEGList.filtered.norm$E
# Define list of known cytotoxicity genes
cytotoxic_genes <- c("GZMA", "GZMB", "PRF1", "NKG7", "GNLY",
"CD8A", "CD8B", "FASLG", "IFNG", "TNF",
"KLRK1", "CASP3", "BID", "IL2", "TRAIL","IL1B")
# Subset those present in your filtered gene list
cytotoxic_targets <- filtered_genes_noIG %>%
filter(gene %in% cytotoxic_genes)
# View them sorted by logFC (descending)
cytotoxic_targets <- cytotoxic_targets %>%
arrange(desc(logFC))
top5_cytotoxic_genes <- cytotoxic_targets %>%
arrange(desc(logFC)) %>%
slice(1:5) %>%
pull(gene)
expr_subset_cyto <- expr_matrix[rownames(expr_matrix) %in% top5_cytotoxic_genes, ]
df_long_cyto <- as.data.frame(expr_subset_cyto) %>%
rownames_to_column(var = "gene") %>%
pivot_longer(-gene, names_to = "sample", values_to = "expression")
# Step 4: Add sample metadata
df_long_cyto <- df_long_cyto %>%
left_join(targets, by = "sample")
# Calculate mean expression per gene per group
fc_table_cyto <- df_long_cyto %>%
group_by(gene, disease) %>%
summarise(mean_expr = mean(expression), .groups = "drop") %>%
pivot_wider(names_from = disease, values_from = mean_expr) %>%
mutate(log2FC = cutaneous - control,
FC = round(2^log2FC)) %>%
arrange(desc(FC)) # Optional: sort
fc_labels_cyto <- paste0("FC ", fc_table_cyto$FC)
label_df_cyto <- df_long_cyto %>%
filter(disease == "cutaneous") %>%
group_by(gene) %>%
summarise(y = max(expression) + 0.5) %>%
mutate(label = fc_labels_cyto)
pd <- position_dodge(width = 0.75)
ggplot(df_long_cyto, aes(x = gene, y = expression, fill = disease)) +
# Boxplot with consistent width and dodge
geom_boxplot(position = pd, outlier.shape = NA, alpha = 0.8, width = 0.5, color = "black") +
# Jitter points, aligned using jitterdodge
geom_jitter(aes(color = disease),
size = 1.8, alpha = 0.7,
position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.75)) +
# FC labels above the boxes
geom_text(data = label_df_cyto, aes(x = gene, y = y, label = label),
inherit.aes = FALSE,
vjust = -0.5, size = 4.5, fontface = "bold") +
# Color & fill mapping
scale_fill_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
scale_color_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
# Labels and theme settings
labs(y = "CPM (log2 scale)", x = NULL) +
geom_hline(yintercept = 0, linetype = "dotted", color = "gray30") +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
legend.title = element_blank(),
axis.text.x = element_text(face = "italic", size = 13),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.title.y = element_text(size = 14, face = "bold")
)+
labs( title = "Top 5 most variable cytotoxic genes expressed in CL vs. HS")
# Identify columns to be extracted from ARCHS4 database
my.sample.locations1 <- which(all.samples.human %in% mySamples[8:28]) # first time you've seen the %in% operator.
# extract gene symbols from the metadata
genes <- h5read(archs4.human, "meta/genes/symbol")
# Extract expression data from ARCHS4 ----
expression1 <- h5read(archs4.human, "data/expression",
index=list(my.sample.locations1, NULL))
# transpose to get genes as rows and samples as columns
expression1 <- t(expression1)
rownames(expression1) <- genes
colnames(expression1) <- all.samples.human[my.sample.locations1]
colSums(expression1) #this shows the sequencing depth for each of the samples you've extracted
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543
## 50779924 35415106 101406579 18945879 66455004 84689579 40410575
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550
## 73600594 74039897 39497323 44325744 52241136 88109937 89112596
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557
## 84331463 64685799 56634215 56572087 69779095 38525553 78194883
archs4.dgelist_outcomes <- DGEList(expression1)
archs4.cpm_outcomes <- cpm(archs4.dgelist_outcomes)
colSums(archs4.cpm_outcomes)
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
# Filter and normalize the extracted data ----
table(rowSums(archs4.dgelist_outcomes$counts==0)==21)
##
## FALSE TRUE
## 54659 12527
keepers1 <- rowSums(archs4.cpm_outcomes>1)>=7
archs4.dgelist.filtered_outcomes <- archs4.dgelist_outcomes[keepers1,]
dim(archs4.dgelist.filtered_outcomes)
## [1] 19963 21
archs4.dgelist.filtered.norm_outcomes <- calcNormFactors(archs4.dgelist.filtered_outcomes, method = "TMM")
archs4.filtered.norm.log2.cpm_outcomes <- cpm(archs4.dgelist.filtered.norm_outcomes, log=TRUE)
# Extract sample metadata from ARCHS4 to create a study design file ----
# extract the sample source
sample_source_name <- h5read(archs4.human, "meta/samples/source_name_ch1")
# extract sample title
sample_title <- h5read(archs4.human, name="meta/samples/title")
# extract sample characteristics
sample_characteristics<- h5read(archs4.human, name="meta/samples/characteristics_ch1")
# let's try putting this all together in a study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
Sample_source = sample_source_name[my.sample.locations],
Sample_characteristics = sample_characteristics[my.sample.locations])
#based on what we extracted from ARCHS4 above, lets customize and clean-up this study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
genotype = rep(c("HS", "CL"), times= c(7,21)),
treatment = rep(c("NInf", "Inf"), times= c(7,21)))
#capture experimental variables as factors from this study design
genotype <- factor(studyDesign$genotype)
treatment <- factor(studyDesign$treatment)
sampleName <- studyDesign$Sample_title
targets.onlypatients <- targets[8:28,]
#failure vs cure as a factor
outcome <- factor(targets.onlypatients$Treat..outcome)
#preparing a model matrix for CL vs HS
design1 <- model.matrix(~0 + outcome)
colnames.design1 <- levels(outcome)
# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
v.DEGList.filtered.norm_outcomes <- voom(archs4.dgelist.filtered.norm_outcomes, design1, plot = TRUE)
# fit a linear model to your data
fit1 <- lmFit(v.DEGList.filtered.norm_outcomes, design1)
# Contrast matrix ----
contrast.matrix.outcomes <- makeContrasts("failure vs. cure"= outcomefailure - outcomecure,
levels=design1)
# extract the linear model fit -----
fits1 <- contrasts.fit(fit1, contrast.matrix.outcomes)
#get bayesian stats for your linear model fit
ebFit1 <- eBayes(fits1)
#write.fit(ebFit, file="lmfit_results.txt")
# TopTable to view DEGs -----
myTopHitsOutcomes <- topTable(ebFit1, adjust ="BH", coef=1, number=40000, sort.by="logFC")
# convert to a tibble
myTopHitsOutcomes.df <- myTopHitsOutcomes %>%
as_tibble(rownames = "geneID")
# Add p-value significance categories
myTopHitsOutcomes_sig.df <- subset(myTopHitsOutcomes.df, P.Value < 0.05)
sigfailure <- myTopHitsOutcomes_sig.df$ID
myTopHitsOutcomes_sig.df$FC <- 2^(myTopHitsOutcomes_sig.df$logFC)
myTopHitsOutcomes_sig.df$neglog10P <- -log10(myTopHitsOutcomes_sig.df$P.Value)
cytotoxic_subset <- subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes)
# View as table
# Display as interactive table
datatable(cytotoxic_subset,
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Make sure this is the full DEG result (not just p < 0.05)
# myTopHitsOutcomes.df must contain columns: ID, FC, P.Value
myTopHitsOutcomes.df$neglog10P <- -log10(myTopHitsOutcomes.df$P.Value)
myTopHitsOutcomes.df$FC <- 2^(myTopHitsOutcomes.df$logFC)
library(ggplot2)
signature <- ggplot(myTopHitsOutcomes.df, aes(x = FC, y = neglog10P)) +
# Plot all genes (gray)
geom_point(color = "gray70") +
# Highlight cytotoxic genes with p < 0.05 in blue
geom_point(data = subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes),
aes(x = FC, y = neglog10P), color = "blue", size = 3) +
# Label the cytotoxic genes
geom_text(data = subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes),
aes(label = ID), color = "red", vjust = -1, size = 3.5) +
# Axes limits and p-value threshold lines
xlim(0, 4) + ylim(0, 4) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
x = "Fold Change (Failure vs. Cure)",
y = "-log10(P.Value)",
title = "Volcano Plot Highlighting Cytotoxic Genes (P < 0.05)"
) +
theme_minimal(base_size = 14)
ggplotly(signature)
Describe the results in your own words. Some things to think about:
The output from running ‘sessionInfo’ is shown below and details all packages and version necessary to reproduce the results in this report.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.