I used Mac OSX to run the analysis along with a computer cluster. If you don’t have access to a computer cluster, a desktop will work if it has sufficient enough RAM (it’s just a lot slower).
Decompressing fastq files (optional)
Open the terminal app using finder. Enter the folder that you have stored your files in by using the command cd (change directory). If the files end in the extension .gz then they need to be decompressed. You may accomplish this by typing in gunzip <filename> one by one or using the following script and instructions:
nano fastgun #!/bin/bash for i in *.gz; do gunzip $i done
Hit ctrl + x to exit and Y to save the script. Next type chmod 700 fastgun into the command line. Execute the script by typing ./fastgun into the command line. This should gunzip all the files.
Quality Check fastq files
Homebrew is a package manager for MacOSX and I encourage the use of it since it makes life incredibly simple. It can be downloaded at https://brew.sh/. Use homebrew to install fastqc(version 0.11.5), a quality check program, by typing brew install fastqc. To see if you have successfully downloaded fastqc type “which fastqc” into the command line, and it should print out the directory (aka location) where fastqc is if you have it on your computer. To run all of the fastq files through fastqc, make a script similar to the one above:
nano qc #!/bin/bash for i in *.fastq; do fastqc $i done
Trimming fastq files
The output should include an html file that contains your report. Based off this report you can decide what you would like to trim from your data. Install trimmomatic(version 0.36) the same way you installed fastqc, type: brew install trimmomatic. In order to execute trimmomatic you must enter the folder that trimmomatic is installed in making sure it contains the jarfile ending in .jar. Below is a sample command to trim the first 5 bases.
java -jar trimmomatic-0.36.jar SE -threads 4 <fastq file> <desired name of your output file> HEADCROP:5
Downloading the Genome
Download the mm10 genome (check that it has the same background as your samples, this one is C57/BL6). This can be done on the illumina website:
Upload this to the cluster using cyberduck. Upload all your fastq files as well. To access the cluster type ssh <cnetID>@midway.rcc.uchicago.edu into the terminal. You will be promted for your password. Change directories to the grove account by using cd /project/egrove. To exit the cluster simple type exit into the terminal. Decompress the genome using the cluster by the following script. To create a script, follow the same process as above using the nano command.
#!/bin/bash #SBATCH --partition=westmere tar -xzvf <path to file>/Mus_musculus_UCSC_mm10.tar.gz
To submit your job to the cluster type sbatch <name of script>. To check the status of a job type squeue –job <jobID#>.
Build your genome using Bowtie2(version 2.2.9)
#!/bin/bash #SBATCH --partition=sandyb #SBATCH --mem-per-cpu=6000 module load samtools module load bowtie2 Bowtie2-build <path to genome.fa> mouse
The path to the genome.fa is usually found under Mus_musculus/UCSC/mm10/sequence/Bowtie2Index/genome.fa
Move all the generated .bt2 files to where the genome.fa file is located using the mv command.
TopHat Alignment (version 2.0.14): running time ~6 hours
#!/bin/bash #SBATCH --partition=sandyb #SBATCH --mem-per-cpu=8000 module load bowtie2 module load samtools module load tophat mkdir <folder/directory where you would like your output to be. Be sure to make a new directory for each sample> Tophat -o <directory> -G <path to annotation file .gtf> <path to genome.fa file> *.fastq file
Do this for each fastq file that you have. After the alignment is completed go into your directory and change the names of accepted_hits.bam file to the name of your sample. You should find this in the directories you made.
Sort and index alignment files: samtools (version 1.2)
#!/bin/bash #SBATCH --partition=westmere #SBATCH --mem-per-cpu=4000 module load samtools samtools sort <path to bam file> <desired name of output> samtools index <path to desired name of output listed above>
This should give you a bam file and a .bam.bai file. Download these using cyberduck to a folder on your local computer. You can now use these files to view the alignment in IGV (integrated genomics viewer).
Creating a count table: featureCounts (version 1.5.2)
The next step is to create a count table of FPKM values to give to the R package edgeR. Download the annotation .gtf file from the cluster to your computer. Go to http://subread.sourceforge.net/ to download the featureCounts program. Installation instructions are on the website. Once installed go into the subreads folder under bin.
You can type which featureCounts to get the full directory. Now execute the following command:
featureCounts -T 4 -t exon -g gene_id -a <annotation file> -o <output that you would like ending in a .txt file extension> < path to your bam file>
You can open the .txt file in excel to see the output.
Finding Differentially Expressed Genes: edgeR (version 3.18.1)
Here is a rough summary of what will be accomplished in edgeR:
The following count data was imported into RStudio and using the edgeR package was fitted to a quasi-likelihood (QL) negative binomial generalized log-linear model to calculated dispersions with the glmQLFit() option. Goodness of fit was determined by a QL- F-Test using glmQLFTest() function. Differentially expressed genes were obtained using the benjamini-hochberg p-adjusted method with a p value cut off at 0.05.
source("http://bioconductor.org/biocLite.R") biocLite(c("edgeR", "gplots", "RColorBrewer", "ggplot2", "scales")) library(edgeR) library(gplots) library(RColorBrewer) library(ggplot2) library(scales) install.packages(pkg = "dplyr") library(dplyr) library(matrixStats) setwd("~/desktop/Counts/") countdata <- read.table(("counts1369.txt"), header=TRUE, row.names=1) countdata <- countdata[ ,6:ncol(countdata)] names(countdata) <- c("EAG1", "EAG3", "EAG6", "EAG9") group <- factor(c(rep("WT",2), rep("GM",2))) y<-DGEList(countdata) design <- model.matrix(~0 + group) colnames(design) con <- makeContrasts(WT - GM, levels=design) keep <- rowSums(cpm(y)>1) >= 2 y <- y[keep, , keep.lib.sizes=FALSE] y<-calcNormFactors(y) y<-estimateGLMCommonDisp(y, design) y<-estimateGLMTrendedDisp(y, design) y<-estimateGLMTagwiseDisp(y, design) fit<-glmFit(y, design) lrt<-glmLRT(fit, contrasts = con) topTags(lrt) output<-topTags(lrt, n=nrow(lrt),adjust.method="BH", p.value = 0.05) tmpcolnames<-c("ID", colnames(output)) output<-data.frame(rownames(output), output) colnames(output)<-tmpcolnames write.table(output, "RNAseq_edgeR_analysis.txt", sep="\t", row.names=F) lrt<-lrt$fitted.values tmpcolnames<-c("ID", colnames(lrt)) lrt<-data.frame(rownames(lrt), lrt) colnames(lrt)<-tmpcolnames write.table(lrt, "RNAseq_edgeR_analysis_fittedvalues.txt", sep="\t", row.names=F) values<-read.delim("RNAseq_edgeR_analysis_fittedvalues.txt", header=T) data<-read.delim("RNAseq_edgeR_analysis.txt", header=T) merged<-merge(data, values, by="ID") head(merged) write.table(merged, "RNAseq_edgeR_analysisWithFittedValues.txt", sep="\t", row.names=F) heatmap_data<-merged heatmap_data<-filter(heatmap_data, FDR<0.05) rnames<-heatmap_data[,1] heatmap_data<-heatmap_data[,7:ncol(heatmap_data)] heatmap_data<-t(apply(heatmap_data,1,scale)) mat_data<-data.matrix(heatmap_data) colnames(mat_data) <- c("EAG1", "EAG3", "EAG6", "EAG9") rownames(mat_data)<-rnames rownames(mat_data)<-diff01[,1] my_palette<-colorRampPalette(c("blue", "white", "red"))(n=99) png("heatmap.png", width=5*300, height=5*300, res=300, pointsize=8) heatmap.2(mat_data, notecol="black", density.info="none", trace="none", col=my_palette, cexCol = 1.5, dendrogram="both", ColSideColors=c(rep("dark blue"), rep("dark red"), rep ("green"), rep("orange"))) dev.off()
A list of your differentially expressed genes can be found in the file: RNAseq_edgeR_analysisWithFittedValues.txt. With this you can begin your analysis of GO pathways and clustering.
- String db: network and clustering
- Gseapy for GO analysis
- Gephi for visualization
- Cytoscape for visualization
- EdgeR Manual