A translator for the flags can be found here explain-flags. You also see that the first read has a mapping quality of 0 (this is phred-scale) and the "cigar" is "100M". You can see the sequence of the read and the qualities as well.
After the qualities there are additional fields that gives information on the alignment, eg.Remember that these reads were the ones we trimmed yesterday.You can copy your own trimmed data but you also need to copy the files here to get everything we need for the exercise Create index and then run the alignment.Lets look at the header and the alignment of the alignment we made before (-h tells samtools to also show the header).Header lines starts with "@", "@SQ" are all the sequences you mapped against (in your reference fasta file).
Bwa single end
Lets try to calculate some statistics on the coverage instead of basing it on eye-view, for this we are going to use bedtools. Finally let us look at some paired end Illumina reads, these reads are from a ~40X wgs of an asian individual.Try to read about bedtools at the link, it is a very useful suite of programs for working with SAM/BAM, BED, VCF and GFF files, files you will encouter many times doing NGS analysis. There are two different libraries - one called "_A_" and another library called "_B_".Calculate the number of reads that are mapped, note down the number of mapped reads. Here we use a "perl-oneliner" to extract the edit distance for each read: R raw = read.table("Paeruginosa.nm") cut = read.table("nm") trim = read.table("trim.nm") a = frame(table(raw[,1])) b = frame(table(cut[,1])) c = frame(table(trim[,1])) m = merge(merge(a, b, by=1), c, by=1, all=TRUE) m[is.na(m)] = 0 df = frame(m[,2:4]) rownames(df) = m[,1] colnames(df) = c("Raw", "Cut", "Trim") barplot(as.matrix(t(df)), beside=TRUE, col=rainbow(3), main="Effect of trimming", xlab="Edit distance") legend("topright", legend=c("raw", "cut", "cut&trim"), col=rainbow(3), pch=16) dev.print("trimming.effect.pdf", device=pdf) Sometimes it is nice to see things with you own eyes - we are therefore going to visualize the alignments using IGV. To do this we first need to sort the alignment based on mapping position and then index it using samtools.To do this we calculate the coverage in regions by adding the "-bga" option to bedtools genomecov. How many missing genes are there, and what are their function, you should see a lot of hypothetical proteins and eg. Recall that the columns in the bam-file are: Read name, flag, reference it mapped to, position, mapping quality, cigar, mate reference map, mate position, template length, read sequence, read qualities and the different tags, among here the read group (RG) and the edit distances (NM).
Afterwards we use grep to get all regions that are not covered. Open the bam-file for library A, the insert size is the number just before the read sequences.Lets calculate coverage using bedtools and after plot it using R Now open the plot. To save time we are not going to use all reads, we are only going to map them to chr21 and additionally the reads are already filtered so that we only have reads that map chr21.The left plot is a histogram showing the number of bases covered at a certain depth (reads), whereas the plot on the right is showing the percentage of the genome covered at X depth or more. How much of the genome is not covered with any read and how much of the genome is covered with eg. Of course we only used 1/4 of the data so on average we should have 4 times the coverage if we used all data. regions in the the PAO1 genome but not in our strain. Try to search for some of the genes in IGV, you should be able to see the regions of the genes that have been lost during adaptation. Create a directory for the exercise and copy the data to there. Lets start the alignment, start by indexing chr21 and then afterward align the paired end files. This should be actually be added all the time (we forgot to do it with the P.aeruginosa reads), this writes information on unique ID for the lane, which sample it is from, which platform was used etc We will try to calculate the insert sizes of the two different libraries from the bam files, sometimes when you get data you do not know the insert sizes used, then this can be very helpful.AS:i: means the alignment score for the read, NM:i:3 means that the edit distance for the alignment is 3, eg. We could also get this using the filtering option in samtools view.we need to change 3 bases in the read to have a perfect match to the reference sequence. Eg adding -f 4 would get us all unmapped reads, getting all reads that are not unmapped (ie.