Extracting interesting information from Fastq files with Bash

Let me start by stating something that everyone who knows me has heard over and over again: I love bash and am always trying to find ways to utilize it. I love how fast it is, I love how you just have to open your terminal and start hacking away. I always use it to extract information rather than go through the arduous (at least for me) task of opening a website or a program or R or python… Yesterday a colleague came to me wanting to know what strategy he had used for his last sequencing run. I knew the information had to be stored somewhere (since this was an Illumina machine, the information was probably stored in their BaseSpace service). However, the colleague was not aware of such things and it would have taken some time to find out who had access to that information. Thus, I just SSH’ed into our server to access the raw fastq files from that run.

For those of you who are not aware of it, fastq is a format for the storage of sequencing data (i.e. the DNA base composition of whatever you had sequenced, be it a whole genome, exome, or a handfull of genes). The file consists of multiple lines of a defined length string of four letters (A,T,C, and G) — plus some N’s which indicate the sequencer was not able to resolve the correct base at this position. Each string is called a read and its length is determined by the user who started the run and depends on the aim of the experiment. In addition, the file consists of such metadata as the type of the sequencer, the flow cell ID, the coordinates of the molecules on the flowcell, and the quality scores of each base in the read. These are distributed over four rows; in other words, each read occupies four rows in the file which encompass all the above. You can read more about the fastq file format in this Wikipedia article.

Unfortunately, finding out the run length is not as straightforward as taking a peek inside the file. For reasons depending on, among other things, the fragment length, the file will include reads of varying lengths. This is what I get on a fastq file when I query for the length of 1% of the reads:

| Length | Number |
| ------ | ------ |
| 50 | 1 |
| 64 | 1 |
| 69 | 1 |
| 97 | 1 |
| 66 | 2 |
| 67 | 2 |
| 68 | 2 |
| 51 | 3 |
| 52 | 3 |
| 70 | 3 |
| 58 | 4 |
| 65 | 4 |
| 98 | 5 |
| 59 | 6 |
| 71 | 6 |
| 54 | 8 |
| 61 | 8 |
| 72 | 8 |
| 60 | 9 |
| 62 | 9 |
| 53 | 10 |
| 56 | 10 |
| 40 | 11 |
| 57 | 11 |
| 73 | 12 |
| 55 | 20 |
| 63 | 38 |
| 99 | 74 |
| 100 | 2,152 |
| 101 | 42,413 |

As you can see, the read lengths vary from as small as 40 to as big as 101 which by the way is the length of the run. Since we don’t need all this information and are just interested in the maximum size of the reads, we can do something like this:

gunzip -c file.fastq.gz | awk 'NR%4==2' | sample -r 1% | awk '{print length($0)}' | stats | grep max

The output will show the following:

max: 101.000000000000

Thus, this is a 101-bp run. Whether it’s a single-end or paired-end run can be gleaned by looking at the files. If only _R1 files exist, it’s single-end but if there are both _R1 and _R2 files, it’s paired-end.

The above relies on the stats package which can be downloaded from this website here. There are other ways of course, but I find this requires the least amount of typing. The command will unzip the file on the fly, take only the rows containing the sequence, randomly sample only 1% of the reads, print out the lengths of the reads, create statistics and only print out the maximum value.

We can also determine the base quality composition of the whole file. The scenario came up that some colleagues wanted to know whether we had any problems with the demultiplexing of our samples on the MiSeq. Now I had previously hacked the MiSeq so that for every run it would generate not only the normal reads but also the indices that were used for multiplexing. I had checked the composition of the file with the following command:

for file in *_L001_I{1,2}_001.fastq.gz; do echo $file; gunzip -c ${file} | awk 'NR%4==2' | sort | uniq -c; done

This showed that the demultiplexing had gone as expected with the greatest number of indices corresponding to what was expected. Because the demultiplexing program allows one mismatch, other sequences that are one base away from the correct index sequence were also included but had far smaller numbers. I had checked the hamming distances between all the indices included in our run and they had a minimum distance of 3 bases, so there was no chance of a mix-up there. But the question arose as to whether the machine had a problem reading the index sequences in the first place. This would be reflected in the Phred-score of the bases and this is included in the fastq file as mentioned previously. We need at least 3 bases to have low quality to cause problems with the demultiplexing so running the following should do the trick:

for file in *_L001_I{1,2}_001.fastq.gz; do echo $file; gunzip -c ${file} | awk 'NR%4==0' | grep -E [\"#$%\&\'\(\)\*\+,\-\.\/0123]{3} | wc -l; done

The grep function looks for 3 bases or more having a Phred-score below 20, which are denoted by the symbols above as well as the numbers 0-3. This is assuming we are working with a Phred-33 scoring system. The above will output the number of indices that have at least 3 bases with a Phred-score below 20. We’ll then need the total number of indices (that should be easy enough) and we can calculate what percentage of the indices is of low quality.

Leave a Reply

Your email address will not be published. Required fields are marked *