This guide is relevant in the middle of 2020.
This post will discuss how to convert your genetic data from Dante Labs (or any other vendor) to a suitable VCF file for online analysis tools. It will also convert the data once again into a SQL database for easy and fast analysis without needing to pay anyone if you decide so.
Unfortunately, the VCF files provided by Dante Labs are missing some vital information. If it was included, this guide would be irrelevant. The provided BAM file won’t work as well as it can’t be directly analyzed.
Warning: I do not understand in detail and depth genetics. I’m just an average Joe who wants to use his computer skills to understand more about himself. Nothing more. I’m most definitely wrong all over the place in this guide, so don’t judge. Preparing it was extremely complex for me. I am trying to make it simple for you (and me) with varying degrees of success. Please correct me where I’m wrong.
Съдържание
The problem
I want to analyze my genes by looking in SNPedia/Promethease for particular SNPs and interpret the results. For example I want to know my genotype for this SNP. Is it (C;C), (C;T) or (T;T)? Uploading Dante Labs’s supplied ~150MB XXX.snp.vcf.gz into Sequencing/Promethease and searching for various SNPs results in many of them missing.
What do I need
Another VCF file, which when analyzed locally or submitted into Sequencing/Promethease/SelfDecode will reveal all possible SNPs I would search for. This way I would know I have a particular genetic predisposition to a genetic problem.
How it works
The chip (or whatever) reads your genetic data in one or more FASTQ files. Reading the DNA can be done in short reads (cheaper) or long reads (newer, better detections for some things). Then these files are aligned to a reference human genome to produce a single BAM file. There are versions of the reference human genome, with patches. There are two naming standards. The previous version is from 2009 and is called GRCh37 or hg19 (in the other naming standard). The latest one is called GRCh38.p13 or hg38. The „.p13“ means „patch 13“ – the latest as of January 2020.
Your personal DNA can differ from other humans (the reference human genome) in several ways:
- SNP – Single Nucleotide Polymorphism – different nucleotide at a fixed position, defined by chromosome + loci (location). Known ones are described in a database (dbSNP) and given rsXXXXX numbers.
- INDEL – Insertion or deletion of several nucleotides at a particular location
- SV – Structural Variants – large DNA sequences that are inserted, inverted, deleted or duplicated
- CNV – Copy Number Variation – different number of copies of a particular gene
Each of these differences (called variants) can be described in a VCF (Variant Call Format) file. There can be a different VCF file for each one or all of them can be combined in one file.
The only interesting differences (variants) are SNPs. Promethease, SelfDecode, and most other services don’t care for others (INDEL, SV, CNV, etc).
A process called variant calling uses as input the BAM file and the reference human genome version it was aligned to. Then a computationally expensive process finds out the interesting differences (SNPs, INDELs, SVs, CNVs – just one or all of them). The result is a VCF (Variant Call Format) file.
Optionally another input file in this process can be used – a database (dbSNP) with all known SNPs so far (for the particular reference human genome it is derived from). The database contains: chromosome+location+rsXXXX number. In this case, the VCF file has a column called ID with the rsXXXX number of each variant=difference=SNP. It is optional. Online tools (like Promethease/SelfDecode) can infer the rsXXXXX number knowing the reference human genome of the VCF file (described inside it) and the position of each SNP using internally an SNP database (dbSNP).
Variant calling is a (very) computationally expensive and probabilistic operation. Different tools may yield different results and there are tuning parameters. There are filters in this process and some SNPs may not be discovered and/or may be wrong. I could be wrong on this one.
A typical VCF file like the one from Dante Labs contains only the differences you have from the reference genome. Another option during the variant calling process is to produce a VCF file with the option to „call all sites“. Such VCF file contains explicitly your genotypes for all positions relative to the reference human genome. Not only the differences. The file is huge (20GB+ compressed with gzip) compared to the „differences only“ Dante’s VCF which is ~145MB.
Why a VCF file with information for all positions matter? Can’t you just assume that a missing entry from a „differences only“ VCF file means your genotype at this position is equal to the reference genome? It turns out it doesn’t! A missing entry from a regular VCF doesn’t mean that your genotype at this position is equal to the reference genome. You can not safely assume this. That’s why Promethease skips all SNPs that are missing from Dante Labs VCF file, effectively filtering out the ones that should be „equal“ to the reference genome. That’s the problem I had with my Dante Labs VCF.
Why? Because an SNP might be missing because of „insufficient sequencing depth„, not because it is the same as the reference genome. Don’t ask me what this means. This is the most critical fact to realize. Dante’s SNP VCF is incomplete in this sense. That’s why it is 145MB instead of 20GB.
Note: SelfDecoded takes the unsafe way of assuming that a missing SNP entry from the VCF you supplied means your genotype is equal to the reference genome at this position. That’s why uploading the VCF from Dante in SelfDecoded seems to be all you need and shows all possible SNPs. I can’t comment on how safe this is without knowing each position (site) explicitly.
Dante Lab’s data
Check out this excellent guide. Dante supplies these files for download online (this changed through various stages last year):
- FASTQ (R1, R2)
- BAM – aligned to the outdated GRCh37/hg19
- SNP VCF – with empty ID column, where the rsXXXX should go, doesn’t „call all sites“ i.e. doesn’t list your genotype for every chromosome + loci (position)
- INDEL VCF
Please note that now (Jan 2020) Dante Labs offers a new product (WholeGenomeL), using long reads (hence the L), which results in more accurate FASTQ file(s). Mine is done using the old „short reads“ method.
What do I need, elaborated
A VCF file annotated rsXXXX SNPs (we don’t case for INDELs, SVs, CNVs) and explicit entry for each position even if my genotype is equal to the reference human genome.
Result: Searching for any SNP in Promethease/SelfDecode or in the file directly will always yield a result because the tool will be certain of my genotype even for equal SNP genotypes to the reference genome.
We should also produce two VCFs. One using the reference human genome version GRCh38.p13 and one using the older GRCh37. Why? Because in the future GRCh37 may become increasingly less relevant and some tools might stop supporting it. You might ask then why bother with GRCh37 at all? Simply because some other tools like SelfDecode work much better and yield more accurate results with GRCh37 right now.
Also, most SNPedia SNPs now use GRCh38 as a reference. However, some still use GRCh37. If a new SNP appears in the future (with a particular position in GRCh38), missing from my online reports (Promethease, etc.), I will be able to just look up chromosome X, loci (position) Y from the command line and see my genotype for this SNP without any effort. If I stumble upon one described relative to GRCh37, I want to be able to reference that one as well.
Let’s do it
We need to construct a pipeline. See one from 2014 (not so relevant anymore). This is a series of steps – from reading the DNA to producing the final VCF. The bird’s eye without any details is this:
„Raw“ DNA in several FASTQ files -> alignment to a version of the reference genome (these are in FASTA files) -> BAM file -> variant calling with „all sites“ -> VCF file.
I personally did this for both GRCh37 and GRCh38.
You can skip some steps. For example, Dante supplies both FASTQ and BAM files. Currently, the BAM file is aligned to an older reference genome (GRCh37).
At around the same time as this blog post, another one appeared, describing almost the same thing. Take a look, it is good. And yeah, it’s complex, but as detailed as what I’m about to present here.
The easy way ($$$)
Upload your FASTQ files to sequencing.com. Right now in 2020 sequencing.com offers to get the FASTQ files directly from your Dante Labs account. Click Dante Labs from the section „Obtain Your Genetic Data from a Laboratory“ in the Upload Center giving your Dante Labs credentials.
If for some reason you want to upload the FASTQ files yourself, use their uploader desktop app – Big Yotta.
If you do not have FASTQ files, upload your BAM file. It will be either GRCh38 or GRCh37 aligned. Then use EVE Premium, an application operating on your data in sequencing.com, to construct the same pipeline I did myself here, which took weeks to understand and days to produce the result. EvE will generate the final VCF with all possible SNPs, ready for interpretation in many online tools like Promethease and SelfDecode, fast. This will cost you $20. If you have the FASTQ files, you will be able to do it once again for the other version of the reference genome as well, so another $20. Those $40 will save you a lot of hassle.
In my case, I didn’t have enough time to research how to use my two FASTQ files. Everything takes tens of hours in this field. I decided just to use my BAM file from Dante (GRCh37 aligned). Then I selected the following EVE Premium parameters:
Target format: GVCF (Genome VCF with blocks), Reference Genome: Autodetect, Variant Calling: SAMtools, Annotation: SnpEff, Interpretation: Both ClinVar Report and Annotation
The generated GVCF file is 20GB compressed and has all the vital information for offline interpretation (actually all you need).
Unfortunately, Promethease crashes when I’m trying to upload it. This might be already fixed, but I haven’t experimented several months later. I haven’t tried to upload it to SelfDecode. It would be super cool if SelfDecoded works. I used another approach, read on.
The easy way ($$$), continued
Another option, although not optimal, is to just subscribe for SelfDecode and upload your Dante Labs supplied VCF. As far as I understand SelfDecode will just assume that for each missing entry in the VCF, your genome is equal to the reference one. Although this doesn’t seem correct, you will have all the data you need in a very easy to interpret interface. In the end, that’s what I did. Just to be on the safe side, I always compare offline on my machine if the SNP genotype reported by SelfDecode is the same and so far this has always been the case.
The hard way (DIY)
We need hardware, software, disk space, and time.
Hardware
For the upcoming task, I recommend at least 32GB RAM, as many fast cores as possible, a fast SSD with at least a couple hundred GBs available.
Do not forget that I never succeeded from the first try, did many mistakes, many times waited for days just to run out of space or face a crash. Then it took me months to figure out what I’m doing wrong. In the end, this guide took about a year to complete + another year of research before that.
The computer used in this guide has an Intel 4th generation 4C/8T CPU, 16GB RAM, and an SSD. Later I upgraded to a 16C/32T CPU, 64GB RAM, and an NVMe SSD which saved a lot of time.
Operating system
All tools require UNIX-like OS, so Linux and possibly macOS will be fine. If you are using macOS, do your research or fire up a virtual machine with Linux inside it. If you are using Linux, you are fine. If you are using another UNIX-like OS I suppose you will know what you are doing.
I’m using Windows at home, so the easiest two options are WSL2 or a virtual machine with Linux inside it. WSL2 is the easiest and it became a viable alternative at the end of this guide. Many of the tools require 10+GB of RAM and 100% CPU usage and I found through trial and error that the virtual machine path is harder, slower, and memory pressure failure-prone. Just use WSL2. How to do it is beyond the scope of this article. You can steal some ideas from here.
Have a couple of hundred gigabytes free. Pause Windows updates for 7 days. A restart is the last thing you want after 5 days of computing.
Option 1: Automated pipeline
This web hosting platform doesn’t allow Bash shell files, so it is GZIP compressed. Download it here: dna_converter
I can’t get rid of the underscore after .sh (it gets added when I upload the file for no obvious reason), so you will have to live with it. Extract it, set correct permissions, and run it:
mv dna_converter.sh_.gz dna_converter.sh.gz
gunzip dna_converter.sh.gz
chmod 777 dna_converter.sh
./dna_converter.sh
This Bash script that I created does all the required steps automatically. It doesn’t even require any installed software. The only thing it assumes is that you have a lot of disk space, a robust Internet connection, and your 2 FASTQ files named 1.fastq.gz and 2.fastq.gz. Make sure you have disabled OS updates or other restart triggers, download it, extract it, and run it. I used Ubuntu 20.04 LTS from Windows 10 WSL2 environment.
The script is a long pipeline of steps. It tries to autodetect the next step it needs to execute and continues from there.
In the end, it produces 37genotype.vcf.gz and 38genotype.vcf.gz. They can be used for analyzing your genome. If for some reason it doesn’t work automagically, the step-by-step guide is below.
At the beginning of the script, there are two control variables that enable or disable GRCh37 and GRCh38. By default, both are enabled. Edit the script to disable either.
# Configuration variables
GRCH37_ENABLED=true
GRCH38_ENABLED=true
Note: If you don’t have enough disc space, disable GRCh38 first, generate the final file GRCh37 based VCF file, delete all the intermediates, switch the GRCh37 off and GRCh38 on and run it again. The intermediate BAM and g.vcf files are over 100GB.
Option 2: Do all the steps manually
In case the script fails, you will need to understand what’s going on.
Software
I’m assuming you are using Ubuntu 20.04 LTS.
sudo apt update && sudo apt install -y samtools bcftools openjdk-11-jre unzip tabix minimap2 gcc
Installing gcc is important. Without it, GATK will run much slower.
Right now in 2020 samtools and bcftools are both at version 1.10. In the future newer versions might have different syntax, beware.
Download GATK version 4.1.4.1 and extract it. Preferably do not use a newer version as the syntax might have changed.
wget https://github.com/broadinstitute/gatk/releases/download/4.1.4.1/gatk-4.1.4.1.zip && unzip gatk-4.1.4.1.zip && rm gatk-4.1.4.1.zip
Use only one directory for downloading and processing files. Enter it, never leave it and issue all commands from there.
Let’s start!
Basically, the path is FASTQ -> BAM -> gVCF-> VCF. I will describe the needed steps for both GRCh37 and GRCh38. I will assume that you have your FASTQ files. They are the real „starting point“.
I’m prefixing each long-running command with:
/usr/bin/time --verbose ionice -c3 nice -n19
What it does is to:
- Measure the time the command takes
- Run it with low CPU priority
- Run it with low disk priority
This leaves the machine as responsive as possible and prints the time the command took when it finishes.
Step 0: If you only have a BAM file
If you have FASTQ files, ignore this step and go to Step 1.
If you happen to have only a BAM file and you don’t have FASTQ files, do this:
samtools view -H your_bam_file.bam | grep "\.fa"
If this shows nothing, try „\.fasta“. Look for stuff like „ucsc.hg19.fa“. This is the FASTA file, used to produce the BAM file. To make this even more complex, there are various FASTA files sources for a particular reference version. There are also patches. For example my BAM file from Dante „ucsc.hg19.fa.gz“ indicated version GRCh37/hg19 from „University of Santa Cruz Genomic Institute“. Go figure…
You should also download the „.fai“ and „.dict“ files as well.
Once you download them, proceed only with the relevant GRCh37 or GRCh38 instructions below depending on which your BAM file depends on. If you downloaded your FASTA files, skip downloading one in step 1. Generate only the indexes/dictionaries if you don’t have them.
Step 1: Download FASTA, generate indexes
Download the reference human genome. Let’s do it for both GRCh37 and GRCh38. There are several sources. I won’t go into detail about my selection. GRCh38.p13 is from March 2019.
Proper FASTA files are compressed with bgzip (indexed gzip, backward compatible with gzip, required by our tools). Unfortunately, both these FASTA are compressed with gzip instead. Let’s convert everything while downloading it on-the-fly. Also, let’s generate proper indexes/dictionaries.
For GRCh37:
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -O - | gunzip -c | bgzip > ucsc.hg19.fasta.gz
samtools faidx ucsc.hg19.fasta.gz
./gatk-4.1.4.1/gatk CreateSequenceDictionary -R ucsc.hg19.fasta.gz -O ucsc.hg19.dict
For GRCh38:
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh38.p13.genome.fa.gz -O - | gunzip -c | bgzip > GRCh38.p13.genome.fa.gz
samtools faidx GRCh38.p13.genome.fa.gz
./gatk-4.1.4.1/gatk CreateSequenceDictionary -R GRCh38.p13.genome.fa.gz -O GRCh38.p13.genome.dict
Step 2: Download dbSNP
For GRCh37 (needs to be indexed):
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -O - | gunzip -c | bgzip > dbsnp_138.hg19.vcf.gz
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk --java-options "-Xmx12g" IndexFeatureFile -I dbsnp_138.hg19.vcf.gz
For GRCh38 (already indexed):
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz && wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi
Step 3: FASTQ to BAM
Download your FASTQ files (raw DNA data) from your Dante Labs account. Rename them to 1.fastq.gz and 2.fastq.gz.
Let’s align your FASTQ files to the reference genome. There are many aligners. And they give different results. Genetics is a strange beast with a probabilistic nature. I tried to use the most popular, bwa, waited for 5 days and it failed with „[mem_sam_pe] paired reads have different names: …“. I tried to fix it for another couple of days and gave up. So let’s use minimap2. It is a lot faster too!
The output from minimap2 would be a SAM file, which must be sorted to produce the final BAM file. We won’t need the intermediate SAM file, so let’s pipe the output directly to the next program in our pipeline – samtools. Samtools will take the input from minimap2 and will sort it, producing our final BAM file.
The -t8 and -@4 parameters set the number of threads in each program. Beware of memory consumption as well. minimap2 peaked at 13GB RAM. By default, samtools uses 768MB RAM per thread. Adjust both parameters for your CPU and RAM. I’m using less threads for samtools to conserve memory. minimap2 is the slower one and samtools waits for it most of the time, so too many threads for samtools doesn’t make much sense.
The option „-R @RG\tID:foo\tSM:bar“ adds the so-called „read group“ to the output file. The GATK tool won’t work without it later. I learned this the hard way.
For GRCh37:
/usr/bin/time --verbose ionice -c3 nice -n19 minimap2 -t8 -a ucsc.hg19.fasta.gz 1.fastq.gz 2.fastq.gz -R @RG\\tID:foo\\tSM:bar | samtools sort -@4 -o 37bam.bam
For GRCh38:
/usr/bin/time --verbose ionice -c3 nice -n19 minimap2 -t8 -a GRCh38.p13.genome.fa.gz 1.fastq.gz 2.fastq.gz -R @RG\\tID:foo\\tSM:bar | samtools sort -@4 -o 38bam.bam
Each operation took 12 hours.
The BAM file is ready. Now let’s build the BAM file index. This will create .bai file.
For GRCh37:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk BuildBamIndex -I 37bam.bam
For GRCh38:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk BuildBamIndex -I 38bam.bam
Each operation took 45 minutes.
Step 4: BAM to GVCF (variant calling)
This step is based on this guide, chapter Working with BAM files. I updated the syntax for GATK 4.x, removed some options, added multithreading speedup and so on.
Let’s finally do variant calling using GATK. It is very important to have the suffix of the output file ending on „.g.vcf“. This tells GATK how to tune some parameters for GVCF. This is an intermediate VCF representation that will be used from GATK in the final stage.
Genomic VCF (GVCF) produced with options „-ERC GVCF –output-mode EMIT_ALL_ACTIVE_SITES“ results in an intermediate file that has information at every possible position (what we need), but also does optimization for a range of positions which are equal to the reference genome by grouping them together. For quite some time I thought that this is the final file that can be used for manual interpretation or uploaded to various web sites for analysis. Well, I was wrong and it was quite frustrating why I got so mixed results, Promethease was crashing, SelfDecode’s engineers couldn’t figure out why it didn’t work and so on. The produced file is specific to GATK and should be transformed further.
Once again adjust the number of threads for your CPU.
For GRCh37:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk --java-options "-Xmx12g" HaplotypeCaller --native-pair-hmm-threads 8 -R ucsc.hg19.fasta.gz -I 37bam.bam -D dbsnp_138.hg19.vcf.gz -O 37.g.vcf.gz -ERC GVCF --output-mode EMIT_ALL_ACTIVE_SITES
For GRCh38:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk --java-options "-Xmx12g" HaplotypeCaller --native-pair-hmm-threads 8 -R GRCh38.p13.genome.fa.gz -I 38bam.bam -D dbsnp_146.hg38.vcf.gz -O 38.g.vcf.gz -ERC GVCF --output-mode EMIT_ALL_ACTIVE_SITES
Each operation took 36 hours.
Step 5: Genotyping
Now let’s use the generated intermediate GVCF (.g.vcf) file and convert it to our final VCF file. Please note that if you skip the „-all-sites“ option, the information where your genome is equal to the reference genome will be lost and only the „differences“ will survive. The final file would be several hundred of megabytes and will be pretty much identical to what Dante Labs provided anyways, making this whole blog post and all the transformations here needless. The only difference would be the inclusion of the dbSNP rsXXXX number in the relevant column, which can be inferred anyways.
By adding „-all-sites“ in this step, combined with the GVCF that we produced in the previous step using the relevant parameters to save the „equal to reference genome“ positions then as well, we can finally produce a VCF file that explicitly has your genotype at every possible position. Yeah, it is quite big, because „range optimizations“ for parts where your genome is equal to the reference one can’t be applied here.
For GRCh37:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk --java-options "-Xmx12g" GenotypeGVCFs -R ucsc.hg19.fasta.gz -D dbsnp_138.hg19.vcf.gz -V 37.g.vcf.gz -O 37genotype.vcf.gz -all-sites
For GRCh38:
/usr/bin/time --verbose ionice -c3 nice -n19 ./gatk-4.1.4.1/gatk --java-options "-Xmx12g" GenotypeGVCFs -R GRCh38.p13.genome.fa.gz -D dbsnp_146.hg38.vcf.gz -V 38.g.vcf.gz -O 38genotype.vcf.gz -all-sites
Each operation took 30 hours.
Interpretation online
Congratulations. You now have the real file you can use to analyze your genome. The theoretical part precedes this blog post and is in Bulgarian. It is very detailed.
The easy part is to upload one of the VCF files to Promethease or SelfDecode, pay some small fee and profit. I believe GRCh37 works better for SelfDecode and GRCh38 works better for Promethease. I might be wrong and/or this could change in the future.
I will give just a quick example here on how to know your genotype for any SNP you might be interested in. I believe you will understand it if you were able to get so far.
Interpretation by yourself
If you don’t want to spend any money on online websites or you want to keep your data private, you can of course look up your genotype for each SNP you’re interested in by yourself.
Let’s take for example rs4680. You can substitute it with any SNP. The relevant page in SelfDecode or SNPedia gives you a lot of data you need to interpret the results. Explaining alleles, the difference between homozygotes or heterozygotes mutations are out of scope for this guide. Check the previous posts in this series. What you need to understand is whether your genotype is one of these three: (A;A), (A;G) or (G;G). Having this information gives you what you need.
The easy, slow way: grep
Check out the reference genome here: GRCh38. This is very important! You need to look at the right VCF file! Some sources will require GRCh37, some GRCh38! It’s very easy to make a mistake here.
For this example, we will be looking into the 38genotype.vcf.gz file.
Explaining the structure of the VCF file is outside the scope of this article. It is sufficient to say that a single line there represents a single location of chromosome + position inside it. Some of them are annotated as an rsXXXXX number from dbSNP. What we need is to filter the line in the VCF, giving us the genotype for chromosome 22 at position 19963748. Or alternatively to look for SNP rs4680.
As each column is separated with a tab character and some other SNP might start with „rs4680“, but contain more numbers afterward, it makes sense to include a „tab“ character at the end of our grep search pattern to filter only the SNP we care about. Also because the file is gzip compressed, we should decompress it on-the-fly. This gives us the following command:
zgrep -P 'rs4680\t' 38genotype.vcf.gz
Substitute „rs4680“ with the SNP you care about, but keep the „\t“ at the end.
Another way to search inside the VCF file is to look for chromosome and position instead of rsXXXXX number. Some SNPs have newer numbers than the database we’re using. For this example:
zgrep -P 'chr22\t19963748\t' 38genotype.vcf.gz
Once again, keep „\t“ parts and replace the others.
After some time waiting for uncompressing and parsing a 20GB file, the result from both searches is:
chr22 19963748 rs4680 G A 1684.03 . AC=2;AF=1.00;AN=2;DB;DP=43;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.89;QD=28.04;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,43:43:99:1698,129,0
Always double check the resulting chromosome and location in the result with the web site you are getting your information from. They should match. In this case chr22 and 19963748. It is easy to make a mistake between GRCh37 and GRCh38!
The first letter after rs4680 „G“ is the reference allele. This is the letter at this position in the reference human genome (GRCh38). Now to construct the above mentioned three possible combinations (A;A), (A;G) or (G; G), we use the combination of the letter after „G“, in this example „A“ and the beginning of the last column, in this example „1/1“.
0/0 -> (G; G) -> no mutation; both your alleles equal the reference genome
0/1 -> (A; G) -> heterozygous mutation (hetero=different)
1/1 -> (A; A) -> homozygous mutation (homo=the same)
In our example we have 1/1, so the final genotype is (A; A).
Please note that some SNPs in SNPedia or SelfDecode will use different letters than the ones in the VCF file. I won’t go into details, but just remember:
- A=T
- C=G
An easy mnemonic is „Apple on the Tree, Car in the Garage“.
Substitute them whenever needed to construct the final genotype you need.
That’s it!
The faster alternative: a database
Using grep to search in the VCF is fine if you don’t mind waiting several minutes to an hour depending on your computer’s CPU and hard drive speed.
But what about an instant result?
Enter databases.
I created a simple script that will convert (yeah, once again) the data from the VCF file into a database. As the database is indexed, searching inside it is super fast. You will be able to find an SNP instantaneously.
I decided to use SQLite as it is lightweight, doesn’t require any configuration, and works really well for this scenario.
I’m assuming that you are using Linux or a similar environment. Windows’s WSL(2) works and macOS should be ok as well.
Download the script and prepare it: snptodb
sudo apt update && sudo apt install nodejs npm sqlite3
tar zxvf snptodb.tar.gz
npm install
Now you have two options. To create a database with only SNPs from the dbSNP, which won’t cover all existing, but the database for each reference genome will be 100-200MB. The other possibility is to include in the database every possible position for fast searching.
Only dbSNP SNPs:
node snptodb 37genotype.vcf.gz 37.db
node snptodb 38genotype.vcf.gz 38.db
Every possible position:
node snptodb 37genotype.vcf.gz 37_full.db --full
node snptodb 38genotype.vcf.gz 38_full.db --full
Once the database is ready, you can ask the database for a particular SNP directly from the command line. Let’s use our rs4680 example once again. Let’s search the GRCh38 database from this SNP:
sqlite3 38.db "select * from dna where snpid=4680;"
Alternatively, we can search by chromosome and location inside it like this:
sqlite3 38.db "select * from dna where chrom=22 and pos=19963748;"
Both of those queries will return the result immediately:
22|19963748|4680|G|A|1/1
Refer several paragraphs above to interpret this line. Long story short it means genotype (A;A) for SNP rs4680.
My personal motivation
For years I have suffered from anxiety and depression. This website is dedicated to explaining methods to deal with those disorders, in Bulgarian language. Through the years CBT, EMDR, and SSRI/SNRIs helped a lot. I have mostly recovered, but I always wanted to understand better what happened to me and how to protect myself in the future. Several years ago Genecept Assay (now Genomind Professional PGx) started offering psychiatric tests. They used to test 18 genes (a bit more now, in 2020). The price seemed steep to me (750 Euros) just for a couple of genes. To be fair the report is useful.
At the same time, Dante Labs offered WGS for 400 Euros. I figured out I can pay less and have my whole genome scanned. Not just 18 genes, but 22 000. I imagined I will get some kind of file that I will be able to load in Sequencing or Promethease or SelfDecode, check the genotype of those genes the psychiatric test looks for and draw my own conclusions. Then later in my life if I need any other genetic test, I will just look into my data, saving a lot of money in the long run.
Well, it turned out I needed almost 2 years to do it, but hey, compared to a lifetime, it’s never too late!
Conclusion
This post took several months. The research for this post series (the others are in Bulgarian) took almost 1 year and another one in editing and polishing. It was rewritten at least 4 times completely, from scratch. Right now this post has over 400 revisions! I went through many iterations and many things didn’t work or didn’t make sense. At one point this blog post was probably 3x longer. I also spent more than $200 on various online tools, trying to make everything I’ve described here. I failed so many times, that I can’t even count them. I bought a much, much beefier PC near the end just to handle what I had to do here as I got sick of waiting for weeks, unable to run effectively other software in between. I wanted to abort this project too many times. Honestly, I don’t how I finally succeeded. I have no idea. At all.