December 3, 2019 0

Using NCBI Data with Tools that Predict the Functional Impact of Genomic Variants


Good afternoon and welcome to using NCBI
databases with tools that predict genomic variant effects. my name is Ben
Busby and this afternoon Kaitlin Williams who is a NIH post-bac
in the genomic outreach and dbSNP groups will be presenting the webinar, I will be
ensuring technical and specific questions as the webinar goes on as well
as other in NCBI staff will be is answering more substantive questions at the end
but also very specific questions about the tools we do debate team nor endorse
any of the tools here so we will likely encourage you to contact the actual tool
developers additionally if you are a tool developer and your tool works within
NCBI variant data is not shown here please send us an email, you can send me an email at: [email protected] or you can send to [email protected] Without further ado like to introduce Kaytlin
Williams who will be presenting to you for the next 40 minutes or so. I will be
going over this webinar with you today here’s a quick overview of what we’re
gonna cover so let’s start off with an introduction of resources, some variant
callers that you might be interested in and then we’ll get into going over using NCBI
databases with our tools. So we’ll start with tools that take vcf input. That will be Annovar, SNPEFF, VEP, VAAST and VAT and
then we’ll go on to a couple tools that use protein fasta and those are SNAP-2, PANTHER, AND PROVEAN. Here’s a little blurb about some of our NCBI variation resources. Some of the more popular ones are dbSNP that’s a database of small genetic variations that are less than 50 nucleotides. dbVar variants are larger structural variations that are greater than 50 nucleotides in length. There is variation viewer which is a visualization tool that you can download data from. The thousand genomes project that has variant data along with population information. Blast which is an alignment search tool.
If you’re running raw data you’ll probably want to dbGaP holds genotyes and phenotypes. ClinVar is human variation with clinical data. first run a variant caller before use any of these
tools just to get into VCF format. So these are some of the more popular variant callers. we also have a list of them at
this website, at github.com/NCBI-Hackathons/Community_Software_Tools_for_NGS and feel free to add more tools that. I’ve been working on that for a few months so
feel free to add to them or email me with any suggestions. Okay, now we’ll get into some of these tools. ok so first off
we’ll wanna go over where you can get NCBI data. So one of the main places is
going to be the NCBI FTP site and that’s file transfer protocol and that’s a website
where you can download and access data available from NCBI. This is kind of just
a demonstration of how our FTP site is set up so if you go to the main ftp page you can find links to 1000 genomes data, BLAST, SNP and SRA directly on that front page, but if you
want to find ClinVar or dbVar data you’ll have to go into the pub directory to
find those. so here’s an example of what a dbSNP page would look like. As I
said this holds small genetic variations and this is an example of a RefSNP report. so this gives the
information you would be able to find it in the dbSNP files just in kind of
visual view that. So this is an example for rs328 and as you can see you can get some
of the allele information as well as clinical significance and things like
that. nAd also as a side note if you need to
search dbSNP for specific variants or if you’re interested in a type of variant or
a specific organism, because dbSNP does hold information from many different organims, you can
go into the search bar you can type and whatever type of functional class you want. For this example it’s missense and homo sapiens is the organism and this will give you a list of all of the missense mutations
in dbSNP so that might help if you’re looking for specific types of variants. Okay so now
if you’re gonna download data from the dbSNP FTP site specifically VCF
files, you can go into the main NCBI ftp site and then go to
organisms and under organisms you’ll be able to find the human files. Snd so we
have files that are mapped to both GRCh37 and 38. You will go to that and then you can click on vcf and
you’ll get all of the VCF files available for that and as you can see there is a
few different variations. You can do common and all; you can just download all if you’d like So there’s a few different files you can
choose from. Next up is ClinVar. So ClinVar contains human varations with clinical
data. This is the main homepage for ClinVar and as you can see there are multiple links
and you can get straight to their FTP page from here. And here is an example of an
ftp file that is on the ClinVar website there is a good amount of information on this
page and from this ClinVar search bar you can do basically the same thing you did with dbSNP,
as well as looking up specific genes so you can find all the variants are on one
gene or related to one condition. And then for the ClinVar ftp site, you go to the ftp site
/ pub and then there are actual links for vcf underscore grc37 or 38.And this will
give you a list of all the VCF files and you can download ones that are just common and clinical or common and
with no medical impact. So just depends on what you’re studying. Then there’s the genome reference
consortium which houses all of our reference sequences. So that’s GRCh38 and 37.
There are reference sequences for human, mouse and zebrafish. Obviously for our purposes
today will be focusing on the human genome. juicy 837 which is also known as
H 19 RGR 6:38 will also go by age 40 ch 37 or 38 you can either do a download of
the full fasta file or dowload by chromosome. So the full download will go
to the ftp site / genome / all and then that / GCA_ file that will
take you to this directory where you can download the full report. If you’re going by chromosome you can go to / genomes then into the H_sapiens file. If you want GRCh37 you click on this archive link here
that’ll bring up past reference sequences and you can get 37 through that. So for you that might not know what a VCF file is
it’s a standardized text format and it’s used for storing large amounts of
genomic information including snps indels and structural variants. It has header lines
lines that start at the beginning that have information about what is maintained in
the file followed by the column headers and then the record for each variant follows. Here’s an example of what of vcf header
might look like, this is actually one for dbSNP, As you can see there’s a ton of info at the top about the file itself and what all the tags might mean followed by that
header with the columns and the record is beneath that. For people who might not be used to working on on the command line you’re gonna wanna
remember to unzip or tar xvfz any files you do download files. For files that end in .tar.gz use tar xvfz. For .gz files gunzip and then the file name. For .zip files unzip.
I’d like to remind you get to visit the github site and I feel free to contact us
with any suggestions. Now start off with our first tool which is ANNOVAR. This is pretty popular and widely used and it’s a variant annotator.
For our I purposes today were using a dbSNP file that is mapped to GRCh38. So we’ll
want to make sure that you download reference databases that are also and in the 38 format. Or it won’t map correctly when you try and matchthose up. For annovar you can download these additional databases. So you can
use this command right here at the top perl annotate_variation.pl and -buildver gh38 and then
for this demonstration one I downloaded avsnp144 which is actually just their version of dbSNP, refgene, knowngene, a 1000 genomes file and clinvar from 2016. If you want a full list of all the extra
databases that they do have you can look up avdb list — a specific
name and it’ll pop out all of them. Running this for annotation you’ll do perl table_annovar.pl and then
you’ll put in the file and so ours is All_20151104, it’s the whole dbSNP file,
humandb, the reference sequence you’re running with, so hg38, what you want the outputto be, so I just made this one annovar.test. And then you type in the data bases that
you downloadeded and that you want to get more information from and then you’ll want to
include that dash vcfinput since that’s what the file is. And it will give you a VCF file
in return that has been annotated with all this different information. as you can see there’s
a lot of information in this INFO column. You can gain a lot more information about the specific snps the file you’re looking
at by doing this. So we’re going to go into SnpEff, also a very popular vcf annotation tool. And as a side note this also does work with sift and polyphen, which will give you variant
effect scoring. When you’re downloading it will actually come preloaded with grch37 so you’ll want to go ahead and download grch38. they actually provide a
pretty easy way to do that. So what you’ll need to do is you’ll type in this java dash Xmx4g dash snpEff.jar databases and
then for this I just grepped out the grch38 because I knew that was what I wanted to
download and then once you find the name of the database you want you can just
type in this last command, snpEff.jar download and it will download that reference sequence so you
don’t have to worry about finding it anywhere else they have it loaded. Once that
debase is downloaded you’ll run this command, snpEff . jar ann, so you’ll annotate it
with grch38 reference sequence. For this one I I ran it with the common_and_clinical snps file and then you get a vct result as well as an HTML result. So I’ll open that up. You’ll get a file that looks something like this. It will give you some details about the command just put in. Some variant rate details, so how many variants are on each chromosome, the number of variants by type, then down here to give you what impact is. So since we’re running
common clinical snps a lot of them came up as being modifier, what kind of
functional class they are, number of missense mutations as well as more
towards giving you the type and the region their in. the number of
variations — the same thing but now in a chart. And then if you scroll down so more there
is this one that gives you the number of base changes. So as you can see that they’re
a lot more that go from a G to an A and a C to a T, than perhaps an A to a T. Scroll down some more. What this gives you is the triple nucleotide changes that would obviously lead to an amino acid change. So as
you can see there’s a lot right here that the CGG to CAG so there’s a lot of
changes in that and then if you go down a little more it actually gives you just the amino acid
changes. So I can be very helpful for gaining more information and the graphical view is much prettier.
So now that we’ve gone over SnpEff we’ll talk about the VEP which is a variant
effect predictor and this is actually an ensemble tool. The basics for
downloading vep are found on their site. You just download their zip file, unzip it and then you’ll just go
to that directory and then install it. So when you’re actually installing VEP
you’ll get a prompt for if you want to download any cache files you want to say yes
and you’re able to download grch38 with RefSeq accessions. So that
is number forty-four right there and you’ll just type in 44 and it will ask you which one
you want to download. It will download it for you and then you will have 38 with refseq accessions that will
work with our data already put into their tool. and it’s important that you
make sure that you’re using the right reference sequence and also that it has
the right accessions. If it has the wrong accessions it might not match up
correctly and then you’ll get proper results. So once you have that downloaded, you can go ahead and
run the following command on the bottom. so it will be variant_effect_predictor.pl –cache and
–refseq so that will call that grch38 file that you downloaded and
for this I focus specifically on chromosome 22 and the common and clinical snps, then type what you want the output to be. As with SnpEff this
also give you a summary page in HTML format you’re able to open up in a
browser. So as you can see this provides charts about the variant classes. So on the
specific chromosome and specific file, the majority of these were small nucleotide variations but smaller amount of
deletions and then it has consequences. So a lot of these were missense, there were a few
splice regions and you can kind of scroll around on these and I’ll tell you the
percentages of what everything was. And then hears all consequences versus the most sever ones. It’ll give you some coding consequences – the same, a lot of them were missense. And then variants by chromosome. I only looked at chromosome 22 but it’ll give you the distribution of these and the
positions in the protein.And then if you scroll back up to the top, it will give you all of this in a
text file as well that you can also find on the command line. Next we’re going to talk about VAAST
which is variant annotation analysis and search tool. So vaast will actually score the variants
and give you a score as to what their functional effects would be. So I
actually ran a dbsnp file with VAAST. So the first thing that you’re
going to do is this perl command. And what this will do is this will remove the genotype
information from the dbsnp file. This actually causes problems when you’re
running vaast with this information still in the file. So once you’ve done that you’ll make a new
dbsnp vcf from that and then you’ll want to convert it to GVF because that is actually what vasst takes for the
annotation step. So you’ll run the vaast converter. Make sure you put
the build in. Also note for this we’re actually using a dbsnp file that is mapped to
grch37, or hg nineteen. That’s why the reference for this is different. So
then you’ll run that and get a gvf file. You’ll then want to sort the GVF file on a GFF three. So for this one I used a gff3 file that vaast actually has available on their
FTP site that you can find under their main website and you’ll hit data links and you can find it
under there. And that made it easier to map with that. And then ypu’ll want to do the same
thing for the GVF, you’ll just want both of those sorted and then you’ll be
able to go ahead and enter with it. So the first part is you’re gonna run the
VAT command for them and then it’s going to be the grch37 fasta file with that GFF three
file that you got from vaast and then that dbsnp sorted gvf, and that
will give you an annotated gvf from that command. So then from there you’ll want to makd this .cdr condenser file. So you use the VST
command that’s available in vaast. You’ll also want to specify that you are using the reference for hg19 and that will make that
dbsnp cdr file. So that will be the target CDR file that you’ll want. It also requires a you have a
background .cdr file so this is also something that vaast has available. So
if you go to this link right here this is through the vaast website, so you’ll download
that, you’ll gunzip that file and then you will have it available. And then you can do this
vaast scroing. So then you’ll finally run the vaast command, and then you’ll want to make
sure that you do this -ref and -no_max_allele_count command because this will
basically say that vaast will score variants that are found in both the background to
file it set up to only or variants are found in both the background and target files. It’s set up to only score variants found in the target file, but since we’re using a target and background file that both have dbSNP data it basically scores nothing. So then you’ll get a file with a bunch of scored variants that looks a little bit something like this. So what this is is that top one is named for the variant itself, what chromosome it’s on and a bunch of information about the different
permutations of that variant. So the first column is the record’s, the chromosome and contig name; the second column designates the strand that the feature is on; in the third column
specifies the structural annotation for that exon. And there’s more numbers separated by
colons there and those are the start coordinate of the exon, the strand of the exon,
and the chromosome number of the exon. Then down here you’ll be able to find
the raw clrt score for the specific variant. So what this is its two times the
natural log of the composite likelihood minus two times the number of variant sites
in the feature. So the last tool that takes vcf input we’re going to cover is VAT or variant annotation tool. Before we start going over vat I would like to say that this
requires a lot of external dependencies I personally a little bit difficult to
set up as a beginner and it also requires permission for sudo commands, which I didn’t so I actually had to have a systems person set it up for me. So while it’s a great tool, if you don’t have sudo commands, it might be better to start with another tool, but we’re still going to demonstrate this one. First, you’r going to want to download data to run on vat. So I followed, vat had a start-up guide. I basically just follow those commands and updated it with
data from grch38 because theirs is still grch37 data so I just did newer data from NCBI. You’ll want to download this gencode annotation file as well as an hg38.2bit file from UCSC and then you’ll
download a file, for this one I downloaded a chr22 phase3 file to test. And then you’ll want to get this all panel too. Those are just backgrounds are mostly just gonna work
with this vcf file. Once you have all those, you’ll want to go through some of these commands.
So the first three are just gonna deal with that gencode annotation filtered gtf file that
you downloaded. The first command which all this can be found on
the vat website, that will go ahead and filter that, and then once it’s filtered
you’ll run gencode2interval and interval2sequences. and you’ll create these gencode annotation filtered interval and filtered fasta files. And then when you run this last command
which is basically just testing that 1000 genomes chr22 file in vcf format that you downloaded. You’ll do snpMapper and then run it with that
interval and the fasta file, and you’ll get this genotypes annotated file at the end and that will
like this and has a ton of information in here regarding population and things
like that since that is what 1000 genomes holds. As well as soome information on whether it’s a snp
and other things of that nature. So now that we’ve gone through some of the tools that take vcf,
will go into tool that take FASTA protein input. So one way to find fasta files is actually on
the variation viewer. Variation viewer is a visualization way to see some of our variants and things like
that are held at NCBI. So for the following examples I just looked at gene COMT on chromosome 22. That is right here. So if you click on this it’ll
come down with some protein files and other things like that are all on
this gene and if you right click on any of these you can get to a blast search or
you can download the fasta. That is one way to download fasta. Then you can blast it and find the fasta through that, and you can also do, this is an example of protein blast, you can also do blastx, which goes from the nucleotide to the amino acids. So this is an example of what a protein blast page will look
like and if you scroll down it gives a ton of alignment information, similar
proteins that might be in other organisms, a lot of really good stuff. And then you can also
go to the protein database on NCBI, search for protein and this will give
you, you can see a graphical view of here or the fastest file. So once you have
that fasta file downloaded you can use some tools. First up
is going to be SNAP-2. So this is pretty simple. So you’re just gonna copy and
paste the fasta into SNAP-2 right there and just hit run
production. So once this is done you’ll get something like this. This actually
shows the amino acid sequences at the top here and then the potential amino
acid changes are on the side. So as you can see, more blue means its neutral, whereas
red means there’s an effect. Like this tryptophan getting changed to basically anything is
going to be a higher effect versus some of these like leucine. It also gives you this
chart so you can jump to a residue position. So looking at residue position
52 here, which is glycine, so when glycine gets changed to another, say, glycine,
obviously very neutral. So the more negative the number the more neutral it
is; the higher or more positive number the greater the effect would be. So let’s
say it gets changed to a Y here, there’s gonna be a greater effect. So you can gain a lot of information about some of the amino acid changes and what that would do to the protein overall. Next we’ll look at Panther, which has a tool that gives a hidden markov model. Basically, you’ll do the same thing. You’ll enter the protein sequence, hit submit, this will give you an alignment and what that evalue score would be. If it’s less than 1 to the negative 3, no hits are reported, so no hits are reported on this since it’s 1 to the negative 179. It also offers a cSNP search tool. So for this one your gonna paste in the fasta sequence followed by the different substitutions that you specifically want
to look at. So for this one I just picked out for random ones. That would be a glycine to a tryptophan at 52, an alanine to histadine at 117, an isoleucine to
leucine at 222 and a cysteine to isoleucine at 34. Once type this in and you can hit submit and
you will get scores like this. For this P-deleterious the higher the number
the are deleterious it would be. You can gain some information, so this one would do a lot more
damage for protein than this bottom one with a much lower number. Next up we’re going to look at running Provean, which does something pretty similar to what the Panther cSNP tool did. So Provean works directly with NCBI
BLAST and blast nr databases. If you want some information on running
those, you can find those on our webinars site. So this will compare fasta files and give possible effects for specific variants. So you will need to download all of NCBI blast and the nr database. So you’ll use the following command. This fasta file is just fasta protein file from blast that dot var file is just a text file
that just contains the variations that we used earlier, just a list of them. So then when you do this it will score those variations for you, it’s a delta alignment score. Basically this means, if you have a specific threshold score, let’s say it’s negative 2.5, this would mean that those first two
variations will be deleterious, the last two would have more of a neutral effect. So that’s about all I
have. So here are links to all of the NCBI resources that I was talking about,
as well as our github site. And now if anyone has any questions we can get to
that, and I think Ben’s going to help me answer those. Does anybody have any questions? Somebody asked, what were the criteria for choosing these programs?. So, we use programs that are fairly popular
in terms of how much play they get on various internet forums that work with
NCBI tools. That said, we don’t endorse any tools and we are not exclusive so if
you have other tools please let us know. We’d be happy to incorporate them into a
future webinar of this kind. These webinars have been relatively popular so
and we really in my opinion, NCBI should be very committed to making
databases that work with external tools. Somebody else asked for variant impacts on 3D
structure. That’s something we’re actively working on in the development
space; once we have to do more stuff done – we have new viewers we’re releasing like iCn3D,
and so once we have more variants reconciled with
those 3d viewers will probably be putting together a webinar on that. My guess
would be early fall. Somebody asked about the program Gemini.
That’s potentially something we can add to a future webinar. Somebody else asked
about Annovar’s version of RefSeq and whether it’s directly from NCBI or whether they modified in any way. We will
double check I don’t think it’s modified but will double check and we can get back
to you on that. This webinar has been recorded to view later on once it’s closed
captioned we will release it and you can find that on the NCBA webinar’s page. One
thing we can’t answer is questions like what is the best program for XYZ. Once again we can’t endorse any of these community programs but it is great that
they all support using dbSNP data. With that, unless there are any more burning questions. Like I said some of the questions are relatively specific, we’ll try to put some
answers to those questions on our FTP site. And if you don’t get an answer to your
question please feel free to send us an email, particularly if that question is
about NCBI databases and not about the individual tools. Thank you very much and I hope
everyone has a fantastic rest of the week. And thanks again to Kaitlin Williams for putting all this together.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts
Recent Comments
Tags
© Copyright 2019. Tehai. All rights reserved. .