class: center, middle # Bioinformatics tools --- # Sequence search tools - BLAST * BLAST is by far the most taught tool in Bioinformatics. I am not going to rehash this for * See NCBI's [Introduction to BLAST](http://www.ncbi.nlm.nih.gov/books/NBK52639/) * One of 7 Million pages by Googling ["blast introduction tutorial"](https://www.google.com/webhp?#q=blast+introduction+tutorial) --- # BLAST on Biocluster There are multiple flavors of BLAST (implementations). Focus on the current latest version from NCBI. We will make links to two files which are ORFs from two yeast species ```bash # setup some files to do some searches $ mkdir BLAST_demo $ cd BLAST_demo $ ln -s /bigdata/gen220/shared/data_files/sequences/yeast_chr1_orfs.fa . $ ln -s /bigdata/gen220/shared/data_files/sequences/C_glabrata_orfs.fa . ``` Now we have some files, set them up for running BLAST. Our question is, what ORFs are similar at the DNA level between these two species. ```bash $ module load ncbi-blast # load the module on the biocluster $ makeblastdb -dbtyp nucl -in C_glabrata_orfs.fa $ ls C_glabrata_orfs.fa C_glabrata_orfs.fa.nhr C_glabrata_orfs.fa.nin C_glabrata_orfs.fa.nsq yeast_chr1_orfs.fa $ blastn -query yeast_chr1_orfs.fa -db C_glabrata_orfs.fa | more ``` --- #BLAST: what are the tools * makeblastdb - index a database (required to do once before searching) * blastn - DNA/RNA to DNA/RNA search * blastp - protein to protein search * blastx - translated query (DNA/RNA) against protein database * tblastn - protein query against translated (DNA/RNA) database * tblastx - translated query and database (both in DNA/RNA but search in protein space) * blastdbcmd - retrieve a sequence from a blast formatted DB --- #BLAST: what are the cmdline options? All the tools have documented command line options. Use -h or -help to get detailed info. Sometimes with no arguments will print documentation, other times will not. ```bash $ makeblastdb USAGE makeblastdb [-h] [-help] [-in input_file] [-input_type type] -dbtype molecule_type [-title database_title] [-parse_seqids] [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids] [-mask_desc mask_algo_descriptions] [-gi_mask] [-gi_mask_name gi_based_mask_names] [-out database_name] [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version] DESCRIPTION Application to create BLAST databases, version 2.2.30+ Use '-help' to print detailed descriptions of command line arguments ======================================================================== ``` --- #BLAST: what are the cmdline options? ```bash $ blastn -h USAGE blastn [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-perc_identity float_value] [-qcov_hsp_perc float_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value] [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value] [-template_type type] [-template_length int_value] [-dust DUST_options] [-filtering_db filtering_database] [-window_masker_taxid window_masker_taxid] [-window_masker_db window_masker_db] [-soft_masking soft_masking] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-use_index boolean] [-index_name string] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-line_length line_length] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version] DESCRIPTION Nucleotide-Nucleotide BLAST 2.2.30+ Use '-help' to print detailed descriptions of command line arguments ``` --- #BLAST: some key arguments * -query - query file name (required) * -db - database file name (require) * -evalue - set the evalue cutoff * -max_target_seqs - max number of hit seqs to show * -num_alignments - max number of alignments to show * -num_threads - number of threads (parallel processing to run, 8 will be faster than 2) * -outfmt - specify a simpler format than the text format, try '-outfmt 6' for tabular format * -subject - instead of doing a DB search, search for alignments between query sequence and 1 to many subject sequences. Useful when want to just see the alignment of 2 sequences already picked out from other analyses --- #BLAST: Putting it all together ```bash #PBS -l nodes=1:ppn=2 module load ncbi-blast if [ ! $PBS_PPN ]; then PBS_PPN=1 fi if [ ! -f C_glabrata_orfs.fa.nhr ]; then makeblastdb -in C_glabrata_orfs.fa -dbtype nucl fi blastn -query yeast_chr1_orfs.fa -db C_glabrata_orfs.fa -evalue 1e-5 -outfmt 6 -out yeastORF-vs-CglabrataORF.BLASTN.tab -a $PBS_PPN ``` --- #Other types of search tools * HMMER + Pfam * Identify conserved domains in a protein * Sensitive searches for distant homologs * phmmer can be of comparable speed to BLASTP * HMMs are a way to not just match a single sequence but match a pattern * FASTA * Another tool like BLAST * Doesn't require formatting the database * FASTA/SSEARCH are more full length optimal alignments instead of individual scoring pairs, a single best alignment generated * Global alignment also with ggsearch * Exonerate * Another aligner useful for cDNA to genome alignment and protein to genome alignment * splice-site aware * output harder to parse but there is a GFF-flavor output and parsers in some toolkits --- # Git for Science Git is a version control system * Allows you to have a repository of files (code, data, images, etc) * You can "check-in" versions which will be archived with a version number/tag. Can always roll back to this version now. * Git repositories can be shared, synchronized to a central repository or shared among multiple distribited points * Key is that you can work on files, share your changes with collaborators. * You can also rewind your own changes, tag a version for a specific use (e.g. when you published the paper) and keep working on the code. --- # Git HowTo * [Github's Tutorial](https://guides.github.com/activities/hello-world/) on how to use their system * SW Carpentry [Git Novice](https://swcarpentry.github.io/git-novice/) * [Git for Scientists](http://nyuccl.org/pages/gittutorial/) --- #Git practice in class * Make a repository * add some files * checkin some file versions * make changes, check in another version * sync to github * collaborate, create a common repository for your team to work out of [GitHub collaboration](https://help.github.com/categories/collaborating/) Advanced concepts * Branches and Merging * Creating a tagged version * Resolving conflicts