GPU-based HOmology Search Tool for Metagenomics

GHOSTM is a homology search tool for huge short reads generated by next-generation sequencers. GHOSTM can detect remote homologs like BLAST and is about 130 times more efficient than BLAST by using a GPU-computing technique.


Ver. 1.2: ghostm.1.2.tar.gz


GHOSTM requires nVIDIA's GPUs which are capable for CUDA, CUDA toolkit and SDK codes. Both two items can be downloaded from nVIDIA's homepage (http://developer.nvidia.com/object/cuda_download.html).


  1. Download the archive of GHOSTM from the above link.
  2. Copy them to the C source directory of "GPU Computing SDK code samples" and cd into that directory . ("$CUDASDK/C/src/" if you installed SDKs to $CUDASDK)
  3. Extract the archive and cd into the extracted directory.
  4. Run make command.
  5. An executable file 'ghostm' was generated at "$CUDASDK/C/bin/$arc/release".
  6. Copy 'ghostm' file to any directory you like.
$ cp ghostm-1.2.tar.gz $CUDASDK/C/src/
$ cd $CUDASDK/C/src/
$ tar xvzf ghostm-1.2.tar.gz
$ cd ghostm-1.2
$ make
$ cp ../../C/bin/$arc/release/ghostm /AS/YOU/LIKE/


GHOSTM requires specifically formatted database files and formatted query files for homology search. These files can be generated from FASTA formmated DNA/protein sequence files.
Users have to prepare a database file in FASTA format and convert it into GHOSTM format database files by using GHOSTM "db" command at first. GHOSTM "db" command requires 2 args ([-i dbFastaFile] and [-o dbName]) and generates several files (.inf, .ind, .nam, .pos, .seq). All generated files are needed for the search.
Users also have to convert a query FASTA file into GHOSTM format query files. Users have to use another GHOSTM command "qry" for converting. It also requires 2 args ([-i qryFastaFile] and [-o qryName]). If the query sequeces are protein sequences, users have to add -t option to specify it.
For executing homology search, GHOSTM "aln" command requires at least 2 args ([-i qryName] and [-d dbName]). For GPU execution, GHOSTM requires -D option. If users do not specified a GPU device, CPU execution mode is used. GPU device ids can be checked by using "deviceQuery" command CUDA SDK.

The archive includes small query and database files at "testset/" for testing the system If you want to test the system by using larger database, please download from public databases (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/).
$ ghostm db  -i ./db.fasta -o exdb

$ ghostm qry -t d -i ./queries.fasta -o exqry

$ ghostm aln -D 0 -i exqry -d exdb -o exaln

Command and Options
db:	convert a FASTA file to GHOSTM format database files

  ghostm db [-i dbFastaFile] [-o dbName] [-k kSize] [-l chunkSize]

	-i STR		Protein sequences in FASTA format for a database
	-o STR		The name of database

	-k INT		The size of K-mer's K.  [4]
	-l INT		Chunk size of the database (MB) [128]

qry:	convert a FASTA file to GHOSTM format query files

  ghostm qry [-i qryFastaFile] [-o qryName] [-l chunkSize] [-t queryType] 

	-i STR		Protein sequences in FASTA format for a database
	-o STR		The name of query

	-l INT		Max query sequence length [75]
	-L INT		Chunk size of the query (MB) [128]
	-t STR		Sequence type of a query fasta file [p]
		d			DNA
		p			Amino acids

aln: 	Search homologs of queries from database

  ghostm aln [-i queries] [-d databes] [-D deviceId]
              [-v] [-o output] [-b best]
              [-G openGap] [-E extendGap] [-M scoreMatrix] 
              [-l CandidatesSize] [-s skipSize] [-t threshold] [-r regionSize] [-e extendSize]

	-i STR		Input query name (must be formatted)
	-d STR		database name (must be formatted)
	-o STR		Output file
	-D INT		GPU device ID (run without a GPU if this option is not given)

  	-v			Verbose mode 
	-b INT		The number of the output for a query [10]

	-M STR		Score matrix file[BLOSUM62]
	-G INT		Open gap penalty [11]
	-E INT		Extend gap penalty [1]

	-l INT		Maximun size of the candidates (MB) [128]
	-s INT		Skip number of query's K-mer [2]
	-r INT		The size of the regions [8]
	-t INT		Required minumum number of candidate seeds in a search region [2]
	-e INT		The width for extending an alignment region [2]
	-S INT		Start query chunk id [0]
	-E INT		Last query chunk id [last chunk id]
Search results
GHOSTM outputs the tab-deliminated file as search results.

query0  subject0        100     25      25      1       25      2.75456e-15     60.4622 
query0  subject6        100     10      10      16      25      2.58417e-05     27.335  
query1  subject0        100     24      24      1       24      1.36707e-14     58.151  
query1  subject6        100     9       9       16      24      0.000128251     25.0238 
query2  subject5        100     25      25      1       25      4.55093e-10     43.1282 
query2  subject6        84.2105 19      16      1       19      1.15998e-05     28.4906 
query3  subject6        100     25      25      1       25      2.85052e-12     50.447  
query3  subject5        84.2105 19      16      7       25      1.15998e-05     28.4906 
query3  subject0        100     10      10      16      25      2.58417e-05     27.335  
query4  subject6        100     25      25      1       25      2.85052e-12     50.447  
query4  subject5        84.2105 19      16      7       25      1.15998e-05     28.4906 
query4  subject0        100     10      10      16      25      2.58417e-05     27.335  

Each column shows;
1.      Name of a query sequence
2.      Name of a homolog sequence (subject)
3.      Sequence Identity
4.      Alignment length
5.      The number of matches in the alignment
6.      Start position of the subject in the alignment
7.      End position of the subject in the alignment
8.      E-value calculated using Karlin-Altschul statistics
9.      Bit score  calculated using Karlin-Altschul statistics


[1] S. Suzuki, T. Ishida, K. Kurokawa, and Y. Akiyama, gGHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomicsh PloS one, vol. 7, no. 5, p. e36060, Jan. 2012.
Copyright © 2010-2012 Akiyama_Laboratory , Tokyo Institute of Technology , All Rights Reserved.