GHOSTX

GHOSTX is a homology search tool which can detect remote homologues like BLAST and is about 100 times more efficient than BLAST by using suffix arrays. GHOSTX outputs search results in the format similar to BLAST-tabular format.

News

We've released GHOSTZ, another homology search tool. Its search is further accelerated by clustering database subsequences. [GHOSTZ Webpage]
GhostKOALA, KEGG service for metagenome annotatoin, uses GHOSTX to assign K numbers to the query sequences. [GhostKOALA Webpage]

Download

Kegg Analyzer

Requirements

gcc >4.3

Installation

Download the archive of GHOSTX from the above link.
Extract the archive and cd into the extracted directory.
Run make command.
Copy 'ghostx' file to any directory you like.

      $ tar xvzf ghostx.tar.gz
      $ cd ghostx/src
      $ make
      $ cp ghostx /AS/YOU/LIKE/

For SPARC64 VIIIfx Users

GHOSTX can be used on a SPARC64 VIIIfx system, but the user should install boost C++ library and use a specific makefile for compiling.

      $ tar xvzf ghostx.tar.gz
      $ cd ghostx/src
      $ make -f Makefile.fcc
      $ cp ghostx /AS/YOU/LIKE/

Usage

GHOSTX requires specifically formatted database files for homology search. These files can be generated from FASTA formatted DNA/protein sequence files.
Users have to prepare a database file in FASTA format and convert it into GHOSTX format database files by using GHOSTX "db" command at first. GHOSTX "db" command requires 2 args ([-i dbFastaFile] and [-o dbName]). GHOSTX "db" command divides a database FASTA file into several database chunks and generates several files (.inf, .ind, .nam, .pos, .seq). All generated files are needed for the search. Users can specify the size of each chunk. Smaller chunk size requires smaller memory, but efficiency of the search will decrease. Using default chunk size (1 GB), GHOSTX requires about 10 GB and 13 GB memory for database construction and homology search, respectively.
For executing homology search, GHOSTX "aln" command is used and that command requires at least 2 args([-i qryName] and [-d dbName]).

Example

$ ghostx db  -i ./db.fasta -o exdb

$ ghostx aln -i exqry -d exdb -o exout

Command and Options

db: convert a FASTA file to GHOSTX format database files

  ghostx db [-i dbFastaFile] [-o dbName] [-l chunkSize]

  Options:
  (Required)
    -i STR    Protein sequences in FASTA format for a database
    -o STR    The name of database

  (Optional)
    -l INT    Chunk size of the database (bytes) [1073741824 (=1GB)]
    -t STR    Database sequence type, p (protein) or d (dna) [p]


aln:  Search homologues of queries from database

  ghostx aln [-i queries] [-o output] [-d databes] [-v maxNumAliSub]
             [-b maxNumAliQue] [-M scoreMatrix] [-G openGap] [-E extendGap]
             [-l CandidatesSize] [-s lowerCutoff] [-T UpperCutoff]
             [-S searchLength] [-q queryType] [-t databaseType]
             [-a numThreads] [-L maxNumHits] [-w maxAliLen]

  Options:
  (Required)
    -i STR    DNA or protein sequences in FASTA format for queries
    -o STR    Output file
    -d STR    database name (must be formatted)

  (Optional)
    -v INT    Maximum number of alignments for each subject [1]
    -b INT    Maximum number of the output for a query [10]

    -M STR    Score matrix file[BLOSUM62]
    -G INT    Open gap penalty [11]
    -E INT    Extend gap penalty [1]

    -l INT    Maximun size of the candidates (Bytes) [134217728 (=128MB)]
    -s INT    Lower limit cutoff score for seed search [4]
    -T INT    Upper limit cutoff score for seed search [30]
    -S INT    Maximum length of alignments in seed search [10]
    -q STR    Query sequence type, p (protein) or d (dna) [p]
    -t STR    Database sequence type, p (protein) or d (dna) [p]
    -F STR    Filter query sequence, T (enable) or F (disable) [T] 
    -a INT    The number of threads [1]
    -L INT    Maximum number of hits [67108864]

Search results

GHOSTX outputs the tab-deliminated file as search results.

Example)
  hsa:124045...   hsa:124045...   100       139     0       0       1       139     1       139     2.04391e-76     283.878
  hsa:124045...   ptr:454320...   99.2126        127     1       0       13      139     14      140     5.96068e-68     255.758
  hsa:124045...   mcc:714360...   88.9764        127     14      0       13      139     14      140     5.05773e-59     226.098
  hsa:124045...   rno:292078...   58.6777        121     46      2       13      133     14      130     1.38697e-32     138.272
  hsa:124045...   mmu:320869...   55.9055        127     50      3       13      139     12      132     1.17414e-31     135.191
  hsa:124045...   pon:100434...   96.4912        57      2       0       13      69      14      70      3.65839e-25     113.62
  hsa:124045...   bta:100335...   44.9275        138     71      3       2       139     25      157     4.04482e-24     110.153
  hsa:124045...   aml:100464...   26.6667        75      46      2       13      81      1183    1254    0.820692        32.7278
  hsa:124045...   bfo:BRAFLD...   56    25      10      1       108     131     581     605     0.820692        32.7278
  hsa:124045...   tgu:100227...   26.1682        107     69      3       25      130     150     247     1.82831 31.5722

Each column shows;
1.  Name of a query sequence
2.  Name of a homologue sequence (subject)
3.  Sequence Identity
4.  Alignment length
5.  The number of mismatches in the alignment
6.  The number of gap openingsin the alignemt
7.  Start position of the query in the alignment
8.  End position of the query in the alignemnt
9.  Start position of the subject in the alignment
10. End position of the subject in the alignment
11. E-value
12. Normalized score

References

Shuji Suzuki, Masanori Kakuta, Takashi Ishida, and Yutaka Akiyama, GHOSTX: An Improved Sequence Homology Search Algorithm Using a Query Suffix Array and a Database Suffix Array, PLoS One. 2014; 9(8): e103833.