GHOSTX is a homology search tool which can detect remote homologues like BLAST and is about 100 times more efficient than BLAST by using suffix arrays. GHOSTX outputs search results in the format similar to BLAST-tabular format.
$ tar xvzf ghostx.tar.gz $ cd ghostx/src $ make $ cp ghostx /AS/YOU/LIKE/
GHOSTX can be used on a SPARC64 VIIIfx system, but the user should install boost C++ library and use a specific makefile for compiling.
$ tar xvzf ghostx.tar.gz $ cd ghostx/src $ make -f Makefile.fcc $ cp ghostx /AS/YOU/LIKE/
GHOSTX requires specifically formatted database files for homology search.
These files can be generated from FASTA formatted DNA/protein sequence files.
Users have to prepare a database file in FASTA format and convert it into GHOSTX format database files by using
GHOSTX "db" command at first.
GHOSTX "db" command requires 2 args ([-i dbFastaFile] and [-o dbName]).
GHOSTX "db" command divides a database FASTA file into several database
chunks and generates
several files (.inf, .ind, .nam, .pos, .seq).
All generated files are needed for the search.
Users can specify the size of each chunk. Smaller chunk size requires smaller
memory, but efficiency of the search will decrease.
Using default chunk size (1 GB), GHOSTX requires about 10 GB and 13 GB memory
for database construction and homology search, respectively.
For executing homology search, GHOSTX "aln" command is used and that command
requires at least 2 args([-i qryName] and [-d dbName]).
$ ghostx db -i ./db.fasta -o exdb $ ghostx aln -i exqry -d exdb -o exout
db: convert a FASTA file to GHOSTX format database files ghostx db [-i dbFastaFile] [-o dbName] [-l chunkSize] Options: (Required) -i STR Protein sequences in FASTA format for a database -o STR The name of database (Optional) -l INT Chunk size of the database (bytes) [1073741824 (=1GB)] -t STR Database sequence type, p (protein) or d (dna) [p] aln: Search homologues of queries from database ghostx aln [-i queries] [-o output] [-d databes] [-v maxNumAliSub] [-b maxNumAliQue] [-M scoreMatrix] [-G openGap] [-E extendGap] [-l CandidatesSize] [-s lowerCutoff] [-T UpperCutoff] [-S searchLength] [-q queryType] [-t databaseType] [-a numThreads] [-L maxNumHits] [-w maxAliLen] Options: (Required) -i STR DNA or protein sequences in FASTA format for queries -o STR Output file -d STR database name (must be formatted) (Optional) -v INT Maximum number of alignments for each subject [1] -b INT Maximum number of the output for a query [10] -M STR Score matrix file[BLOSUM62] -G INT Open gap penalty [11] -E INT Extend gap penalty [1] -l INT Maximun size of the candidates (Bytes) [134217728 (=128MB)] -s INT Lower limit cutoff score for seed search [4] -T INT Upper limit cutoff score for seed search [30] -S INT Maximum length of alignments in seed search [10] -q STR Query sequence type, p (protein) or d (dna) [p] -t STR Database sequence type, p (protein) or d (dna) [p] -F STR Filter query sequence, T (enable) or F (disable) [T] -a INT The number of threads [1] -L INT Maximum number of hits [67108864]
GHOSTX outputs the tab-deliminated file as search results. Example) hsa:124045... hsa:124045... 100 139 0 0 1 139 1 139 2.04391e-76 283.878 hsa:124045... ptr:454320... 99.2126 127 1 0 13 139 14 140 5.96068e-68 255.758 hsa:124045... mcc:714360... 88.9764 127 14 0 13 139 14 140 5.05773e-59 226.098 hsa:124045... rno:292078... 58.6777 121 46 2 13 133 14 130 1.38697e-32 138.272 hsa:124045... mmu:320869... 55.9055 127 50 3 13 139 12 132 1.17414e-31 135.191 hsa:124045... pon:100434... 96.4912 57 2 0 13 69 14 70 3.65839e-25 113.62 hsa:124045... bta:100335... 44.9275 138 71 3 2 139 25 157 4.04482e-24 110.153 hsa:124045... aml:100464... 26.6667 75 46 2 13 81 1183 1254 0.820692 32.7278 hsa:124045... bfo:BRAFLD... 56 25 10 1 108 131 581 605 0.820692 32.7278 hsa:124045... tgu:100227... 26.1682 107 69 3 25 130 150 247 1.82831 31.5722 Each column shows; 1. Name of a query sequence 2. Name of a homologue sequence (subject) 3. Sequence Identity 4. Alignment length 5. The number of mismatches in the alignment 6. The number of gap openingsin the alignemt 7. Start position of the query in the alignment 8. End position of the query in the alignemnt 9. Start position of the subject in the alignment 10. End position of the subject in the alignment 11. E-value 12. Normalized score