LCS-HIT

Longest Common Subsequence based sequence clustering tool with High IdenTity

LCS-HIT is a sequecence clustering tool very similar to CD-HIT. LCS-HIT is over 2 times more efficient than CD-HIT by using the filtering based on the longest common subsequence (LCS) of two sequecences.

System Requirements

LCS-HIT runs on Linux with GCC (version 4 or later) and GNU make.

Download

Ver. 0.5.3 (2013/10/16) : lcs_hit-0.5.3.tar.gz

Installation

Download the archive of LCS-HIT from the above link and extract it. Then cd into the extracted directory and run make.

$ tar zxvf lcs_hit-0.5.3.tar.gz
$ cd lcs_hit-0.5.3
$ make

If successful, the make process will produce an executable file "lcs_hit" in the current directory.

You can also run tests as follows and check whether the compiled "lcs_hit" runs correctly or not.

$ make test

Usage

The usage of LCS-HIT is almost the same as that of CD-HIT.

Usage: lcs_hit [Options]

Options

    -i  input filename in fasta format, required  
    -o  output filename, required
    -c  sequence identity threshold, default 0.9
        this is the default lcs_hit's "global sequence identity" calculated as :
        number of identical bases in alignment
        divided by the full length of the shorter sequence
    -n  word_length, default 8
    -s  length difference cutoff, default 0.0
        if set to 0.9, the shorter sequences need to be
        at least 90% length of the representative of the cluster
    -g  1 or 0, default 0
        by cd-hit's default algorithm, a sequence is clustered to the first
        cluster that meet the threshold (fast cluster). If set to 1, the program
        will cluster it into the most similar cluster that meet the threshold
        (accurate but slow mode)
        but either 1 or 0 won't change the representatives of final clusters
    -h  print this help

History

0.5.3 (2013/10/16): Small modification for old GCCs (4.3 & 4.4)
0.5.2 (2013/10/16): Modification for the latest version of GCC (4.7.3)
0.5.1 (12011/01/11): BMC bioinformatics version

References

Y. Namiki, T. Ishida, and Y. Akiyama, “Acceleration of sequence clustering using longest common subsequence filtering,” BMC Bioinformatics, vol. 14, no. Suppl 8, S7, 2013.

Y. Namiki, T. Ishida, and Y. Akiyama, “Fast DNA Sequence Clustering Based on Longest Common Subsequence,” Commun. Comput. Inf. Sci., vol. 304, pp. 453–460, 2012.