Protein sequences clustering based on their sequence patterns has attracted lots of research efforts in the last decade. The principal idea of most clustering systems is how to represent and interpret protein sequences, which principally determines the performance of classifiers. In this paper, we proposed a new methodology, that definite a new descriptor to represent and interpret each sequence using its Probability Densities Functions (PDF). The Hellinger distance is used to measure the similarity between the sequences. Afterward, a hierarchical algorithm is applied to clustering proteins sequences using the Hellinger distance. Two of protein data sets are using for the experiments; the first is a mixed between Influenza and Ebola virus and the second is a set of Influenza. We compare between a two Hierarchical Clustering Algorithms, The first based on similarity measure is to use methods with sequences alignments (HCAWSA). The second is the proposed approach to the similarity measure is to use methods without sequences alignments.( HCAWOSA). The experiments result show that the proposed methodology is feasible and achieves good accuracy.
Digital Object Identifier (DOI)
"New Hierarchical Clustering Algorithm for Protein Sequences Based on Hellinger Distance,"
Applied Mathematics & Information Sciences: Vol. 10
, Article 32.
Available at: https://dc.naturalspublishing.com/amis/vol10/iss4/32