We are carrying out
research and development on the second key techniques, i.e., High Performance Parallel Database/Parallel File System. The
ultimate goal of this research project is building the research system of
parallel data mining. We will carry out research and development on the
following key techniques:
(1)
New
techniques on designing and building Linux cluster. We have built the two nodes prototype
Linux Cluster using industry standard OSCAR Linux cluster software package. And
further research on adding the parallel file system to Linux cluster to further
improve its parallel I/O performance is under urgent research and development.
We select PVFS as our research prototype. And want to improve it on the
performance, reliability and fault tolerant ability.
(2)
High
performance large scale parallel database. We found it is difficult to find a
GNU free parallel database for common user and the commercial parallel database
like Oracle 9i and IBM DB2, etc, are too expensive to be used on Linux cluster.
We want to develop a GNU parallel database system based on MySQL, PVFS, and MPI
message passing environment. The combine of database with parallel file system
will further improve the I/O performance of the parallel database.
(3)
High
performance parallel data mining algorithms, especially on association rule
mining and clustering. We are using WEKA package (http://www.cs.waikato.ac.nz/~ml/weka/)
as the primary research start point to build a parallel WEKA that using JAVA
language. Though the JAVA has the advantage of portability, but how to
guarantee high performance using JAVA language is the
key techniques in our research. The ultimate application area of our parallel
data mining system research would be bioinformatics, especially Data Mining for Protein Structure Prediction.
(4)
Proteins fold spontaneously and
reproducibly into complex three-dimensional globules when placed in an aqueous
solution, and, the sequence of amino acids making up a protein appears to completely
determine its three dimensional structure. This self-organization cannot occur
by a random conformational search for the lowest energy state, since such a
search would take millions of years and proteins fold in milliseconds (known as
levinthal's paradox). The challenges of the protein folding problem on data
mining are: how to predict the three dimensional tertiary structure of a
protein given its linear amino acid sequence. Some researcher are using a
hybrid approach to predict local structure using a Hidden Markov Model, and
then infering contact rules based on association mining. The HMM models the
interactions between adjacent short regions of the protein sequence, and so
attempts to model the propagation of structure along the sequence. To detect
long-range amino-acid contacts they discover rules to predict if a pair of
residues is in conact or not. In the testing phase one can predict the contact
map for an unknown protein, and from the contact map
one can recover the 3D shape.