We are carrying out research and development on the second key techniques, i.e., High Performance Parallel Database/Parallel File System. The ultimate goal of this research project is building the research system of parallel data mining. We will carry out research and development on the following key techniques:

(1)                               New techniques on designing and building Linux cluster.  We have built the two nodes prototype Linux Cluster using industry standard OSCAR Linux cluster software package. And further research on adding the parallel file system to Linux cluster to further improve its parallel I/O performance is under urgent research and development. We select PVFS as our research prototype. And want to improve it on the performance, reliability and fault tolerant ability.

(2)                               High performance large scale parallel database. We found it is difficult to find a GNU free parallel database for common user and the commercial parallel database like Oracle 9i and IBM DB2, etc, are too expensive to be used on Linux cluster. We want to develop a GNU parallel database system based on MySQL, PVFS, and MPI message passing environment. The combine of database with parallel file system will further improve the I/O performance of the parallel database.

(3)                               High performance parallel data mining algorithms, especially on association rule mining and clustering. We are using WEKA package (http://www.cs.waikato.ac.nz/~ml/weka/) as the primary research start point to build a parallel WEKA that using JAVA language. Though the JAVA has the advantage of portability, but how to guarantee high performance using JAVA language is the key techniques in our research. The ultimate application area of our parallel data mining system research would be bioinformatics, especially Data Mining for Protein Structure Prediction.

(4)                               Proteins fold spontaneously and reproducibly into complex three-dimensional globules when placed in an aqueous solution, and, the sequence of amino acids making up a protein appears to completely determine its three dimensional structure. This self-organization cannot occur by a random conformational search for the lowest energy state, since such a search would take millions of years and proteins fold in milliseconds (known as levinthal's paradox). The challenges of the protein folding problem on data mining are: how to predict the three dimensional tertiary structure of a protein given its linear amino acid sequence. Some researcher are using a hybrid approach to predict local structure using a Hidden Markov Model, and then infering contact rules based on association mining. The HMM models the interactions between adjacent short regions of the protein sequence, and so attempts to model the propagation of structure along the sequence. To detect long-range amino-acid contacts they discover rules to predict if a pair of residues is in conact or not. In the testing phase one can predict the contact map for an unknown protein, and from the contact map one can recover the 3D shape.