On-going  Research 

Previous Projects

  • SLAPI on DAWNING3000 and DAWNING4000(2001.9-2003.8):  developping an API for ScaLAPACK on DAWNING3000 and DAWNING4000, which can automatically select the optimal block size and processor grid to provide near optimal for common users.

  • Previous Projects Participated:

  • 1996-2000, Porting of ScaLAPACk on DAWNING 1000A, 2000II and 3000.

  • 1998-2000, International Corporation Project with HITACHI Co. Ltd.: Speedup version ScaLAPACK for HITACHI SR8000

  • 1998-2000, National 863 Plan, Distributed Large Scale Parallel Algorithms and Theory.

  • กก

  • 1996-1998, International Corporation Project with HITACHi Co. Ltd.: Speedup version ScaLAPACK for HITACHI SR2201

  • 1994-1998, Basic Research on Serveral Key Techniques for High Performance Computers, National Climbing Plan.

  • Previous Research

    Previous Research Participated:

  • 1999-2001, National Science Foundation (No. 69883006), Research on Several Techniques of Improving Distributed Parallel Numerical Software Performance.

  • กก

    Previous  Research (Ph.D. Thesis Work at IOS)

        With the more and more complicated architecture design of high performance computing systems, it becomes difficult things for common users to port their software and exploit potential performance on these systems. Thus the design and development of numerical software package became the research focus in the 70s of last 20th century. Up to now, according to the target computing system architectures, these numerical software packages can be classified into three generations. The first generation include Eispack and Linpack packages from 1973 to 1985, which targeted to high performance vector computers; The second generation is the LAPACK package from 1984 to 1992, which targeted to high performance vector computers, shared memory parallel computing systems and high performance computers with deep memory hierarchy; The third generation is the ScaLAPACK package began on 1990, which targeted to distributed memory parallel computing system with message passing. With the maturation of distributed shared memory computing systems, the forth generation numerical software package will become the future research focus.

        However, in the current popular parallel numerical linear algebra package ScaLAPACK, the performance critical problem on the automatic selection of processor grid and block size are still unresolved.  Thus when common user calls the ScaLAPACK subroutines, they are required to determine the performance suitable processor grid shape and block size. This would be a large burden for common users and thus obstacle the wide population and application of ScaLAPACK package.  At the same time, there are no  much changes on the time and space complexity of those classic algorithm adopted in the three generations of numerical software packages, but after modification on the implementation of these algorithms, they can still exploit most performance from its target computing platforms. The reason is the developer of these packages adopted the implementation forms that is suitable to its target platform's memory system characteristics.  The vector form for vector computers, the blocked forms algorithms for computers with memory hierarchy, and the processor grid and two dimensional block cycle data distribution forms, are the demonstration of these considerations. But if we analyzed these different implementation forms under RAM computing model, we can find out that they are not distinguishable under this model. How to distinguish these different implementation forms of one same algorithm, becomes one urgent problem that needs to be resolved. The resolution on this problem will encourage the common user to design their algorithm implementation form that suitable for the memory hierarchy system characteristics from the just beginning of their development stage, thus will save the performance tuning cost a lot.

        In this dissertation, we discussed the optimization techniques and analytical methods on five aspects of ScaLAPACK performance optimization, i.e., automatic selection of optimal processor grid shape, the automatic selection of near--optimal block size, user friendly application programming interface of ScaLAPACK, the memory complexity concept of different implementations of the same algorithm, and a new parallel computing model with memory hierarchy and instruction level parallelism -- DRAM(h,k).

        In order to realize the automatic selection of optimal processor grid shape, we extended the communication point definitions. Based on the analysis of the communication point set in one parallel program and the observation that the change of processor grid shape doesn't affect the load balance in ScaLAPACK, we proposed the minimum degree of communication point set method that can determine the performance suitable processor grid shape for one parallel program.  In this method, the parallel program with different processor grid will have different communication point set and the corresponding communication point weight, thus will have different final cost on communication. Through finding the communication point set with the minimum degree(cost), we can determine the processor grid shape with the best performance.  The analysis on two ScaLAPACK parallel performance timing drivers, LU and LLT factorization, and  corresponding experiments results on HITACHI SR2201, demonstrate that our method can correctly determine the optimal processor grid for specific number of processors, problem size and block size of one parallel program.

        The common method to select the optimal block size is based on the user's experience and large amount of experiments. However, the common user needs method that is simple and time saving, they have no such rich required experiences and are unwilling to pay for the large amount of experimental time.  Since the optimal block size is dependent on the loop characteristics, memory hierarchy and their interactions, the optimal block size selection problem of LAPACK package is still unresolved. The user has to model the detailed behavior of memory access pattern and floating point pipeline in order to accurately find out the changing rules of optimal block size with the problem size. But this requirement is often unrealistic and difficult.  However, this doesn't mean we can't solve the selection of optimal parallel data distribution block size problem.  In the domain of parallel program, the optimal data distribution block size selection becomes the interaction among local acceptable block size range problem, load balance problem, communication optimization problem and memory size constraints, and so on.  Based on the assumption that the local acceptable performance block size range can be determined through small

    experiments, we proposed a uniform framework that can unify all the problems mentioned. In this framework, we determine the possible range of near optimal block size through finding out all the non-equals that satisfy all the requirement of mentioned problems. And then through performing small experiments on small number of processors, we can determine one or more performance critical constraints that must be satisfied. Then after curve fitting on these experimental results and non-equals, we can finally give the theoretical formula for parallel near optimal data distribution block size selection. We successfully found out the theoretical selection formula for parallel LU and QR factorization of ScaLAPACK on SR2201 using our framework.  The average performance difference between predicted and measured optimal block size is 2.98% for LU factorization and 7.73% for QR factorization.  For even larger problem size, the performance difference is even small.

        Using the techniques we proposed in the former two chapters, we can automatilize the selection process of optimal processor grid and near-optimal block size. This will enable us to provide the common user of ScaLAPACK with a more friendly application programming interface. The burden of processor grid and block size selection can be totally hidden from the view of common user. Combined with the capasualization of matrix distribution information, the new API would give user a cleanly interface that is similar with LAPACK. 

       With the ever enlarging gap between processor processing speed and memory access speed, the impaction of memory access latency on performance becomes more and more large. One of current popular and easy way to improve the application performance is to improve and optimize on the memory access behavior, not the research on new algorithm that has less computing steps. Thus in the past decades, the upgradation of three generations numerical software packages are mostly the change of algorithm implementation forms, not the algorithm themselves. However, how to evaluate the memory access behavior of an implementation and quantify its characteristic, becomes one urgent problem needs to be resolved. In this dissertation, we proposed the memory complexity concept that can be used to evaluate the memory access behavior of one implementation based on previous work. We defined one metric named memory complexity degree and summarized its analysis methods. Under our definition, the  memory complexity degree of one implementation is its expected memory access latency of all its data accesses. Combined with the ratio of total number of data accesses and floating point operations, we finally gave the comparable metric Tfp, which is memory access cost per floating point operations. The smaller Tfp value, the higher performance of one implementation. The experimental results on several computing platforms matched with our memory complexity degree analysis results well.

       Based on the memory complexity concept, and with the observation on the recent development of parallel computing models, we proposed a new parallel computing model --- distributed RAM model with memory hierarchy[h] and instruction level parallelism[k](DRAM(h,k)). The unique features of this model are its two enhancement over traditional sequential computing model RAM: one changed the assumption of unit memory access cost to cost function that dependents on the data distribution status in memory hierarchy, the accessing order with other data, and the memory access pattern of the program; another changed the one instruction per cycle assumption  to  k instructions per cycle, in order to reflect the fact that most of popular processors have the superscalar function, and the exploitation on instruction level parallelism also greatly affects the final performance and sometime can be used to hide the memory access latency.  Although we didn't use this assumption on ILP in our current analysis and experiments since we have no matural and easy analysis framework. We still incorporate this feature in our proposed parallel computing model. In our parallel computing model, we look the remote memory systems of other processors as another level memory hierarchy of local processor and assign different cost for long and short message passing in the wish to encourage long message passing. Each node in our model is a processor of RAM with local memory hierarchy and ILP function and they are connected through interconnection at the far side memory hierarchy away from processor. These processors exchanged information and synchronization through message passing. The analytical results under our parallel model is thus the parallel memory complexity degree of one implementation of an algorithm. We performed analysis using our parallel model on eight forms of parallel LU factorization and four forms of lower triangular systems solvers. In most of the case we analyzed , The experimental validation on HITACHI SR2201 and PIII500 Linux PC Cluster matched with the analytical results well.