TY - GEN
T1 - Experiences with parallel N-body simulation
AU - Liu, Pangfeng
AU - Bhatt, Sandeep N.
N1 - Publisher Copyright:
© 1994 ACM.
PY - 1994/8/1
Y1 - 1994/8/1
N2 - This paper describes our experiences developing high- performance code for astrophysical iV-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.
AB - This paper describes our experiences developing high- performance code for astrophysical iV-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.
UR - http://www.scopus.com/inward/record.url?scp=84947725835&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84947725835&partnerID=8YFLogxK
U2 - 10.1145/181014.181081
DO - 10.1145/181014.181081
M3 - Conference contribution
AN - SCOPUS:84947725835
T3 - Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1994
SP - 122
EP - 131
BT - Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1994
T2 - 6th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1994
Y2 - 27 June 1994 through 29 June 1994
ER -