Application Kernel Performance Monitoring Module of XDMoD tool is designed to measure quality of service (QoS) as well as preemptively identify underperforming hardware and software by deploying customized, computationally lightweight “application kernels” that are run on a regular basis (daily to several times per week) to continuously monitor HPC system performance and reliability from the application users’ point of view. The term “computationally-lightweight” is used to indicate that the application kernel requires relatively modest resources for a given run frequency. Accordingly, through XDMoD, system managers have the ability to proactively monitor system performance as opposed to having to rely on users to report failures or underperforming hardware and software.
The XDMoD’s application kernel performance monitoring consists of two parts:
application kernel remote runner (AKRR) and XDMoD appkernel module (xdmod-appkernels
).
Application kernel remote runner (AKRR) executes the scheduled jobs, monitors their execution, processes the output, extracts performance metrics and exports the results to the database.
XDMoD appkernel module analyse the results of application kernels runs, provides visualization tools and web-base interface to control AKRR. Among analysis tools it has automatic anomaly detector, that analyzes the performance of all application kernels executed on a resource and automatically recognizes poorly performing systems.
XDMoD should be install first, then AKRR and
then xdmod-appkernels. In addition before
installing xdmod-appkernels
module, add your HPC resource to AKRR and run few appkernels
jobs, this will help to ensure that xdmod-appkernels
is working properly.
Simakov, N. A., White, J. P., DeLeon, R. L., Ghadersohi, A., Furlani, T. R., Jones, M. D., Gallo, S. M., and Patra, A. K. (2015). “Application kernels: HPC resources performance monitoring and variance analysis.” Concurrency Computat.: Pract. Exper., 27: 5238– 5260. doi: 10.1002/cpe.3564.
Nikolay A. Simakov, Robert L. DeLeon, Joseph P. White, Thomas R. Furlani, Martins Innus, Steven M. Gallo, Matthew D. Jones, Abani Patra, Benjamin D. Plessinger, Jeanette Sperhac, Thomas Yearke, Ryan Rathsam, and Jeffrey T. Palmer. 2016. “A Quantitative Analysis of Node Sharing on HPC Clusters Using XDMoD Application Kernels.” In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (XSEDE16). Association for Computing Machinery, New York, NY, USA, Article 32, 1–8. DOI: 10.1145/2949550.2949553
Nikolay Simakov and Martins D. Innus and Matthew D. Jones and Joseph P. White and Steven M. Gallo and Robert L. DeLeon and Thomas R. Furlani (2018). “Effect of Meltdown and Spectre Patches on the Performance of HPC Applications” HPC Systems Professionals Workshop (HPCSYSPROS18) at SC18. ArXiv: abs/1801.04329
Next: AKRR Download