Sparse Grid Regression for Performance Prediction Using High-Dimensional Run Time Data
We employ sparse grid regression to predict the run time in three types of numerical simulation: molecular dynamics (MD), weather and climate simulation. The impact of algorithmic, OpenMP/MPI and hardware-aware optimization parameters on performance is studied. We show that normalization of run time data via algorithmic complexity arguments signiﬁcantly improves prediction accuracy. Mean relative prediction errors are in the range of few percent; in MD, a ﬁve-dimensional parameter space exploration results in mean relative prediction errors of ca. 20% using ca. 178 run time samples.
MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Speciﬁc Analysis
Luka Stanisic and Klaus Reuter
This paper reports on the design and implementation of the HPC performance monitoring system, deployed to continuously monitor performance metrics of all jobs on HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to the users, application support, system administrators, and to management. On each compute node, hardware and software performance monitoring data is collected by our newly developed lightweight open-source hpcmd middleware which builds upon standard Linux tools. The data is transported via rsyslog, and aggregated and processed by a Splunk system, enabling detailed per- cluster and per-job interactive analysis in a web browser. Additionally, performance reports are provided to the users as PDF ﬁles. Finally, we report on practical experience and beneﬁts from large-scale deployments on MPCDF HPC systems, demonstrating how our solution can be useful to any HPC center.
Towards a Predictive Energy Model for HPC Runtime Systems Using Supervised Learning
Gence Ozer, Sarthak Garg, Neda Davoudi, Gabrielle Poerwawinata, Matthias Maiterth, Alessio Netti, and Daniele Tafani
High-Performance Computing systems collect vast amounts of operational data with the employment of monitoring frameworks, often augmented with additional information from schedulers and runtime systems. This amount of data can be used and turned into a beneﬁt for operational requirements, rather than being a data pool for post-mortem analysis. This work focuses on deriving a model with supervised learning which enables optimal selection of CPU frequency during the execution of a job, with the objective of minimizing the energy consumption of a HPC system. Our model is trained utilizing sensor data and performance metrics collected with two distinct open-source frameworks for monitoring and runtime optimization. Our results show good prediction of CPU power draw and number of instructions retired on realistic dynamic runtime settings within a relatively low error margin.
Resource Aware Scheduling for EDA Regression Jobs
Saurav Nanda, Ganapathy Parthasarathy, Parivesh Choudhary, and Arun Venkatachar
Typical Integrated Circuit (IC) design projects use Electronic Design Automation (EDA) tool ﬂows to launch thousands of regressions every day on shared compute grids to complete the IC design veriﬁcation process. These regressions in turn launch compute jobs with varied resource requirements and inter-job dependency constraints. Traditional grid schedulers, such as the Univa Grid Engine (UGE)  prioritize fairness over performance to maximize the number of jobs run with equal distribution of resources at any time. A constant challenge in day-to-day operations is to schedule these jobs for minimum overall job completion time so that developers can expect predictable regression turn-around time (TAT).
We propose a resource-aware scheduling mechanism that balances performance and fairness for real-word EDA-centric workloads. We present an analysis of historical proﬁle information from a set of regressions with complex inter-job dependencies and highly variable resource requirements to show that many of these regression jobs are well suited for efficient packing on grid machines.
We formulate the regression scheduling problem as a variant of the bin packing problem, where the size of bins and balls may vary according to job-resource requirements and differing server conﬁgurations on the grid. We propose using two analytic techniques – namely K-means Clustering  and adaptive histogram-based binning , to solve this problem. We then evaluate the performance of our proposed solution using real workloads from daily regressions on an enterprise compute grid.