MATOG Auto-Tuning


MATOG Auto-Tuning on GPUs is a tool to automatically optimize performance of NVIDIA CUDA code. The aim is to optimize arbitrary CUDA applications with as few code adaptions and limitations as possible. MATOG is written in C++, platform independent and does require only a few external tools.

The most recent version 5.0 (ACM TACO release) is fully functional and can be used to optimize nearly any CUDA application, independent of its application domain. In our evaluation we achieved significant speeds up over hand optimized code, with only a couple of minutes required to gather and analyze the necessary profiling data.

If you use MATOG in your publication, please cite our most recent paper.


  • optimization of Arrays, Array of Structs, Multi-dimensional Arrays and Multi-dimensional Array of Structs
  • Layouts: AoS, SoA, AoSoA
  • Transposition of multi-dimensional arrays
  • Memory Placement: Global <> Texture
  • Adjustment of L1 cache size
  • user defined optimizations

External Dependencies

  • CUDA 6.0
  • GCC 4.8 (Linux) / Visual Studio 2012 (Windows)
  • OpenCV 3.0
  • Intel TBB 4.4
  • SQLite 3 (included in source code)
  • CTemplate (included in source code)
  • JsonCPP (included in source code)

Be advised: MATOG is still under development and gets constantly improved. Please inform us if you encounter any problems. (E-Mail: matog [at]


MATOG: Array Layout Auto-Tuning for CUDA

Nicolas Weber and Michael Goesele in ACM Transactions on Architecture and Code Optimization (ACM TACO), 2017

Optimal code performance is (beside correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In the last years more and more approaches have emerged that automatically tune the performance of applications towards the underlying hardware. In this paper we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speed ups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.

[Paper] [Evaluation Parameters] [BibTex]

Adaptive GPU Array Layout Auto-Tuning


Nicolas Weber and Michael Goesele in Proceedings of Software Engineering Methods for Parallel and High Performance Applications (SEM4HPC), Kyoto, Japan, 2016

Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for auto-tuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels can have varying optimal configurations. We thus adjust array layouts adaptively at runtime and achieve or even exceed performance of hand optimized code. We automatically detect data characteristics to identify different performance scenarios without user input or additional programming. We perform an empirical analysis of the application in order to construct our decision models. Our adaptive optimization requires in principle profiling data for an extremely high number of scenarios which cannot be exhaustively evaluated for complex applications. We solve this by extending a previously published method that is able to efficiently profile single kernel calls and enhance it to find application-wide optimal solutions. Our method is able to optimize applications in a few minutes, reaching speed ups of up to 20% compared to hand optimized code.

[Paper] [BibTex]

Guided Profiling for Auto-Tuning Array Layouts on GPUs

Nicolas Weber, Sandra C. Amend and Michael Goesele in Proceedings of 6th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15), Austin, Texas, USA, 2015

Auto-tuning for Graphics Processing Units (GPUs) has become very popular in recent years. It removes the necessity to hand-tune GPU code especially when a new hardware architecture is released. Our auto-tuner optimizes memory access patterns. This is a key aspect to exploit the full performance of modern GPUs. As the memory hierarchy has historically changed in nearly every GPU generation, it was necessary to reoptimize the code for all of these new architectures. Unfortunately, the solution space for memory optimizations in large applications can easily reach millions of configurations for a single kernel. This vast number of implementations cannot be fully evaluated in a feasible time. In this paper we present an adaptive profiling algorithm that aims at finding a near optimal configuration within a fraction of the global optimum, while reducing the profiling time by several orders of magnitude compared to an exhaustive search. Our algorithm is aimed at and evaluated on large real-world applications.

[Paper] [BibTex]

Auto-Tuning Complex Array Layouts for GPUs

Nicolas Weber and Michael Goesele in Proceedings of Eurographics Symposium on Parallel Graphics and Visualization (EGPGV14), Swansea, Wales, UK, 2014

The continuing evolution of Graphics Processing Units (GPU) has shown rapid performance increases over the years. But with each new hardware generation, the constraints for programming them efficiently have changed. Programs have to be tuned towards one specific hardware to unleash the full potential. This is time consuming and costly as vendors tend to release a new generation every 18 months. It is therefore important to auto-tune GPU code to achieve GPU-specific improvements. Using either static or empirical profiling to adjust parameters or to change the kernel implementation. We introduce a new approach to automatically improve memory access on GPUs. Our system generates an application specific library which abstracts the memory access for complex arrays on the host and GPU side. This allows to optimize the code by exchanging the memory layout without recompiling the application, as all necessary layouts are pre-compiled into the library. Our implementation is able to speedup real-world applications up to an order of magnitude and even outperforms hand-tuned implementations.

[Paper] [Supplemental Material] [BibTex]

Source Code

Version Date Description File
Please refer to the documentation inside the source code archive for instructions.
5.0 June 8th, 2017 ACM TACO'17 release ZIP
4.0 April 4th, 2016 SEM4HPC'16 release ZIP
3.0 November 10th, 2015 PMBS'15 release ZIP
1.1 April 14th, 2014 EGPGV'14 release ZIP


The code on this website is distributed under both the New BSD License and a commercial license. If you want to license the code for commercial purposes, like incorporating parts of it into a propritary project or link against our libraries without the restrictions imposed by the New BSD License, please get in contact with us. For further information, please see the LICENSE.txt file which is contained in the code distribution.

The code on this website is free software; you can redistribute it and/or modify it under the terms of the New BSD License. A copy of the New BSD License is available at This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.