NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training

Abstract

Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a limited range of operators and rely on hand-optimized libraries to provide efficient implementations for these operators. To evaluate novel neural networks with new operators, the programmers have to either replace the holistic new operators with existing operators or provide low-level implementations manually. Therefore, a critical requirement for DNN training frameworks is to provide high-performance implementations for the neural networks containing new operators automatically in the absence of efficient library support. In this article, we introduce NeoFlow, which is a flexible framework for enabling efficient compilation for high-performance DNN training. NeoFlow allows the programmers to directly write customized expressions as new operators to be mapped to graph representation and low-level implementations automatically, providing both high programming productivity and high performance. First, NeoFlow provides expression-based automatic differentiation to support customized model definitions with new operators. Then, NeoFlow proposes an efficient compilation system that partitions the neural network graph into subgraphs, explores optimized schedules, and generates high-performance libraries for subgraphs automatically. Finally, NeoFlow develops an efficient runtime system to combine the compilation and training as a whole by overlapping their execution. In the experiments, we examine the numerical accuracy and performance of NeoFlow. The results show that NeoFlow can achieve similar or even better performance at the operator and whole graph level for DNNs compared to deep learning frameworks. Especially, for novel networks training, the geometric mean speedups of NeoFlow to PyTorch, TensorFlow, and CuDNN are 3.16X, 2.43X, and 1.92X, respectively.

Publication
In IEEE Transactions on Parallel and Distributed Systems
Bingyang Wu
Bingyang Wu
Ph.D. Candidate

My research interests include machine learning systems, video conferencing and cloud computing.