0%

Optimally Scheduling CNN Convolutions for Efficient Memory Access

Introduction

problem

difficult to simultaneously buffer the input feature maps, the output feature maps, and the filter weights in limited internal accelerator memory.

Previous method

use large SRAM buffers (up to a few MBytes may be used). drawback: large amounts of memory are not affordable.

Another point/direction:

  • data reuse , is determined by the dataflow schedule of computatio

    • Previous work: the working set of these innermost loops fits the available internal storage

      drawback: existing models have not been adapted to explicitly application managed buffers, which constitute by far the most common memory architecture template for CNN accelerators, therefore, overestimate internal storage requirements

Contribution of this work:

  • new analytical memory performance model to evaluate dataflow schedules
  • best dataflow schedules
  • applied it on the case study of the design of a flexible CNN accelerator for deeply embedded Systems-on-Chip.

Background

Convolution loop-nest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// M output fmaps loop
LOF: for (m = 0; m < M; m++)
// C input fmaps loop
LIF: for (c = 0; c < C; c++)
// spatial loops (ExE)
LSY: for (y = 0; y < E; y++)
LSX: for (x = 0; x < E; x++)
// filter loops (RxR, stride S)
LFY: for (k = 0; k < R; k++)
LFX: for (l = 0; l < R; l++)
{
p = I[c][y*S+k][x*S+l];
w = W[m][c][k][l];
O[m][y][x] += p*w;
}

Goal

The problem consists in minimizing the number of memory accesses to the off-accelerator memory given a limited local buffer capacity.

Reuse in convolution

  • Input feature maps
  • Output feature maps
  • Weight

the reuse of a reference is carried by a loop if the same memory location is used by different iterations of that loop

However, unless the entire working set fits the local buffer, this reuse cannot be taken full advantage of.

Reuse buffer

  • Data cache

  • Application-managed scratchpad memories

    partitioning the local reuse buffer in a set of application-managed buffers,

Data Locality Optimization

  • reordring

  • titling

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    // output fmaps -- loop on tiles
    LTOF: for (mm = 0; mm < M; mm += mss)
    // input fmaps -- loop on tiles
    LTIF: for (cc = 0; cc < C; cc += css)
    // spatial -- loops on tiles
    LTSY: for (yy = 0; yy < E; yy += iss)
    LTSX: for (xx = 0; xx < E; xx += jss)
    // output fmaps -- tile loop
    LOF: for (m=mm; m<min(mm+mss,M); m++)
    // input fmaps -- tile loop
    LIF: for (c=cc; c<min(cc+css,C); c++)
    // spatial -- tile loops
    LSY: for (y=yy; y<min(yy+iss,E); y++)
    LSX: for (x=xx; x<min(xx+jss,E); x++)
    // kernel -- tile loops
    LFY: for (k=0; k<R; k++)
    LFX: for (l=0; l<R; l++)
    {
    p = I[c][y*S+k][x*S+l];
    w = W[m][c][k][l];
    O[m][y][x] += p*w;
    }

Method

Analytical model for evaluating the dataflow schedules

Distance $d_X(L)$

The number of loop iterations in the loop L , that separates two consecutive accesses to the same data element X

(d次iteration后,再也不会access到这次access的data element)

footprint $F_X(L_i)$

the footprint of array X in loop L measures the number of distinct elements of X used inside L.

1552309141092

n(L) : loop L 的interation数目, L0 代表最里面的inner loop

the actual required size for the application-managed buffer is computed as follows:

1552309773395

memory traffic $T_X$

number of bytes accessed in off-accelerator memory for array X

1552309887178

PX the numerical precision (in bytes) used for its storage.

It should be noticed that i here is actually defined on maximal B(Li) that fits the size of buffer