difficult to simultaneously buffer the input feature maps, the output feature maps, and the filter weights in limited internal accelerator memory.
Previous method
use large SRAM buffers (up to a few MBytes may be used). drawback: large amounts of memory are not affordable.
Another point/direction:
data reuse , is determined by the dataflow schedule of computatio
Previous work: the working set of these innermost loops fits the available internal storage
drawback: existing models have not been adapted to explicitly application managed buffers, which constitute by far the most common memory architecture template for CNN accelerators, therefore, overestimate internal storage requirements
Contribution of this work:
new analytical memory performance model to evaluate dataflow schedules
best dataflow schedules
applied it on the case study of the design of a flexible CNN accelerator for deeply embedded Systems-on-Chip.
Background
Convolution loop-nest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// M output fmaps loop LOF: for (m = 0; m < M; m++) // C input fmaps loop LIF: for (c = 0; c < C; c++) // spatial loops (ExE) LSY: for (y = 0; y < E; y++) LSX: for (x = 0; x < E; x++) // filter loops (RxR, stride S) LFY: for (k = 0; k < R; k++) LFX: for (l = 0; l < R; l++) { p = I[c][y*S+k][x*S+l]; w = W[m][c][k][l]; O[m][y][x] += p*w; }
Goal
The problem consists in minimizing the number of memory accesses to the off-accelerator memory given a limited local buffer capacity.
Reuse in convolution
Input feature maps
Output feature maps
Weight
the reuse of a reference is carried by a loop if the same memory location is used by different iterations of that loop
However, unless the entire working set fits the local buffer, this reuse cannot be taken full advantage of.
Reuse buffer
Data cache
Application-managed scratchpad memories
partitioning the local reuse buffer in a set of application-managed buffers,