1 \documentclass[a4paper,11pt]{article}
3 \usepackage[latin1]{inputenc}
4 \usepackage[T1]{fontenc}
5 \usepackage[normalem]{ulem}
6 \usepackage[english]{babel}
7 \usepackage{listings,babel}
8 \lstset{breaklines=true,basicstyle=\ttfamily}
16 \title{Texture Mapping Unit}
17 \author{S\'ebastien Bourdeauducq}
20 \setlength{\parindent}{0pt}
21 \setlength{\parskip}{5pt}
24 \section{Presentation}
25 Milkymist has hardware acceleration for texture mapping on triangle strips. This process is used to implement the image warping effect in MilkDrop.
27 The texture mapping unit also supports fading the colors of the image to black (decay effect).
29 The core deals with 16-bit RGB565 progressive-scan framebuffers, accessed via FML links with a width of 64 bits and a burst length of 4.
31 The vertex data is fetched using a 32-bit WISHBONE master. Connecting this bus to the WISHBONE-to-FML caching bridge allows the mesh data to be stored in cost-effective DRAM.
33 For controlling the core, a CSR bus slave is also implemented.
35 \section{Configuration and Status Registers}
36 Registers can be read at any time, and written when the core is not busy. Write operations when the busy bit is set in register 0, including those to the control register, are illegal and can cause unpredictable behaviour.
38 Addresses are in bytes to match the addresses seen by the CPU when the CSR bus is bridged to Wishbone.
40 \subsection{Parameters and control}
41 \begin{tabularx}{\textwidth}{|l|l|l|X|}
43 \bf{Offset} & \bf{Access} & \bf{Default} & \bf{Description} \\
45 0x00 & RW & 0 & Control register. \\
46 & & & Bit 0: Start/Busy.\\
47 & & & Bit 1: Enable chroma key. \\
49 0x04 & RW & 32 & Number of mesh squares in the X direction (which is the number of mesh points minus one, and also the index of the last mesh point). \\
51 0x08 & RW & 24 & Number of mesh squares in the Y direction. \\
53 0x0C & RW & 63 & Brightness, between 0 and 63. The components of each pixel are multiplied by $ (n+1) \over 64 $ and rounded to the lowest integer. That means that a value of 0 in this register makes the destination picture completely black (because of the limited resolution of RGB565). \\
55 0x10 & RW & 0 & RGB565 color used as chroma key. Texture pixels with this color will not be drawn if the ``chroma key'' flag in the control register is set. \\
57 0x14 & RW & 0 & Address of the source mesh data (texture coordinates). Must be aligned on a 32-bit boundary. \\
59 0x18 & RW & 0 & Source framebuffer address. Must be aligned on a 16-bit boundary. \\
61 0x1C & RW & 640 & Source horizontal resolution. \\
63 0x20 & RW & 480 & Source vertical resolution. \\
65 0x24 & RW & 0 & Address of the destination mesh data (vertex coordinates). Must be aligned on a 32-bit boundary. \\
67 0x28 & RW & 0 & Destination framebuffer address. Must be aligned on a 16-bit boundary. \\
69 0x2C & RW & 640 & Destination horizontal resolution. \\
71 0x30 & RW & 480 & Destination vertical resolution. \\
75 \subsection{Performance counters}
76 In order to help tracking down performance bugs, the core has built-in performance counters.
78 Those counters are automatically reset when a new frame is submitted for processing, and must be read after the frame processing is finished.
80 These registers are read-only. Attempting to write them results in undefined behaviour.
82 \begin{tabularx}{\textwidth}{|l|l|l|X|}
84 \bf{Offset} & \bf{Access} & \bf{Default} & \bf{Description} \\
86 0x34 & R & 0 & Number of drawn pixels. Off-screen pixels are not counted. \\
88 0x38 & R & 0 & Processing time, in clock cycles. \\
90 0x3C & R & 0 & Number of stalled transactions detected at pipeline monitor 1. \\
92 0x40 & R & 0 & Number of completed transactions detected at pipeline monitor 1. \\
94 0x44 & R & 0 & Number of stalled transactions detected at pipeline monitor 2. \\
96 0x48 & R & 0 & Number of completed transactions detected at pipeline monitor 2. \\
98 0x4C & R & 0 & Number of misses in the source image (texture) cache. \\
103 The TMU is equipped with one active-high edge-sensitive interrupt line.
105 An interrupt is triggered when a texture mapping is done and all resulting data has been sent through the bus master.
107 \section{Encoding the vertex data}
108 The core supports a maximum mesh of 128x128 points. The address of the point at indices $ (x, y) $ in the mesh is, regardless of the actual the number of mesh points:
111 base + 4 \cdot (128 \cdot y + x)
114 This means that the mesh always has the same size in memory.
116 Each point is made up of 32 bits, with the 16 upper bits being the Y coordinates and the 16 lower bits the X coordinate.
118 Exactly 64KB are used by the mesh.
120 \section{Architecture}
124 \includegraphics[height=180mm]{architecture.eps}
125 \caption{Texture mapping unit architecture.}\label{fig:architecture}
128 \subsection{Handshake protocol between pipeline stages}
129 Because pipeline stages are not always ready to accept and/or to produce data (because, for example, of memory latencies), a flow control protocol must be implemented.
131 The situation is the same between all stages: an upstream stage is registering data into a downstream stage. During some cycles, the upstream stage cannot produce valid data and/or the downstream stage is processing the previous data and has no memory left to store the incoming data.
135 \includegraphics[height=30mm]{comm.eps}
136 \caption{Communication between two pipeline stages.}\label{fig:comm}
139 Appropriate handling of these cases is done using standardized \verb!stb! and \verb!ack! signals. The meaning of these is summarized in this table:\\
141 \begin{tabularx}{\textwidth}{|l|l|X|}
143 \bf stb & \bf ack & \bf Situation \\
145 0 & 0 & The upstream stage does not have data to send, and the downstream stage is not ready to accept data. \\
147 0 & 1 & The downstream stage is ready to accept data, but the upstream stage has currently no data to send. The downstream stage is not required to keep its ack signal asserted. \\
149 1 & 0 & The upstream stage is trying to send data to the downstream stage, which is currently not ready to accept it. The transaction is \textit{stalled}. The upstream stage must keep stb asserted and continue to present valid data until the transaction is completed. \\
151 1 & 1 & The upstream stage is sending data to the downstream stage which is ready to accept it. The transaction is \textit{completed}. The downstream stage must register the incoming data, as the upstream stage is not required to hold it valid at the next cycle. \\
155 It is not allowed to generate \verb!ack! combinatorially from \verb!stb!. The \verb!ack! signal must always represent the current state of the downstream stage, ie. whether or not it will accept whatever data we present to it.
157 \subsection{Triangle filling}
158 The triangle filler is inspired by \url{http://www.geocities.com/wronski12/3d_tutor/tri_fillers.html} and is based on the following algorithm:
160 if (B.y-A.y > 0) dx1=(B.x-A.x)/(B.y-A.y) else dx1=B.x - A.x;
161 if (C.y-A.y > 0) dx2=(C.x-A.x)/(C.y-A.y) else dx2=0;
162 if (C.y-B.y > 0) dx3=(C.x-B.x)/(C.y-B.y) else dx3=0;
166 for(;S.y<=B.y;S.y++,E.y++,S.x+=dx2,E.x+=dx1)
167 horizline(S.x,E.x,S.y);
169 for(;S.y<=C.y;S.y++,E.y++,S.x+=dx2,E.x+=dx3)
170 horizline(S.x,E.x,S.y);
172 for(;S.y<=B.y;S.y++,E.y++,S.x+=dx1,E.x+=dx2)
173 horizline(S.x,E.x,S.y);
175 for(;S.y<=C.y;S.y++,E.y++,S.x+=dx3,E.x+=dx2)
176 horizline(S.x,E.x,S.y);
180 The coordinates of vertices are (A.x,A.y), (B.x,B.y), (C.x,C.y); we assume that $A.y \leq B.y \leq C.y$ (the points are sorted first). \verb!dx1!, \verb!dx2!, \verb!dx3! are deltas used in interpolation; in the TMU implementation a variant of the Bresenham algorithm is used to avoid approximation errors associated with fixed point or floating point formats. The \verb!horizline! procedure draws an horizontal segment between points (S.x,Y), (E.x,Y).
182 Using a similar algorithm, the TMU interpolates image coordinates at the same time it fills the triangle.
184 \section{Bilinear filtering}
185 \subsection{Principle}
186 Bilinear filtering works by adding 5 extra bits of precision to the texture coordinates, which become fixed-point non-integer coordinates.
188 Those bits are used to compute a weighted average of 4 neighbouring texture pixels when the interpolated texture coordinates are not integer.
192 \includegraphics[height=32mm]{bilinear.eps}
193 \caption{Principle of bilinear texture filtering.}\label{fig:bilinear}
196 \subsection{Pixel distribution in the cache}
197 Because of performance requirements, all four texture pixels must be fetched in one clock cycle. It is therefore relevant to examine their distribution in the cache, to determine what cache architecture should be used.
199 \subsubsection{General case, middle of texture}
202 \includegraphics[height=55mm]{dist_common.eps}
203 \caption{Most common case, pixels 1 and 2 are in the same cache line.}\label{fig:case1}
208 \includegraphics[height=55mm]{dist_12lines.eps}
209 \caption{Pixels 1 and 2 are in different cache lines.}\label{fig:case2}
212 \subsubsection{With clamping}
213 When pixels go out of the texture, clamping can induce cases where two or four pixels merge.
217 \includegraphics[height=55mm]{dist_clamp.eps}
218 \caption{Example of pixels merging because Y1=-1.}\label{fig:caseclamp}
221 \subsubsection{With wrapping}
222 Wrapping can cause pixels split across the texture. This can happen horizontally, vertically or both.
226 \includegraphics[height=55mm]{dist_wrap1.eps}
227 \caption{Vertical wrapping.}\label{fig:casewrap1}
232 \includegraphics[height=55mm]{dist_wrap2.eps}
233 \caption{Horizontal wrapping.}\label{fig:casewrap2}
238 \includegraphics[height=55mm]{dist_wrap3.eps}
239 \caption{Wrapping in both directions.}\label{fig:casewrap3}
242 \subsection{Cache architecture}
243 \subsubsection{Design options}
244 We have two options for designing the cache:
246 \item four separate caches (one for pixel 1, one for pixel 2, etc.)
247 \item a shared cache capable of looking up 4 pixels at a time
250 Misses between the ways for pixels 1, 2, 3 and 4 are likely to be correlated, so a shared cache architecture is chosen to minimize memory bandwidth.
252 \subsubsection{Avoiding cache conflicts}
253 There could be replacement conflicts that have, at least, a detrimental impact of performance.
255 Such conflicts happen when:
257 \item (Figure~\ref{fig:case1}) lines A and B collide. This happens when:
259 \text{hres} \equiv 0 \pmod{\text{csize}}
261 \item (Figure~\ref{fig:case2}) lines A and B collide or lines C and D collide. This does not happen unless the cache only contains one line, which is not a practical case.
262 \item (Figure~\ref{fig:case2}) lines A and C or lines B and D collide. This is the same case as for Figure~\ref{fig:case1}.
263 \item (Figure~\ref{fig:casewrap1}) lines A and B collide. This happens when:
265 \text{hres}\cdot(\text{vres}-1) \equiv 0 \pmod{\text{csize}}
267 \item (Figure~\ref{fig:casewrap2}) lines A and B or lines B and C collide. This happens when:
269 \text{hres}-1 \equiv 0 \pmod{\text{csize}}
273 In the equations above:
275 \item hres is the horizontal resolution in pixels
276 \item vres is the vertical resolution in pixels
277 \item csize is the total number of pixels the cache can hold. It is equal to the cache size in bytes (not counting the tag memory) divided by 2.
280 \subsubsection{Chosen architecture}
281 The block diagram of the cache architecture is given in Figure~\ref{fig:carch}.
285 \includegraphics[height=95mm]{carch.eps}
286 \caption{TMU cache architecture for bilinear filtering.}\label{fig:carch}
289 Because cache conflicts can be easily be avoided by choosing the texture resolution carefully, they are not handled by this architecture. \textbf{Having a cache conflict results in a lockup of the cache controller}, that requires a reset of the core.
291 The four-way cache can be efficently implemented on FPGA targets by using two dual-port block RAMs.
293 \subsubsection{Summary}
294 In order to avoid cache conflicts which lock up the core, one must make sure that:
296 \text{hres} \not \equiv 0 \pmod{\text{csize}}
298 Additionally, if texture wrapping is enabled, one must also make sure that:
300 \text{hres}\cdot(\text{vres}-1) \not \equiv 0 \pmod{\text{csize}}
303 \text{hres}-1 \not \equiv 0 \pmod{\text{csize}}