In commonly-used line-based implementa- tion for Two-dimensional Discrete wavelet transform (2D DWT), data bu®er is required between the row DWT processor and the column DWT processor to reorder the data °ow into consistent sequence, which increases the on- chip memory size, output latency and control complex- ity. Based on the proposed Decomposed lifting algorithm (DLA), image data is processed in raster scan manner both in row processor and column processor. Theoretical anal- ysis indicates that the precision of DLA outperforms other lifting-based algorithms in terms of round-o® noise and in- ternal word-length. An e±cient line-based architecture is designed to perform 2D DWT based on DLA with high per- formance and low memory by eliminating the implementa- tion of data bu®er. For an N £ N image, only 4N internal memory is required for 9/7 ¯lter with output latency of 2N clock cycles. Compared with related 2D DWT archi- tectures, the size of on-chip memory and output latency are reduced signi¯cantly under the same arithmetic cost, memory bandwidth and timing constraint. This design was implemented in SMIC 0.18¹m CMOS logic fabrication with 32K bits dual-port RAM and 20K equivalent 2-input NAND gates in a 1.2mm£1.1mm die, which can perform 5-level Mallat decomposition at 36.17frames/s with image resolution up to 1920 £ 1080 pixels in YUV422 full color format under 100MHz.