# Image Super Resolution: Literature Review (Building the CNN)

In general, the convolutional neural network is like the traditional neural network. They partake the same features such as objects are made into neuron forms, and each neuron unit can take inputs, then performs dot product, and it does not require to be a linear procedure. The advantage of which is that neurons have learnable bias and weights, so that it makes it easy for users to model them. The layers can be formed by neurons and input can be decomposed into the mannequin. And so, as the result, a single differentiable score function can be held from the pixels of raw images. In the backdrop, a fully connected layer exists for the grading of the loss function to interpret the class grade. The simplified example is depicted in Figure 1.

Still, compared to the traditional neural network, the convolutional neural network explicitly takes images as inputs. Then the architecture is fully encoded for the properties of images. Therefore, due to the simplicity, it is easier to connect neurons and requires less parameters for the network implementation.

To void the overfitting effect that the traditional neural network can have, we deploy our neurons in a 3D volume bucket. On each layer, we have the size defined as the width and height, which is directly related to the input. We call the third dimension as depth, which is the depth of the activation volume instead of the depth of the neural network.

Next, if we consider the layers from the input to the output. Let’s say we have an image as input with the resolution of A by A, then let’s break down the parameters for each layer:

- Input will be decomposed into RGB channels. Then the RAM usage would be [AxAx3].
- The convolution layer will compute the number of neurons which are connected to the local input. If we decide to use number of B filters, the RAM allotment would be [AxAxB].
- The next layer is defined as ReLu. It can perform elementwise activation functions, so that the max (0, x) has the threshold at 0. The size of the neuron volumes will keep at [AxAxB].
- Then, the next polling layer will simply perform down sampling in the spatial dimensions.
- Then the final fully-connected layer will compute the class grades, which is defined by users. If we categorize the image sets into C categories, we will have an output [1x1xC]. Per the nature of the convolutional network, each neuron cell in the final layer will be connected to all the coefficients in previous volumes.

Then, if we consider the convolutional layer, we know that it is defined as a set of learnable filters. It is the fundamentals of convolutional network firstly being applied in LeNet. It is the most computational heavy module for the machine learning part. Every single filter has a small planar size but a full depth matches with the 3D neuron block depth. Simply like human brains, certain layers will be activated if and only if similar features can be found in the object.

If we talk about the single image super resolution, there are 4 major categories: edge based modeling, prediction modeling, example patch modeling, and image statistical modeling. However, if we talk about the general performance, it seems that the example patch modeling can achieve the best overall quality. In addition, it can be derived into two distinct methods. The internal learning method was first proposed in 2009, and applied to the AlexNet in 2012. It utilizes the base similarity properties and generate the exemplar patch via the input image. The external one, which being used in this case, learns the mapping between low and high resolution image patches from the external libraries. With the help of nearest neighbor strategy, it makes the user easier to control the speed and the computation complexity, which later is being called the sparse coding formulation. There have been some applications for image restoration with the similar approaches, however, there is a key feature they did not focus on, which is denoising.

When we process the modeling for a specific patch set, an end to end mapping algorithm will be used with several network estimation parameters. These parameters are gained from minimizing the losses in between the related high resolution image and the reconstructed image. If we denote the reconstructed image as I, the high-resolution contradiction as k, the low-resolution image as i, the total pixel count is j, n is we can get the loss function by using the Mean Squared Error:

N is the number of images in the training patch

Then, we can achieve a high Peak Signal to Noise Ratio. This is the standard metric function being widely used for image restoration quality evaluation.

In addition, *MAX _{I}* is the maximum possible pixel value of the image.

Bibliography

Yang, Jianchao, and Thomas Thomas Huang. Image Super-resolution:

Historical Overview and Future Challenges (n.d.): n. pag. Web.

Yang, Chih-Yuan, Chao Ma, and Ming-Hsuan Yang. “Single-Image Super-Resolution: A Benchmark.” Computer Vision – ECCV 2014 Lecture Notes in Computer Science (2014): 372-86. Web.

Yang, Jianchao, John Wright, Thomas Huang, and Yi Ma. “Image Super-Resolution via Sparse Representation.” IEEE (n.d.): n. pag. Web