Link to Github here.
Report also included in the Github.
- Requires the VAMemory library
- Use VS to compile
We still have several things to address for our project. Most of the issues must do with run time and efficiency of the application of both segmentation and upscaling together. The current form of the implementation requires upscaling every segment individually and then merging them together. Given larger images especially for the CPU based implementation, for 1080p->2160p upscaling the runtime is already 44minutes for one segment. This means if we were to have a lot of segments, the of the city scene, this means 205*44=9020 minutes approximately 150 hours. This is unacceptable. If we were to incorporate segmentation again it would have to be inside the upscaling loop itself or somehow built into the models. In hindsight, we can completely disregard this problem when using a DGPU to do the processing as the runtime for one segment will take only several seconds meaning any image will take only several minutes to scale by 2x. However, this will completely nullify any notion of a low power (no DGPU) or mobile implementation (mobile) using image segmentation. Another way we could achieve a much better run time is to use an algorithm developed by someone else for super resolution. For example, we could use Milanfar’s algorithm.
The current denoising model developed for this project is somewhat peculiar. It introduces noise to a completely black image. Although the output appears to the human eye to be completely the same, the segmentation algorithm appears to be very sensitive to the noise (albeit visually appealing) added by the denoising process. It would be good to redevelop the model to add the exception for the case of RGB values of (0,0,0) to do nothing. This has led to the use of the original file to produce segments due to the inconsistencies produced by the model.
Currently, the current segmentation code uses a custom RGB class. Although simpler than the OpenCV matrix representation, it would be easier to read the code and easier to adapt the code if everything used was of type OpenCV. In addition to this, the segmentation, although completely dependent on the RGB class (can adapt the diff condition to fit YUV), can be implemented into the YUV format. This will make it unnecessary to use cv::cvtColor to convert between YUV and RGB. This will make it unnecessary to use cv::cvtColor to convert between YUV and RGB.
For improvements to the CNN, a better optimizing algorithm can be implemented in the future (when one is developed) to further reduce the required memory for computation. Currently when running the model, it takes several days and a server with 128Gb of RAM to finish. Using a rig with any lower amount of RAM will cause the memory to overflow. Although probably infeasible now there is a possiblility of moving the core algorithm (without segmentation).
As we know that, Android has native support for OpenCL, which will open the possibility to the device’s native GPU to implement the algorithm on mobile devices. This will significantly reduce the runtime as we have seen above due to the parallelization capabilities of GPUs. As the chips are getting faster and faster, it is possible to run the implementation in real-time, especially for 480P->720P, 720P->1080P. This makes it possible to make the most use out of the available cable bandwidth. This is very desirable to for streaming services (albeit at the dismay of the user) as they can stream compressed video and upscale it on the users’ end claiming “true HD resolution”.
Things to note: 1. We are using OpenCV version 3.0 RC1 for all the tests below. Prior versions 2.8 and 3.1 failed in one aspect or another. 2.8 failed to compile on Linux and 3.1 failed for some unknown reason when deallocating memory for matrices when a function went out of scope.
Results from the i7-5600U Processor (CPU version of the algorithm using filter2)
Again, we will be using the same 3 color bar as a control base case. Again, using the same input arguments of sigma=0.1, k=30, min_size=10. We chose the upscale ratio of 2.0x we see the following results. First, we take a look at the scaled segments. The runtime of scaling+segmentation+merge is 15.814 seconds. Here is the input image.
Here are the upscaled segments…
And now merged together…
Now we make the image more complex. We use the Undertale logo for this test. We used parameters sigma=0.8, k=900, min_size=550, and a scale ratio of 2.0x. First, we look at the input image. The runtime was 9 minutes 41 seconds. We can already see the incredible difference in run time. This is firstly due to the difference in the resolution, bit depth, and the number of segments produced (12).
Now we take a look at the upscaled sements (only a few shown).
And after a few merges…
And finally here is the output with all the segments merged together. Even given that we have upscaled it we can already see that the heart inside the R appears to have gained some sharpness and in general the whole image seems to have gained more contrast in addition to being upscaled. However, due to the filter used (filter2) the output seems a bit softer than it really should.
Below we look at the source image for our last example.
Given the computer’s CPU it took a total of 44 min 12 sec to upscale one segment (original image) to 2160p. Thus we concluded it would be infeasible to complete a complete upscaling of this image to 2160p. The output of the 4x upscaling is below.
Now we use a completely different setup to perform the upscaling. Given our previous knowledge of the multiplicative nature of runtime due to image segmentation, all of the next tests will be done without image segmentation. We first perform the same upscaling using an i7-5930k. This is a 6 core CPU capable of 12 threads and has a base frequency of 3.5GHz and turobos up to 3.70GHz. Using the same input image as above this CPU was able to perform the upscaling in 38 minutes and 42 seconds. Considering the price difference and core number compared to the Thinkpad, this is only marginally faster. This shows that using CPU to do computation is only producing diminision returns. Below is the output run on filter2.
Our third test case was an AMD FX 9590 Black an 8 core processor with 4.7GHz clock speed and turbo speed of 5.0GHz. There are two main issues to address when it comes to using AMD products to perform the computation. Firstly, because people working at Intel first developed OpenCV, it uses SSE (Streaming SIMD Extensions) optimizations. Although AMD later added support for SSE instructions starting with its Athlon XP line, it is still not up to par with Intel’s implementation. Thus computation is inherently slower than using an Intel product. As thus even with its superior core count and processor speed due to the limitations of the SSE implementation on this processor it is even slower than an Intel i7 dual core processor with a runtime of 1 hour 31 minutes and 9 seconds.
Running the same image using a GPU implementation enabling the use of CUDA and much more efficient parallelization. We used a Nvidia GTX 1070 (1920 Cuda Cores, Clock 1506 Mhz, Boost Clock 1683 Mhz), GTX 980Ti (2816 Cuda Cores, Clock 1075 Mhz, Boost Clock 1075 MHz.), and 2 GTX980Ti in SLI. This was significantly faster, on the order of hundreds of times faster than a CPU implementation.
All results have been summarized below for the city scene.
|Processor Type||Time to execute|
|AMD FX 9590 Black (filter2)||1 hour 31 minutes 9 seconds|
|Intel i7-5600U (filter2)||44 minutes 12 seconds|
|Intel i7-5930k (filter2)||38 minutes 42 seconds|
|Intel E5-2690v4 (filter2)||16 minutes 27 seconds|
|Nvidia GTX 1070 (filter2)||5.0823 seconds|
|Nvidia GTX 980Ti (filter2)||8.0292 seconds|
|Nvidia GTX 980Ti SLI (2) (filter2)||1.16984 seconds|
And the output
The hardware used to run the segementation is a i7-5600U Processor (due to the lack of a dGPU in this device, Thinkpad T450s). Using a more powerful processor will improve the runtime of the algorithm. However, as we will see the runtime of the segmentation algorithm (CPU implementation) is insignificant to the runtime of the upscaling therefore there is not much to be improved from a hardware point of view. Note: Other than the first test image, it appears that a sigma of 0.8 (for smoothing seems to work best.
Here is the source image of an image to segment. It is composed of 3 colors, a black background and 2 colored bars.
These calculations were run with the input arguments k=30, min_size=10, sigma=0.1 and had a run time of .153 seconds for 3 segments.
For the next image the calculation and output of the segments took 9.666 seconds and created 17 segments. In lieu of space, only the most significant segments will be shown.
Here is the source image.
As we increase the complexity of the image, we will notice that the number of segments to begin growing exponentially. Now we look at the final example of image segmentation.
Here is the source image.
The input arguments were sigma=0.8, k=750, min_size=2000, had a run time of 10.320 seconds and produced 27 segments. The input arguments were sigma=0.8, k=750, min_size=2000, had a run time of 10.320 seconds and produced 27 segments.
Upscaling Block Diagram
Discussion on Merge:
The process of merge took an extensive amount of time to make working. For some reason after denoising and upscaling there were many anomalies within the image datat. For instance a black pixel may have non zero RGB valuers or in YUV form it may have non 0 UV channels or very large Y values for 0 UV. We concluded that we cannot use ==0 as the threshold for computing whether a pixel is a part of a segment or not.
Originally each segment was passed as a map of RGB pixels to the merge portion of the code and we checked if the R,G,B channels had values <1. And if it did we would do nothing to the pixel. However this proved impossible as after denoising and upscaling some of the supposed black pixels weren’t truly black. Worst of all the values weren’t consistent sometimes they were (1,1,1) or (3,3,3) or (5,5,5) and we were not able to discern a pattern or upper bound on these values.
After that we decided to convert the image to a black and white format and then directly compare whether an image is “black” or not. The same problem occurred as above and some segments would end up overwriting parts of other segments due to blurring of pixels around the edges.
We tried the same thing going from RGB to YUV and performing the same computation. However we were faced with the same problem again with erratic Y values.
There are two things to consider about why the above three ways failed. Firstly, we were using OpenCV’s implementation of pixel conversion. We are not sure how it is converted and whether it will produce interpolation errors or computation errors. Secondly, the way we accessed the Y values of the YUV converted image were (probably) wrong. Thus, we were getting values as uint output instead of the float values the Y channel was supposed to be.
To resolve these issues, after much trial and error we discovered that the best way to correctly do this is to calculate the Y channel ourselves manually. This required scaling the output Y value by 255.0 and doing calculations in floating point to maintain as much accuracy as possible. Only with this way were we able to properly use a threshold to determine whether a pixel is black or not with a much smaller margin of error in the use of floating point.
Image Segmentation Block Diagram
Discussion on Disjoint-Set Forests as Segment Storage
Disjoint-set forests with union by rank and path compression. Let us define what these things mean. A disjoin-set data structure allows us to keep track of a set of elements made up of disjoint components. In our case, this would mean that each disjoint set would represent an individual segment. The two notable functions that this provides is the find and union (named join in the implementation). This allows us to determine if two representatives are part of the same root and to join two trees into one single tree building the segmentation. The two notable functions that this provides is the find and union (named join in the implementation). This allows us to determine if two representatives are part of the same root and to join two trees into one single tree building the segmentation.
In union-by-size, the determining factor for which set to join to the other is the number of elements in each set. Union by rank ensures that we can retain minimum depth for maximum efficiency. This is results in a complexity of ~O (log n). Rank is initially 0 when a one element set is formed. Then as we begin joining sets, sets of the same rank will be joined and then incrementing the rank variable of the new union set.
Now what is a disjoint-set forest? Instead of implementing a disjoint-set data structure using either an array or vector, in which a representative could be any element inside the set, a disjoint set forest uses a tree structure where the representative is the root of the tree. This alters the find and so that when called it reaturs root of each tree instead. In the case of find called on a node, it will follow parent nodes until it reaches the root. This means that we do not need to keep track of the actual set in which an element belongs to as we can easily search through a tree to determine if two elements are in the same set. In addition to this, when using path compression when find is called on any node it will attach itself and all its parent nodes directly to the root hence lowering the height of the tree making subsequent finds much faster.
Felenswalb and Huttenlocher’s algorithm has complexity O (m α(m)) Where α(m) is the inverse Ackerman function and m represents the number of vertices in the graph. Because there are at most three edges connected to each vertex (one for each RGB component) all operations can be done in constant time. Therefore, the only limiting factor is the rank (depth) which is bounded by O (m α(m)).
Optimization of the CNN was done using the algorithm ADAM developed by Diederik P. Kingma and Jimmy Lei Ba described in their paper Adam: A Method for Stochastic Optimization. The motivation for choosing this algorithm is the low computational complexity of the method. As stated in the paper it only requires first order gradients on the input objective function which has the same computational complexity as evaluating the original function. Some other advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are bounded by the stepsize hyperparameter, and naturally performs a form of step size annealing. In any case this means the algorithm has a degree of automation and adaptivity that many other algorithms do not present.
We start with defining the pseudo code which defines how this functions. Adam uses the method of moments which is where unknown parameters are estimated using 0th, 1st, and higher order moments. Adam uses only the first two. For example in the case of a random vector, 1st order moment is the mean vector and second order is the covariance matrix, a measure of how related two variables are.
Inputs: α (step size), β1β2 (exponential decay rates for moment estimation ranging from [0,1] ), f(θ) (stochastic objective function), θ_0 (initial parameter vector). As advised by the paper the input values for α=0.001, β1=0.9, β2=.999 and ε=10^-8 (for update last step)
Initialization: m_0=0 (1st moment vector), v_0=0 (2nd moment vector), t=0 (Initialize timestep)
In general, we are trying to minimize a noisy objective function such that it matches our expected outoput (an image) with respect to the parameters. Moment estimates are used to compute a converging parameter using bias correction. Bias correction is required because moments are initialized to zero thus without the bias the estimate may or may not change at all.
Given the gradient (g) of the stochastic objective f we can estimate the moment based upon all previous moments. This is done using the following summation formula. Where the beta represents the decay rate of the function and gradient squared is the elementwise square of the vector. (second moment shown here).
The expected value of the moment can be calculated in the same manner where ζ represents an offset value to account for the moving average of the second moment.
Looking at the final line of the equation we are able to see how we can calculate the expected moment and the scalar value multiplying the expected second order gradient to obtain our biased second order moment. This in turn is what is used to scale to obtain the correct value for the newly defined timestep scalar below.
Kingma and Ba have stated that the following conditions are a good choice of how to choose alpha as a step size. |∆t| ≤ α · (1 − β1)/ √( 1 − β2) in the case (1 − β1) > √ (1 − β2), and |∆t| ≤ α. Recall α is the step size and essentially the bound on how many iterations the algorithm will have to go through before we are able to reach an estimate of our parameter. Often we are able to choose α based upon some physical limitation in the model we are building. However, although inefficient often choosing a small enough step size (thus large amount of iterations) will still be able to generate the same result as long as the algorithm is run long enough. In Adam’s case the step size will be automatically scaled as we reach our expected parameter values in the asymptotic case. Thus, larger step size values will still be relatively effective. We are able to see this in the equation for the update step where α is scaled by the moments
Kingma and Ba “define” a signal to noise ratio which describes the condition in which we have converged on the expected parameter when the SNR reaches zero. This is easy to see because we can note that a value multiplied by zero is zero.
Kingma, Diederik P., and Jimmy Lei Ba. “ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION.” ICLR (2015): 1-15. 23 July 2015. Web.
In general, the convolutional neural network is like the traditional neural network. They partake the same features such as objects are made into neuron forms, and each neuron unit can take inputs, then performs dot product, and it does not require to be a linear procedure. The advantage of which is that neurons have learnable bias and weights, so that it makes it easy for users to model them. The layers can be formed by neurons and input can be decomposed into the mannequin. And so, as the result, a single differentiable score function can be held from the pixels of raw images. In the backdrop, a fully connected layer exists for the grading of the loss function to interpret the class grade. The simplified example is depicted in Figure 1.
Still, compared to the traditional neural network, the convolutional neural network explicitly takes images as inputs. Then the architecture is fully encoded for the properties of images. Therefore, due to the simplicity, it is easier to connect neurons and requires less parameters for the network implementation.
To void the overfitting effect that the traditional neural network can have, we deploy our neurons in a 3D volume bucket. On each layer, we have the size defined as the width and height, which is directly related to the input. We call the third dimension as depth, which is the depth of the activation volume instead of the depth of the neural network.
Next, if we consider the layers from the input to the output. Let’s say we have an image as input with the resolution of A by A, then let’s break down the parameters for each layer:
- Input will be decomposed into RGB channels. Then the RAM usage would be [AxAx3].
- The convolution layer will compute the number of neurons which are connected to the local input. If we decide to use number of B filters, the RAM allotment would be [AxAxB].
- The next layer is defined as ReLu. It can perform elementwise activation functions, so that the max (0, x) has the threshold at 0. The size of the neuron volumes will keep at [AxAxB].
- Then, the next polling layer will simply perform down sampling in the spatial dimensions.
- Then the final fully-connected layer will compute the class grades, which is defined by users. If we categorize the image sets into C categories, we will have an output [1x1xC]. Per the nature of the convolutional network, each neuron cell in the final layer will be connected to all the coefficients in previous volumes.
Then, if we consider the convolutional layer, we know that it is defined as a set of learnable filters. It is the fundamentals of convolutional network firstly being applied in LeNet. It is the most computational heavy module for the machine learning part. Every single filter has a small planar size but a full depth matches with the 3D neuron block depth. Simply like human brains, certain layers will be activated if and only if similar features can be found in the object.
If we talk about the single image super resolution, there are 4 major categories: edge based modeling, prediction modeling, example patch modeling, and image statistical modeling. However, if we talk about the general performance, it seems that the example patch modeling can achieve the best overall quality. In addition, it can be derived into two distinct methods. The internal learning method was first proposed in 2009, and applied to the AlexNet in 2012. It utilizes the base similarity properties and generate the exemplar patch via the input image. The external one, which being used in this case, learns the mapping between low and high resolution image patches from the external libraries. With the help of nearest neighbor strategy, it makes the user easier to control the speed and the computation complexity, which later is being called the sparse coding formulation. There have been some applications for image restoration with the similar approaches, however, there is a key feature they did not focus on, which is denoising.
When we process the modeling for a specific patch set, an end to end mapping algorithm will be used with several network estimation parameters. These parameters are gained from minimizing the losses in between the related high resolution image and the reconstructed image. If we denote the reconstructed image as I, the high-resolution contradiction as k, the low-resolution image as i, the total pixel count is j, n is we can get the loss function by using the Mean Squared Error:
N is the number of images in the training patch
Then, we can achieve a high Peak Signal to Noise Ratio. This is the standard metric function being widely used for image restoration quality evaluation.
In addition, MAXI is the maximum possible pixel value of the image.
Yang, Jianchao, and Thomas Thomas Huang. Image Super-resolution:
Historical Overview and Future Challenges (n.d.): n. pag. Web.
Yang, Chih-Yuan, Chao Ma, and Ming-Hsuan Yang. “Single-Image Super-Resolution: A Benchmark.” Computer Vision – ECCV 2014 Lecture Notes in Computer Science (2014): 372-86. Web.
Yang, Jianchao, John Wright, Thomas Huang, and Yi Ma. “Image Super-Resolution via Sparse Representation.” IEEE (n.d.): n. pag. Web