Data Structure and Algorithms: November 2013

Sunday, November 24, 2013

Topic 4: Sorting algorithms

I will summarize several sorting algorithms including insertion sorting, merge sorting, heap sorting, and quick sorting. The note followed by the implementation of the code, the analysis of those sorting algorithms, and at last including some of the typical questions that could be asked in various environment such as exam or technical interviews.

I will also enumerate all the other not-so-popular sorting algorithms at the end as appendix.

1. Selection Sort:

Selection sort is the simplest and the most straight forward sorting algorithm. Selection sort iterate through the elements to look for the smallest at first(select the smallest element), put it in the right place, and then select the second smallest element and put it in the sorted place, and so on. It runs O(n^2).

Implementation:

2. Insertion Sort:

Insertion sort is the first sorting algorithm that mentioned in CLRS book. It is also the first algorithm that analyzed in detail in the textbook. It interpret insertion sort as sorting piles of card that faced down on the desk, our right hand pick the cards one by one from desk to insert to our left hand in order. What we essentially doing in this insertion sorting process is we go through all the card already in the left hand, compare them with the one at right hand, insert this right handed card into cards at left hand.

Following the basic ideas above about insertion sort. We come up the implementation as in-place implementation, which reorganize the element in memory right upon the initial array. What we do is we iterate through the elements starting from the second element, comparing it with elements before it, we should be able to find the right position this element should belong to. we shift all the elements between the correct position(included) and its original position(excluded) to right, leave the correct position for the element and move the element to it.

The following diagram shows how Insertion Sort work:

The pseudo code from CLRS is: (Pay attention to the while loop)

The implementation of Insertion Sort is:

3 Merge Sort:

Merge sort is used to describe the divide and conquer methodology. The basic ingredient of Merge Sort is MERGE() method. What MERGE() does is it take two array, both of which is sorted, merge them to form single array, which is also sorted. The idea of merge sort is take a given array, recursively dividing and merging the sub-arrays using MERGE() method.

The MERGE() method pseudo code from CRLS:

The MERGE-SORT() pseudo code from CRLS:

The implementation of Merge sort:

4 Heap Sort:

In introducing the heap sort, the CRLS book introduce the heap concepts. Which basically a binary tree, with parents keys always greater or equal to children. One example heap to illustrate heap property is the following figure.

Heap usually represented by array. In this implementation, giving the index i of the array we can calculate the parents index by PARENT(i) return [i/2] (floor of the i/2), the left children index by LEFT(i) return 2i, and the right children RIGHT(i) return 2i+1.

To implement Heap Sort, We first need a method to MAX-HEAPIFY to maintain the max-heap properties. Another method BUILD-MAX-HEAP use MAX-HEAPIFY to build a max-heap from the given nodes. HEAPSORT() calls MAX-HEAPIFY and BUILD-MAX_HEAP to sorting in O(nlgn). The idea of HEAPSORT is exchange the root(maximum) with the last element in the heap, reduce the heap size by removing the largest element. Doing this iteratively will do the job of heap sort.

Following is the methods mentioned above and one figure to illustrate how heap sort work.

Here is the heap sort implementation:
Problems:

4.5 Priority Queues:
Priority Queues are normally implemented using heap. The max-priority queue is used in computer resource scheduling. The min-priority queue is used in event simulation, in which the events happened from small time value. The method max-priority queue normally support is:

HEAP-MAXIMUM, //used for get the max

HEAP-EXTRACT-MAX, //used for get the max value and also remove it from the queue

HEAP-INCREASE-KEY, //used for modify the element in the queue if needed.

MAX-HEAP-INSERT, //used for insert element to the queue

Implementation:

Problems:

5. Quick Sort

Quick Sort is the most interesting sorting algorithm reviewed so far. It performance is depends on partitioning and the analysis of it is nontrivial. I first present the idea of quick sort and some fundamental insight can come to most people's mind. Then I will discuss about it deep at the end of this post. The idea can really see as a evolution from merge sort. It use the idea of divid and conquer and recursion. Remember in merger sort, the MERGE() method used is suppose the merging subarrays are already sorted. Here quick sort have to do the sorting from scratch. The way it does it is to select a pivot element, when "divide"(divide in the divide and conquer), each of those subarray elements is compared to the pivot element and positioned correctly before they "merge". Think about it, if you can understand what said here, go back to the textbook for reference.

The Quick Sort pseudo code:

The Figure the illustrate how Quick Sort work:

When writing quicksort code, keep this figure in your mind and the loop invariant during the initialization, maintenance, and termination: the element from p and i is always less than the pivot, the element from i+1 to j is always greater than the pivot r. Don't forget the i is initialized to p-1, which indicate the initially, the number of element in the smaller subarray, which is less then pivot, is 0;

The implementation of Quick Sort:

Analysis of Quick Sort
We want to take a look at the worst-case performance, best-case performance, and average case performance(expected running time).

Worst-case performance
If the partitioning routine produce one subproblem with n-1 and one with 0 elements. Assume this situation is happening in each recursive call. The recursive call take theta(n). So the running time is

Using the substitution method to get T(n) = Theta(n^2). So in the worst case, if the partitioning is maximum unbalanced in each recursive call, quick sort will run no better than insertion sort. Unfortunately, this quadratic running time happened when the input array is already sorted.

Best-case performance
When the size of each of the subproblems not more than n/2. In this case, quick sort is much faster, the recurrence for the running time is then

The solution of T(n) is Theta(nlogn). which is much better than insertion sort.

Average-case performane
It more close to the best-case then the worst-case. The insight is really hide in the reflection of partitioning in the recurrence that describe the running time. Assume it alway produce a 9-to-1 proportional split, which seems quite unbalanced and prone to be believed quite close to the worst performance. In fact this is a wrong intuition. We obtain the recurrence

solving this recurrence relation will give us T(n) = Theta(nlogn). We can also take a look at the recursion tree of this T(n).

The recursion tree tell us there are always a Theta(logn) bound for the depth of the tree. each level take at most cn. So the running time is O(nlgn). This is quite counter intuitive when we think about it. As a matter of fact, even a 99-to-1 split yields a O(nlgn).

A Randomized version of Quicksort
The idea is to randomly select one element from the array, swap with the last element. Then do the partition. The RANDOMIZED-QUICKSORT() is based on this new partitioning metod.

The expected running time analysis in the book assume in each level of recursion, the split induced by RANDOMIZED-QUICKSORT() puts any constant fraction of the elements on one side of the partition, then the recursion tree has depth Theta(lgn), and O(n) work is performed at each level, which is the running time of partition method.

Tail-Recursive version of Quicksort and Stack Depth

ToDO: Sovling the problem 7.4 in the text book of CLRS.

Analysis of running time and comparisons

Implementation of quicksort: write a linked list version and a array version.

Problems:
1. What is the running time of QUICKSORT when all elements of array A have the same value?
2. What is the running time of QUICKSORT when all elements of array A have already sorted?
3. What is the running time of QUICKSORT when all elements of array A is sorted in decreasing order?

Answer: O(n^2). Because the comparison to the pivot is inclusive, "<=" in the pseudo code of quicksort. So the partition will be the worst case partition: 0 to n-1. No matter the elements is sorted in increasing order or decreasing order, the partition will always be the worst case partition. which results in a quadratic running time. In other words, if the partition is 0 to n-1, the recursive tree will have a linear height, in each height (each call to PARTITION()), it is n running time, so results to O(n^2) running time for all three cases. You can understand in this way: In particular, PARTITION, given a subarray A[p..r] of distinct elements in de- creasing order, produces an empty partition in A[ p . . q − 1], puts the pivot (orig- inally in A[r]) into A[p], and produces a partition A[p + 1..r] with only one fewer element than A[p..r]. The recurrence for QUICKSORT becomes T(n) = T (n − 1) + Theta(n), which has the solution T (n) = Theta(n2).

Sunday, November 17, 2013

Topic 3: Binary Search Trees

Questions to be answered:

1. How to check tree is balanced?

2. Number of internal nodes, external nodes, and height.

Todos:
~~1. binary search tree deletion.~~

Binary search tree property:

"Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y:key <= x:key. If y is a node in the right subtree of x, then y:key >= x:key."

Operation related to Binary Search Tree:

Inorder-Tree-Walk (recursive)
Preorder-Tree-Walk

put the print x.key before the recursive calls.

Postorder-Tree-Walk

put the print x.key after the recursive calls.

Search(recursive)
Search(iterative)
Minimum
Maximum
Successor

Two case observed,

If successor exist and x's right sub-tree is not NUL, we find the minimum node of the right sub-tree.
If successor exist and x don't have a right sub-tree, it's successor must be in the ancestors, which is above the x in the tree. Its successor in this case can be found as the lowest (first when we track back from bottom up) ancestor who has the left children is also x's ancestor. (for instance, the node 13's successor is 15 in the following tree)

Predecessor

Similar to the successor, two cases:

If the left sub-tree exist, find the maximum of the left sub-tree, which is the predecessor of the node.
If the left sub-tree doesn't exist, the predecessor is the first ancestor that has a right child which is also its ancestor.

Insertion

Two point to notice:

y is defined to keep a copy of the x's parent pointer, it also called trailing pointer. Because by the time the NUL is found for z, x already become NUL, only its previous(save at y) can be used to update with z.
x is defined as a pointer to find the proper position in the tree. View it as a pinter that used to visit the tree node to find something.

Deletion

Deletion is non-trivial, there are three cases:

z has only one child, just remove the node, this will not impact the property of Binary search tree.
z has two children and its successor y is z's right children, replace z by y.
z has two children and its successor y is not z's right children. In this case, we still need to replace z by y, but it involves more operations. We need re-arrange the node in the z's right sub-tree in order to maintain the binary search tree property. I can be proved that z's successor y doesn't have a left sub-tree. So we do the following: (1) move y out of the right sub-tree, transplant its right sub-tree to y, notice this doesn't violate the tree properties. (2) replace z by y, update y's right sub-tree with z's right sub-tree. update y's left sub-tree with z's left sub-tree to complete the "delete operation". Notice this second step still keep the tree as a valid binary search tree. (If you don't understand, see CLRS page 297 Figure 12.4)

Transplant

this is function is to replace the sub-tree rooted at node u with the sub-tree rooted at node v, node u's parent becomes node v's parent. Pay attention to update v's parent with v.p = u.p if v is not NUL. if it is, we don't need to do anything, replace u with a NUL will do the all.

How to build a balanced binary search tree?

Solving Real Tree Problems

Write a iterative method to do in-order traversal of a Binary Tree. Use two solution, solution 1 can use stack. In solution 2, stack is not allowed.
Thoughts:
1. Using loops to do the iterative, stack is used as a temp storage.
2. First node to visit is left most node in the tree. It don't have any left nodes.
3. End the loop when the largest node has been visited (in-order), the largest nodes in the right most.
4. Algorithm: Visit the left most node first, two possibilities exist: (1) it have no children, (2) it have right children. If it have no children, visit it --> visit parents --> go to its parent's right children.
5. BE CAUTIONS about the if condition correctness.
The non-stack solution can rely on pointer that can visit its parents and its grandparents after visit the right children. This is straight forward to think but hardest to implement.

The full solution to this problem can find here: http://leetcode.com/2010/04/binary-search-tree-in-order-traversal.html
untitled

Friday, November 15, 2013

Topic 2: Recursion, Divide and Conquer Methodology

Topic 1: Big O Notation and Complexity

The big O notation is easy concept, it definition is the key to solve any problem related to it.

Another topic need to understand well is the distinguish between the best case, worst case and the average case complexity of an algorithm. The following figure from Serika's book is the best illustration of the point.

We can think about running an algorithm over all possible instance of data that can be fed to it. Each of the input instance can be abstract as a point in the figure. In the figure, the x-axis represents the size of the input problem (for sorting, the number of items to sort), and the y-axis denotes the number of steps taken by the algorithm in this instance.

Topic 0: Motivation of This Blog

Why I start this blog?

For my qualification exam and technical interview. when I prepare my Ph.D. qualification exam, I found it is very hard to keep indexible notes for the materials I learn during preparation. I thought those things are valuable in future when I looking for a internship or full-time job. So I start this idea to use the label property in this blog to manage all the posts, basically the topics in each post.

Does it work?

My hands-on experience is limited. I know the idea of clever algorithm and the insight of most data structures. But really slowing in solving a technical interview problem. why? I was wondering. I figured it out finally, because I did less. There is a proverb: Practice makes perfect. why not try to solve more problems and build this skill day by day? I finally came to this idea: I could increase my ability to solve those technical questions as I do work-out to build my body muscle. Keeping about two months regular work-out with diet. I am in a much better shape. This inspired me a lot in achieve the final goal to become a better algorithm problem solver. It helps a lot in future.

How this blog work?

Basically, I will keep all my daily study note in this blog about data structure and algorithms, of course in a very clear way of presentation. Label them properly, for example, with topics such as linked list or recursion, to facilitate the review process in need. I also keep various learning resources in the top line. I should focus on only those resources, keep on working, making notes, review notes, ... Finally increased my brain muscles. Wow! Believe it, you can do it, day after day, hour after hour. Just keep this as a habit.

Pages