Skip to content
This repository has been archived by the owner on Jan 5, 2021. It is now read-only.

Summer School 2016

Notifications You must be signed in to change notification settings

loicteixeira/summer-school-2016

Repository files navigation

Summer School 2016

Summer School is a course in algorithms and data structures. The course is run via the WellingtonRuby Meetup and features weekly assignments and review sessions.

Assignments

  1. Array, Enumerable & Enumerator
  1. CircularBuffer & LinkedList
  1. Hash
  1. Binary Search Tree
  1. Binary Heap & Priority Queue
  1. Matrix & Page Rank
  1. Graph
  1. Algorithms
  1. Weighted Graph Algorithms

Notes

01 - Array, Enumerable & Enumerator

When the memory has reached capacity, it needs to be extended. It is done by allocating a new memory chunk and copying all the elements to it.

If it was only extended to exactly what is needed, the next time around, it would happen again. arithmetic series: 1 + 2 + ... + n = (n^2 + n) / 2 geometric series: 1 + 2 + 4 + 8 + ... + 2^n = 2^(n+1) - 1

02 - Circular Buffer & Linked List

Linked Lists lookup is faster than Arrays lookup if we assume:

03 - Hashes

Use hash when keys are sparse or not even integer.

For non-integer keys, Ruby calls hash on the key to have an integer. Ruby uses a SipHash which is collision resistant but not completely secure, see collision resolution.

04 - Binary Tree

Binary Tree: Each node has at most 2 child. Binary Search Tree: Same but each node has a value and the left node is always smaller.

The shape of the tree is dependent on the order things are put in. Although, self adjusting/balancing trees help having a balanced tree regardless of the order in which the elements are added.

O(n^2) for insert and search if not balanced. O(log•n) for insert and O(n•log•n) for search if balanced.

Easy in order traversal with recursive function (see each).

Normal implementation of eql? would only verify that the container has the same values in it but for the sake of this exercise, it also check the structure of the tree.

05 - Binary Heap & Priority Queue

Index calculations for Binary Heaps:

  • In general the children of the value at index i are at index 2 * i + 1 and 2 * i + 2.
  • The parent of the value at index i is at index (i - 1) / 2.

Utils methods have a few preconception.

  • sift_up assume that element at index j has a parent (or that j >= 1).
  • sift_down assume that the element at index i has 2 children.

Heapsort (runtime of n•log•n) usually use a max-heap which will shift all values from the start and add them from the end.

06 - Matrix & Page Rank

Matrix

The inner product of two vectors is the sum of the products of corresponding elements, such as a•b = a1b1 + a2b2 + a3b3 + ... (e.g. with v1 = Vector[1, 2] and v2 = Vector[2, 3], v1 • v2 = 8).

The straightforward algorithm for calculating a floating-point dot product of vectors can suffer from catastrophic cancellation. To avoid this, approaches such as the Kahan summation algorithm are used. – Wikipedia

The product of two matrices is a matrix of the inner product of each row (of the first matrix) with each column (of the second matrix).

Page Rank

Given the pages and links:

  • A links to B
  • B links to C
  • C links to B and D
  • D links to A

An adjacency matrix can be created to represent the pages links between each other. It will take n^2 space.

   A   B   C   D
A  0   1   0   0
B  0   0   1   0
C  0   1   0   1
D  1   0   0   0

Then it will be transformed into a (right) stochastic matrix (a.k.a. probability matrix) where each value is a non-negative real number representing a probability and where each row sums to 1. It will show the probability to go to a given page.

   A   B   C   D
A  0   1   0   0
B  0   0   1   0
C  0  0.5  0  0.5
D  1   0   0   0

The PageRank theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85. – Wikipedia

It then multiplies the matrix by the damping factor and saves the result.

Separately, it creates another stochastic matrix of the same size, filled with 1/n representing the probability to pick a page when the surfer stops following links. This second matrix is multiplied by the opposite of the damping factor.

Both result are summed: follow_links_matrix * d + pick_new_page_matrix * (1 - d).

In addition to being stochastic, the resulting matrix is also irreducible (can get anywhere from anywhere) and aperiodic (it's possible to be on any page at any given step).

The PageRank is finally computed using the power iteration (an Eigenvalue algorithm) which, in short, produces a non-zero vector, multiplies it by the matrix, saves the resulting vector and repeat the process with that new vector until the probability to be on any page reach stationary distribution.

# With `m` the matrix and `n` the number of pages in the matrix.
v0 = Vector.new(n, 1/n)
while true
  v1 = m * v0
  t = (v1.magnitude - v0.magnitude)

  break if t < 0.0001

  v0 = v1
end
# PageRank is the value of v1

Note: In this exercise, it give the same weight to each link here but the importance of a link could be weighted. In addition, in case the surfer stops following links and go to another page, all the pages have an equal chance of being selected, but in reality, the surfer probably goes to a well known page.

07 - Graph

Representing page links with a matrix takes n^2 space and don't have edges per se. A more efficient way to represent the pages relations would be to use a graph of size n + e (nodes + edges, a.k.a. links).

Recursive methods are simpler but might cause stack overflow. Iterative methods can be used instead but then need to use a stack to keep track of where you are conceptually.

It is not possible to produce a topological sort if the graph is cyclic (or if a portion of it is).

Note: In this exercise, an array of size n is used to keep track of the visited nodes and might not be practical when the nodes not simple integers. Other solutions might involve using the hash of the object and checking whether it is in the array; or it could use a binary search tree.

08 - Algorithms

Fisher–Yates shuffle

The Fisher–Yates shuffle is an algorithm for generating a random permutation of a finite set. [...] The algorithm effectively puts all the elements into a hat; it continually determines the next element by randomly drawing an element from the hat until no elements remain. The algorithm produces an unbiased permutation: every permutation is equally likely. – Wikipedia

Binary search algorithm

Binary search runs in at worst O(log•n) (i.e. logarithmic time).

Knuth–Morris–Pratt algorithm

In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a word within a main text string by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. – Wikipedia

09 - Weighted Graph

Floyd–Warshall

The Floyd–Warshall algorithm is an algorithm for finding shortest paths in a weighted graph – Wikipedia

Every combination of edges is tested by incrementally improving an estimate on the shortest path between two vertices, until the estimate is optimal.

Dijkstra

Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a graph – Wikipedia

Prim

Prim's algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph. This means it finds a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. The algorithm operates by building this tree one vertex at a time, from an arbitrary starting vertex, at each step adding the cheapest possible connection from the tree to another vertex – Wikipedia

Annex

Big O notation

In computer science, big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows. – Wikipedia

Running time Name
O(1) constant time
O(log•n) logarithmic time
O(n) linear time
O(n•log•n) linearithmic time
O(n^2) quadratic time
O(n^3) cubic time

Random

Most default implementation of random are only pseudo-random and Ruby's Random is no different, although it also has SecureRandom.