Spaghetti Optimization

An informal and biased Tutorial on Kantorovich-Wasserstein distances

2018-12-31T13:13:00+01:00

Two years ago, I started to study Computational Optimal Transport (OT), and, now, it is time to wrap up informally the main ideas by using an Operations Research (OR) perspective with a Machine Learning (ML) motivation.

Yes, an OR perspective.

Why an OR perspective?

Well, because most of the current theoretical works on Optimal Transport have a strong functional analysis bias, and, hence, the are pretty far to be an “easy reading” for anyone working on a different research area. Since I’m more comfortable with “summations” than with “integrations”, in this post I focus only on Discrete Optimal Transport and on Kantorovich-Wasserstein distances between a pair of discrete measures.

Why an ML motivation?

Because measuring the similarity between complex objects is a crucial basic step in several #machinelearning tasks. Mathematically, in order to measure the similarity (or dissimilarity) between two objects we need a metric, i.e., a distance function. And Optimal Transport gives us a powerful similarity measure based on the solution of a Combinatorial Optimization problem, which can be formulated and solved with Linear Programming.

The main Inspirations for this post are:

My current favorite tutorial is Optimal Transport on Discrete Domains, by Justin Solomon.
Two years ago, I started to study this topic thanks to Giuseppe Savaré, and I wrote this post by looking also to his set of slides. He has several interesting papers and you can look at his publications list.
For a complete overview of this topic look at the Computational Optimal Transport book, by Gabriel Peyré and Marco Cuturi. These two researchers, together with Justin Solomon, where among the organizers of two #NeurIPS workshops, in 2014 and in 2017, on Optimal Transport and Machine Learning.
For a broader history of Combinatorial Optimization till 1960, see this manuscript by Alexander Schrijver.

DISCLAIMER 1: This is a long “ongoing” post, and, despite my efforts, it might contain errors of any type. If you have any suggestion for improving this post, please (!), let me know about: I will be more than happy to mention you (or any of your avatars) in the acknowledgement section. Otherwise, if you prefer, I can offer you a drink, whenever we will meet in real life.

DISCLAIMER 2: I wrote this post while reading the book Ready Player One, by Ernest Cline.

DISCLAIMER 3: I’m recruiting postdocs. If you like the topic of this post and you are looking for a postdoc position, write me an email.

Similarity measures and distance functions

A metric is a function, usually denoted by $d$ , between a pair of objects belonging to a space $X$ :

$d : X \times X \rightarrow \mathbb{R}_+$

Given any triple of points $x,y,z \in X$ , the conditions that $d$ must satisfy in order to be a metric are:

$d(x,y) \geq 0$ (non negativity)
$d(x,y) = 0 \Leftrightarrow x=y$ (identity of indiscernibles)
$d(x,y) = d(y,x)$ (symmetry)
$d(x,z) \leq d(x,y) + d(y,z)$ (triangle inequality or subadditivity)

If the space $X$ is $\mathbb{R}^k$ , then $\mathbf{x}, \mathbf{y}, \mathbf{z}$ are vectors of $k$ elements, and the most common distance is indeed the Euclidean function

$d(\mathbf{x},\mathbf{y}) = \sqrt{\sum_{i = 1}^k(x_i - y_i)^2}$

where $x_i$ is the $i$ -th component of the vector $\mathbf{x}$ . Clearly, the algorithmic complexity of computing (in finite precision) this distance is linear with the dimension of $X$ .

QUESTION: What if we want to compute the distance between a pair of clouds of $n$ points defined in $\mathbb{R}^k$ ?

If we want to compute the distance between the two vectors, that represent the two clouds of $n$ points, we need to define a distance function.

Let me fix the notation first. If $\mathbf{x}$ is a vector, then $x_i$ is the $i$ -th element. Suppose we have two matrices $\mathbf{X}$ and $\mathbf{Y}$ with $n \times k$ elements, which represent $n$ points in $\mathbb{R}^k$ . We denote by $\mathbf{x}_i$ the $i$ -th row of matrix $\mathbf{X}$ , and by $x_{ij}$ the $j$ -th element of row $i$ . Indeed, the rows $\mathbf{x}_i$ and $\mathbf{y}_i$ of the two matrices give the coordinates $x_{i1},\dots,x_{ik}$ and $y_{i1},\dots,y_{ik}$ of the two corresponding points.

Whenever $k=1$ , a simple choice is to consider the Minkowski Distance, which is a metric for normed vector spaces:

$M_p(\mathbf{x},\mathbf{y}) = \left( \sum_{i=1}^n \mid x_i - y_i\mid^p \right)^{\frac{1}{p}}$

where typical values of $p$ are:

$p=1$ (Manhattan distance)
$p=2$ (Euclidean distance, see above)
$p=\infty$ (Infinity distance)

We have also the Minkowski norm, that is a function

$\ell_p : \mathbb{R} \rightarrow \mathbb{R}_+$

computed as

$\ell_p(\mathbf{x}) = \left( \sum_{i=1}^n \mid x_i \mid^p\right)^{\frac{1}{p}}$

Whenever $k>1$ , we have to consider a more general distance function:

$D : \mathbb{R}^{n \times k} \times \mathbb{R}^{n \times k} \rightarrow \mathbb{R}_+$

such that the relations (1)-(4) are satisfied. We could use as distance function $D(\mathbf{X},\mathbf{Y})$ any matrix norm, but, to begin with, we can use the Minkowski distance twice in cascade as follows.

First, we compute the distance between a pair of points in $\mathbb{R}^k$ , which we call the ground distance, using $M_q$ , with $q\geq 1$ . By applying this function to all the $n$ pairs of points, we get a vector $\mathbf{z}$ of $n$ non-negative values:

$z_i = M_q(\mathbf{x}_i,\mathbf{y}_i), \quad i=1,\dots, n$

Second, we apply the Minkowski norm $\ell_p$ to the vector $\mathbf{z}$ .

Composing these two operations, we can define a distance function between a pair of vectors of points (i.e., pair of matrices) as follows:

$D_{p,q}(\mathbf{X},\mathbf{Y}) = \ell_p(\mathbf{z}) = \ell_p(\left< M_q(\mathbf{x}_i,\mathbf{y}_i)\right>) = \left( \sum_{i=1}^n \left( \sum_{j = 1}^n \left(x_{ij} - y_{ij}\right)^q \right)^{\frac{p}{q}} \right)^{\frac{1}{p}}$

Note that for $p=q=2$ , we get

$D_{2,2}(\mathbf{X},\mathbf{Y}) = \sqrt{\sum_{i=1}^n \sum_{j = 1}^n \left(x_{ij} - y_{ij}\right)^2} = \mid\mid \mathbf{X} - \mathbf{Y}\mid\mid_F$

which is the Frobenius Norm of the element wise difference $\mathbf{X} - \mathbf{Y}$ .

The main drawback of this distance function is that it implicitly relies on the order (position) of the single points in the two input vectors: any permutation of one (or both) of the two vectors will yield a different value of the ground distance. This happens because the distance function between the two input vectors considers only “interactions” between the $i$ -th pair of points stored at the same $i$ -th position in the two vectors.

IMPORTANT. Here is where Discrete Optimal Transport comes into action: it offers an alternative distance function based on the solution of a Combinatorial Optimization problem, which is, in the simplest case, formulated as the following Linear Program:

$\begin{aligned}\mathcal{W}_d(\mathbf{X},\mathbf{Y}) := \min \;\;& \sum_{i=1}^n\sum_{j=1}^n d(\mathbf{x}_i,\mathbf{y}_j) \pi_{ij} \\ & \sum_{i=1}^n \pi_{ij} = 1& j=1,\dots,n \\ & \sum_{j=1}^n \pi_{ij} = 1& i=1,\dots,n \\ & \pi_{ij} \geq 0 & i=1,\dots,n, j=1,\dots,n . \end{aligned}$

If you have a minimal #orms background at this point you should have recognized that this problem is a standard Assignment Problem: we have to assign each point of the first vector $\mathbf{x}$ to a single point of the second vector $\mathbf{y}$ , in such a way that the overall cost is minimum. From an optimal solution of this problem, we can select among all possible permutations of the rows of $\mathbf{Y}$ , the permutation that gives the minimal value of the Frobenius norm.

Whenever the ground distance $d(\mathbf{x},\mathbf{y})$ is a metric, then $\mathcal{W}_d(\mathbf{X},\mathbf{Y})$ is a metric as well. In other terms, the optimal value of this problem is a measure of distance between the two vectors, while the optimal values of the decision variables $\mathbf{\pi}$ gives a mapping from the rows of $\mathbf{X}$ to the rows of $\mathbf{Y}$ (in OT terminology, an optimal plan). This is possible because the LP problem has a Totally Unimodular coefficient matrix, and, hence, every basic optimal solution of the LP problem has integer values.

WAIT, LET ME STOP HERE FOR A SECOND!

I am being too technical, too early. Let me take a step back in History.

Once Upon a Time: from Monge to Kantorovich

The History of Optimal Transport is quite fascinating and it begins with the Mémoire sur la théorie des déblais et des remblais by Gaspard Monge (1746-1818). I like to think of Gaspard as visiting the Dune of Pilat, near Bordeaux, and then writing his Mémoire while going back to home… but this is only my imagination. Still, particles of sand give me the most concrete idea for passing from a continuous to a discrete problem.

In his Mémoire, Gaspard Monge considered the problem of transporting “des terres d’un lieu dans un autre” at minimal cost. The idea is that we have first to consider the cost of transporting a single molecule of “terre”, which is proportional to its weight and to the distance from its initial and final position. The total cost of transportation is given by summing up the transportation cost of each single molecule. Using the Lex Schrijver’s words, Monge’s transportation problem was camouflaged as a continuous problem.

The idea of Gaspard is to assign to each initial position a single final destination: it is not possible to split a molecule into smaller parts. This unsplittable version of the problem posed a very challenging problem where “the direct methods of the calculus of variations fail spectacularly” (Lawrence C. Evans, see link at page 5). This challenge stayed unsolved until the work of Leonid Kantorovich (1912-1986).

Curiously, Leonid did not arrive to the transportation problem while studying directly the work of Monge. Instead, he was asked to solve an industrial resource allocation problem, which is more general than Monge’s problem. Only a few years later, he reconsidered his contribution in terms of the continuous probabilistic version of Monge’s transportation problem. However, I am unable to state the true technical contributions of Leonid with respect to the work of Gaspard in a short post (well, honestly, I would be unable even in an infinite post), but I recommend you to read the Long History of the Monge-Kantorovich Transportation Problem.

Anyway, I have pretty clear the two main concepts that are the foundations of the work by Kantorovich:

RELAXATION: He relaxes the problem posed by Gaspard and he proves that an optimal solution of the relaxation equals an optimal solution of the original problem (Here is the link to the very unofficial soundtrack of his work: “Relax, take it easy!”). Indeed, Leonid relaxed formulation allows each molecule to be split across several destinations, differently from the formulation of Monge. In OR terms, he solves a Hitchcock Problem, not an Assignment Problem. For more details, see Chapter 1 of the Computational Optimal Transport.

DUALITY: Leonid uses a dual formulation of the relaxed problem. Indeed, he invented the dual potential functions and he sketched the first version of dual simplex algorithm. Unfortunately, I studied Linear Programming duality without having an historical perspective, and hence duality looks like obvious, but at the time is was clearly a new concept.

For the records, Leonid Kantorovich won the Nobel Prize, and his autobiography merits to be read more than once.

Well, I have still so much to learn from the past!

Two Fields medal winners: Alessio Figalli and Cedric Villani

If you think that Optimal Transport belongs to the past, you are wrong!

Last summer (2018), in Rio de Janeiro, Alessio Figalli won the Fields Medal with this citation:

“for his contributions to the theory of optimal transport, and its application to partial differential equations, metric geometry, and probability”

Alessio is not the first Fields medalist who worked on Optimal Transport. Already Cedric Villani, who wrote the most cited book on Optimal Transport [1], won the Fields Medal in 2010. I strongly suggest you to look any of his lectures available on Youtube. And … do you know that Villani spent a short period of time during his PhD in my current Math Dept. at University of Pavia?

As an extra bonus, if you don’t know what a Fields Medal is, you can have Robin Williams to explain the prize in this clip taken from Good Will Hunting.

Discrete Optimal Transport and Linear Programming

It is time of being technical again and to move from the Monge assignment problem, to the Kantorovich “relaxed” transportation problem. In the assignment model presented above, we are implicitly considering that all the positions are occupied by a single molecule of unitary weight. If we want to consider the more general setting of Discrete Optimal Transport, we need to consider the “mass” of each molecule, and to formulate the problem of transporting the total mass at minimum cost. Before presenting the model, we define formally the concept of a discrete measure and of the cost matrix between all pairs of molecule positions.

DISCRETE MEASURES: Given a of vector of $n$ positions $\mathbf{x}_i$ , and given the Dirac delta function, we can define the Dirac measures $\delta_{\mathbf{x}_i}$ as

$\delta_{\mathbf{x}_i}(A) = \left\{ \begin{array}{ll} 1 & \text{if} \;\;\mathbf{x}_i \in A \subseteq X \\ 0 & \text{if} \;\;\mathbf{x}_i \notin A \subseteq X \\ \end{array}\right.$

Given a vector of $n$ weights $\mu_i$ , one associated to each element of $\mathbf{x}$ , we can define the discrete measure $\mathbf{\mu}$ as

$\mu(A) = \sum_{i=1}^n \mu_i \delta_{\mathbf{x}_i}$

Note the $\mu$ is a function of type: $\mu : A \rightarrow \mathbb{R}_+$ , for any subset $A$ of $X$ . The vector $\mathbf{x}$ is called the support of the measure $\mathbf{\mu}$ . In Computer Science terms, a discrete measure is defined by a vector of pairs, where each pair contains a positive number $\mu_i$ (the measured value) and its support point $\mathbf{x}_i$ (the location where the measure occurred). Note that $\mathbf{x}_i$ is a (small?) vector storing the coordinates of the $i$ -th point.

COST MATRIX: Given two discrete measures $\mathbf{\mu}$ and $\mathbf{\nu}$ , the first with support $\mathbf{x}$ and the second with support $\mathbf{y}$ , we can define the following cost matrix:

$c_{ij} = d(\mathbf{x}_i, \mathbf{y}_j), \quad \text{for } i=1,\dots,n \text{ and } j= 1,\dots, m$

where $d$ is a distance function, such as, for instance, the Minkowski distance $M_q(\mathbf{x},\mathbf{y})$ defined before. Note that $\mathbf{\mu}$ has $n$ elements, and $\mathbf{\nu}$ has $m$ elements.

Kantorovich-Wasserstein distances between two discrete measures

At this point, we have all the basic elements to define the Kantorovich-Wasserstein distance function between discrete measures in terms of the solution of a (huge) Linear Program.

INPUT: Two discrete measures $\mathbf{\mu}$ and $\mathbf{\nu}$ defined on a metric space $X$ , and the corresponding supports $\mathbf{x}$ and $\mathbf{y}$ , having $n$ and $m$ elements, respectively. A distance function $d$ , which permits to compute the cost $c_{ij}$ .

OUTPUT: A transportation plan $\mathbf{\pi}$ and a value of distance of $\mathcal{W}(\mathbf{\mu}, \mathbf{\nu})$ that corresponds to an optimal solution of the following Linear Program:

$\begin{aligned} \mathcal{W}(\mathbf{\mu},\mathbf{\nu}) := \min \;\;& \sum_{i=1}^n\sum_{j=1}^m c_{ij} \pi_{ij} \\ & \sum_{j=1}^m \pi_{ij} \geq \mu_i & i=1,\dots,n \\ & \sum_{i=1}^n \pi_{ij} \leq \nu_j& j=1,\dots,m \\ & \pi_{ij} \geq 0 & i=1,\dots,n, j=1,\dots,m. \end{aligned}$

This Linear Program is indeed a special case of the Transportation Problem, known also as the Hitchcock-Koopmans problem. It is a special case because the cost vector has a strong structure that should be exploited as much as possible.

Computational Challenge: While the previous problem is polynomially solvable, the size of practical instances is very large. For instance, if you want to compute the distance between a pair of grey scale images of resolution $512 \times 512$ pixels, you end up with an LP with $512^4 = 68\;719\;476\;736$ cost coefficients. Hence, these problems must be handled with care. If you want to see how the solution time and memory requirement scale for grey scale image, please, have a look at the slides of my talk at Aussois (2018).

KANTOROVICH-WASSERSTEIN DISTANCE

Whenever

The two measure are discrete probability measures, that is, both $\sum_{i=1}^n \mu_i = 1$ and $\sum_{j = 1}^m \nu_j = 1$ (i.e., $\mathbf{\mu}$ and $\mathbf{\nu}$ belongs to the probability simplex), and,
The cost vector is defined as the $p$ -th power of a distance,

then we define the Kantorovich-Wasserstein distance of order $p$ as the following functional:

$\mathcal{W}_p(\mathbf{\mu},\mathbf{\nu}) := \left(\min_{\pi \in U} \sum_{i=1}^n\sum_{j=1}^m d(\mathbf{x}_i, \mathbf{y}_j)^p \pi_{ij} \right)^{\frac{1}{p}}$

where the set $U$ is defined as:

$U := \left\{\begin{array}{ll} \sum_{j=1,\dots,m} \pi_{ij} \leq \mu_i & i=1,\dots,n \\ \sum_{i=1,\dots,n} \pi_{ij} \geq \nu_j& j=1,\dots,m \\ \pi_{ij} \geq 0 & i=1,\dots,n, j=1,\dots,m. \\ \end{array} \right\}$

From a mathematical perspective, the most interesting case is the order $p=2$ , which generalizes the Euclidean distance to discrete probability vectors. Note that in this formulation, the two constraint sets defining $U$ could be replaced with equality constraints, since $\mathbf{\mu}$ and $\mathbf{\nu}$ belongs to the probability simplex. In addition, any Combinatorial Optimization algorithm must be used with care, since all the cost and constraint coefficients are not integer. Note: The $p$ power used for the ground distance must not be confused with the order (power) of a Kantorovich-Wasserstein distance.

EARTH MOVER DISTANCE

A particular case of the Kantorovich-Wasserstein distance very popular in the Computer Vision research community, is the so-called Earth Mover Distance (EMD) [2], which is used between a pair of $k$ -dimensional histograms obtained by preprocessing the images of interest. In this case, (i) we have $n=m$ , (ii) we do not require the discrete measures to belong to the probability simplex, and (iii) we do not even require that the two measures are balanced, that is, $\sum_{i=1}^n \mu_i = \sum_{j=1}^n \nu_j$ . In Optimal Transport terminology, the Earth Mover Distance solves an unbalanced optimal transport problem. For the EMD, the feasibility set $U$ is replaced by the set:

$V := \left\{\begin{array}{ll} \sum_{i=1}^n \sum_{j=1}^n \pi_{ij} = \min \{ \sum_{i=1}^n \mu_i, \sum_{j=1}^n \nu_j \}& \\ \sum_{j=1,\dots,n} \pi_{ij} \geq \mu_i & i=1,\dots,n \\ \sum_{i=1,\dots,n} \pi_{ij} \geq \nu_j& j=1,\dots,n \\ \pi_{ij} \geq 0 & i=1,\dots,n, j=1,\dots,n. \\ \end{array} \right\}$

The cost function is taken with the order $p=1$ :

$EMD(\mathbf{x},\mathbf{y}) := \min_{\pi \in V} \sum_{i=1}^n\sum_{j=1}^m d(x_i, y_j) \pi_{ij}$

The most used function $d$ is the Minkowski distance induced by the $\ell_p$ norm.

WORD MOVER DISTANCE

A very interesting application of discrete optimal transport is the definition of a metric for text documents [3]. The main idea is, first, to exploit a word embedding obtained, for instance, with the popular word2vec neural network [4], and, second, to formulate the problem of “transporting” a text document into another at minimal cost.

Yes, but … how we compute the ground distance between two words?

A word embedding associates to each word of a given vocabulary a vector of $\mathbb{R}^k$ . For the pre-trained embedding made available by Google at this archive, which contains the embedding of around 3 millions of words, $k$ is equals to 300. Indeed, given a vocabulary of $n$ words and fixed a dimension $k$ , a word embedding is given by a matrix $\mathbf{X}$ of dimension $n \times k$ : Row $\mathbf{x}_i$ gives the $k$ -dimensional vector representing the word embedding of word $i$ .

In this case, instead of having discrete measures, we deal with normalized bag-of-words (nBOW), which are vector of $\mathbb{R}^n$ , where $n$ denotes the number of words in the vocabulary. If a text document $\mathbf{\mu} \in \mathbb{R}^n$ contains $t_i$ times the word $i$ , then $\mathbf{\mu}_i = \frac{t_i}{\sum_{j=1}^n t_j}$ . At this point is clear that the ground distance between a pair of words is given by the distance between the corresponding embedding vectors in $\mathbb{R}^k$ , that is, given two words $i$ and $j$ , then

$c(i,j) = \ell_2(\mathbf{x}_i - \mathbf{x}_j)$

Finally, given two text documents $\mathbf{\mu}$ and $\mathbf{\nu}$ , we can formulate the Linear Program that gives the Word Mover Distance as:

$\text{WMD}(\mathbf{\mu},\mathbf{\nu}) := \min_{\pi \in U} \sum_{i=1}^n\sum_{j=1}^n c(\mathbf{x}_i, \mathbf{x}_j) \pi_{ij}$

where the set $U$ is defined as:

$U := \left\{\begin{array}{ll} \sum_{j=1,\dots,n} \pi_{ij} = \mu_i & i=1,\dots,n \\ \sum_{i=1,\dots,n} \pi_{ij} = \nu_j& j=1,\dots,n \\ \pi_{ij} \geq 0 & i=1,\dots,n, j=1,\dots,n. \\ \end{array} \right\}$

If you are serious reader, and you are still reading this post, then it is clear that the Word Mover Distance is exactly a Kantorovich-Wasserstein distance of order 1, with an Euclidean ground distance. If you are interested in the quality of this distance when used within a nearest neighbor heuristic for a text classification task, we refer to [3]. (While writing this post, I started to wonder how would perform a WMD of order 2, but this is another story…)

Interesting Research Directions

The following are the research topics I am interested in right now. Each topic deserves its own blog post, but let me write here just a short sketch.

Computational Challenges: The development of efficient algorithms for the solution of Optimal Transport problems is an active area of research. Currently, the preferred (heuristic) approach is based on so-called regularized optimal transport, introduced originally in [5]. Indeed, regularized optimal transport deserves its own blog post. In two recent works, together with my co-authors, we tried to revive the Network Simplex for two special cases: Kantorovich-Wasserstein distances of order 1 for $k$ -dimensional histograms [6] and Kantorovich-Wasserstein distances of order 2 for decomposable cost functions [7]. The second paper was presented as a poster at NeurIPS2018. (If you ever read any of the two papers, please, let me know what you think about them)
Unbalanced Optimal Transport: The computation of Kantorovich-Wasserstein distance for pair of unbalanced discrete measures is very challenging. Last year, a brilliant math student finished a nice project on this topic, which I hope to finalize during the next semester.
Barycenters of Discrete Measures: The Kantorovich-Wasserstein distance can be used to generalize the concept of barycenters, that is, the problem of finding a discrete measure that is the closest (in Kantorovich terms) to a given set of discrete measures. The problem of finding the barycenter can be formulated as a Linear Program, where the unknowns (the decision variables) are both the transport plan and the discrete measure representing the barycenter. For instance, the following images, taken from our last draft paper, represents the barycenters of each of the 10 digits of the MNIST data set of handwritten images (each image is the barycenter of other $3\;200$ images).

Distances between Discrete Measures defined on different metric spaces: This topic is at the top of my 2019 resolutions. There are a few papers by Facundo Mémoli on this topic, which are based on the so-called Gromov-Wasserstein distance and that requires the solution of a Quadratic Assignment Problem (QAP).

Now, it’s time to close this post with final remarks.

Optimal Transport: A brilliant future?

Given the number and the quality of results achieved on the Theory of Optimal Transport by “pure” mathematicians, it is the time to turn these theoretical results into a set of useful algorithms implemented in efficient and scalable solvers. So far, the only public library I am aware of is POT: Python Optimal Transport.

On Medium, C.E. Perez claims that Optimal Transport Theory (is) the New Math for Deep Learning. In his short post, he explains how Optimal Transport is used in Deep Learning algorithms, specifically in Generative Adversarial Networks (GANs), to replace the Kullback-Leibler (KL) divergence.

Honestly, I do not have a clear idea regarding the potential impact of Kantorovich distances on GANs and Deep Learning in general, but I think there are a lot of research opportunities for everyone with a strong passion for Computational Combinatorial Optimization.

And you, what do you think about the topics presented in this post?

As usual, I will be very happy to hear from you, in the meantime…

GAME OVER

Acknowledgement

I would like to thank Marco Chiarandini, Stefano Coniglio, and Federico Bassetti for constructive criticism of this blog post.

References

Villani, C. Optimal transport, old and new. Grundlehren der mathematischen Wissenschaften, Vol.338, Springer-Verlag, 2009. [pdf]
Rubner, Y., Tomasi, C. and Guibas, L.J., 2000. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2), pp.99-121. [pdf]
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q. From Word Embeddings To Document Distances. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. [pdf]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). [pdf]
Cuturi, M., 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems (pp. 2292-2300). [pdf]
Bassetti, F., Gualandi, S. and Veneroni, M., 2018. On the Computation of Kantorovich-Wasserstein Distances between 2D-Histograms by Uncapacitated Minimum Cost Flows. arXiv preprint arXiv:1804.00445. [pdf]
Auricchio, G., Bassetti, F., Gualandi, S. and Veneroni, M., 2018. Computing Kantorovich-Wasserstein Distances on d-dimensional histograms using (d+1)-partite graphs. NeurIPS, 2018.[pdf]

Exercise in Python: remove blanks from strings

2017-01-24T16:51:00+01:00

This morning, after reading this very nice post, I decided to challenge myself in Python and to have a look at the impact of mispredicted branches in a language different from C/C++. The basic idea was to use only Python builtins: external libraries are not allowed!

As a benchmark, I grabbed a large text file from P. Norvig’s website, which is 6’488’666 byte long.

The final answer? Yes, mispredicted branches have a huge impact in Python too.

The hidden answer? Python dictionaries ever stop to surprise me: they are REALLY efficient.

NOTE: The followig code snippets were executed in a Python 3.5 notebook, on a windows machine, running Windows 10 and Anaconda Python 3.5 64 bits. You can find my notebook on my Blog GitHub repo. Don’t ask me why, but this blog entry is better visualized directly on GitHub.

UPDATE: Well, most of the time I would use my first implementation based on the filter builtin function, and I would try for alternative implementations only after a profiler has shown that removing blanks is a true bottleneck of my whole program. As written in the title, this post is meant as a basic exercise in Python.

First attempt: Functional style

In Python, I prefer to write as much code in functional style as possible, relying on the 3 basic functions:

map
filter
reduce (this is in the functools module and it is not a true builtin)

Therefore, after few preliminaries, here is my first code snippet:

import cProfile

# Download file from: 'http://norvig.com/big.txt'
big = open('big.txt', 'r')

# Read the while file
test = big.read()

def RemoveBlanksFilter(in_str):
    result = filter(lambda c: c != '\r' and c != '\t' and c != ' ', in_str)
    return "".join(result)

test_result = RemoveBlanksFilter(test)

cProfile.run('RemoveBlanksFilter(test)')

         6488671 function calls in 1.956 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.955    1.955 :1(RemoveBlanksFilter)
  6488666    0.870    0.000    0.870    0.000 :2()
        1    0.000    0.000    1.956    1.956 :1()
        1    0.000    0.000    1.956    1.956 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    1.085    1.085    1.955    1.955 {method 'join' of 'str' objects}

Wow, I didn’t realize that I would have call the lambda function for every single byte of my input file. This is clearly too much overhead.

2nd attempt: remove function calls overhead

Let me drop my functional style, and write a plain old for-loop:

def RemoveBlanks(in_str):
    result = []
    for c in in_str:
        if c != '\r' and c != '\t' and c != ' ':
            result.append(c)
    return "".join(result)

print('Is test passed:', test_result == RemoveBlanks(test))

Is test passed: True

cProfile.run('RemoveBlanks(test)')

         5452148 function calls in 1.566 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.210    1.210    1.553    1.553 :1(RemoveBlanks)
        1    0.012    0.012    1.566    1.566 :1()
        1    0.000    0.000    1.566    1.566 {built-in method builtins.exec}
  5452143    0.310    0.000    0.310    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.033    0.033    0.033    0.033 {method 'join' of 'str' objects}

Mmm… we just shift the problem to the list append function calls. Maybe we can do better by working in place.

3rd attempt: work in place

Well, almost in place: Python string are immutable; therefore, we first copy the string into a list, and then we work in place over the copied list.

def RemoveBlanksInPlace(in_str):
    buffer = list(in_str)
    pos = 0
    for c in in_str:
        if c != '\r' and c != '\t' and c != ' ':
            buffer[pos] = c
            pos += 1
    return "".join(buffer[:pos])

print('Is test passed:', test_result == RemoveBlanksInPlace(test))

Is test passed: True

cProfile.run('RemoveBlanksInPlace(test)')

         5 function calls in 1.158 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.113    1.113    1.145    1.145 :1(RemoveBlanksInPlace)
        1    0.013    0.013    1.158    1.158 :1()
        1    0.000    0.000    1.158    1.158 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.032    0.032    0.032    0.032 {method 'join' of 'str' objects}

Ok, working in place does have an impact. Let me go on the true point: avoiding mispredicted branches.

4th attempt: to avoid mispredicted branches

As in the original blog post:

def RemoveBlanksNoBranch(in_str):
    # Build table
    table = []
    for ic in range(256):
        c = chr(ic)
        if c == '\r' or c == '\t' or c == ' ':
            table.append(0)
        else:
            table.append(1)

    # Removal
    buffer = list(in_str)
    pos = 0
    for c in in_str:
        buffer[pos] = c
        pos += table[ord(c)]  # ord() is a function --> bottleneck
    return "".join(buffer[:pos])


print('Is test passed:', test_result == RemoveBlanksNoBranch(test))

Is test passed: True

cProfile.run('RemoveBlanksNoBranch(test)')

         6489183 function calls in 1.474 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.235    1.235    1.460    1.460 :1(RemoveBlanksNoBranch)
        1    0.014    0.014    1.474    1.474 :1()
      256    0.000    0.000    0.000    0.000 {built-in method builtins.chr}
        1    0.000    0.000    1.474    1.474 {built-in method builtins.exec}
  6488666    0.192    0.000    0.192    0.000 {built-in method builtins.ord}
      256    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.033    0.033    0.033    0.033 {method 'join' of 'str' objects}

Ouch!!! These are getting even worse! Why? Well, ‘ord’ is a function, so we are getting back the overhead of function calls. Can we do better by using a dictionary instead of an array?

5th attempt: use a dictionary

Let me use a dictionary in order to avoid the ‘ord’ function calls.

def RemoveBlanksNoBranchDict(in_str):
    if type(in_str) != str:
        raise TypeError('This function works only for strings')

    # Build table
    table = {}
    for ic in range(256):
        c = chr(ic)
        if c == '\r' or c == '\t' or c == ' ':
            table[c] = 0
        else:
            table[c] = 1

    # Removal
    buffer = list(in_str)
    pos = 0
    for c in in_str:
        buffer[pos] = c
        pos += table[c]
    return "".join(buffer[:pos])


print('Is test passed:', test_result == RemoveBlanksNoBranchDict(test))

Is test passed: True

cProfile.run('RemoveBlanksNoBranchDict(test)')

         261 function calls in 0.771 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.724    0.724    0.758    0.758 :1(RemoveBlanksNoBranchDict)
        1    0.013    0.013    0.771    0.771 :1()
      256    0.000    0.000    0.000    0.000 {built-in method builtins.chr}
        1    0.000    0.000    0.771    0.771 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.034    0.034    0.034    0.034 {method 'join' of 'str' objects}

Oooh, yes! Now we can see that without mispredicted branches we can really speed up our algorithm.

Is this the best pythonic solution? No, surely not, but still it is an interesting remark to keep in mind when coding.

Final remark: a simple pythonic solution

Likely, the simplest pythonic solution is just to use the ‘replace’ string function as follows:

def RemoveBlanksBuiltin(in_str):
    s1 = in_str.replace('\r','')
    s2 = s1.replace('\t','')
    return s2.replace(' ','')


print('Is test passed:', test_result == RemoveBlanksBuiltin(test))

Is test passed: True

cProfile.run('RemoveBlanksBuiltin(test)')

         7 function calls in 0.065 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.064    0.064 :1(RemoveBlanksBuiltin)
        1    0.001    0.001    0.065    0.065 :1()
        1    0.000    0.000    0.065    0.065 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        3    0.063    0.021    0.063    0.021 {method 'replace' of 'str' objects}

Here we are, the best solution is indeed to use a builtin function, whenever it is possible, even if this was not the real aim of this exercise.

Please, let me know if you have some comments or a different solution in Python.

Graph Coloring: Column Generation or Column Enumeration?

2015-07-04T16:45:00+02:00

In this post, I like to share a simple idea on how to solve to optimality some hard instances of the Graph Coloring problem. This simple idea yields a “new time” record for a couple of hard instances.

To date, the best exact approach to solve Graph Coloring is based on Branch-and-Price [1, 2, 3]. The branch-and-price method is completely different from the Constraint Programming approach I discussed in a previous post. A key component of Branch-and-Price is the column generation phase, which is intuitively quite simple, but mathematically rather involved for a short blog post.

Here, I want to show you that a modern Mixed Integer Programming (MIP) solver, such as Gurobi or CPLEX, can solve a few hard instances of graph coloring with the following “null implementation effort”:

Enumerate all possible columns
Build a .mps instance with those columns
Use a MIP solver to solve the .mps instance

Indeed, in this post we try to answer to the following question:

Is there any hope to solve any hard graph coloring instances with this naive approach?

Formulation

Given an undirected graph $G=(V,E)$ and a set of colors $K$ , the minimum (vertex) graph coloring problem consists of assigning a color to each vertex, while every pair of adjacent vertices gets a different color. The objective is to minimize the number of colors used in a solution.

The branch-and-price approach to graph coloring is based on a set covering formulation. Let $S$ be the collection of all the maximal stable sets of $G$ , and let $S_i \subseteq S$ be the maximal stable sets that contain the vertex $i$ . Let $\lambda_s$ be a 0-1 variable equal to 1 if all the vertices in the maximal stable set $s \in S$ get assigned the same color. Hence, the set covering model is:

$\min \sum_{s \in S} \lambda_s \mbox{ such that } \sum_{s \in S_i} \lambda_s \geq 1, \forall i \in N, \lambda_s \in \{0,1\}, \forall s \in S.$

Indeed, we “cover” every vertex of $G$ with the minimal number of maximal stable sets. The issue with this model is the total number of maximal stable sets in $G$ , which is exponential in the number of vertices of G.

Column Generation is a “mathematically elegant” method to by-pass this issue: it lets you to solve the set covering model by generating a very small subset of the elements in $S$ . This happens by repeatedly solving an auxiliary problem, called the pricing subproblem. For graph coloring, the pricing subproblem consists of a Maximum Weighted Stable Set problem. If you are interested in Column Generation, I recommend you to look at the first chapter of the Column Generation book, which contains a nice tutorial on the topic, and I would strongly recommend reading the nice survey “Selected Topics in Column Generation”, [4].

How many maximal stable sets are in a hard graph coloring instance?

If this number were not so high, we could enumerate all the stable sets in $S$ and attempt to directly solve the set covering model without resorting to column generation. However, “high” is a subjective measure, so let me do some computations on my laptop and give you some precise numbers.

Hard instances

Among the DIMACS instances of Graph Coloring, there are a few instances proposed by David Johnson, which are still unsolved (in the sense that we have not a computational proof of optimality of the best known upper bounds).

The table below shows the dimensions of these instances. The name of instances are DSJC{n}.{d}, where {n} is the number of vertices and {d} gives the density of the graph (e.g., DSJC125.9 has 125 vertices and 0.9 of density).

Graph	Nodes	Edges	Max stable sets	Enumeration Time
DSJC125.9	125	6,961	524	0.00
DSJC250.9	250	27,897	2,580	0.01
DSJC500.9	500	112,437	14,560	0.12
DSJC1000.9	1,000	449,449	100,389	2.20
DSJC125.5	125	3,891	43,268	0.53
DSJC250.5	250	15,668	1,470,363	43.16
DSJC500.5	500	62,624	?	out of memory
DSJC1000.5	1,000	249,826	?	out of memory
DSJC125.1	125	736	?	out of memory
DSJC250.1	250	3,218	?	out of memory
DSJC500.1	500	12,458	?	out of memory
DSJC1000.1	1,000	49,629	?	out of memory

As you can see the number of maximal stable sets (i.e. the cardinality of $S$ ) of several instances is not so high, above all for very dense graphs, where the number of stables set is less than the number of edges. However, for sparse graphs, the number of maximal stable sets is too large for the memory available in my laptop.

Now, let me re-state the main question of this post:

Can we enumerate all the maximal stable sets of $G$ and use a MIP solver such as Gurobi or CPLEX to solve any Johnson’s instance of Graph Coloring?

Results

I have written a small script which uses Cliquer to enumerate all the maximal stable sets of a graph, and then I generate an .mps instance for each of the DSJC instance where I was able to store all maximal stable sets. The .mps file are on my public GitHub repository for this post.

The table below shows some numbers for the sparse instances obtained using Gurobi (v6.0.0) with a timeout of 10 minutes on my laptop. If you compare these numbers with the results published in the literature, you can see that they are not bad at all.

Believe me, these number are not bad at all, and establish a new TIME RECORD.

For example, the instance DSJC250.9 was solved to optimality only recently in 11094 seconds by [3], while the column enumeration approach solves the same instance on a similar hardware in only 23 seconds (!), and, honestly, our work in [2] did not solve this instance to optimality at all.

Graph	Best known	Enum. Time	Run time	LB	UB	Time [2]	LB[2]	UB [2]
DSJC125.9	44	0.00	0.44	44	44	44	44	44
DSJC250.9	72	0.01	23	72	72	timeout	71	72
DSJC500.9	128	0.12	timeout	123	128	timeout	123	136
DSJC1000.9	222	2.20	timeout	215	229	timeout	215	245
DSJC125.5	17	0.53	70.6	17	17	19033	17	17
DSJC250.5	28	43.16	timeout	26	33	timeout	26	31

Can we ever solve to optimality DSJC500.9 and DSJC1000.9 via Column Enumeration?

I would say:

“Yes, we can!”

… but likely we need to be smarter while branching on the decision variables, since the default branching strategy of a generic MIP solver does not exploit the structure of the problem. If I had the time to work again on Graph Coloring, I would likely use the same branching scheme used in [2], where we combined a Zykov’s branching rule with a randomized iterative deepening depth-first search (randomised because at each restart we were using a different initial pool of columns). Another interesting option would be to tighten the set covering formulation with valid inequalities, by starting with those studied in [5].

In conclusion, I believe that enumerating all columns can be a simple but good starting point to attempt to solve to optimality at least the instances DSJC500.9 and DSJC1000.9.

Do you have some spare time and are you willing to take up the challenge?

References

A Mehrotra, MA Trick. A column generation approach for graph coloring. INFORMS Journal on Computing. Fall 1996 vol. 8(4), pp.344-354. [pdf]
S. Gualandi and F. Malucelli. Exact Solution of Graph Coloring Problems via Constraint Programming and Column Generation. INFORMS Journal on Computing. Winter 2012 vol. 24(1), pp.81-100. [pdf] [preprint]
S. Held, W. Cook, E.C. Sewell. Maximum-weight stable sets and safe lower bounds for graph coloring. Mathematical Programming Computation. December 2012, Volume 4, Issue 4, pp 363-381. [pdf]
M. Lubbecke and J. Desrosiers. Selected topics in column generation. Operations Research. 2005, Volume 53, Issue 6, pp 1007-1023. [pdf]
Set covering and packing formulations of graph coloring: algorithms and first polyhedral results. Discrete Optimization. 2009, Volume 6, Issue 2, pp 135-147. [pdf]

Big Data and Convex Optimization

2014-09-27T16:38:00+02:00

In the last months, I came several times across different definitions of Big Data. However, when someone asks me what Big Data means in practice, I am never able to give a satisfactory explanation. Indeed, you can easily find a flood of posts on twitter, blogs, newspaper, and even scientific journals and conferences, but I always kept feeling that Big Data is a buzzword.

By sheer serendipity, this morning I came across three paragraphs clearly stating the importance of Big Data from a scientific standpoint, that I like to cross-post here (the following paragraphs appear in the introduction of [1]):

In all applied fields, it is now commonplace to attack problems through data analysis, particularly through the use of statistical and machine learning algorithms on what are often large datasets. In industry, this trend has been referred to as ‘Big Data’, and it has had a significant impact in areas as varied as artificial intelligence, internet applications, computational biology, medicine, finance, marketing, journalism, network analysis, and logistics.

Though these problems arise in diverse application domains, they share some key characteristics. First, the datasets are often extremely large, consisting of hundreds of millions or billions of training examples; second, the data is often very high-dimensional, because it is now possible to measure and store very detailed information about each example; and third, because of the large scale of many applications, the data is often stored or even collected in a distributed manner. As a result, it has become of central importance to develop algorithms that are both rich enough to capture the complexity of modern data, and scalable enough to process huge datasets in a parallelized or fully decentralized fashion. Indeed, some researchers have suggested that even highly complex and structured problems may succumb most easily to relatively simple models trained on vast datasets.

Many such problems can be posed in the framework of Convex Optimization.

Given the significant work on decomposition methods and decentralized algorithms in the optimization community, it is natural to look to parallel optimization algorithms as a mechanism for solving large-scale statistical tasks. This approach also has the benefit that one algorithm could be flexible enough to solve many problems.

Even if I am not an expert of Convex Optimization [2], I do have my own mathematical optimization bias. Likely, you may have a different opinion (that I am always happy to hear), but, honestly, the above paragraphs are the best content that I have read so far about Big Data.

References

[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning. Vol. 3, No. 1 (2010) 1–122. [pdf]

[2] If you like to have a speedy overview of Convex Optimization, you may read a J.F. Puget’s blog post.

The Impact of Preprocessing on the MIPLIB2003

2014-03-19T22:38:00+01:00

What do you know about preprocessing for Mixed Integer Programming (MIP) problems?

After a nice chat with Bo Jensen, CEO, founder, and co-owner (really, he is a Rocket Scientist!) at Sulum Optimization, I realised that I know barely anything.

By definition, we have that:

“Presolving is a way to transform the given problem instance into an equivalent instance that is (hopefully) easier to solve.” (see, chap. 10 in Tobias Achterberg’s Thesis)

All I know is that every MIP solver has a Presolve parameter, which can take different values. For instance, Gurobi has three possible values for that parameter (you can find more details on the Gurobi online manual):

Presolve=0: no presolve at all
Presolve=1: standard presolve
Presolve=2: aggressive presolve: “More aggressive application of presolve takes more time, but can sometimes lead to a significantly tighter model.”

However, I can’t tell you the real impact of that parameter on the overall solution process of a MIP instance. Thus, here we go: let me write a new post that addresses this basic question!

How to measure the Impact of Preprocessing?

To measure the impact of preprocessing we need four ingredients:

A MIP solver
A Data set
Computer power
A method to measure the impact of preprocessing

Changing one of the ingredients could give you different results, but, hopefully, the big picture will not change too much.

As a solver, I have selected the current release of Gurobi (i.e., version 5.6.2). For the data set, likely the most critical ingredient, I have used the MIPLIB2003, basically because I had already all the 60 instances on my server. For running the test I have used an old cluster from the Math Department of University of Pavia.

The measure of impact I have decided to use (after considering other alternatives) is quite conservative: the fraction of closed instances as a function of runtime.

During the last weekend, I have collected a bunch of logs for the 60 instances of the MIPLIB2003, and, then, using RStudio, I have draw the following cumulative plot:

The picture is as simple as clear:

Preprocessing does always pay-off and permits to solve around 10% more of the instances within the same time limit!

In this post, I will not discuss additional technical details, but I just want to add two observations:

Standard preprocessing has removed in average 20.3% of nonzero entries of the original model, while aggressive preprocessing has removed 22.5% of nonzero entries, only a few more.
The average MIP gaps as reported by Gurobi at timeout are: no-presolve = 13.44%, standard = 9.08%, and aggressive = 11.02%.

Likely, the aggressive presolve setting has been decided by Gurobi using a different, much larger, and customer-oriented dataset.

Open Questions

Indeed, preprocessing is a very important feature of a modern MIP solver as Gurobi. Investing few seconds before starting the branch-and-bound MIP search can save a significant amount of runtime. However, a more aggressive preprocessing strategy does not seem to payoff, in average, on the MIPLIB2003.

Unfortunately, preprocessing is somehow disregarded from the research community. There are few recent papers dealing with preprocessing (“ehi! if you do have one, please, let me know about it, ok?”). Most of papers are from the 90s and about Linear Programming, i.e., without integer variables, which mess up everything.

Here a list of basic questions I have in mind:

If cutting planes are used to approximate the convex hull of an Integer Problem, preprocessing for what is used for, really?
Preprocessing techniques have been designed considering a trade-off between efficiency and efficacy (see, MWP Savelsbergh, Preprocessing and Probing Techniques for MIP problems, Journal of Computing, vol6(4) 445-454, 1995). With recent progress in software and hardware technologies, can we revise this trade-off in favor of efficacy?
Preprocessing techniques used for Linear Programming are effective when applied to LP relaxations of Integer Problems?
Should preprocessing sparsify the coefficient matrix?
Using the more recent MIPLIB2010 should we expect much different results?
Which is a better method to measure the impact of preprocessing on a collection of instances?

If you want to share your idea, experience, or opinion, with respect to these questions, you could comment below or send me an email.

Now, to conclude, my bonus question:

Do you have any new smart idea for improving preprocessing?

Well, if you had, I guess you would at least write a paper about, but, do not go for a patent, please!

An Informal Report from the Combinatorial Optimization Workshop @ Aussois 2014

2014-01-13T18:30:00+01:00

It is very hard to report about the Combinatorial Optimization Workshop in Aussois. It was like an “informal” IPCO with Super Heroes researchers in the audience, leaded by Captain Egon, who appears at work in the following photo-tweet:

Egon talks intersection cuts at #aussois. Still the man. pic.twitter.com/7KMcNyJYV0
— Jeff Linderoth (@JeffLinderoth) January 8, 2014

The Captain gave an inspiring talk by questioning the recursive paradigm of cutting planes algorithms. With a very basic example, Balas has shown how a non basic vertex (solution) can produce a much deeper cut than a cut generated by an optimal basis. Around this intuition, Balas has presented a very nice generalization of Intersection Cuts… a new paper enters my “PAPERS-TO-BE-READ” folder.

To stay on the subject of cutting planes, the talk by Marco Molinaro in the first day of the workshop was really nice. He raises the fundamental question on how important are sparse cuts versus dense cuts. The importance of sparse cuts comes from linear algebra: when solving the simplex it is better to have small determinants in the coefficient matrix of the Linear Programming relaxation in order to avoid numerical issues; sparse cuts implicitly help in keeping small the determinants (intuitively, you have more zeros in the matrix). Dense cuts play the opposite role, but they can be really important to improve the bound of the LP relaxation. In his talk, Molinaro has shown and proofed, for three particular cases, when sparse cuts are enough, and when they are not. Another paper goes on the “PAPERS-TO-BE-READ” folder.

In the same day of Molinaro, it was really inspiring the talk by Sebastian Pokutta, who really gave a completely new (for me) perspective on Extended Formulations by using Information Theory. Sebastian is the author of a blog, and I hope he will post about his talk.

Andrea Lodi has discussed about an Optimization problem that arises in Supervised Learning. For this problem, the COIN-OR solver Couenne, developed by Pietro Belotti, significantly outperforms CPLEX. The issues seem to come from on a number of basic big-M (indicator) constraints. To make a long story short, if you have to solve a hard problem, it does pay off to try different solvers, since there is not a “win-all” solver.

Do you have an original new idea for developing solvers? Do not be intimidated by CPLEX or Gurobi and go for it!

The presentation by Marco Senatore was brilliant and his work looks very interesting. I have particularly enjoyed the application in Public Transport that he has mentioned at the end of his talk.

I recommend to have a look at the presentation of Stephan Held about the Reach-aware Steiner Tree Problem. He has an interesting Steiner tree-like problem with a very important application in chip design. The presentation has impressive pictures of what optimal solutions look like in chip design.

At the end of talk, Stephan announced the 11th DIMACS challenge on Steiner Tree Problems.

Eduardo Uchoa gave another impressive presentation on recent progresses on the classical Capacitated Vehicle Routing Problem (CVRP). He has a very sophisticated branch-and-price-and-cut algorithm, which comes with a very efficient implementation of every possible idea developed for CVRP, plus new ideas on solving efficiently the pricing sub problems (my understanding, but I might be wrong, is that they have a very efficient dominance rule for solving a shortest path sub problem). +1 item in the “PAPERS-TO-BE-READ” folder.

The last day of the workshop, I have enjoyed the two talks by Simge Kucukyavuz and Jim Luedtke on Stochastic Integer Programming: for me is a completely new topic, but the two presentations were really inspiring.

To conclude, Domenico Salvagnin has shown how far it is possible to go by carefully using MIP technologies such as cutting planes, symmetry handling, and problem decomposition. Unfortunately, it does happen too often that when someone (typically a non OR expert) has a difficult application problem, he writes down a more or less complicated Integer Programming model, tries a solver, sees it takes too much time, and gives up with exact methods. Domenico, by solving the largest unsolved instance for the 3-dimensional assignment problem, has shown that

there are potentially no limits for MIP solvers!

In this post, I have only mentioned a few talks, which somehow overlap with my research interests. However, every talk was really interesting. Fortunately, Francois Margot has strongly encouraged all of the speakers to upload their slides and/or papers, so you can find (almost) all of them on the program web page of the workshop. Visit the website and have a nice reading!

To conclude, let me steal another nice picture from twitter:

Goodbye #aussois2014 pic.twitter.com/ODupKKmGTZ
— Matteo Fischetti (@MFischetti) January 10, 2014

Public Transport and Big Data

2013-11-17T14:10:00+01:00

Big Data is nowadays a buzzword. A simple query for “Big Data” on Google gives about 26,700,000 results.

Public Transport is not really a buzzword, but still on Google you can get almost the same number as with “Big Data”: 26,400,000 results.

Why is Public Transport so important?

Because many of us use Public Transport every day, but most of us still use their own car to go to work, to bring child at school, and to go shopping. This has a negative impact on the quality of life of everyone and is clearly inefficient since it does cost more:

More money.
More pollution.
More time.

(Well, for time, it is not always true, but it happens more often than commonly perceived).

Thus, an important challenge is to improve the quality of Public Transport while keeping its cost competitive. The ultimate goal should be to increase the number of people that trust and use Public Transport.

How is it possible to achieve this goal?

Transport Operators are Big Data producers (are they?)

Modern transport operators have installed so called Automatic Vehicle Monitoring (AVM) systems that use several technologies to monitor the fleet of vehicles that operates the service (e.g., metro coaches, buses, metro trains, trains, …).

The stream of data produced by an AVM might be considered as Big Data because of its volume and velocity (see Big Data For Dummies, by J.F. Puget). Each vehicle produces at regular intervals (measured in seconds) data concerning its position and status. This information is stored in remote data centers. The data for a single day might not be considered as “Big”, however once you start to analyze the historical data, the volume increases significantly. For instance, a public transport operator could easily have around 2000 thousands vehicles that operate 24 hours a day, producing data potentially every second.

At the moment, this stream of data misses the third dimension of Big Data that is variety. However, new projects that aim at integrating this information with the stream of data coming from social networks are quickly reaching maturity. One of such project is SuperHub, a FP7 project that has recently won the best exhibit award in Cluster 2 “Smart and sustainable cities for 2020+”, at the ICT2013 Conference in Vilnius.

I don’t know whether transport operators are really Big Data producers or they are merely Small Data producers, but data collected using AVMs are nowadays mainly used to report and monitor the daily activities.

In my own opinion, the data produced by transport operators, integrated with input coming from social networks, should be used to improve the quality of the public transport, for instance, by trying to better tackle Disruption Management issues.

So, I am curious:

Do you know any project that uses AVM data, combined with Social Network inputs (e.g., from Twitter), to elaborate Disruption Management strategies for Public Transport? If yes, do they use Mathematical Optimization at all?

Reading Excuses

2013-11-01T15:56:00+01:00

I love reading!

I love reading about everything and I am glad that part of my work consists in reading.

Unfortunately, for researchers, reading is not always that easy, as clearly explained in The Researcher’s Bible:

Reading is difficult: The difficulty seems to depend on the stage of academic development. Initially it is hard to know what to read (many documents are unpublished), later reading becomes seductive and is used as an excuse to avoid research. Finally one lacks the time and patience to keep up with reading (and fears to find evidence that one’s own work is second rate or that one is slipping behind)

For my stage of academic development, reading is extremely seductive, and the situation became even worse after reading the answers to the following question raised by Michael Trick on OR-exchange:

What paper should everyone read?

If you are looking for excuses to avoid research, go through those answers and select any paper you like, you will have outstanding and authoritative excuses!

GeCol: a Graph Coloring solver on top of Gecode

2013-06-28T15:00:00+02:00

This post is about solving the classical Graph Coloring problem by using a simple solver, named here GeCol, that is built on top of the Constraint Programming (CP) solver Gecode. The approach of GeCol is based on the CP model described in [1]. Here, we want to explore some of the new features of the last version of Gecode (version 4.0.0), namely:

Lightweight Dynamic Symmetry Breaking (LDSB) [2]
Accumulated Failure Count (AFC) and Activity-based strategies for variable selection while branching, combined with Restart Based Search

We are going to present computational results using these features to solve the instances of the Graph Coloring DIMACS Challenge. However, this post is not going to describe in great details what these features are: please, for this purpose, refer to the Modeling and Programming with Gecode book.

As usual, all the sources used to write this post are publicly available on my GitHub repository.

Modeling Graph Coloring with Constraint Programming

To model this problem with CP, we can use for each vertex $i$ an integer variable $x_i$ with domain equals to $K$ : if $x_i=k$ , then color $k$ is assigned to vertex $i$ . Using (inclusion-wise) maximal cliques, it is possible to post constraints on subsets of adjacent vertices: every subset of vertices belonging to the same clique must get a different color. In CP, we can use the well-known alldifferent constraint for posting these constraints.

In practice, to build our CP model, first, we find a collection of maximal cliques $C$ , such that for every edge $(i,j) \in E$ there exists at least a clique $c \in C$ that contains both vertices $i$ and $j$ . Second, we post the following constraints:

$\mbox{alldifferent}([x_c]) \qquad \forall c \in C$

where $x_c$ denotes the subset of variables corresponding to the vertices that belong to the clique $c$ .

In order to minimize the number of colors, we use a simple iterative procedure. Every time we found a coloring with $k$ colors, we restart the search by restricting the cardinality of $K$ to $k-1$ . If no feasible coloring exists with $k-1$ colors, we have proved optimality for the last feasible coloring found, i.e. $\chi(G)=k$ .

In addition, we apply a few basic preprocessing steps that are described in [1]. The maximal cliques are computed using Cliquer v1.21 [5].

Lightweight Dynamic Symmetry Breaking

The Graph Coloring problem is an optimization problem that has several equivalent optimum solutions: for instance, given an optimal assignment of colors to vertices, any permutation of the colors, gives a solution with the same optimum value.

While this property is implicitly considered in Column Generation approaches to Graph Coloring (e.g., see [3], [1], and [4]), the CP model we have just presented, suffers from symmetries issues: the values of the domains of the integer variables are symmetric.

The Lightweight Dynamic Symmetry Breaking is a strategy for dealing with this issue [2]. In Gecode, you can define a set of values that are symmetric as follows:

Symmetries syms; syms << ValueSymmetry(IntArgs::create(k,1));

and then when posting the branching strategy you just write (just note that use of object syms):

branch(*this, x, INT_VAR_SIZE_MIN(), INT_VAL_MIN(), syms);

With three lines of code, you have solved (some of) the symmetry issues.

How efficient is Lightweight Dynamic Symmetry Breaking for Graph Coloring?

We try to answer to this question with the plot below that shows the results for two versions of GeCol:

(A) The first version without any breaking symmetry strategy
(B) The second version with the Lightweight Dynamic Breaking Symmetry

Both versions select for branching the variable with the smallest domain size. The plot reports the empirical cumulative distribution as function of run time (in log-scale). The tests were run with a timeout of 300 seconds on a quite old server. Note that at the timeout, the version with LDBS has solved around 55% of the instances, while the version without LDBS has solved only around 48% of the instances.

Accumulated Failure Count and Activity-based Branching

The second new feature of Gecode that we explore here is the Accumulated Failure Count and the Activity-based branching strategies.

While solving any CP model, the strategy used to select the next variable to branch over is very important. The Accumulated Failure Count strategy stores the cumulative number of failures for each variable (for details see Section 8.5 in MPG). The Activity-based search does something similar, but instead of counting failures, measures the activity of each variable. In a sense, these two strategies try to learn from failures and activities as they occur during the search process.

These two branching strategies are more effective when combined with Restart Based Search: the solver performs the search with increasing cutoff values on the number of failures. Gecode offers several optional strategies to improve the cutoff. In our tests, we have used a geometric cutoff sequence (Section 9.4 in MPG).

How effective are the Accumulated Failure Count and the Activity-based strategies for Graph Coloring when combined with Restart Based Search?

The second plot below shows a comparison of 3 versions of GeCol, with 3 different branching strategies:

(A) Select the variable with smallest domain size
(B) Select the variable with largest Activity Cumulated value
(C) Select the variable with largest Accumulated Failure Count (AFC) value

The last strategy is tremendously efficient: it dominates the other two strategies, and it is able to solve more of the 60% of the considered instances within the timeout of 300 seconds.

However, it is possible to do still slightly better. Likely, at the begging of the search phase, several variables have the same value of AFC. Therefore, it is possible to improve the branching strategy by breaking ties: we can divide the ACT or the AFC value of a variable by the its domain size. The next plot shows the results with these other branching strategies:

(A) Select the variable with largest ratio of variable degree vs. domain size
(B) Select the variable with largest ratio of Activity Cumulated value vs. domain size
(C) Select the variable with largest ratio of Accumulated Failure Count vs. domain size

Conclusions

The new features of Gecode are very interesting and offer plenty of options. The LDBS is very general, and it could be easily applied to several other combinatorial optimization problems. Also the new branching strategies gives important enhancements, above all when combined with restart based search.

”…with great power there must also come – great responsibility!” (Uncle Ben, The Amazing Spider-Man, n.660, Marvel Comics)

As a drawback, it is becoming harder and harder to find the best parameter configuration for solvers as Gecode (but this is true also for other type of solvers, e.g. Gurobi and Cplex).

Can you find or suggest a better parameter configuration for GeCol?

References

S. Gualandi and F. Malucelli. Exact Solution of Graph Coloring Problems via Constraint Programming and Column Generation. INFORMS Journal on Computing. Winter 2012 vol. 24(1), pp.81-100. [pdf] [preprint]
C. Mears, M.G. de la Banda, B. Demoen, M. Wallace. Lightweight dynamic symmetry breaking. In Eighth International Workshop on Symmetry in Constraint Satisfaction Problems, SymCon’08, 2008. [pdf]
A Mehrotra, MA Trick. A column generation approach for graph coloring. INFORMS Journal on Computing. Fall 1996 vol. 8(4), pp.344-354. [pdf]
S. Held, W. Cook, E.C. Sewell. Maximum-weight stable sets and safe lower bounds for graph coloring. Mathematical Programming Computation. December 2012, Volume 4, Issue 4, pp 363-381. [pdf]
Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, vol. 120(1-3), pp. 197–207, 2002 [pdf]

Backtrack Programming in c

2013-03-22T12:45:00+01:00

Recently, I have discovered a nice tiny library (1 file!) that supports Backtrack Programming in standard C. The library is called CBack and is developed by Keld Helsgaun, who is known in the Operations Research and Computer Science communities for his efficient implementation of the Lin-Kernighan heuristics for the Travelling Salesman Problem.

CBack offers basically two functions that are described in [1] as follows:

Choice(N): “is used when a choice is to be made among a number of alternatives, where N is a positive integer denoting the number of alternatives”.
Backtrack(): “causes the program to backtrack, that is to say, return to the most recent call of Choice, which has not yet returned all its values”.

With these two functions is pretty simple to develop exact enumeration algorithms. The CBack library comes with several examples, such as algorithms for the N-queens problem and the 15-puzzle. Below, I will show you how to use CBack to implement a simple algorithm that finds a Maximum Clique in an undirected graph.

As usual, the source code used to write this post is publicly available on my GitHub repository.

Basic use of CBack

The CBack documentation shows as first example the following code snippet:

Example

int i, j;
i = Choice(3);
j = Choice(2);
printf("i = %d, j = %d\n",i,j);
Backtrack();

The output produced by the snippet is:

Output

i = 1, j = 1
i = 1, j = 2
i = 2, j = 1
i = 2, j = 2
i = 3, j = 1
i = 3, j = 2

If you are familiar with backtrack programming (e.g., Prolog), you should not be surprised by the output, and you can jump to the next section. Otherwise, the Figure below sketches the program execution.

When the program executes the Choice(N=3) statement, that is the first call to the first choice (line 2), value 1 is assigned to variable i. Behind the scene, the Choice function stores the current execution state of the program in its own stack, and records the next possible choices (i.e. the other possible program branches), that are values 2 and 3. Next, the second Choice(N=2) assigns value 1 to j (line 3), and again the state of the program is stored for later use. Then, the printf outputs i = 1 , j = 1 (line 4 and first line of output). Now, it is time to backtrack (line 5).

What is happening here?

Look again at the figure above: When the Backtrack() function is invoked, the algorithm backtracks and continues the execution from the most recent Choice stored in its stack, i.e. it assigns to variable j value 2, and printf outputs i = 1, j = 2. Later, the Backtrack() is invoked again, and this time the algorithm backtracks until the previous possible choice that corresponds to the assignment of value 2 to variable i, and it executes i = 2. Once the second choice for variable i is performed, there are again two possible choices for variable j, since the program has backtracked to a point that precedes that statement. Thus, the program executes j = 1, and printf outputs i = 2, j = 1. At this point, the program backtracks again, and consider the next possible choice, j = 2. This is repeated until all possible choices for Choice(3) and Choice(2) are exhausted, yielding the 6 possible combinations of i and j that the problem gave as output.

Indeed, during the execution, the program has implicitly visited in a depth-first manner the search tree of the previous figure. CBack supports also different search strategy, such as best first, but I will not cover that topic here.

In order to store and restore the program execution state (well, more precisely the calling environment), Choice(N) and Backtrack use two threatening C standard functions, setjmp and longjmp. For the details of their use in CBack, see [1].

A Basic Maximum Clique Algorithm

The reason why I like this library, apart from remembering me the time I was programming with Mozart, is that it permits to implement quickly exact algorithms based on enumeration. While enumeration is usually disregarded as inefficient (“ehi, it is just brute force!”), it is still one of the best method to solve small instances of almost any combinatorial optimization problem. In addition, many sophisticated exact algorithms use plain enumeration as a subroutine, when during the search process the size of the problem becomes small enough.

Consider now the Maximum Clique Problem: Given an undirected graph $G=(V,E)$ , the problem is to find the largest complete subgraph of $G$ . More formally, you look for the largest subset $C$ of the vertex set $V$ such that for any pair of nodes ${i,j}$ in $C \times C$ there exists an arc ${i,j} \in E$ .

The well-known branch-and-bound algorithm of Carraghan and Pardalos [2] is based on enumeration. The implementation of Applegate and Johnson, called dfmax.c, is a very efficient implementation of that algorithm. Next, I show a basic implementation of the same algorithm that uses CBack for backtracking.

The Carraghan and Pardalos algorithm uses three sets: the current clique $C$ , the largest clique found so far $C^*$ , and the set of candidate vertices $P$ . The pseudo code of the algorithm is as follows (as described in [3]):

Basic Maximum Clique Algorithm

function findClique(C, P)
   if |C| > |C*| then C* <- C  // store the best clique
   if |C| + |P| > |C*| then
      for all v in P // the order does matter 
         P <- P \ {v}
         C' <- C u {v}
         P' <- P  \intersect N(v)  // neighbors of p
         findClique(C', P')

function main()
   C* <- {}  // empty set, C* global variable
   findClique({},V)
   return C*

As you can see, the backtracking is here described in terms of a recursive function. However, using CBack, we can implement the same algorithm without using recursion.

Maximum Clique with CBack

We use an array S of $n$ integers, one for each vertex of $V$ . If S[v]=0, then vertex $i$ belongs to the candidate set $P$ ; if S[v]=1, then vertex $i$ is in $C$ ; if S[v]=2, then vertex $i$ cannot be neither in $P$ nor in $C$ . The variable s stores the size of current clique.

Let me show you directly the C code:

Max Clique via Branch-and-Bound

for (v = 0; v < n; v++ ) {
   /// If the current clique cannot be extended to a clique
   /// larger than C*, where LB=|C*|, then backtrack
   if ( s + P <= LB )
      Backtrack();
   if ( S[v] < 2 ) {  /// Skip removed vertices  
      /// Choice: Either v is in C (S[v]=1) or is not (S[v]=2)
      S[v] = Choice(2);
      if ( S[v] == 2 ) { /// P <- P \ {v} 
         P--;  /// Decrease the size of the candidate set 
      } else { /// S[v]=1: C <- C u {v}
         s++;   /// Update current clique size 
         if ( s > LB ) {
            LB = s;  /// Store the new best clique
            for ( w = 0; w <= v; w++ )
               C[w] = S[w];
         }
         /// Restrict the candidate set 
         for ( w = V[v+1]; w > V[v] ; w-- )
            if ( S[E[w]] == 0 ) {
               S[E[w]] = 2;
               P--; /// Decrease the size of the candidate set 
            }
      }
   }
}
Backtrack();

Well, I like this code pretty much, despite being a “plain old” C program. The algorithm and code can be improved in several ways (ordering the vertices, improving the pruning, using upper bounds from heuristic vertex coloring, using induced degree as in [2]), but still, the main loop and the backtrack machinery is all there, in a few lines of code!

Maybe you wonder about the efficiency of this code, but at the moment I have not a precise answer. For sure, the ordering of the vertices is crucial, and can make a huge difference on solving the max-clique DIMACS instances. I have used CBack to implement my own version of the Ostengard’s max-clique algorithm [4], but my implementation is somehow slower. I suspect that the difference is due to data structure used to store the graph (Ostengard’s implementation relies on bitsets), but not in the way the backtracking is achieved. Although, to answer to such question could be a subject of another post.

In conclusion, if you need to implement an exact enumerative algorithm, CBack could be an option to consider.

References

Keld Helsgaun. CBack: A Simple Tool for Backtrack Programming in C. Software: Practice and Experience, vol. 25(8), pp. 905-934, 2006. [doi]
Carraghan and Pardalos. An exact algorithm for the maximum clique problem. Operations Research Letters, vol. 9(6), pp. 375-382, 1990, [pdf]
Torsten Fahle. Simple and Fast: Improving a Branch-and-Bound Algorithm. In Proc ESA 2002, LNCS 2461, pp. 485-498. [doi]
Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, vol. 120(1-3), pp. 197–207, 2002 [pdf]

From blackboard to code: Gomory Cuts using CPLEX

2013-02-05T15:22:00+01:00

Edited on May 16th, 2013: fixes due to M. Chiarandini

On the blackboard, to solve small Integer Linear Programs with 2 variables and less or equal constraints is easy, since they can be plotted in the plane and the linear relaxation can be solved geometrically. You can draw the lattice of integer points, and once you have found a new cutting plane, you show that it cuts off the optimum solution of the LP relaxation.

This post presents a naive (textbook) implementation of Fractional Gomory Cuts that uses the basic solution computed by CPLEX, the commercial Linear Programming solver used in our lab sessions. In practice, this post is an online supplement to one of my last exercise session.

In order to solve the “blackboard” examples with CPLEX, it is necessary to use a couple of functions that a few years ago were undocumented. GUROBI has very similar functions, but they are currently undocumented. (Edited May 16th, 2013: From version 5.5, Gurobi has documented its advanced simplex routines)

As usual, all the sources used to write this post are publicly available on my GitHub repository.

The basics

Given a Integer Linear Program in the form:

$(P) \qquad \min \{ cx \mid Ax \leq b, \, x \geq 0, \, x \mbox{ integer} \}$

it is possible to rewrite the problem in standard form by adding slack variables:

$(P) \qquad \min \{ cx \mid Ax + Ix_S = b, \, x \geq 0, \, x \mbox{ integer}, \, x_S \geq 0 \}$

where $I$ is the identity matrix and $x_S$ is a vector of slack variables, one for each constraint in $(P)$ . Let us denote by $(\bar{P})$ the linear relaxation of $(P)$ obtained by relaxing the integrality constraint.

The optimum solution vector of $(\bar{P})$ , if it exists and it is finite, it is used to derive a basis (for a formal definition of basis, see [1] or [3]). Indeed, the basis partitions the columns of matrix $A$ into two submatrices $B$ and $N$ , where $B$ is given by the columns corresponding to the basic variables, and $N$ by columns corresponding to variables out of the base (they are equal to zero in the optimal solution vector).

Remember that, by definition, $B$ is nonsingular and therefore is invertible. Using the matrices $B$ and $N$ , it is easy to derive the following inequalities (for details, see any OR textbook, e.g., [1]):

$\begin{eqnarray} &Ax = b & \\ &Bx_B + Nx_N = b& \\ &x_B + B^{-1}N\,x_N = B^{-1}\,b & \qquad B \mbox{ is nonsingular} \\ &x_B + \lfloor B^{-1}N \rfloor \,x_N \leq B^{-1}\,b &\qquad \mbox{since }x \geq 0 \\ &x_B + \lfloor B^{-1}N \rfloor \,x_N \leq \lfloor B^{-1}\,b \rfloor& \qquad x \mbox{ is integer} \end{eqnarray}$

where the operator $\lfloor \cdot \rfloor$ is applied component wise to the matrix elements. In practice, for each fractional basic variable, it is possible to generate a valid Gomory cut.

The key step to generate Gomory cuts is to get an optimal basis or, even better, the inverse of the basis matrix $B^{-1}$ multiplied by $A$ and by $b$ . Once we have that matrix, in order to generate a Gomory cut from a fractional basic variable, we just use the last equation in the previous derivation, applying it to each row of the system of inequalities

Given the optimal basis, the optimal basic vector is $x_B=B^{-1}b$ , since the non basic variables are equal to zero. Let $j$ be the index of a fractional basic variable, and let $i$ be the index of the constraint corresponding to variable $j$ in the equations $x_B=B^{-1}A$ , then the Gomory cut for variable $j$ is:

$x_j + \sum_{l \in N} \lfloor (B^{-1}N)_{il} \rfloor\,x_l \leq \lfloor (B^{-1}\,b)_i \rfloor$

Using the CPLEX callable library

The CPLEX callable library (written in C) has the following advanced functions:

CPXbinvarow computes the i-th row of the tableau
CPXbinvrow computes the i-th row of the basis inverse
CPXbinvacol computes the representation of the j-th column in terms of the basis
CPXbinvcol computes the j-th column of the basis inverse

Using the first two functions, Gomory cuts from an optimal base can be generated as follows:

Gomory cut Fork Me on GitHub

printf("\nGenerate Gomory cuts:\n");
idx = 0;
cut = 0;  /// Index of cut to be added
for ( i = 0; i < m-1; ++i )
if ( floor(b_bar[i]) != b_bar[i] ) {
   printf("Row %d gives cut ->   ", i+1);
   POST_CMD( CPXbinvarow(env, model, i, z) );
   rmatbeg[cut] = idx;
   for ( j = 0; j < n1; ++j ) {
      z[j] = floor(z[j]); /// DANGER!
      if ( z[j] != 0 ) {
         rmatind[idx] = j;
         rmatval[idx] = z[j];
         idx++;
      }
      /// Print the cut
      if ( z[j] >= 0 ) printf("+");
      printf("%.1f x%d ", z[j], j+1);
   }
   gc_rhs[cut] = floor(b_bar[i]); /// DANGER!
   gc_sense[cut] = 'L';
   printf("<= %.1f\n", gc_rhs[cut]);
   cut++;
}
/// Add the new cuts
POST_CMD( CPXaddrows (env, model, 0, n_cuts, idx, gc_rhs, gc_sense,
         rmatbeg, rmatind, rmatval, NULL, NULL) );

The code reads row by row (index i) the inverse basis matrix $B^{-1}$ multiplied by $A$ (line 7), which is temporally stored in vector z, and then the code stores the corresponding Gomory cut in the compact matrix given by vectors rmatbeg, rmatind, and rmatval (lines 8-15). The array b_bar contains the vector $B^{-1}b$ (line 21). In lines 26-27, all the cuts are added at once to the current LP data structure.

On GitHub you find a small program that I wrote to generate Gomory cuts for problems written as $(P)$ . The repository have an example of execution of my program.

The code is simple only because it is designed for small IPs in the form $\min\, \{\, cx \mid Ax\, \leq\,b,\, x\geq 0\}$ . Otherwise, the code must consider the effects of preprocessing, different sense of the constraints, and additional constraints introduced because of range constraints.

If you are interested in a real implementation of Mixed-Integer Gomory cuts, that are a generalization of Fractional Gomory cuts to mixed integer linear programs, please look at the SCIP source code.

Additional readings

The introduction of Mixed Integer Gomory cuts in CPLEX was The major breakthrough of CPLEX 6.5 and produced the version-to-version speed-up given by the blue bars in the chart below (source: Bixby’s slides available on the web):

Gomory cuts are still subject of research, since they pose a number of implementation challenges. These cuts suffer from severe numerical issues, mainly because the computation of the inverse matrix requires the division by its determinant.

“In 1959, […] We started to experience the unpredictability of the computational results rather steadily” (Gomory, see [4]).”

A recent paper by Cornuejols, Margot, and Nannicini deals with some of these issues [2].

If you like to learn more about how the basis are computed in the CPLEX LP solver, there is very nice paper by Bixby [3]. The paper explains different approaches to get the first basic feasible solution and gives some hints of the CPLEX implementation of that time, i.e., 1992. Though the paper does not deal with Gomory cuts directly, it is a very pleasant reading.

To conclude, for those of you interested in Optimization Stories there is a nice chapter by G. Cornuejols about the Ongoing Story of Gomory Cuts [4].

References

C.H. Papadimitriou, K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. 1998. [book]
G. Cornuejols, F. Margot and G. Nannicini. On the safety of Gomory cut generators. Submitted in 2012. Mathematical Programming Computation, under review. [preprint]
R.E. Bixby. Implementing the Simplex Method: The Initial Basis. Journal on Computing vol. 4(3), pages 267–284, 1992. [abstract]
G. Cornuejols. The Ongoing Story of Gomory Cuts. Documenta Mathematica - Optimization Stories. Pages 221-226, 2012. [preprint]

How Italian Commuters Discovered Operations Research

2012-12-16T10:45:00+01:00

Last week, more then 700,000 Italian commuters discovered the importance of Operations Research (OR). Nobody explicitly mentioned OR, but due to a horrible crew schedule of Trenord (a train operator with around 330 trains and 2700 employees), the commuters had a long, long, long nightmare. During the whole week, several trains were cancelled (1375 in total) and most of the trains were delayed. A newspaper wrote that a commuter waiting to go home had the painful record of 11 consecutive trains cancelled. The Italian online edition of Wired has an article about this horrible week. If you want to get an idea of the chaos you can search for “caos tilt software trenord” on google.it.

Trenord officially said that the software that planned the crew schedule is faulty. The software was bought last year from Goal Systems, a Spanish company. Rumors say that Trenord paid the Goal System around 1,500,000 Euro. Likely, the system is not faulty, but it “only” had bad input data.

What newspapers do not write

Before the Goal System, Trenord was using a different software, produced by Management Artificial Intelligence Operations Research srl (MAIOR) that is used by several public transportation companies in Italy, included ATM that operates the subway and buses in Milan. In addition, MAIOR collaborates with the Politecnico di Milano and the University of Pisa to improve continuously its software. Honestly, I am biased, since I collaborate with MAIOR. However, Trenord dismissed the software of MAOIR without any specific complaint, since the management had decided to buy the Goal System software.

Newspapers do not ask the following question:

Why to change a piece of software, if the previous one was working correctly?

In Italy, soccer players have a motto: “squadra che vince non si cambia”. Maybe at Trenord nobody plays soccer.

MAIOR is back

Likely, next week will be better for the 700,000 commuters, since OR experts from MAIOR are traveling to Milan to help Trenord to improve the situation.

Disclaimer (post edited on 18th December 2012)

I am a Pavia-Milano commuter disappointed of the chaotic week we had.
The information reported in this post were obtained with searches on google.it and published on Italian online magazines.
Surely, the Goal System is a piece of software as good as MAIOR software is.
This post does not intend to offend anyone.

Challenging MIPs instances

2012-12-13T18:11:00+01:00

Today, I share seven challenging MIP instances as .mps files along with the AMPL model and data files I used to generate them. While I like the MIPLIBs, I do prefer problem libraries similar to the CSPLIB where you get both a problem description and a set of data. This allows anyone to try with her new model and/or method.

The MIP instances I propose come from my formulation of the Machine Reassignment Problem proposed for the Roadef Challenge sponsored by Google last year. As I wrote in a previous post, the Challenge had huge instances and a micro time limit of 300 seconds. I said micro because I have in mind exact methods: there is little you can do in 300 seconds when you have a problem with potentially as many as $50000 \times 5000$ binary variables. If you want to use math programming and start with the solution of a linear programming relaxation of the problem, you have to be careful: it might happen that you cannot even solve the LP relaxation at the root node within 300 seconds.

That is why most of the participants tackled the Challenge mainly with heuristic algorithms. The only general purpose solver that qualified for the challenge is Local Solver, which has a nice abstraction (“somehow” similar to AMPL) to well-known local search algorithms and move operators. The Local Solver script used in the qualification phase is available here.

However, in my own opinion, it is interesting to try to solve at least the instances of the qualification phase with Integer Linear Programming (ILP) solvers such as Gurobi and CPLEX. Can these branch-and-cut commercial solvers be competitive on such problems?

Problem Overview

Consider you are given a set of processes $P$ , a set of machines $M$ , and an initial mapping $\pi$ of each process to a single machine (i.e., $\pi_p = i$ if process $p$ is initially assigned to machine $i$ ). Each process consumes several resources, e.g., CPU, memory, and bandwidth. In the challenge, some processes were defined to be transient: they consume resources both on the machine where they are initially located, and in the machine they are going to be after the reassignment. The problem asks to find a new assignment of processes to machines that minimizes a rather involved cost function.

A basic ILP model will have a 0-1 variable $x_{pi}$ equals to 1 if you (re)assign process $p$ to machine $i$ . The number of processes and the number of machines give a first clue on the size of the problem. The constraints on the resource capacities yield a multi-dimensional knapsack subproblem for each machine. The Machine Reassignment Problem has other constraints (kind of logical 0-1 constraints), but I do not want to bore you here with a full problem description. If you like to see my model, please read the AMPL model file.

A first attempt with Gurobi

In order to convince you that the proposed instances are challenging, I report some computational results.

The table below reports for each instance the best result obtained by the participants of the challenge (second column). The remaining four columns give the upper bound (UB), the lower bound (LB), the number of branch-and-bound nodes, and the computation time in seconds obtained with Gurobi 5.0.1, a timeout of 300 seconds, and the default parameter setting on a rather old desktop (single core, 2Gb of RAM).

Instance	Best Known UB	Upper Bound	Lower Bound	Nodes	Time
a1-1	44,306,501	44,306,501	44,306,501	0	0.05
a1-2	777,532,896	780,511,277	777,530,829	537	-
a1-3	583,005,717	583,005,720	583,005,715	15	48.76
a1-4	252,728,589	320,104,617	242,404,632	24	-
a1-5	727,578,309	727,578,316	727,578,296	221	2.43
a2-1	198	54,350,836	110	0	-
a2-2	816,523,983	1,876,768,120	559,888,659	0	-
a2-3	1,306,868,761	2,272,487,840	1,007,955,933	0	-
a2-4	1,681,353,943	3,223,516,130	1,680,231,407	0	-
a2-5	336,170,182	787,355,300	307,041,984	0	-

Instances a1-1, a1-3, a1-5 are solved to optimality within 300 seconds and hence they are not further considered.

The remaining seven instances are the challenging instances mentioned at the begging of this post. The instances a2-x are embarrassing: they have an UB that is far away from both the best known UB and the computed LB. Specifically, look at the instance a2-1: the best result of the challenge has value 198, Gurobi (using my model) finds a solution with cost 54,350,836: you may agree that this is “slightly” more than 198. At the same time the LB is only 110.

Note that for all the a2-x instances the number of branch-and-bound nodes is zero. After 300 seconds the solver is still at the root node trying to generate cutting planes and/or running their primal heuristics. Using CPLEX 12.5 we got pretty similar results.

This is why I think these instances are challenging for branch-and-cut solvers.

Search Strategies: Feasibility vs Optimality

Commercial solvers have usually a meta-parameter that controls the search focus by setting other parameters (how they are precisely set is undocumented: do you know more about?). The two basic options of this parameter are (1) to focus on looking for feasible solution or (2) to focus on proving optimality. The name of this parameter is MipEmphasis in CPLEX and MipFocus in Gurobi. Since the LPs are quite time consuming and after 300 seconds the solver is still at the root node, we can wonder whether generating cuts is of any help on these instances.

If we set the MipFocus to feasibility and we explicitly disable all cut generators, would we get better results?

Look at the table below: the values of the upper bounds of instances a1-2, a1-4, and a2-3 are slightly better than before: this is a good news. However, for instance a2-1 the upper bound is worse, and for the other three instances there is no difference. Moreover, the LBs are always weaker: as expected, there is no free lunch!

Instance	Upper Bound	Lower Bound	Gap	Nodes
a1-2	779,876,897	777,530,808	0.30%	324
a1-4	317,802,133	242,398,325	23.72%	48
a2-1	65,866,574	66	99.99%	81
a2-2	1,876,768,120	505,443,999	73.06%	0
a2-3	1,428,873,892	1,007,955,933	29.45%	0
a2-4	3,223,516,130	1,680,230,915	47.87%	0
a2-5	787,355,300	307,040,989	61.00%	0

If we want to keep a timeout of 300 seconds, there is little we can do, unless we develop an ad-hoc decomposition approach.

Can we improve those results with a branch-and-cut solver using a longer timeout?

Most of the papers that uses branch-and-cut to solve hard problems have a timeout of at least one hour, and they start by running an heuristic for around 5 minutes. Therefore, we can think of using the best results obtained by the participants of the challenge as starting solution.

So, let us make a step backward: we enable all cut generators and we set all parameters at the default value. In addition we set the time limit to one hour. The table below gives the new results. With this setting we are able to “prove” near-optimality of instance a1-2, and we reduce significantly the gap of instance a2-4. However, the solver never improves the primal solutions: this means that we have not improved the results obtained in the qualification phase of the challenge. Note also that the number of nodes explored is still rather small despite the longer timeout.

Instance	Upper Bound	Lower Bound	Gap	Nodes
a1-2	777,532,896	777,530,807	~0.001%	0
a1-4	252,728,589	242,404,642	4.09%	427
a2-1	198	120	39.39%	2113
a2-2	816,523,983	572,213,976	29.92%	18
a2-3	1,306,868,761	1,068,028,987	18.27%	69
a2-4	1,681,353,943	1,680,231,594	0.06%	133
a2-5	336,170,182	307,042,542	8.66%	187

What if we disable all cuts and set the MipFocus to feasibility again?

Instance	Upper Bound	Lower Bound	Gap	Nodes
a1-2	777,532,896	777,530,807	~0.001%	0
a1-4	252,728,589	242,398,708	4.09%	1359
a2-1	196	70	64.28%	818
a2-2	816,523,983	505,467,074	38.09%	81
a2-3	1,303,662,728	1,008,286,290	22.66%	56
a2-4	1,681,353,943	1,680,230,918	0.07%	108
a2-5	336,158,091	307,040,989	8.67%	135

With this parameter setting, we improve the UB for 3 instances: a2-1, a2-3, and a2-5. However, the lower bounds are again much weaker. Look at instance a2-1: the lower bound is now 70 while before it was 120. If you look at instance a2-3 you can see that even if we got a better primal solution, the gap is weaker, since the lower bound is worse.

RFC: Any idea?

With the focus on feasibility you get better results, but you might miss the ability to prove optimality. With the focus on optimality you get better lower bounds, but you might not improve the primal bounds.

1) How to balance feasibility with optimality?

To use branch-and-cut solver and to disable cut generators is counterintuitive, but if you do you, you get better primal bounds.

2) Why should I use a branch-and-cut solver then?

Do you have any idea out there?

Minor Remark

While writing this post, we got 3 solutions that are better than those obtained by the participants of the qualification phase: a2-1, a2-3, and a2-5 (the three links give the certificates of the solutions). We are almost there in proving optimality of a2-3, and we get better lower bounds than those published in [1].

References

Deepak Mehta, Barry O’Sullivan, Helmut Simonis. Comparing Solution Methods for the Machine Reassignment Problem. In Proc of CP 2012, Québec City, Canada, October 8-12, 2012.

Credits

Thanks to Stefano Coniglio and to Marco Chiarandini for their passionate discussions about the posts in this blog.

CP2012: Je me souviens

2012-10-19T13:24:00+02:00

Last week in Quebec City, there was the 18th International Conference on Principles and Practice of Constraint Programming. This year the conference had a record of submissions (186 in total) and the program committee made a vey nice job in organizing the plenary sessions and the tutorials. You can find very nice pictures of the conference on Helmut’s web page.

During the conference, the weather outside was pretty cold, but at the conference site the discussions were warm and the presentations were intriguing.

In this post, I share an informal report of the conference as “Je me souviens”.

Challenges in Smart Grids

The invited talks were excellent and my favorite one was given by Miguel F. Anjos on Optimization Challenges in Smart Grid Operations. Miguel is not exactly a CP programmer, he is more on discrete non linear optimization, but his talk was a perfect mixed of applications, modeling, and solution techniques. Please, read and enjoy his slides.

I like to mention just one of his observations. Nowadays, electric cars are becoming more and more present. What would happen when each of us will have an electric car? Likely, during the night, while sleeping, we will connect our car to the grid to recharge the car batteries. This will lead to high variability in night peaks of energy demand.

How to manage these peaks?

Well, what Miguel has reported as a possible challenging option is to think of the collection of cars connected to the grid as a kind of huge battery. This sort of collective battery could be used to better handle the peaks of energy demands. Each car would play the game with a double role: if there is not an energy demand peak, you can recharge the car battery; otherwise, the car battery could be used as a power source and it could supply energy to the grid. This is an oversimplification, but as you can image there would be great challenges and opportunities for any constraint optimizer in terms of modeling and solution techniques.

I am curious to read more about, do you?

Sessions and Talks

This year CP had the thicker conference proceedings, ever. Traditionally, the papers are presented in two parallel sessions. Two is not that much when you think that this year at ISMP there were 40 parallel sessions… but still, you always regret that you could not attend the talk in the other session. Argh!

Here I like to mention just two works. However, the program chair is trying to make all the slides available. Have a look at the program and at the slides: there are many good papers.

In the application track, Deepak Mehta gave a nice talk about a joint work with Barry O’Sullivan and Helmut Simonis on Comparing Solution Methods for the Machine Reassignment Problem, a problem that Google has to solve every day in its data centers and that was the subject of the Google/Roadef Challenge 2012. The true challenge is given by the HUGE size of the instances and the very short timeout (300 seconds). The work presented by Deepak is really interesting and they got excellent results using CP-based Large Neighborhood Search: they classified second at the challenge.

Related to the Machine Reassignment Problem there was a second interesting talk entitled Weibull-based Benchmarks for Bin Packing, by Ignacio Castineiras, Milan De Cauwer and Barry O’Sullivan. They have designed a parametric instance generator for bin packing problems based on the Weibull distribution. Having a parametric generator is crucial to perform exhaustive computational results and to identify those instances that are challenging for a particular solution technique. For instance, they have considered a CP-approach to bin packing problems and they have identified those Weibull shape values that yield challenging instances for such an approach. A nice feature is that their generator is able to create instances similar to those of the Google challenge… I hope they will release their generator soon!

The Doctoral Program

Differently from other conferences (as for instance IPCO), CP gives PhD students the opportunity to present their ongoing work within a Doctoral Program. The sponsors cover part of the costs for attending the conference. During the conference each student has a mentor who is supposed to help him. This year there were around 24 students and only very few of them had a paper accepted at the main conference. This means that without the Doctoral Program, most of these students would not had the opportunity to attend the conference.

Geoffrey Chu awarded the 2012 ACP Doctoral Research Award for his thesis Improving Combinatorial Optimization. To give you an idea about the amount of his contributions, consider that after his thesis presentation, someone in the audience asked:

“And you got only one PhD for all this work?”

Chapeau! Among other things, Chu has implemented Chuffed one of the most efficient CP solver that uses lazy clause generation and that ranked very well at the last MiniZinc Challenge, even if it was not one of the official competitors.

For the record, the winner of the MiniZinc challenge of this year is (again) the Gecode team. Congratulations!

Next Year

Next year CP will be held in Sweden, at Uppsala University on 16-20 September 2013. Will you be there? I hope so…

In the meantime, if you were at the conference, which was your favorite talk and/or paper?

Dijkstra, Dantzig, and Shortest Paths

2012-09-19T22:14:00+02:00

Here we go, my first blog entry, ever. Let’s start with two short quizzes.

1. The well known Dijkstra’s algorithm is:
[a] A greedy algorithm
[b] A dynamic programming algorithm
[c] A primal-dual algorithm
[d] It was discovered by Dantzig

2. Which is the best C++ implementation of Dijkstra’s algorithm among the following?
[a] The Boost Graph Library (BGL)
[b] The COIN-OR Lemon Graph Library
[c] The Google OrTools
[d] Hei dude! We can do better!!!

What is your answer for the first question? … well, the answers are all correct! And for the second question? To know the correct answer, sorry, you have to read this post to the end…

If you are curious to learn more about the classification of the Dijkstra’s algorithm proposed in the first three answers, please consider reading [1] and [2]. Honestly, I did not know that the algorithm was independently discovered by Dantzig [3] as a special case of Linear Programming. However, Dantzig is credited for the first version of the bidirectional Dijkstra’s algorithm (should we called it Dantzig’s algorithm?), which is nowadays the best performing algorithm on general graphs. The bidirectional Dijkstra’s algorithm is used as benchmark to measure the speed-up of modern specialized shortest path algorithms for road networks [4,5], those algorithms that are implemented, for instance, in our GPS navigation systems, in yours smartphones (I don’t have one, argh!), in Google Maps Directions, and Microsoft Bing Maps.

Why a first blog entry on Dijkstra’s algorithm? That’s simple.

Have you ever implemented an efficient version of this well-known and widely studied algorithm?
Have you ever used the version that is implemented in well-reputed graph libraries, such as, the Boost Graph Library (BGL), the COIN-OR Lemon, and/or Google OrTools?

I did while programming in C++, and I want to share with you my experience.

The Algorithm

The algorithm is quite simple. First partition the nodes of the input graph G=(N,A) in three sets: the sets of (1) scanned, (2) reachable, and (3) unvisited nodes. Every node has a distance label $d_i$ and a predecessor vertex $p_i$ . Initially, set the label of the source node $d_s=0$ , while set $d_i=+\infty$ for all other nodes. Moreover, the node s is placed in the set of reachable nodes, while all the other nodes are unvisited.

The algorithm proceedes as follows: select a reachable node i with minimum distance label, and move it in the set of scanned nodes, it will be never selected again. For each arc (i,j) in the forward star of node i check if node j has distance label $d_j > d_i + c_{ij}$ ; if it is the case, update the label $d_j = d_i + c_{ij}$ and the predecessor vertex $p_j=i$ . In addition, if the node was unvisited, move it in the set of reachable nodes. If the selected node i is the destination node t, stop the algorithm. Otherwise, continue by selecting the next node i with minimum distance label.

The algorithm stops either when it scans the destination node t or the set of reachable nodes is empty. For the nice properties of the algorithm consult any textbook in computer science or operations research.

At this point it should be clear why Dijkstra’s algorithm is greedy: it always select a reachable node with minimum distance label. It is a dynamic programming algorithm because it maintains the recursive relation $d_j = \min \{d_i + c_{ij} \mid (i,j) \in A \}$ for all $j \in N$ . If you are familiar with Linear Programming, you should recognize that the distance labels play the role of dual variable of a flow based formulation of the shortest path problem, and the Dijkstra’s algorithm costructs a primal solution (i.e. a path) that satisfies the dual constraints $d_j - d_i \leq c_{ij}$ .

Graphs and Heaps

The algorithm uses two data structures: the input graph G and the set of reachable nodes Q. The graph G can be stored with an adjacency list, but be sure that the arcs are stored in contiguous memory, in order to reduce the chance of cache misses when scanning the forward stars. In my implementation, I have used a std::vector to store the forward star of each node.

The second data structure, the most important, is the priority queue Q. The queue has to support three operations: push, update, and extract-min. The type of priority queue used determines the worst-case complexity of the Dijkstra’s algorithm. Theoretically, the best strongly polynomial worst-case complexity is achieved via a Fibonacci heap. On road networks, the Multi Bucket heap yields a weakly polynomial worst-case complexity that is more efficient in practice [4,5]. Unfortunately, the Fibonacci Heap is a rather complex data structure, and lazy implementations end up in using a simpler Binomial Heap.

The good news is that the Boost Library from version 1.49 has a Heap library. This library contains several type of heaps that share a common interface: d-ary-heap, binomial-heap, fibonacci-heap, pairing-heap, and skew-heap. The worst-case complexity of the basic operations are summarized in a nice table. Contrary to text-books, these heaps are ordered in non increasing order (they are max-heap instead of min-heap), that means that the top of the heap is always the element with highest priority. For implementing Dijkstra, where all arc lengths are non negative, this is not a problem: we can store the elements with the distance changed in sign (sorry for the rough explanation, but if you are really intrested it is better to read directly the source code).

The big advantage of boost::heap is that it allows to program Dijkstra once, and to compile it with different heaps via templates. If you wonder why the Boost Graph Library does not use boost::heap, well, the reason is that BGL was implemented a few years ago, while boost::heap appeared this year.

Benchmarking on Road Networks

Here is the point that maybe interests you the most: can we do better than well-reputed C++ graph libraries?

I have tried three graph libraries: Boost Graph Library (BGL) v1.51, COIN-OR Lemon v1.2.3, and Google OrTools cheked out from svn on Sep 7th, 2012. They all have a Dijkstra implementation, even if I don’t know the implementation details. As a plus, the three libraries have python wrappers (but I have not test it). The BGL is a header only library. Lemon came after BGL. BGL, Lemon, and my implementation use (different) Fibonacci Heaps, while I have not clear what type of priority queue is used by OrTools.

Disclaimer: Google OrTools is much more than a graph library: among others, it has a Constraint Programming solver with very nice features for Large Neighborhood Search; however, we are interested here only in its Dijkstra implementation. Constraint Programming will be the subject of another future post.

A few tests on instances taken from the last DIMACS challenge on Shortest Path problems show the pros and cons of each implementation. Three instances are generated using the rand graph generator, while 10 instances are road networks. The test are done on my late 2008 MacBookPro using the apple gcc-4.2 compiler. All the source code, scripts, and even this post text, are available on github.

RAND Graphs

The first test compares the four implementations on 3 graphs with different density d that is the ratio $\frac{2m}{n(n-1)}$ . The graphs are:

Rand 1: with n=10000, m=100000, d=0.001
Rand 2: with n=10000, m=1000000, d=0.01
Rand 3: with n=10000, m=10000000, d=0.1

For each graph, 50 queries between different pairs of source and destination nodes are performed. The table below reports the average of query times (total time divided by query numbers). The entries in bold highlight the shortest time per row.

Graph	MyGraph	BGL	Lemon	OrTools
Rand 1	0.0052	0.0059	0.0074	1.2722
Rand 2	0.0134	0.0535	0.0706	1.6128
Rand 3	0.0705	0.5276	0.7247	4.2535

In these tests, it looks like my implementation is the winner… wow! Although, the true winner is the boost::heap library, since the nasty implementation details are delegated to that library.

… but come on! These are artificial graphs: who is really interested in shortest paths on random graphs?

Road Networks

The second test uses road networks that are very sparse graphs. We report only average computation time in seconds over 50 different pair of source-destination nodes. We decided to leave out OrTools since it is not very performing on very sparse graphs.

This table below shows the average query time for the standard implementations that use Fibonacci Heaps.

Area	nodes	arcs	MyGraph	BGL	Lemon
Western USA	6,262,104	15,248,146	2.7215	2.7804	3.8181
Eastern USA	3,598,623	8,778,114	1.9425	1.4255	2.7147
Great Lakes	2,758,119	6,885,658	0.1808	0.8946	0.2602
California and Nevada	1,890,815	4,657,742	0.5078	0.5808	0.7083
Northeast USA	1,524,453	3,897,636	0.6061	0.5662	0.8335
Northwest USA	1,207,945	2,840,208	0.3652	0.3506	0.5152
Florida	1,070,376	2,712,798	0.1141	0.2753	0.1574
Colorado	435,666	1,057,066	0.1423	0.1117	0.1965
San Francisco Bay	321,270	800,172	0.1721	0.0836	0.2399
New York City	264,346	733,846	0.0121	0.0677	0.0176

From this table, BGL and my implementation are equally good, while Lemon comes after. What would happen if we use a diffent type of heap?

This second table shows the average query time for the Lemon graph library with a specialized Binary Heap implementation, and my own implementation with generic 2-Heap and 3-Heap (binary and ternary heaps) and with a Skew Heap. Note that in order to use a different heap I just modify a single line of code.

Area	nodes	arcs	2-Heap	3-Heap	Skew Heap	Lemon 2-Heap
Western USA	6,262,104	15,248,146	1.977	1.934	2.104	1.359
Eastern USA	3,598,623	8,778,114	1.406	1.372	1.492	0.938
Great Lakes	2,758,119	6,885,658	0.132	0.130	0.135	0.109
California and Nevada	1,890,815	4,657,742	0.361	0.353	0.372	0.241
Northeast USA	1,524,453	3,897,636	0.433	0.421	0.457	0.287
Northwest USA	1,207,945	2,840,208	0.257	0.252	0.256	0.166
Florida	1,070,376	2,712,798	0.083	0.081	0.080	0.059
Colorado	435,666	1,057,066	0.100	0.098	0.100	0.064
San Francisco Bay	321,270	800,172	0.121	0.117	0.122	0.075
New York City	264,346	733,846	0.009	0.009	0.009	0.007

Mmmm… I am no longer the winner: COIN-OR Lemon is!

This is likely due to the specialized binary heap implementation of the Lemon library. Instead, the boost::heap library has a d-ary-heap, that for d=2 is a generic binary heap.

So what?

Dijkstra’s algorithm is so beatiful because it has the elegance of simplicity.

Using an existing efficient heap data structure, it is easy to implement an “efficient” version of the algorithm.

However, if you have spare time, or you need to solve shortest path problems on a specific type of graphs (e.g., road networks), you might give a try with existing graph libraries, before investing developing time in your own implementation. In addition, be sure to read [4] and the references therein contained.

All the code I have used to write this post is available on github. If you have any comment or criticism, do not hesitate to comment below.

References

Pohl, I. Bi-directional and heuristic search in path problems. Department of Computer Science, Stanford University, 1969. [pdf]
Sniedovich, M. Dijkstra’s algorithm revisited: the dynamic programming connexion. Control and cybernetics vol. 35(3), pages 599-620, 2006. [pdf]
Dantzig, G.B. Linear Programming and Extensions. Princeton University Press, Princeton, NJ, 1962.
Delling, D. and Sanders, P. and Schultes, D. and Wagner, D. Engineering route planning algorithms. Algorithmics of large and complex networks Lecture Notes in Computer Science, Volume 5515, pages 117-139, 2009. [doi]
Goldberg, A.V. and Harrelson, C. Computing the shortest path: A-star search meets graph theory. Proc. of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 156-165, 2005. [pdf]