Yes, an OR perspective.
Why an OR perspective?
Well, because most of the current theoretical works on Optimal Transport have a strong functional analysis bias, and, hence, the are pretty far to be an “easy reading” for anyone working on a different research area. Since I’m more comfortable with “summations” than with “integrations”, in this post I focus only on Discrete Optimal Transport and on KantorovichWasserstein distances between a pair of discrete measures.
Why an ML motivation?
Because measuring the similarity between complex objects is a crucial basic step in several #machinelearning tasks. Mathematically, in order to measure the similarity (or dissimilarity) between two objects we need a metric, i.e., a distance function. And Optimal Transport gives us a powerful similarity measure based on the solution of a Combinatorial Optimization problem, which can be formulated and solved with Linear Programming.
The main Inspirations for this post are:
DISCLAIMER 1: This is a long “ongoing” post, and, despite my efforts, it might contain errors of any type. If you have any suggestion for improving this post, please (!), let me know about: I will be more than happy to mention you (or any of your avatars) in the acknowledgement section. Otherwise, if you prefer, I can offer you a drink, whenever we will meet in real life.
DISCLAIMER 2: I wrote this post while reading the book Ready Player One, by Ernest Cline.
DISCLAIMER 3: I’m recruiting postdocs. If you like the topic of this post and you are looking for a postdoc position, write me an email.
A metric is a function, usually denoted by , between a pair of objects belonging to a space :
Given any triple of points , the conditions that must satisfy in order to be a metric are:
If the space is , then are vectors of elements, and the most common distance is indeed the Euclidean function
where is the th component of the vector . Clearly, the algorithmic complexity of computing (in finite precision) this distance is linear with the dimension of .
QUESTION: What if we want to compute the distance between a pair of clouds of points defined in ?
If we want to compute the distance between the two vectors, that represent the two clouds of points, we need to define a distance function.
Let me fix the notation first. If is a vector, then is the th element. Suppose we have two matrices and with elements, which represent points in . We denote by the th row of matrix , and by the th element of row . Indeed, the rows and of the two matrices give the coordinates and of the two corresponding points.
Whenever , a simple choice is to consider the Minkowski Distance, which is a metric for normed vector spaces:
where typical values of are:
We have also the Minkowski norm, that is a function
computed as
Whenever , we have to consider a more general distance function:
such that the relations (1)(4) are satisfied. We could use as distance function any matrix norm, but, to begin with, we can use the Minkowski distance twice in cascade as follows.
Composing these two operations, we can define a distance function between a pair of vectors of points (i.e., pair of matrices) as follows:
Note that for , we get
which is the Frobenius Norm of the element wise difference .
The main drawback of this distance function is that it implicitly relies on the order (position) of the single points in the two input vectors: any permutation of one (or both) of the two vectors will yield a different value of the ground distance. This happens because the distance function between the two input vectors considers only “interactions” between the th pair of points stored at the same th position in the two vectors.
IMPORTANT. Here is where Discrete Optimal Transport comes into action: it offers an alternative distance function based on the solution of a Combinatorial Optimization problem, which is, in the simplest case, formulated as the following Linear Program:
If you have a minimal #orms background at this point you should have recognized that this problem is a standard Assignment Problem: we have to assign each point of the first vector to a single point of the second vector , in such a way that the overall cost is minimum. From an optimal solution of this problem, we can select among all possible permutations of the rows of , the permutation that gives the minimal value of the Frobenius norm.
Whenever the ground distance is a metric, then is a metric as well. In other terms, the optimal value of this problem is a measure of distance between the two vectors, while the optimal values of the decision variables gives a mapping from the rows of to the rows of (in OT terminology, an optimal plan). This is possible because the LP problem has a Totally Unimodular coefficient matrix, and, hence, every basic optimal solution of the LP problem has integer values.
WAIT, LET ME STOP HERE FOR A SECOND!
I am being too technical, too early. Let me take a step back in History.
The History of Optimal Transport is quite fascinating and it begins with the Mémoire sur la théorie des déblais et des remblais by Gaspard Monge (17461818). I like to think of Gaspard as visiting the Dune of Pilat, near Bordeaux, and then writing his Mémoire while going back to home… but this is only my imagination. Still, particles of sand give me the most concrete idea for passing from a continuous to a discrete problem.
In his Mémoire, Gaspard Monge considered the problem of transporting “des terres d’un lieu dans un autre” at minimal cost. The idea is that we have first to consider the cost of transporting a single molecule of “terre”, which is proportional to its weight and to the distance from its initial and final position. The total cost of transportation is given by summing up the transportation cost of each single molecule. Using the Lex Schrijver’s words, Monge’s transportation problem was camouflaged as a continuous problem.
The idea of Gaspard is to assign to each initial position a single final destination: it is not possible to split a molecule into smaller parts. This unsplittable version of the problem posed a very challenging problem where “the direct methods of the calculus of variations fail spectacularly” (Lawrence C. Evans, see link at page 5). This challenge stayed unsolved until the work of Leonid Kantorovich (19121986).
Curiously, Leonid did not arrive to the transportation problem while studying directly the work of Monge. Instead, he was asked to solve an industrial resource allocation problem, which is more general than Monge’s problem. Only a few years later, he reconsidered his contribution in terms of the continuous probabilistic version of Monge’s transportation problem. However, I am unable to state the true technical contributions of Leonid with respect to the work of Gaspard in a short post (well, honestly, I would be unable even in an infinite post), but I recommend you to read the Long History of the MongeKantorovich Transportation Problem.
Anyway, I have pretty clear the two main concepts that are the foundations of the work by Kantorovich:
For the records, Leonid Kantorovich won the Nobel Prize, and his autobiography merits to be read more than once.
Well, I have still so much to learn from the past!
If you think that Optimal Transport belongs to the past, you are wrong!
Last summer (2018), in Rio de Janeiro, Alessio Figalli won the Fields Medal with this citation:
“for his contributions to the theory of optimal transport, and its application to partial differential equations, metric geometry, and probability”
Alessio is not the first Fields medalist who worked on Optimal Transport. Already Cedric Villani, who wrote the most cited book on Optimal Transport [1], won the Fields Medal in 2010. I strongly suggest you to look any of his lectures available on Youtube. And … do you know that Villani spent a short period of time during his PhD in my current Math Dept. at University of Pavia?
As an extra bonus, if you don’t know what a Fields Medal is, you can have Robin Williams to explain the prize in this clip taken from Good Will Hunting.
It is time of being technical again and to move from the Monge assignment problem, to the Kantorovich “relaxed” transportation problem. In the assignment model presented above, we are implicitly considering that all the positions are occupied by a single molecule of unitary weight. If we want to consider the more general setting of Discrete Optimal Transport, we need to consider the “mass” of each molecule, and to formulate the problem of transporting the total mass at minimum cost. Before presenting the model, we define formally the concept of a discrete measure and of the cost matrix between all pairs of molecule positions.
DISCRETE MEASURES: Given a of vector of positions , and given the Dirac delta function, we can define the Dirac measures as
Given a vector of weights , one associated to each element of , we can define the discrete measure as
Note the is a function of type: , for any subset of . The vector is called the support of the measure . In Computer Science terms, a discrete measure is defined by a vector of pairs, where each pair contains a positive number (the measured value) and its support point (the location where the measure occurred). Note that is a (small?) vector storing the coordinates of the th point.
COST MATRIX: Given two discrete measures and , the first with support and the second with support , we can define the following cost matrix:
where is a distance function, such as, for instance, the Minkowski distance defined before. Note that has elements, and has elements.
At this point, we have all the basic elements to define the KantorovichWasserstein distance function between discrete measures in terms of the solution of a (huge) Linear Program.
INPUT: Two discrete measures and defined on a metric space , and the corresponding supports and , having and elements, respectively. A distance function , which permits to compute the cost .
OUTPUT: A transportation plan and a value of distance of that corresponds to an optimal solution of the following Linear Program:
This Linear Program is indeed a special case of the Transportation Problem, known also as the HitchcockKoopmans problem. It is a special case because the cost vector has a strong structure that should be exploited as much as possible.
Computational Challenge: While the previous problem is polynomially solvable, the size of practical instances is very large. For instance, if you want to compute the distance between a pair of grey scale images of resolution pixels, you end up with an LP with cost coefficients. Hence, these problems must be handled with care. If you want to see how the solution time and memory requirement scale for grey scale image, please, have a look at the slides of my talk at Aussois (2018).
Whenever
then we define the KantorovichWasserstein distance of order as the following functional:
where the set is defined as:
From a mathematical perspective, the most interesting case is the order , which generalizes the Euclidean distance to discrete probability vectors. Note that in this formulation, the two constraint sets defining could be replaced with equality constraints, since and belongs to the probability simplex. In addition, any Combinatorial Optimization algorithm must be used with care, since all the cost and constraint coefficients are not integer. Note: The power used for the ground distance must not be confused with the order (power) of a KantorovichWasserstein distance.
A particular case of the KantorovichWasserstein distance very popular in the Computer Vision research community, is the socalled Earth Mover Distance (EMD) [2], which is used between a pair of dimensional histograms obtained by preprocessing the images of interest. In this case, (i) we have , (ii) we do not require the discrete measures to belong to the probability simplex, and (iii) we do not even require that the two measures are balanced, that is, . In Optimal Transport terminology, the Earth Mover Distance solves an unbalanced optimal transport problem. For the EMD, the feasibility set is replaced by the set:
The cost function is taken with the order :
The most used function is the Minkowski distance induced by the norm.
A very interesting application of discrete optimal transport is the definition of a metric for text documents [3]. The main idea is, first, to exploit a word embedding obtained, for instance, with the popular word2vec neural network [4], and, second, to formulate the problem of “transporting” a text document into another at minimal cost.
Yes, but … how we compute the ground distance between two words?
A word embedding associates to each word of a given vocabulary a vector of . For the pretrained embedding made available by Google at this archive, which contains the embedding of around 3 millions of words, is equals to 300. Indeed, given a vocabulary of words and fixed a dimension , a word embedding is given by a matrix of dimension : Row gives the dimensional vector representing the word embedding of word .
In this case, instead of having discrete measures, we deal with normalized bagofwords (nBOW), which are vector of , where denotes the number of words in the vocabulary. If a text document contains times the word , then . At this point is clear that the ground distance between a pair of words is given by the distance between the corresponding embedding vectors in , that is, given two words and , then
Finally, given two text documents and , we can formulate the Linear Program that gives the Word Mover Distance as:
where the set is defined as:
If you are serious reader, and you are still reading this post, then it is clear that the Word Mover Distance is exactly a KantorovichWasserstein distance of order 1, with an Euclidean ground distance. If you are interested in the quality of this distance when used within a nearest neighbor heuristic for a text classification task, we refer to [3]. (While writing this post, I started to wonder how would perform a WMD of order 2, but this is another story…)
The following are the research topics I am interested in right now. Each topic deserves its own blog post, but let me write here just a short sketch.
Computational Challenges: The development of efficient algorithms for the solution of Optimal Transport problems is an active area of research. Currently, the preferred (heuristic) approach is based on socalled regularized optimal transport, introduced originally in [5]. Indeed, regularized optimal transport deserves its own blog post. In two recent works, together with my coauthors, we tried to revive the Network Simplex for two special cases: KantorovichWasserstein distances of order 1 for dimensional histograms [6] and KantorovichWasserstein distances of order 2 for decomposable cost functions [7]. The second paper was presented as a poster at NeurIPS2018. (If you ever read any of the two papers, please, let me know what you think about them)
Unbalanced Optimal Transport: The computation of KantorovichWasserstein distance for pair of unbalanced discrete measures is very challenging. Last year, a brilliant math student finished a nice project on this topic, which I hope to finalize during the next semester.
Barycenters of Discrete Measures: The KantorovichWasserstein distance can be used to generalize the concept of barycenters, that is, the problem of finding a discrete measure that is the closest (in Kantorovich terms) to a given set of discrete measures. The problem of finding the barycenter can be formulated as a Linear Program, where the unknowns (the decision variables) are both the transport plan and the discrete measure representing the barycenter. For instance, the following images, taken from our last draft paper, represents the barycenters of each of the 10 digits of the MNIST data set of handwritten images (each image is the barycenter of other images).
Now, it’s time to close this post with final remarks.
Given the number and the quality of results achieved on the Theory of Optimal Transport by “pure” mathematicians, it is the time to turn these theoretical results into a set of useful algorithms implemented in efficient and scalable solvers. So far, the only public library I am aware of is POT: Python Optimal Transport.
On Medium, C.E. Perez claims that Optimal Transport Theory (is) the New Math for Deep Learning. In his short post, he explains how Optimal Transport is used in Deep Learning algorithms, specifically in Generative Adversarial Networks (GANs), to replace the KullbackLeibler (KL) divergence.
Honestly, I do not have a clear idea regarding the potential impact of Kantorovich distances on GANs and Deep Learning in general, but I think there are a lot of research opportunities for everyone with a strong passion for Computational Combinatorial Optimization.
And you, what do you think about the topics presented in this post?
As usual, I will be very happy to hear from you, in the meantime…
GAME OVER
I would like to thank Marco Chiarandini, Stefano Coniglio, and Federico Bassetti for constructive criticism of this blog post.
Villani, C. Optimal transport, old and new. Grundlehren der mathematischen Wissenschaften, Vol.338, SpringerVerlag, 2009. [pdf]
Rubner, Y., Tomasi, C. and Guibas, L.J., 2000. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2), pp.99121. [pdf]
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q. From Word Embeddings To Document Distances. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. [pdf]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 31113119). [pdf]
Cuturi, M., 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems (pp. 22922300). [pdf]
Bassetti, F., Gualandi, S. and Veneroni, M., 2018. On the Computation of KantorovichWasserstein Distances between 2DHistograms by Uncapacitated Minimum Cost Flows. arXiv preprint arXiv:1804.00445. [pdf]
Auricchio, G., Bassetti, F., Gualandi, S. and Veneroni, M., 2018. Computing KantorovichWasserstein Distances on ddimensional histograms using (d+1)partite graphs. NeurIPS, 2018.[pdf]
As a benchmark, I grabbed a large text file from P. Norvig’s website, which is 6’488’666 byte long.
The final answer? Yes, mispredicted branches have a huge impact in Python too.
The hidden answer? Python dictionaries ever stop to surprise me: they are REALLY efficient.
NOTE: The followig code snippets were executed in a Python 3.5 notebook, on a windows machine, running Windows 10 and Anaconda Python 3.5 64 bits. You can find my notebook on my Blog GitHub repo. Don’t ask me why, but this blog entry is better visualized directly on GitHub.
UPDATE: Well, most of the time I would use my first implementation based on the filter
builtin function, and I would try for alternative implementations only after a profiler has shown
that removing blanks is a true bottleneck of my whole program. As written in the title, this post is meant as a basic exercise in Python.
In Python, I prefer to write as much code in functional style as possible, relying on the 3 basic functions:
Therefore, after few preliminaries, here is my first code snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

6488671 function calls in 1.956 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.955 1.955 <ipythoninput3eeb7d3495697>:1(RemoveBlanksFilter)
6488666 0.870 0.000 0.870 0.000 <ipythoninput3eeb7d3495697>:2(<lambda>)
1 0.000 0.000 1.956 1.956 <string>:1(<module>)
1 0.000 0.000 1.956 1.956 {builtin method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 1.085 1.085 1.955 1.955 {method 'join' of 'str' objects}
Wow, I didn’t realize that I would have call the lambda function for every single byte of my input file. This is clearly too much overhead.
Let me drop my functional style, and write a plain old forloop:
1 2 3 4 5 6 7 8 

Is test passed: True
1


5452148 function calls in 1.566 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.210 1.210 1.553 1.553 <ipythoninput65e45e3056bc2>:1(RemoveBlanks)
1 0.012 0.012 1.566 1.566 <string>:1(<module>)
1 0.000 0.000 1.566 1.566 {builtin method builtins.exec}
5452143 0.310 0.000 0.310 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.033 0.033 0.033 0.033 {method 'join' of 'str' objects}
Mmm… we just shift the problem to the list append function calls. Maybe we can do better by working in place.
Well, almost in place: Python string are immutable; therefore, we first copy the string into a list, and then we work in place over the copied list.
1 2 3 4 5 6 7 8 9 10 

Is test passed: True
1


5 function calls in 1.158 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.113 1.113 1.145 1.145 <ipythoninput999d36ae6359e>:1(RemoveBlanksInPlace)
1 0.013 0.013 1.158 1.158 <string>:1(<module>)
1 0.000 0.000 1.158 1.158 {builtin method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.032 0.032 0.032 0.032 {method 'join' of 'str' objects}
Ok, working in place does have an impact. Let me go on the true point: avoiding mispredicted branches.
As in the original blog post:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Is test passed: True
1


6489183 function calls in 1.474 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.235 1.235 1.460 1.460 <ipythoninput121bd75a3de21d>:1(RemoveBlanksNoBranch)
1 0.014 0.014 1.474 1.474 <string>:1(<module>)
256 0.000 0.000 0.000 0.000 {builtin method builtins.chr}
1 0.000 0.000 1.474 1.474 {builtin method builtins.exec}
6488666 0.192 0.000 0.192 0.000 {builtin method builtins.ord}
256 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.033 0.033 0.033 0.033 {method 'join' of 'str' objects}
Ouch!!! These are getting even worse! Why? Well, ‘ord’ is a function, so we are getting back the overhead of function calls. Can we do better by using a dictionary instead of an array?
Let me use a dictionary in order to avoid the ‘ord’ function calls.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Is test passed: True
1


261 function calls in 0.771 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.724 0.724 0.758 0.758 <ipythoninput1546ad4c3f0b26>:1(RemoveBlanksNoBranchDict)
1 0.013 0.013 0.771 0.771 <string>:1(<module>)
256 0.000 0.000 0.000 0.000 {builtin method builtins.chr}
1 0.000 0.000 0.771 0.771 {builtin method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.034 0.034 0.034 0.034 {method 'join' of 'str' objects}
Oooh, yes! Now we can see that without mispredicted branches we can really speed up our algorithm.
Is this the best pythonic solution? No, surely not, but still it is an interesting remark to keep in mind when coding.
Likely, the simplest pythonic solution is just to use the ‘replace’ string function as follows:
1 2 3 4 5 6 7 

Is test passed: True
1


7 function calls in 0.065 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.064 0.064 <ipythoninput1858fd6655cfba>:1(RemoveBlanksBuiltin)
1 0.001 0.001 0.065 0.065 <string>:1(<module>)
1 0.000 0.000 0.065 0.065 {builtin method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
3 0.063 0.021 0.063 0.021 {method 'replace' of 'str' objects}
Here we are, the best solution is indeed to use a builtin function, whenever it is possible, even if this was not the real aim of this exercise.
Please, let me know if you have some comments or a different solution in Python.
]]>In this post, I like to share a simple idea on how to solve to optimality some hard instances of the Graph Coloring problem. This simple idea yields a “new time” record for a couple of hard instances.
To date, the best exact approach to solve Graph Coloring is based on BranchandPrice [1, 2, 3]. The branchandprice method is completely different from the Constraint Programming approach I discussed in a previous post. A key component of BranchandPrice is the column generation phase, which is intuitively quite simple, but mathematically rather involved for a short blog post.
Here, I want to show you that a modern Mixed Integer Programming (MIP) solver, such as Gurobi or CPLEX, can solve a few hard instances of graph coloring with the following “null implementation effort”:
Indeed, in this post we try to answer to the following question:
Is there any hope to solve any hard graph coloring instances with this naive approach?
Given an undirected graph and a set of colors , the minimum (vertex) graph coloring problem consists of assigning a color to each vertex, while every pair of adjacent vertices gets a different color. The objective is to minimize the number of colors used in a solution.
The branchandprice approach to graph coloring is based on a set covering formulation. Let be the collection of all the maximal stable sets of , and let be the maximal stable sets that contain the vertex . Let be a 01 variable equal to 1 if all the vertices in the maximal stable set get assigned the same color. Hence, the set covering model is:
Indeed, we “cover” every vertex of with the minimal number of maximal stable sets. The issue with this model is the total number of maximal stable sets in , which is exponential in the number of vertices of G.
Column Generation is a “mathematically elegant” method to bypass this issue: it lets you to solve the set covering model by generating a very small subset of the elements in . This happens by repeatedly solving an auxiliary problem, called the pricing subproblem. For graph coloring, the pricing subproblem consists of a Maximum Weighted Stable Set problem. If you are interested in Column Generation, I recommend you to look at the first chapter of the Column Generation book, which contains a nice tutorial on the topic, and I would strongly recommend reading the nice survey “Selected Topics in Column Generation”, [4].
How many maximal stable sets are in a hard graph coloring instance?
If this number were not so high, we could enumerate all the stable sets in and attempt to directly solve the set covering model without resorting to column generation. However, “high” is a subjective measure, so let me do some computations on my laptop and give you some precise numbers.
Among the DIMACS instances of Graph Coloring, there are a few instances proposed by David Johnson, which are still unsolved (in the sense that we have not a computational proof of optimality of the best known upper bounds).
The table below shows the dimensions of these instances. The name of instances are DSJC{n}.{d}, where {n} is the number of vertices and {d} gives the density of the graph (e.g., DSJC125.9 has 125 vertices and 0.9 of density).
Graph  Nodes  Edges  Max stable sets  Enumeration Time 

DSJC125.9  125  6,961  524  0.00 
DSJC250.9  250  27,897  2,580  0.01 
DSJC500.9  500  112,437  14,560  0.12 
DSJC1000.9  1,000  449,449  100,389  2.20 
DSJC125.5  125  3,891  43,268  0.53 
DSJC250.5  250  15,668  1,470,363  43.16 
DSJC500.5  500  62,624  ?  out of memory 
DSJC1000.5  1,000  249,826  ?  out of memory 
DSJC125.1  125  736  ?  out of memory 
DSJC250.1  250  3,218  ?  out of memory 
DSJC500.1  500  12,458  ?  out of memory 
DSJC1000.1  1,000  49,629  ?  out of memory 
As you can see the number of maximal stable sets (i.e. the cardinality of )
of several instances is not so high, above all for very dense graphs, where the number of stables set is less than the number of edges. However, for sparse graphs, the number of maximal stable sets is too large for the memory available in my laptop.
Now, let me restate the main question of this post:
Can we enumerate all the maximal stable sets of and use a MIP solver such as Gurobi or CPLEX to solve any Johnson’s instance of Graph Coloring?
I have written a small script which uses Cliquer to enumerate all the maximal stable sets of a graph, and then I generate an .mps instance for each of the DSJC instance where I was able to store all maximal stable sets. The .mps file are on my public GitHub repository for this post.
The table below shows some numbers for the sparse instances obtained using Gurobi (v6.0.0) with a timeout of 10 minutes on my laptop. If you compare these numbers with the results published in the literature, you can see that they are not bad at all.
Believe me, these number are not bad at all, and establish a new TIME RECORD.
For example, the instance DSJC250.9 was solved to optimality only recently in 11094 seconds by [3], while the column enumeration approach solves the same instance on a similar hardware in only 23 seconds (!), and, honestly, our work in [2] did not solve this instance to optimality at all.
Graph  Best known  Enum. Time  Run time  LB  UB  Time [2]  LB[2]  UB [2] 

DSJC125.9  44  0.00  0.44  44  44  44  44  44 
DSJC250.9  72  0.01  23  72  72  timeout  71  72 
DSJC500.9  128  0.12  timeout  123  128  timeout  123  136 
DSJC1000.9  222  2.20  timeout  215  229  timeout  215  245 
DSJC125.5  17  0.53  70.6  17  17  19033  17  17 
DSJC250.5  28  43.16  timeout  26  33  timeout  26  31 
Can we ever solve to optimality DSJC500.9 and DSJC1000.9 via Column Enumeration?
I would say:
“Yes, we can!”
… but likely we need to be smarter while branching on the decision variables, since the default branching strategy of a generic MIP solver does not exploit the structure of the problem. If I had the time to work again on Graph Coloring, I would likely use the same branching scheme used in [2], where we combined a Zykov’s branching rule with a randomized iterative deepening depthfirst search (randomised because at each restart we were using a different initial pool of columns). Another interesting option would be to tighten the set covering formulation with valid inequalities, by starting with those studied in [5].
In conclusion, I believe that enumerating all columns can be a simple but good starting point to attempt to solve to optimality at least the instances DSJC500.9 and DSJC1000.9.
Do you have some spare time and are you willing to take up the challenge?
A Mehrotra, MA Trick. A column generation approach for graph coloring. INFORMS Journal on Computing. Fall 1996 vol. 8(4), pp.344354. [pdf]
S. Gualandi and F. Malucelli. Exact Solution of Graph Coloring Problems via Constraint Programming and Column Generation. INFORMS Journal on Computing. Winter 2012 vol. 24(1), pp.81100. [pdf] [preprint]
S. Held, W. Cook, E.C. Sewell. Maximumweight stable sets and safe lower bounds for graph coloring. Mathematical Programming Computation. December 2012, Volume 4, Issue 4, pp 363381. [pdf]
M. Lubbecke and J. Desrosiers. Selected topics in column generation. Operations Research. 2005, Volume 53, Issue 6, pp 10071023. [pdf]
Set covering and packing formulations of graph coloring: algorithms and first polyhedral results. Discrete Optimization. 2009, Volume 6, Issue 2, pp 135147. [pdf]
By sheer serendipity, this morning I came across three paragraphs clearly stating the importance of Big Data from a scientific standpoint, that I like to crosspost here (the following paragraphs appear in the introduction of [1]):
In all applied fields, it is now commonplace to attack problems through data analysis, particularly through the use of statistical and machine learning algorithms on what are often large datasets. In industry, this trend has been referred to as ‘Big Data’, and it has had a significant impact in areas as varied as artificial intelligence, internet applications, computational biology, medicine, finance, marketing, journalism, network analysis, and logistics.
Though these problems arise in diverse application domains, they share some key characteristics. First, the datasets are often extremely large, consisting of hundreds of millions or billions of training examples; second, the data is often very highdimensional, because it is now possible to measure and store very detailed information about each example; and third, because of the large scale of many applications, the data is often stored or even collected in a distributed manner. As a result, it has become of central importance to develop algorithms that are both rich enough to capture the complexity of modern data, and scalable enough to process huge datasets in a parallelized or fully decentralized fashion. Indeed, some researchers have suggested that even highly complex and structured problems may succumb most easily to relatively simple models trained on vast datasets.
Many such problems can be posed in the framework of Convex Optimization.
Given the significant work on decomposition methods and decentralized algorithms in the optimization community, it is natural to look to parallel optimization algorithms as a mechanism for solving largescale statistical tasks. This approach also has the benefit that one algorithm could be flexible enough to solve many problems.
Even if I am not an expert of Convex Optimization [2], I do have my own mathematical optimization bias. Likely, you may have a different opinion (that I am always happy to hear), but, honestly, the above paragraphs are the best content that I have read so far about Big Data.
[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning. Vol. 3, No. 1 (2010) 1–122. [pdf]
[2] If you like to have a speedy overview of Convex Optimization, you may read a J.F. Puget’s blog post.
]]>After a nice chat with Bo Jensen, CEO, founder, and coowner (really, he is a Rocket Scientist!) at Sulum Optimization, I realised that I know barely anything.
By definition, we have that:
“Presolving is a way to transform the given problem instance into an equivalent instance that is (hopefully) easier to solve.” (see, chap. 10 in Tobias Achterberg’s Thesis)
All I know is that every MIP solver has a Presolve parameter, which can take different values. For instance, Gurobi has three possible values for that parameter (you can find more details on the Gurobi online manual):
However, I can’t tell you the real impact of that parameter on the overall solution process of a MIP instance. Thus, here we go: let me write a new post that addresses this basic question!
To measure the impact of preprocessing we need four ingredients:
Changing one of the ingredients could give you different results, but, hopefully, the big picture will not change too much.
As a solver, I have selected the current release of Gurobi (i.e., version 5.6.2). For the data set, likely the most critical ingredient, I have used the MIPLIB2003, basically because I had already all the 60 instances on my server. For running the test I have used an old cluster from the Math Department of University of Pavia.
The measure of impact I have decided to use (after considering other alternatives) is quite conservative: the fraction of closed instances as a function of runtime.
During the last weekend, I have collected a bunch of logs for the 60 instances of the MIPLIB2003, and, then, using RStudio, I have draw the following cumulative plot:
The picture is as simple as clear:
Preprocessing does always payoff and permits to solve around 10% more of the instances within the same time limit!
In this post, I will not discuss additional technical details, but I just want to add two observations:
Likely, the aggressive presolve setting has been decided by Gurobi using a different, much larger, and customeroriented dataset.
Indeed, preprocessing is a very important feature of a modern MIP solver as Gurobi. Investing few seconds before starting the branchandbound MIP search can save a significant amount of runtime. However, a more aggressive preprocessing strategy does not seem to payoff, in average, on the MIPLIB2003.
Unfortunately, preprocessing is somehow disregarded from the research community. There are few recent papers dealing with preprocessing (“ehi! if you do have one, please, let me know about it, ok?”). Most of papers are from the 90s and about Linear Programming, i.e., without integer variables, which mess up everything.
Here a list of basic questions I have in mind:
If you want to share your idea, experience, or opinion, with respect to these questions, you could comment below or send me an email.
Now, to conclude, my bonus question:
Do you have any new smart idea for improving preprocessing?
Well, if you had, I guess you would at least write a paper about, but, do not go for a patent, please!
]]>Egon talks intersection cuts at #aussois. Still the man. pic.twitter.com/7KMcNyJYV0
— Jeff Linderoth (@JeffLinderoth) January 8, 2014
The Captain gave an inspiring talk by questioning the recursive paradigm of cutting planes algorithms. With a very basic example, Balas has shown how a non basic vertex (solution) can produce a much deeper cut than a cut generated by an optimal basis. Around this intuition, Balas has presented a very nice generalization of Intersection Cuts… a new paper enters my “PAPERSTOBEREAD” folder.
To stay on the subject of cutting planes, the talk by Marco Molinaro in the first day of the workshop was really nice. He raises the fundamental question on how important are sparse cuts versus dense cuts. The importance of sparse cuts comes from linear algebra: when solving the simplex it is better to have small determinants in the coefficient matrix of the Linear Programming relaxation in order to avoid numerical issues; sparse cuts implicitly help in keeping small the determinants (intuitively, you have more zeros in the matrix). Dense cuts play the opposite role, but they can be really important to improve the bound of the LP relaxation. In his talk, Molinaro has shown and proofed, for three particular cases, when sparse cuts are enough, and when they are not. Another paper goes on the “PAPERSTOBEREAD” folder.
In the same day of Molinaro, it was really inspiring the talk by Sebastian Pokutta, who really gave a completely new (for me) perspective on Extended Formulations by using Information Theory. Sebastian is the author of a blog, and I hope he will post about his talk.
Andrea Lodi has discussed about an Optimization problem that arises in Supervised Learning. For this problem, the COINOR solver Couenne, developed by Pietro Belotti, significantly outperforms CPLEX. The issues seem to come from on a number of basic bigM (indicator) constraints. To make a long story short, if you have to solve a hard problem, it does pay off to try different solvers, since there is not a “winall” solver.
Do you have an original new idea for developing solvers? Do not be intimidated by CPLEX or Gurobi and go for it!
The presentation by Marco Senatore was brilliant and his work looks very interesting. I have particularly enjoyed the application in Public Transport that he has mentioned at the end of his talk.
I recommend to have a look at the presentation of Stephan Held about the Reachaware Steiner Tree Problem. He has an interesting Steiner treelike problem with a very important application in chip design. The presentation has impressive pictures of what optimal solutions look like in chip design.
At the end of talk, Stephan announced the 11th DIMACS challenge on Steiner Tree Problems.
Eduardo Uchoa gave another impressive presentation on recent progresses on the classical Capacitated Vehicle Routing Problem (CVRP). He has a very sophisticated branchandpriceandcut algorithm, which comes with a very efficient implementation of every possible idea developed for CVRP, plus new ideas on solving efficiently the pricing sub problems (my understanding, but I might be wrong, is that they have a very efficient dominance rule for solving a shortest path sub problem). +1 item in the “PAPERSTOBEREAD” folder.
The last day of the workshop, I have enjoyed the two talks by Simge Kucukyavuz and Jim Luedtke on Stochastic Integer Programming: for me is a completely new topic, but the two presentations were really inspiring.
To conclude, Domenico Salvagnin has shown how far it is possible to go by carefully using MIP technologies such as cutting planes, symmetry handling, and problem decomposition. Unfortunately, it does happen too often that when someone (typically a non OR expert) has a difficult application problem, he writes down a more or less complicated Integer Programming model, tries a solver, sees it takes too much time, and gives up with exact methods. Domenico, by solving the largest unsolved instance for the 3dimensional assignment problem, has shown that
there are potentially no limits for MIP solvers!
In this post, I have only mentioned a few talks, which somehow overlap with my research interests. However, every talk was really interesting. Fortunately, Francois Margot has strongly encouraged all of the speakers to upload their slides and/or papers, so you can find (almost) all of them on the program web page of the workshop. Visit the website and have a nice reading!
To conclude, let me steal another nice picture from twitter:
— Matteo Fischetti (@MFischetti) January 10, 2014
]]>Public Transport is not really a buzzword, but still on Google you can get almost the same number as with “Big Data”: 26,400,000 results.
Because many of us use Public Transport every day, but most of us still use their own car to go to work, to bring child at school, and to go shopping. This has a negative impact on the quality of life of everyone and is clearly inefficient since it does cost more:
(Well, for time, it is not always true, but it happens more often than commonly perceived).
Thus, an important challenge is to improve the quality of Public Transport while keeping its cost competitive. The ultimate goal should be to increase the number of people that trust and use Public Transport.
How is it possible to achieve this goal?
Modern transport operators have installed so called Automatic Vehicle Monitoring (AVM) systems that use several technologies to monitor the fleet of vehicles that operates the service (e.g., metro coaches, buses, metro trains, trains, …).
The stream of data produced by an AVM might be considered as Big Data because of its volume and velocity (see Big Data For Dummies, by J.F. Puget). Each vehicle produces at regular intervals (measured in seconds) data concerning its position and status. This information is stored in remote data centers. The data for a single day might not be considered as “Big”, however once you start to analyze the historical data, the volume increases significantly. For instance, a public transport operator could easily have around 2000 thousands vehicles that operate 24 hours a day, producing data potentially every second.
At the moment, this stream of data misses the third dimension of Big Data that is variety. However, new projects that aim at integrating this information with the stream of data coming from social networks are quickly reaching maturity. One of such project is SuperHub, a FP7 project that has recently won the best exhibit award in Cluster 2 “Smart and sustainable cities for 2020+”, at the ICT2013 Conference in Vilnius.
I don’t know whether transport operators are really Big Data producers or they are merely Small Data producers, but data collected using AVMs are nowadays mainly used to report and monitor the daily activities.
In my own opinion, the data produced by transport operators, integrated with input coming from social networks, should be used to improve the quality of the public transport, for instance, by trying to better tackle Disruption Management issues.
So, I am curious:
]]>Do you know any project that uses AVM data, combined with Social Network inputs (e.g., from Twitter), to elaborate Disruption Management strategies for Public Transport? If yes, do they use Mathematical Optimization at all?
I love reading about everything and I am glad that part of my work consists in reading.
Unfortunately, for researchers, reading is not always that easy, as clearly explained in The Researcher’s Bible:
Reading is difficult: The difficulty seems to depend on the stage of academic development. Initially it is hard to know what to read (many documents are unpublished), later reading becomes seductive and is used as an excuse to avoid research. Finally one lacks the time and patience to keep up with reading (and fears to find evidence that one’s own work is second rate or that one is slipping behind)
For my stage of academic development, reading is extremely seductive, and the situation became even worse after reading the answers to the following question raised by Michael Trick on ORexchange:
If you are looking for excuses to avoid research, go through those answers and select any paper you like, you will have outstanding and authoritative excuses!
]]>This post is about solving the classical Graph Coloring problem by using a simple solver, named here GeCol, that is built on top of the Constraint Programming (CP) solver Gecode. The approach of GeCol is based on the CP model described in [1]. Here, we want to explore some of the new features of the last version of Gecode (version 4.0.0), namely:
We are going to present computational results using these features to solve the instances of the Graph Coloring DIMACS Challenge. However, this post is not going to describe in great details what these features are: please, for this purpose, refer to the Modeling and Programming with Gecode book.
As usual, all the sources used to write this post are publicly available on my GitHub repository.
Given an undirected graph and a set of colors , the minimum (vertex) graph coloring problem consists of assigning a color to each vertex, while every pair of adjacent vertices gets a different color. The objective is to minimize the number of colors.
To model this problem with CP, we can use for each vertex an integer variable with domain equals to : if , then color is assigned to vertex .
Using (inclusionwise) maximal cliques, it is possible to post constraints on subsets of adjacent vertices: every subset of vertices belonging to the same clique must get a different color. In CP, we can use the wellknown alldifferent
constraint for posting these constraints.
In practice, to build our CP model, first, we find a collection of maximal cliques , such that for every edge there exists at least a clique that contains both vertices and . Second, we post the following constraints:
where denotes the subset of variables corresponding to the vertices that belong to the clique .
In order to minimize the number of colors, we use a simple iterative procedure. Every time we found a coloring with colors, we restart the search by restricting the cardinality of to . If no feasible coloring exists with colors, we have proved optimality for the last feasible coloring found, i.e. .
In addition, we apply a few basic preprocessing steps that are described in [1]. The maximal cliques are computed using Cliquer v1.21 [5].
The Graph Coloring problem is an optimization problem that has several equivalent optimum solutions: for instance, given an optimal assignment of colors to vertices, any permutation of the colors, gives a solution with the same optimum value.
While this property is implicitly considered in Column Generation approaches to Graph Coloring (e.g., see [3], [1], and [4]), the CP model we have just presented, suffers from symmetries issues: the values of the domains of the integer variables are symmetric.
The Lightweight Dynamic Symmetry Breaking is a strategy for dealing with this issue [2]. In Gecode, you can define a set of values that are symmetric as follows:
Symmetries syms;
syms << ValueSymmetry(IntArgs::create(k,1));
and then when posting the branching strategy you just write (just note that use of object syms
):
branch(*this, x, INT_VAR_SIZE_MIN(), INT_VAL_MIN(), syms);
With three lines of code, you have solved (some of) the symmetry issues.
How efficient is Lightweight Dynamic Symmetry Breaking for Graph Coloring?
We try to answer to this question with the plot below that shows the results for two versions of GeCol:
Both versions select for branching the variable with the smallest domain size. The plot reports the empirical cumulative distribution as function of run time (in logscale). The tests were run with a timeout of 300 seconds on a quite old server. Note that at the timeout, the version with LDBS has solved around 55% of the instances, while the version without LDBS has solved only around 48% of the instances.
The second new feature of Gecode that we explore here is the Accumulated Failure Count and the Activitybased branching strategies.
While solving any CP model, the strategy used to select the next variable to branch over is very important. The Accumulated Failure Count strategy stores the cumulative number of failures for each variable (for details see Section 8.5 in MPG). The Activitybased search does something similar, but instead of counting failures, measures the activity of each variable. In a sense, these two strategies try to learn from failures and activities as they occur during the search process.
These two branching strategies are more effective when combined with Restart Based Search: the solver performs the search with increasing cutoff values on the number of failures. Gecode offers several optional strategies to improve the cutoff. In our tests, we have used a geometric cutoff sequence (Section 9.4 in MPG).
How effective are the Accumulated Failure Count and the Activitybased strategies for Graph Coloring when combined with Restart Based Search?
The second plot below shows a comparison of 3 versions of GeCol, with 3 different branching strategies:
The last strategy is tremendously efficient: it dominates the other two strategies, and it is able to solve more of the 60% of the considered instances within the timeout of 300 seconds.
However, it is possible to do still slightly better. Likely, at the begging of the search phase, several variables have the same value of AFC. Therefore, it is possible to improve the branching strategy by breaking ties: we can divide the ACT or the AFC value of a variable by the its domain size. The next plot shows the results with these other branching strategies:
The new features of Gecode are very interesting and offer plenty of options. The LDBS is very general, and it could be easily applied to several other combinatorial optimization problems. Also the new branching strategies gives important enhancements, above all when combined with restart based search.
”…with great power there must also come – great responsibility!” (Uncle Ben, The Amazing SpiderMan, n.660, Marvel Comics)
As a drawback, it is becoming harder and harder to find the best parameter configuration for solvers as Gecode (but this is true also for other type of solvers, e.g. Gurobi and Cplex).
Can you find or suggest a better parameter configuration for GeCol?
S. Gualandi and F. Malucelli. Exact Solution of Graph Coloring Problems via Constraint Programming and Column Generation. INFORMS Journal on Computing. Winter 2012 vol. 24(1), pp.81100. [pdf] [preprint]
C. Mears, M.G. de la Banda, B. Demoen, M. Wallace. Lightweight dynamic symmetry breaking. In Eighth International Workshop on Symmetry in Constraint Satisfaction Problems, SymCon’08, 2008. [pdf]
A Mehrotra, MA Trick. A column generation approach for graph coloring. INFORMS Journal on Computing. Fall 1996 vol. 8(4), pp.344354. [pdf]
S. Held, W. Cook, E.C. Sewell. Maximumweight stable sets and safe lower bounds for graph coloring. Mathematical Programming Computation. December 2012, Volume 4, Issue 4, pp 363381. [pdf]
Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, vol. 120(13), pp. 197–207, 2002 [pdf]
Recently, I have discovered a nice tiny library (1 file!) that supports Backtrack Programming in standard C. The library is called CBack and is developed by Keld Helsgaun, who is known in the Operations Research and Computer Science communities for his efficient implementation of the LinKernighan heuristics for the Travelling Salesman Problem.
CBack offers basically two functions that are described in [1] as follows:
Choice(N)
: “is used when a choice is to be made among a number of alternatives, where N is a positive integer denoting the number of alternatives”.Backtrack()
: “causes the program to backtrack, that is to say, return to the most recent call of Choice, which has not yet returned all its values”.With these two functions is pretty simple to develop exact enumeration algorithms. The CBack library comes with several examples, such as algorithms for the Nqueens problem and the 15puzzle. Below, I will show you how to use CBack to implement a simple algorithm that finds a Maximum Clique in an undirected graph.
As usual, the source code used to write this post is publicly available on my GitHub repository.
The CBack documentation shows as first example the following code snippet:
1 2 3 4 5 

The output produced by the snippet is:
1 2 3 4 5 6 

If you are familiar with backtrack programming (e.g., Prolog), you should not be surprised by the output, and you can jump to the next section. Otherwise, the Figure below sketches the program execution.
When the program executes the Choice(N=3)
statement, that is the first call to the first choice (line 2), value 1 is assigned to variable i
. Behind the scene, the Choice function stores the current execution state of the program in its own stack,
and records the next possible choices (i.e. the other possible program branches),
that are values 2
and 3
. Next, the second Choice(N=2)
assigns value 1 to j
(line 3),
and again the state of the program is stored for later use. Then, the printf
outputs i = 1 , j = 1
(line 4 and first line of output). Now, it is time to backtrack (line 5).
What is happening here?
Look again at the figure above: When the Backtrack()
function is invoked, the algorithm backtracks and continues the execution
from the most recent Choice stored in its stack, i.e. it assigns to variable j
value 2, and printf
outputs i = 1, j = 2
. Later, the Backtrack()
is invoked again, and this time the algorithm backtracks until the previous possible choice that corresponds to the assignment of value 2 to variable i
, and it executes i = 2
. Once the second choice for variable i
is performed, there are again two possible choices for variable j
, since the program has backtracked to a point that precedes that statement. Thus, the program executes j = 1
, and printf
outputs i = 2, j = 1
. At this point, the program backtracks again, and consider the next possible choice, j = 2
. This is repeated until all possible choices for Choice(3)
and Choice(2)
are exhausted, yielding the 6 possible combinations of i
and j
that the problem gave as output.
Indeed, during the execution, the program has implicitly visited in a depthfirst manner the search tree of the previous figure. CBack supports also different search strategy, such as best first, but I will not cover that topic here.
In order to store and restore the program execution state (well, more precisely the calling environment), Choice(N)
and Backtrack
use two threatening C standard functions, setjmp
and longjmp
.
For the details of their use in CBack, see [1].
The reason why I like this library, apart from remembering me the time I was programming with Mozart, is that it permits to implement quickly exact algorithms based on enumeration. While enumeration is usually disregarded as inefficient (“ehi, it is just brute force!”), it is still one of the best method to solve small instances of almost any combinatorial optimization problem. In addition, many sophisticated exact algorithms use plain enumeration as a subroutine, when during the search process the size of the problem becomes small enough.
Consider now the Maximum Clique Problem: Given an undirected graph , the problem is to find the largest complete subgraph of . More formally, you look for the largest subset of the vertex set such that for any pair of nodes in there exists an arc .
The wellknown branchandbound algorithm of Carraghan and Pardalos [2] is based on enumeration. The implementation of Applegate and Johnson, called dfmax.c, is a very efficient implementation of that algorithm. Next, I show a basic implementation of the same algorithm that uses CBack for backtracking.
The Carraghan and Pardalos algorithm uses three sets: the current clique , the largest clique found so far , and the set of candidate vertices . The pseudo code of the algorithm is as follows (as described in [3]):
1 2 3 4 5 6 7 8 9 10 11 12 13 

As you can see, the backtracking is here described in terms of a recursive function. However, using CBack, we can implement the same algorithm without using recursion.
We use an array S
of integers, one for each vertex of .
If S[v]=0
, then vertex belongs to the candidate set ; if S[v]=1
, then vertex is in ; if S[v]=2
, then vertex cannot be neither in nor in . The variable s
stores the size of current clique.
Let me show you directly the C code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Well, I like this code pretty much, despite being a “plain old” C program. The algorithm and code can be improved in several ways (ordering the vertices, improving the pruning, using upper bounds from heuristic vertex coloring, using induced degree as in [2]), but still, the main loop and the backtrack machinery is all there, in a few lines of code!
Maybe you wonder about the efficiency of this code, but at the moment I have not a precise answer. For sure, the ordering of the vertices is crucial, and can make a huge difference on solving the maxclique DIMACS instances. I have used CBack to implement my own version of the Ostengard’s maxclique algorithm [4], but my implementation is somehow slower. I suspect that the difference is due to data structure used to store the graph (Ostengard’s implementation relies on bitsets), but not in the way the backtracking is achieved. Although, to answer to such question could be a subject of another post.
In conclusion, if you need to implement an exact enumerative algorithm, CBack could be an option to consider.
Keld Helsgaun. CBack: A Simple Tool for Backtrack Programming in C. Software: Practice and Experience, vol. 25(8), pp. 905934, 2006. [doi]
Carraghan and Pardalos. An exact algorithm for the maximum clique problem. Operations Research Letters, vol. 9(6), pp. 375382, 1990, [pdf]
Torsten Fahle. Simple and Fast: Improving a BranchandBound Algorithm. In Proc ESA 2002, LNCS 2461, pp. 485498. [doi]
Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, vol. 120(13), pp. 197–207, 2002 [pdf]
On the blackboard, to solve small Integer Linear Programs with 2 variables and less or equal constraints is easy, since they can be plotted in the plane and the linear relaxation can be solved geometrically. You can draw the lattice of integer points, and once you have found a new cutting plane, you show that it cuts off the optimum solution of the LP relaxation.
This post presents a naive (textbook) implementation of Fractional Gomory Cuts that uses the basic solution computed by CPLEX, the commercial Linear Programming solver used in our lab sessions. In practice, this post is an online supplement to one of my last exercise session.
In order to solve the “blackboard” examples with CPLEX, it is necessary to use a couple of functions that a few years ago were undocumented. GUROBI has very similar functions, but they are currently undocumented. (Edited May 16th, 2013: From version 5.5, Gurobi has documented its advanced simplex routines)
As usual, all the sources used to write this post are publicly available on my GitHub repository.
Given a Integer Linear Program in the form:
it is possible to rewrite the problem in standard form by adding slack variables:
where is the identity matrix and is a vector of slack variables, one for each constraint in . Let us denote by the linear relaxation of obtained by relaxing the integrality constraint.
The optimum solution vector of , if it exists and it is finite, it is used to derive a basis (for a formal definition of basis, see [1] or [3]). Indeed, the basis partitions the columns of matrix into two submatrices and , where is given by the columns corresponding to the basic variables, and by columns corresponding to variables out of the base (they are equal to zero in the optimal solution vector).
Remember that, by definition, is nonsingular and therefore is invertible. Using the matrices and , it is easy to derive the following inequalities (for details, see any OR textbook, e.g., [1]):
where the operator is applied component wise to the matrix elements. In practice, for each fractional basic variable, it is possible to generate a valid Gomory cut.
The key step to generate Gomory cuts is to get an optimal basis or, even better, the inverse of the basis matrix multiplied by and by . Once we have that matrix, in order to generate a Gomory cut from a fractional basic variable, we just use the last equation in the previous derivation, applying it to each row of the system of inequalities
Given the optimal basis, the optimal basic vector is , since the non basic variables are equal to zero. Let be the index of a fractional basic variable, and let be the index of the constraint corresponding to variable in the equations , then the Gomory cut for variable is:
The CPLEX callable library (written in C) has the following advanced functions:
Using the first two functions, Gomory cuts from an optimal base can be generated as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

The code reads row by row (index i) the inverse basis matrix multiplied by (line 7),
which is temporally stored in vector z
,
and then the code stores the corresponding Gomory cut in the compact matrix given by vectors rmatbeg
, rmatind
, and rmatval
(lines 815).
The array b_bar
contains the vector (line 21). In lines 2627, all the cuts are added at once to the current LP data structure.
On GitHub you find a small program that I wrote to generate Gomory cuts for problems written as . The repository have an example of execution of my program.
The code is simple only because it is designed for small IPs in the form . Otherwise, the code must consider the effects of preprocessing, different sense of the constraints, and additional constraints introduced because of range constraints.
If you are interested in a real implementation of MixedInteger Gomory cuts, that are a generalization of Fractional Gomory cuts to mixed integer linear programs, please look at the SCIP source code.
The introduction of Mixed Integer Gomory cuts in CPLEX was The major breakthrough of CPLEX 6.5 and produced the versiontoversion speedup given by the blue bars in the chart below (source: Bixby’s slides available on the web):
Gomory cuts are still subject of research, since they pose a number of implementation challenges. These cuts suffer from severe numerical issues, mainly because the computation of the inverse matrix requires the division by its determinant.
“In 1959, […] We started to experience the unpredictability of the computational results rather steadily” (Gomory, see [4]).”
A recent paper by Cornuejols, Margot, and Nannicini deals with some of these issues [2].
If you like to learn more about how the basis are computed in the CPLEX LP solver, there is very nice paper by Bixby [3]. The paper explains different approaches to get the first basic feasible solution and gives some hints of the CPLEX implementation of that time, i.e., 1992. Though the paper does not deal with Gomory cuts directly, it is a very pleasant reading.
To conclude, for those of you interested in Optimization Stories there is a nice chapter by G. Cornuejols about the Ongoing Story of Gomory Cuts [4].
C.H. Papadimitriou, K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. 1998. [book]
G. Cornuejols, F. Margot and G. Nannicini. On the safety of Gomory cut generators. Submitted in 2012. Mathematical Programming Computation, under review. [preprint]
R.E. Bixby. Implementing the Simplex Method: The Initial Basis. Journal on Computing vol. 4(3), pages 267–284, 1992. [abstract]
G. Cornuejols. The Ongoing Story of Gomory Cuts. Documenta Mathematica  Optimization Stories. Pages 221226, 2012. [preprint]
Trenord officially said that the software that planned the crew schedule is faulty. The software was bought last year from Goal Systems, a Spanish company. Rumors say that Trenord paid the Goal System around 1,500,000 Euro. Likely, the system is not faulty, but it “only” had bad input data.
Before the Goal System, Trenord was using a different software, produced by Management Artificial Intelligence Operations Research srl (MAIOR) that is used by several public transportation companies in Italy, included ATM that operates the subway and buses in Milan. In addition, MAIOR collaborates with the Politecnico di Milano and the University of Pisa to improve continuously its software. Honestly, I am biased, since I collaborate with MAIOR. However, Trenord dismissed the software of MAOIR without any specific complaint, since the management had decided to buy the Goal System software.
Newspapers do not ask the following question:
Why to change a piece of software, if the previous one was working correctly?
In Italy, soccer players have a motto: “squadra che vince non si cambia”. Maybe at Trenord nobody plays soccer.
Likely, next week will be better for the 700,000 commuters, since OR experts from MAIOR are traveling to Milan to help Trenord to improve the situation.
The MIP instances I propose come from my formulation of the Machine Reassignment Problem proposed for the Roadef Challenge sponsored by Google last year. As I wrote in a previous post, the Challenge had huge instances and a micro time limit of 300 seconds. I said micro because I have in mind exact methods: there is little you can do in 300 seconds when you have a problem with potentially as many as binary variables. If you want to use math programming and start with the solution of a linear programming relaxation of the problem, you have to be careful: it might happen that you cannot even solve the LP relaxation at the root node within 300 seconds.
That is why most of the participants tackled the Challenge mainly with heuristic algorithms. The only general purpose solver that qualified for the challenge is Local Solver, which has a nice abstraction (“somehow” similar to AMPL) to wellknown local search algorithms and move operators. The Local Solver script used in the qualification phase is available here.
However, in my own opinion, it is interesting to try to solve at least the instances of the qualification phase with Integer Linear Programming (ILP) solvers such as Gurobi and CPLEX. Can these branchandcut commercial solvers be competitive on such problems?
Consider you are given a set of processes , a set of machines , and an initial mapping of each process to a single machine (i.e., if process is initially assigned to machine ). Each process consumes several resources, e.g., CPU, memory, and bandwidth. In the challenge, some processes were defined to be transient: they consume resources both on the machine where they are initially located, and in the machine they are going to be after the reassignment. The problem asks to find a new assignment of processes to machines that minimizes a rather involved cost function.
A basic ILP model will have a 01 variable equals to 1 if you (re)assign process to machine . The number of processes and the number of machines give a first clue on the size of the problem. The constraints on the resource capacities yield a multidimensional knapsack subproblem for each machine. The Machine Reassignment Problem has other constraints (kind of logical 01 constraints), but I do not want to bore you here with a full problem description. If you like to see my model, please read the AMPL model file.
In order to convince you that the proposed instances are challenging, I report some computational results.
The table below reports for each instance the best result obtained by the participants of the challenge (second column). The remaining four columns give the upper bound (UB), the lower bound (LB), the number of branchandbound nodes, and the computation time in seconds obtained with Gurobi 5.0.1, a timeout of 300 seconds, and the default parameter setting on a rather old desktop (single core, 2Gb of RAM).
Instance  Best Known UB  Upper Bound  Lower Bound  Nodes  Time 

a11  44,306,501  44,306,501  44,306,501  0  0.05 
a12  777,532,896  780,511,277  777,530,829  537   
a13  583,005,717  583,005,720  583,005,715  15  48.76 
a14  252,728,589  320,104,617  242,404,632  24   
a15  727,578,309  727,578,316  727,578,296  221  2.43 
a21  198  54,350,836  110  0   
a22  816,523,983  1,876,768,120  559,888,659  0   
a23  1,306,868,761  2,272,487,840  1,007,955,933  0   
a24  1,681,353,943  3,223,516,130  1,680,231,407  0   
a25  336,170,182  787,355,300  307,041,984  0   
Instances a11, a13, a15 are solved to optimality within 300 seconds
and hence they are not further considered.
The remaining seven instances are the challenging instances mentioned at the begging of this post. The instances a2x are embarrassing: they have an UB that is far away from both the best known UB and the computed LB. Specifically, look at the instance a21: the best result of the challenge has value 198, Gurobi (using my model) finds a solution with cost 54,350,836: you may agree that this is “slightly” more than 198. At the same time the LB is only 110.
Note that for all the a2x instances the number of branchandbound nodes is zero. After 300 seconds the solver is still at the root node trying to generate cutting planes and/or running their primal heuristics. Using CPLEX 12.5 we got pretty similar results.
This is why I think these instances are challenging for branchandcut solvers.
Commercial solvers have usually a metaparameter that controls the search focus by setting other parameters (how they are precisely set is undocumented: do you know more about?). The two basic options of this parameter are (1) to focus on looking for feasible solution or (2) to focus on proving optimality. The name of this parameter is MipEmphasis in CPLEX and MipFocus in Gurobi. Since the LPs are quite time consuming and after 300 seconds the solver is still at the root node, we can wonder whether generating cuts is of any help on these instances.
If we set the MipFocus to feasibility and we explicitly disable all cut generators, would we get better results?
Look at the table below: the values of the upper bounds of instances a12, a14, and a23 are slightly better than before: this is a good news. However, for instance a21 the upper bound is worse, and for the other three instances there is no difference. Moreover, the LBs are always weaker: as expected, there is no free lunch!
Instance  Upper Bound  Lower Bound  Gap  Nodes 

a12  779,876,897  777,530,808  0.30%  324 
a14  317,802,133  242,398,325  23.72%  48 
a21  65,866,574  66  99.99%  81 
a22  1,876,768,120  505,443,999  73.06%  0 
a23  1,428,873,892  1,007,955,933  29.45%  0 
a24  3,223,516,130  1,680,230,915  47.87%  0 
a25  787,355,300  307,040,989  61.00%  0 
If we want to keep a timeout of 300 seconds, there is little we can do, unless we develop an adhoc decomposition approach.
Can we improve those results with a branchandcut solver using a longer timeout?
Most of the papers that uses branchandcut to solve hard problems have a timeout of at least one hour, and they start by running an heuristic for around 5 minutes. Therefore, we can think of using the best results obtained by the participants of the challenge as starting solution.
So, let us make a step backward: we enable all cut generators and we set all parameters at the default value. In addition we set the time limit to one hour. The table below gives the new results. With this setting we are able to “prove” nearoptimality of instance a12, and we reduce significantly the gap of instance a24. However, the solver never improves the primal solutions: this means that we have not improved the results obtained in the qualification phase of the challenge. Note also that the number of nodes explored is still rather small despite the longer timeout.
Instance  Upper Bound  Lower Bound  Gap  Nodes 

a12  777,532,896  777,530,807  ~0.001%  0 
a14  252,728,589  242,404,642  4.09%  427 
a21  198  120  39.39%  2113 
a22  816,523,983  572,213,976  29.92%  18 
a23  1,306,868,761  1,068,028,987  18.27%  69 
a24  1,681,353,943  1,680,231,594  0.06%  133 
a25  336,170,182  307,042,542  8.66%  187 
What if we disable all cuts and set the MipFocus to feasibility again?
Instance  Upper Bound  Lower Bound  Gap  Nodes 

a12  777,532,896  777,530,807  ~0.001%  0 
a14  252,728,589  242,398,708  4.09%  1359 
a21  196  70  64.28%  818 
a22  816,523,983  505,467,074  38.09%  81 
a23  1,303,662,728  1,008,286,290  22.66%  56 
a24  1,681,353,943  1,680,230,918  0.07%  108 
a25  336,158,091  307,040,989  8.67%  135 
With this parameter setting, we improve the UB for 3 instances: a21, a23, and a25.
However, the lower bounds are again much weaker. Look at instance a21: the lower bound is
now 70 while before it was 120. If you look at instance a23 you can see that even if
we got a better primal solution, the gap is weaker, since the lower bound is worse.
With the focus on feasibility you get better results, but you might miss the ability to prove optimality. With the focus on optimality you get better lower bounds, but you might not improve the primal bounds.
1) How to balance feasibility with optimality?
To use branchandcut solver and to disable cut generators is counterintuitive, but if you do you, you get better primal bounds.
2) Why should I use a branchandcut solver then?
Do you have any idea out there?
While writing this post, we got 3 solutions that are better than those obtained by the participants of the qualification phase: a21, a23, and a25 (the three links give the certificates of the solutions). We are almost there in proving optimality of a23, and we get better lower bounds than those published in [1].
Deepak Mehta, Barry O’Sullivan, Helmut Simonis. Comparing Solution Methods for the Machine Reassignment Problem. In Proc of CP 2012, Québec City, Canada, October 812, 2012.
Thanks to Stefano Coniglio and to Marco Chiarandini for their passionate discussions about the posts in this blog.
]]>During the conference, the weather outside was pretty cold, but at the conference site the discussions were warm and the presentations were intriguing.
In this post, I share an informal report of the conference as “Je me souviens”.
The invited talks were excellent and my favorite one was given by Miguel F. Anjos on Optimization Challenges in Smart Grid Operations. Miguel is not exactly a CP programmer, he is more on discrete non linear optimization, but his talk was a perfect mixed of applications, modeling, and solution techniques. Please, read and enjoy his slides.
I like to mention just one of his observations. Nowadays, electric cars are becoming more and more present. What would happen when each of us will have an electric car? Likely, during the night, while sleeping, we will connect our car to the grid to recharge the car batteries. This will lead to high variability in night peaks of energy demand.
How to manage these peaks?
Well, what Miguel has reported as a possible challenging option is to think of the collection of cars connected to the grid as a kind of huge battery. This sort of collective battery could be used to better handle the peaks of energy demands. Each car would play the game with a double role: if there is not an energy demand peak, you can recharge the car battery; otherwise, the car battery could be used as a power source and it could supply energy to the grid. This is an oversimplification, but as you can image there would be great challenges and opportunities for any constraint optimizer in terms of modeling and solution techniques.
I am curious to read more about, do you?
This year CP had the thicker conference proceedings, ever. Traditionally, the papers are presented in two parallel sessions. Two is not that much when you think that this year at ISMP there were 40 parallel sessions… but still, you always regret that you could not attend the talk in the other session. Argh!
Here I like to mention just two works. However, the program chair is trying to make all the slides available. Have a look at the program and at the slides: there are many good papers.
In the application track, Deepak Mehta gave a nice talk about a joint work with Barry O’Sullivan and Helmut Simonis on Comparing Solution Methods for the Machine Reassignment Problem, a problem that Google has to solve every day in its data centers and that was the subject of the Google/Roadef Challenge 2012. The true challenge is given by the HUGE size of the instances and the very short timeout (300 seconds). The work presented by Deepak is really interesting and they got excellent results using CPbased Large Neighborhood Search: they classified second at the challenge.
Related to the Machine Reassignment Problem there was a second interesting talk entitled Weibullbased Benchmarks for Bin Packing, by Ignacio Castineiras, Milan De Cauwer and Barry O’Sullivan. They have designed a parametric instance generator for bin packing problems based on the Weibull distribution. Having a parametric generator is crucial to perform exhaustive computational results and to identify those instances that are challenging for a particular solution technique. For instance, they have considered a CPapproach to bin packing problems and they have identified those Weibull shape values that yield challenging instances for such an approach. A nice feature is that their generator is able to create instances similar to those of the Google challenge… I hope they will release their generator soon!
Differently from other conferences (as for instance IPCO), CP gives PhD students the opportunity to present their ongoing work within a Doctoral Program. The sponsors cover part of the costs for attending the conference. During the conference each student has a mentor who is supposed to help him. This year there were around 24 students and only very few of them had a paper accepted at the main conference. This means that without the Doctoral Program, most of these students would not had the opportunity to attend the conference.
Geoffrey Chu awarded the 2012 ACP Doctoral Research Award for his thesis Improving Combinatorial Optimization. To give you an idea about the amount of his contributions, consider that after his thesis presentation, someone in the audience asked:
“And you got only one PhD for all this work?”
Chapeau! Among other things, Chu has implemented Chuffed one of the most efficient CP solver that uses lazy clause generation and that ranked very well at the last MiniZinc Challenge, even if it was not one of the official competitors.
For the record, the winner of the MiniZinc challenge of this year is (again) the Gecode team. Congratulations!
Next year CP will be held in Sweden, at Uppsala University on 1620 September 2013. Will you be there? I hope so…
In the meantime, if you were at the conference, which was your favorite talk and/or paper?
]]>Here we go, my first blog entry, ever. Let’s start with two short quizzes.
1. The well known Dijkstra’s algorithm is:
[a] A greedy algorithm
[b] A dynamic programming algorithm
[c] A primaldual algorithm
[d] It was discovered by Dantzig
2. Which is the best C++ implementation of Dijkstra’s algorithm among the following?
[a] The Boost Graph Library (BGL)
[b] The COINOR Lemon Graph Library
[c] The Google OrTools
[d] Hei dude! We can do better!!!
What is your answer for the first question? … well, the answers are all correct! And for the second question? To know the correct answer, sorry, you have to read this post to the end…
If you are curious to learn more about the classification of the Dijkstra’s algorithm proposed in the first three answers, please consider reading [1] and [2]. Honestly, I did not know that the algorithm was independently discovered by Dantzig [3] as a special case of Linear Programming. However, Dantzig is credited for the first version of the bidirectional Dijkstra’s algorithm (should we called it Dantzig’s algorithm?), which is nowadays the best performing algorithm on general graphs. The bidirectional Dijkstra’s algorithm is used as benchmark to measure the speedup of modern specialized shortest path algorithms for road networks [4,5], those algorithms that are implemented, for instance, in our GPS navigation systems, in yours smartphones (I don’t have one, argh!), in Google Maps Directions, and Microsoft Bing Maps.
Why a first blog entry on Dijkstra’s algorithm? That’s simple.
I did while programming in C++, and I want to share with you my experience.
The algorithm is quite simple. First partition the nodes of the input graph G=(N,A) in three sets: the sets of (1) scanned, (2) reachable, and (3) unvisited nodes. Every node has a distance label and a predecessor vertex . Initially, set the label of the source node , while set for all other nodes. Moreover, the node s is placed in the set of reachable nodes, while all the other nodes are unvisited.
The algorithm proceedes as follows: select a reachable node i with minimum distance label, and move it in the set of scanned nodes, it will be never selected again. For each arc (i,j) in the forward star of node i check if node j has distance label ; if it is the case, update the label and the predecessor vertex . In addition, if the node was unvisited, move it in the set of reachable nodes. If the selected node i is the destination node t, stop the algorithm. Otherwise, continue by selecting the next node i with minimum distance label.
The algorithm stops either when it scans the destination node t or the set of reachable nodes is empty. For the nice properties of the algorithm consult any textbook in computer science or operations research.
At this point it should be clear why Dijkstra’s algorithm is greedy: it always select a reachable node with minimum distance label. It is a dynamic programming algorithm because it maintains the recursive relation for all . If you are familiar with Linear Programming, you should recognize that the distance labels play the role of dual variable of a flow based formulation of the shortest path problem, and the Dijkstra’s algorithm costructs a primal solution (i.e. a path) that satisfies the dual constraints .
The algorithm uses two data structures: the input graph G and the set of reachable nodes Q. The graph G can be stored with an adjacency list, but be sure that the arcs are stored in contiguous memory, in order to reduce the chance of cache misses when scanning the forward stars. In my implementation, I have used a std::vector to store the forward star of each node.
The second data structure, the most important, is the priority queue Q. The queue has to support three operations: push, update, and extractmin. The type of priority queue used determines the worstcase complexity of the Dijkstra’s algorithm. Theoretically, the best strongly polynomial worstcase complexity is achieved via a Fibonacci heap. On road networks, the Multi Bucket heap yields a weakly polynomial worstcase complexity that is more efficient in practice [4,5]. Unfortunately, the Fibonacci Heap is a rather complex data structure, and lazy implementations end up in using a simpler Binomial Heap.
The good news is that the Boost Library from version 1.49 has a Heap library. This library contains several type of heaps that share a common interface: daryheap, binomialheap, fibonacciheap, pairingheap, and skewheap. The worstcase complexity of the basic operations are summarized in a nice table. Contrary to textbooks, these heaps are ordered in non increasing order (they are maxheap instead of minheap), that means that the top of the heap is always the element with highest priority. For implementing Dijkstra, where all arc lengths are non negative, this is not a problem: we can store the elements with the distance changed in sign (sorry for the rough explanation, but if you are really intrested it is better to read directly the source code).
The big advantage of boost::heap is that it allows to program Dijkstra once, and to compile it with different heaps via templates. If you wonder why the Boost Graph Library does not use boost::heap, well, the reason is that BGL was implemented a few years ago, while boost::heap appeared this year.
Here is the point that maybe interests you the most: can we do better than wellreputed C++ graph libraries?
I have tried three graph libraries: Boost Graph Library (BGL) v1.51, COINOR Lemon v1.2.3, and Google OrTools cheked out from svn on Sep 7th, 2012. They all have a Dijkstra implementation, even if I don’t know the implementation details. As a plus, the three libraries have python wrappers (but I have not test it). The BGL is a header only library. Lemon came after BGL. BGL, Lemon, and my implementation use (different) Fibonacci Heaps, while I have not clear what type of priority queue is used by OrTools.
Disclaimer: Google OrTools is much more than a graph library: among others, it has a Constraint Programming solver with very nice features for Large Neighborhood Search; however, we are interested here only in its Dijkstra implementation. Constraint Programming will be the subject of another future post.
A few tests on instances taken from the last DIMACS challenge on Shortest Path problems show the pros and cons of each implementation. Three instances are generated using the rand graph generator, while 10 instances are road networks. The test are done on my late 2008 MacBookPro using the apple gcc4.2 compiler. All the source code, scripts, and even this post text, are available on github.
The first test compares the four implementations on 3 graphs with different density d that is the ratio . The graphs are:
For each graph, 50 queries between different pairs of source and destination nodes are performed. The table below reports the average of query times (total time divided by query numbers). The entries in bold highlight the shortest time per row.
Graph  MyGraph  BGL  Lemon  OrTools 

Rand 1  0.0052  0.0059  0.0074  1.2722 
Rand 2  0.0134  0.0535  0.0706  1.6128 
Rand 3  0.0705  0.5276  0.7247  4.2535 
In these tests, it looks like my implementation is the winner… wow!
Although, the true winner is the boost::heap library, since the nasty implementation details
are delegated to that library.
… but come on! These are artificial graphs: who is really interested in shortest paths on random graphs?
The second test uses road networks that are very sparse graphs. We report only average computation time in seconds over 50 different pair of sourcedestination nodes. We decided to leave out OrTools since it is not very performing on very sparse graphs.
This table below shows the average query time for the standard implementations that use Fibonacci Heaps.
Area  nodes  arcs  MyGraph  BGL  Lemon 

Western USA  6,262,104  15,248,146  2.7215  2.7804  3.8181 
Eastern USA  3,598,623  8,778,114  1.9425  1.4255  2.7147 
Great Lakes  2,758,119  6,885,658  0.1808  0.8946  0.2602 
California and Nevada  1,890,815  4,657,742  0.5078  0.5808  0.7083 
Northeast USA  1,524,453  3,897,636  0.6061  0.5662  0.8335 
Northwest USA  1,207,945  2,840,208  0.3652  0.3506  0.5152 
Florida  1,070,376  2,712,798  0.1141  0.2753  0.1574 
Colorado  435,666  1,057,066  0.1423  0.1117  0.1965 
San Francisco Bay  321,270  800,172  0.1721  0.0836  0.2399 
New York City  264,346  733,846  0.0121  0.0677  0.0176 
From this table, BGL and my implementation are equally good, while Lemon comes after.
What would happen if we use a diffent type of heap?
This second table shows the average query time for the Lemon graph library with a specialized Binary Heap implementation, and my own implementation with generic 2Heap and 3Heap (binary and ternary heaps) and with a Skew Heap. Note that in order to use a different heap I just modify a single line of code.
Area  nodes  arcs  2Heap  3Heap  Skew Heap  Lemon 2Heap 

Western USA  6,262,104  15,248,146  1.977  1.934  2.104  1.359 
Eastern USA  3,598,623  8,778,114  1.406  1.372  1.492  0.938 
Great Lakes  2,758,119  6,885,658  0.132  0.130  0.135  0.109 
California and Nevada  1,890,815  4,657,742  0.361  0.353  0.372  0.241 
Northeast USA  1,524,453  3,897,636  0.433  0.421  0.457  0.287 
Northwest USA  1,207,945  2,840,208  0.257  0.252  0.256  0.166 
Florida  1,070,376  2,712,798  0.083  0.081  0.080  0.059 
Colorado  435,666  1,057,066  0.100  0.098  0.100  0.064 
San Francisco Bay  321,270  800,172  0.121  0.117  0.122  0.075 
New York City  264,346  733,846  0.009  0.009  0.009  0.007 
Mmmm… I am no longer the winner: COINOR Lemon is!
This is likely due to the specialized binary heap implementation of the Lemon library. Instead, the boost::heap library has a daryheap, that for d=2 is a generic binary heap.
Dijkstra’s algorithm is so beatiful because it has the elegance of simplicity.
Using an existing efficient heap data structure, it is easy to implement an “efficient” version of the algorithm.
However, if you have spare time, or you need to solve shortest path problems on a specific type of graphs (e.g., road networks), you might give a try with existing graph libraries, before investing developing time in your own implementation. In addition, be sure to read [4] and the references therein contained.
All the code I have used to write this post is available on github. If you have any comment or criticism, do not hesitate to comment below.
Pohl, I. Bidirectional and heuristic search in path problems. Department of Computer Science, Stanford University, 1969. [pdf]
Sniedovich, M. Dijkstra’s algorithm revisited: the dynamic programming connexion. Control and cybernetics vol. 35(3), pages 599620, 2006. [pdf]
Dantzig, G.B. Linear Programming and Extensions. Princeton University Press, Princeton, NJ, 1962.
Delling, D. and Sanders, P. and Schultes, D. and Wagner, D. Engineering route planning algorithms. Algorithmics of large and complex networks Lecture Notes in Computer Science, Volume 5515, pages 117139, 2009. [doi]
Goldberg, A.V. and Harrelson, C. Computing the shortest path: Astar search meets graph theory. Proc. of the sixteenth annual ACMSIAM symposium on Discrete algorithms, 156165, 2005. [pdf]