Pooling in GNNs is a fairly complicated task that requires a solid understanding of a graph’s structure in order to work properly.

In our latest paper, we presented a new pooling method for GNNs, called **minCUT pooling**, which has a lot of desirable properties for pooling methods:

- it’s based on well-understood theoretical techniques for node clustering;
- it’s fully differentiable and learnable with gradient descent;
- it depends directly on the task-specific loss on which the GNN is being trained, but …
- … it can be trained on its own without a task-specific loss, if needed;
- it’s fast;

The method is based on the minCUT optimization problem from operational research. We considered a relaxed and differentiable version of minCUT, and implemented it as a neural network layer in order to provide a sound pooling method for GNNs.

In this post I’ll describe the working principles of minCUT pooling and show some applications of the layer.

## Background

The K-way normalized minCUT is an optimization problem to find K clusters on a graph by minimizing the overall intra-cluster edge weight. This is equivalent to solving:

where is the set of nodes, is the -th cluster of nodes, and indicates a weighted edge between two nodes.

If we define a **cluster assignment matrix** , which maps each of the nodes to one of the clusters, the problem can also be re-written as:

where is the adjacency matrix of the graph, and is the diagonal degree matrix.

While finding the optimal minCUT is a NP-hard problem, there exist relaxations that can be leveraged by spectral clustering (SC) to find near-optimal solutions in polynomial time. Still, the complexity of SC is in the order of for a graph of nodes, making it expensive to apply to large graphs.

A possible way of solving this scalability issue is to search for good cluster assignments using SGD, which is the idea on which we based our implementation of minCUT pooling.

## MinCUT pooling

We designed minCUT pooling to be used in-between message-passing layers of GNNs. The idea is that, like in standard convolutional networks, pooling layers should help the network capture broader patterns in the input data by summarizing local information.

At the core of minCUT pooling there is a MLP, which maps the node features to a **continuous** cluster assignment matrix (of size ):

We can then use the MLP to generate on the fly, and reduce the graphs with simple multiplications as:

At this point, we can already make a couple of considerations:

- Nodes with similar features will likely belong to the same cluster, because they will be “classified” similarly by the MLP. This is especially good when using message-passing layers before pooling, since they will cause the node features of connected nodes to become similar;
- depends only on the features of the graph, making the layer
**transferable**to new graphs once it has been trained.

This is already pretty good, and it covers some of the main desiderata of a GNN layer, but it still isn’t enough. We want to explicitly account for the connectivity of the graph in order to pool it.

This is where the minCUT optimization comes in.

By slightly adapting the minCUT formulation above, we can design an auxiliary loss to train the MLP, so that it will learn to solve the minCUT problem in an unsupervised way.

In practice, our unsupervised regularization loss encourages the MLP to cluster together nodes that are strongly connected with each other and weakly connected with the nodes in the other clusters.

The full unsupervised loss that we minimize in order to achieve this is:

where is the normalized adjacency matrix of the graph.

Let’s break this loss down and see how it works.

### Cut loss

The first term, , forces the MLP to find a cluster assignment to solve the minCUT problem (to see why, compare it with the minCUT maximization that we described above). We refer to this loss as the **cut loss**.

In particular, minimizing the numerator leads to clustering together nodes that are strongly connected on the graph, while the denominator prevents any of the clusters to be too small.

The cut loss is bounded between -1 and 0, which are **ideally** reached in the following situations:

- when there are disconnected components in the graph, and exactly maps the components to the clusters;
- when all pairs of connected nodes are assigned to different clusters;

The figure below shows what these situations might look like. Note that in the case the clustering corresponds to a bipartite grouping of the graph. This could be a desirable outcome for some applications, but the general assumption is that connected nodes should be clustered together, and not vice-versa.

We must also consider that is non-convex, and that there are spurious minima that can be found via SGD.

For example, for , the uniform assignment matrix

would cause the numerator and the denominator of to be equal, and the loss to be .

A similar situation occurs when all nodes in the graph are assigned to the same cluster.

This can be easily verified with Numpy:

```
In [1]: # Adjacency matrix
...: A = np.array([[1, 0, 1],
...: [0, 1, 0],
...: [1, 0, 1]])
In [2]: # Degree matrix
...: D = np.diag(A.sum(-1))
In [3]: # Perfect cluster assignment
...: S = np.array([[1, 0], [0, 1], [1, 0]])
In [4]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[4]: 1.0
In [5]: # All nodes uniformly distributed
...: S = np.ones((3, 2)) / 2
In [6]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[6]: 1.0
In [7]: # All nodes in the same cluster
...: S = np.array([[1, 0], [1, 0], [1, 0]])
In [8]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[8]: 1.0
```

### Orthogonality loss

The second term, , helps to avoid such degenerate minima of by encouraging the MLP to find clusters that are orthogonal between each other. We call this the **orthogonality loss**.

In other words, encourages the MLP to “make a decision” about which nodes belong to which clusters, avoiding those degenerate solutions where assigns all nodes equally to every cluster.

By adding the orthogonality constraint, we force the MLP to find a non-trivial assignment for the nodes.

Moreover, we can see that the perfect minimizer of is only attainable if we have nodes, because in general, given a dimensional vector space, we cannot find more than mutually orthogonal vectors. The only way to minimize given assignment vectors is therefore to distribute the nodes equally between the clusters. This causes the MLP to avoid the other type of spurious minima of , where all nodes are in a single cluster.

## Interaction of the two losses

We can see how the two loss terms interact with each other to find a good solution to the cluster assignment problem. The figure above shows the evolution of the unsupervised loss as the network is trained to cluster the nodes of Cora (plot on the left). We can see that as the network is trained, the normalized mutual information (NMI) score of the clustering improves, meaning that the layer is able to find meaningful clusters (plot on the right).

Note how starts from a trivial assignment (-1) due to the random initialization, and then moves away from the spurious minima as the orthogonality loss forces the MLP towards more sensible solutions.

### Pooled graph

As a further consideration, we can take a closer look at the pooled adjacency matrix .

First of all, we can see that it is a matrix that contains the number of links connecting each cluster. For example, the entry contains the number of links between the nodes in cluster 1 and cluster 2, while the entry is the number of links between the nodes in cluster 1.

We can also see that the trace of is exactly the numerator that is being minimized in . Therefore, we can expect the diagonal elements to be much larger than the other entries of .

For this reason, will represent a graph with very strong self-loops, and the message-passing layers after pooling will have a hard time propagating information on the graph (because the self-loops will keep sending the information of a node back onto itself, and not its neighbors).

To address this problem, a solution is to remove the diagonal of and renormalize the matrix by its degree, before giving it as output of the pooling layer:

Our reccomendation is to combine minCUT with message-passing layers with a built-in skip connection, in order to bring each node’s information forward (e.g., Spektral’s GraphConvSkip). However, if your GNN is based on the graph convolutional networks (GCN) of Kipf & Welling, you may want to manually re-compute the normalized Laplacian after pooling in order to add the self-loops back.

### Notes on gradient flow

A couple of notes for the gradient-heads out there.

The unsupervised loss can be optimized on its own, adapting the weights of the MLP to compute an that solves the minCUT problem under the orthogonality constraint.

However, given the multiplicative interaction between and , the gradient can also flow from the task-specific loss (i.e., whatever the GNN is being trained to do) through the MLP. We can see in the picture above how there is a path going from the input to the output , directly passing through the MLP.

This means that the overall solution found by the GNN will keep into account both the graph structure (to solve minCUT) and the final task.

## Code

Implementing minCUT in TensorFlow is fairly straightforward. Let’s start from some setup:

```
import tensorflow as tf
from tensorflow.keras.layers import Dense
A = ... # Adjacency matrix (N x N)
X = ... # Node features (N x F)
n_clusters = ... # Number of clusters to find with minCUT
```

First, the layer computes the cluster assignment matrix `S`

by applying a softmax MLP to the node features:

```
H = Dense(16, activation='relu')(X)
S = Dense(n_clusters, activation='softmax')(H) # Cluster assignment matrix
```

The cut loss is then implemented as:

```
# Cut loss
A_pool = tf.matmul(
tf.transpose(tf.matmul(A, S)), S
)
num = tf.trace(A_pool)
D = tf.reduce_sum(A, axis=-1)
D_pooled = tf.matmul(
tf.transpose(tf.matmul(D, S)), S
)
den = tf.trace(D_pooled)
mincut_loss = -(num / den)
```

And the orthogonality loss is implemented as:

```
# Orthogonality loss
St_S = tf.matmul(tf.transpose(S), S)
I_S = tf.eye(n_clusters)
ortho_loss = tf.norm(
St_S / tf.norm(St_S) - I_S / tf.norm(I_S)
)
```

Finally, the full unsupervised loss of the layer is obtained as the sum of the two auxiliary losses:

```
total_loss = mincut_loss + ortho_loss
```

The actual pooling step is simply implemented as a simple multiplication of `S`

with `A`

and `X`

, then we zero-out the diagonal of `A_pool`

and re-normalize the matrix. Since we already computed `A_pool`

for the numerator of , we only need to do:

```
# Pooling node features
X_pool = tf.matmul(tf.transpose(S), X)
# Zeroing out the diagonal
A_pool = tf.linalg.set_diag(A_pool, tf.zeros(tf.shape(A_pool)[:-1])) # Remove diagonal
# Normalizing A_pool
D_pool = tf.reduce_sum(A_pool, -1)
D_pool = tf.sqrt(D_pool)[:, None] + 1e-12 # Add epsilon to avoid division by 0
A_pool = (A_pool / D_pool) / tf.transpose(D_pool)
```

Wrap this up in a layer, and use the layer in a GNN. Done.

## Experiments

### Unsupervised clustering

Because the core of minCUT pooling is an unsupervised loss that does not require labeled data in order to be minimized, we can optimize on its own to test the clustering ability of minCUT.

A good first test is to check whether the layer is able to cluster a grid (the size of the clusters should be the same), and to isolate communities in a network. We see in the figure below that minCUT was able to do this perfectly.

To make things more interesting, we can also test minCUT on the task of graph-based image segmentation. We can build a region adjacency graph from a natural image, and cluster its nodes in order to see if regions with similar colors are clustered together.

The results look nice, and remember that this was obtained by only optimizing !

Finally, we also checked the clustering abilities of minCUT pooling on the popular citations datasets: Cora, Citeseer, and Pubmed. As mentioned before, we used the Normalized Mutual Information (NMI) score to test whether the layer was clustering together nodes of the same class. Note that the layer did not have access to the labels during training (meaning that we didn’t need to decide how to split the data into train and test sets, which is a known issue in the GNN community).

You can check the paper to see how minCUT fared in comparison to other methods, but in short: it did well, sometimes by a full order of magnitude better than previous methods.

### Autoencoder

Another interesting unsupervised test that we came up with was to check how much information is preserved in the coarsened graph after pooling. To do this, we built a simple graph autoencoder with the structure pictured below:

The “Unpool” layer is simply obtained by transposing the same found by minCUT, in order to upscale the graph instead of downscaling it:

We tested the graph AE on some very regular graphs, that should have been easy to reconstruct after pooling. Surprisingly, this turned out to be a difficult problem for some pooling layers from the GNN literature. MinCUT, on the other hand, was able to defend itself quite nicely.

### Supervised inductive tasks

Finally, we tested whether minCUT provides an improvement on the usual graph classification and graph regression tasks.

We picked a fixed GNN architecture, and tested several pooling strategies by swapping the pooling layers in the network.

The dataset that we used were:

- The Benchmark Data Sets for Graph Kernels;
- A synthetic dataset created by F. M. Bianchi to test GNNs;
- The QM9 dataset for the prediction of chemical properties of molecules.

I’m not gonna report the comparisons with other methods, but I will highlight an interesting sanity check that we performed in order to see whether using GNNs and graph pooling even made sense at all.

Among the various methods that we tested, we also included:

- A simple MLP which did not exploit the relational information carried by the graphs;
- The same GNN architecture without pooling layers.

We were once again surprised to see that, while minCUT yielded a consistent improvement over such simple baselines, other pooling methods did not.

## Conclusions

Working on minCUT pooling was an interesting experience that deepened my understanding of GNNs, and allowed me to see what is really necessary for a GNN to work.

We have put the paper on arXiv, and I’m going to release an official implementation of the layer on Spektral soon.

If you want to build on our work and use minCUT in your own GNNs, you can cite us with:

```
@article{bianchi2019mincut,
title={Mincut Pooling in Graph Neural Networks},
author={Filippo Maria Bianchi and Daniele Grattarola and Cesare Alippi},
journal={arXiv preprint arXiv:1907.00481},
year={2019}
}
```

Cheers!