Daniele Grattarola

My second interview on Machine Learning Street Talk

Sat, 16 Dec 2023 00:00:00 +0000

I was featured for the second time on Machine Learning Street Talk.

This interview was shot at NeurIPS 2023 last year, where I was presenting our work on generalized implicit neural representations from my time at EPFL.

Cheers!

My interview on Machine Learning Street Talk

Fri, 29 Apr 2022 00:00:00 +0000

I had the pleasure of being a guest on Machine Learning Street Talk to chat about cellular automata, emergence, life, the universe, and my own work on graph neural cellular automata.

I had a great time with Tim and Keith, they are doing an incredible work with the podcast and it’s really an honor to having been a part of it.

Enjoy!

^{P.S. I was so nervous and hyper-excited that I lost my own train of thought a couple of times, please be patient :D}

Graph Neural Cellular Automata

Mon, 08 Nov 2021 00:00:00 +0000

Cellular automata (or CA for short) are a fascinating computational model. They consist of a lattice of stateful cells and a transition rule that updates the state of each cell as a function of its neighbourhood configuration. By applying this local rule synchronously over time, we see interesting dynamics emerge.

For example, here is the transition table of Rule 110 in a 1-dimensional binary CA:

And here is the corresponding evolution of the states starting from a random initialization (time goes downwards):

By changing the rule, we get different dynamics, some of which can be extremely interesting. One example of this is the 2-dimensional Game of Life, with its complex patterns that replicate and move around the grid.

We can also bring this idea of locality to the extreme, by keeping it as the only requirement and making everything else more complicated.

For example, if we make the states continuous and change the size of the neighbourhood, we get the mesmerizing Lenia CA with its insanely life-like creatures that move around smoothly, reproduce, and even organize themselves into higher-order organisms.

By this principle, we can also derive an even more general version of CA, in which the neighbourhoods of the cells no longer have a fixed shape and size. Instead, the cells of the CA are organized in an arbitrary graph.

Note that the central idea of locality that characterizes CA does not change at all: we’re just extending it to account for these more general neighbourhoods.

The super-general CA are usually called Graph Cellular Automata (GCA).

The general form of GCA transition rules is a map from a cell and its neighbourhood to the next state, and we can also make it anisotropic by introducing edge attributes that specify a relation between the cell and each neighbour.

Learning CA rules

The world of CA is fascinating, but unfortunately, they are almost always considered simply pretty things.

But can they be also useful? Can we design a rule to solve an interesting problem using the decentralized computation of CA?

The answer is yes, but manually designing such a rule may be hard. However, being AI scientists, we can try to learn the rule.

This is not a new idea.

We can go back to NeurIPS 1992 to find a seminal work on learning CA rules with neural networks (they use convolutional neural networks, although back then they were called “sum-product networks with shared weights”).

Since then, we’ve seen other approaches to learn CA rules, like these papers using genetic algorithms or compositional pattern-producing networks to find rules that lead to a desired configuration of states, a task called morphogenesis.^{Papers not on arXiv, sorry}

More recently, convolutional networks have been shown to be extremely versatile in learning CA rules. This work by William Gilpin, for example, shows that we can implement any desired transition rule with CNNs by smartly setting their weights.

CNNs have also been used for morphogenesis. Inspired by the regenerative abilities of the flatworm (pictured above), in this visually-striking paper they train a CNN to grow into a desired image and to regenerate the image if it is perturbed.

Learning GCA rules

So, can we do something similar in the more general setting of GCA?

Well, let’s start with the model. Similar to how CNNs are the natural family of models to implement typical grid-based CA rules, the more general family of graph neural networks is the natural choice for GCA.

We call this setting the Graph Neural Cellular Automata (GNCA).

We propose an architecture composed of a pre-preprocessing MLP, a message-passing layer, and a post-processing MLP, which we use as transition function.

This model is universal to represent GCA transition rules. We can prove this by making an argument similar to the one for CNNs that I mentioned above.

I won’t go into the specific details here, but in short, we need to implement two operations:

One-hot encoding of the states;
Pattern-matching for the desired rule.

The first two blocks in our GNCA are more than enough to achieve this. The pre-processing MLP can compute the one-hot encoding, and by using edge attributes and edge-conditioned convolutions we can implement pattern matching easily.

Experiments

However, regardless of what the theory says, we want to know whether we can learn a rule in practice. Let’s try a few experiments.

Voronoi GCA

We can start from the simplest possible binary GCA, inspired by the 1992 NeurIPS paper I mentioned before. The difference is that our CA cells are the Voronoi tasselletion of some random points. Alternatively, you can think of this GCA as being defined on the Delaunay triangulation of the points.

We use an outer-totalistic rule that swaps the state of a cell if the density of its alive neighbours exceeds a certain threshold, not too different from the Game of Life.

We try to see if our model can learn this kind of transition rule. In particular, we can train the model to approximate the 1-step dynamics in a supervised way, given that we know the true transition rule.

The results are encouraging. We see that the GNCA achieves 100% accuracy with no trouble and, if we let it evolve autonomously, it does not diverge from the real trajectory.

Boids

For our second experiment, we keep a similar setting but make the target GCA much more complicated. We consider the Boids algorithm, an agent-based model designed to simulate the flocking of birds. This can be still seen as a kind of GCA because the state of each bird (its position and velocity) is updated only locally as a function of its closest neighbours. However, this means that the states of the GCA are continuous and multi-dimensional, and also that the graph changes over time.

Again, we can train the GNCA on the 1-step dynamics. We see that, although it’s hard to approximate the exact behaviour, we get very close to the true system. The GNCA (yellow) can form the same kind of flocks as the true system (purple), even if their trajectories diverge.

Morphogenesis

The final experiment is also the most interesting, and the one where we actually design a rule. Like previously in the literature, here too we focus on morphogenesis. Our task is to find a GNCA rule that, starting from a given initial condition, converges to a desired point cloud (like a bunny) where the connectivity of the cells has a geometrical/spatial meaning.

In this case, we don’t know the true rule, so we must train the model differently, by teaching it to arrive at the target state when evolving autonomously.

To do so, we let the model evolve for a given number of steps, then we compute the loss from the target, and we update the weights with backpropagation through time. To stabilise training, and to ensure that the target state becomes a stable attractor of the GNCA, we use a cache. This is a kind of replay memory from which we sample the initial conditions, so that we can reuse the states explored by the GNCA during training. Crucially, this teaches the model to remain at the target state when starting from the target state.

And the results are pretty amazing… have you seen the gif at the top of the post? Let’s unroll the first few frames here.

A 2-dimensional grid:

A bunny:

The PyGSP logo:

We see that the GNCA has no trouble in finding a stable rule that converges quickly at the target and then remains there.

Even for complex and seemingly random graphs, like the Minnesota road network, the GNCA can learn a rule that quickly and stably converges to the target:

However, this is not the full story. Sometimes, instead of converging, the GNCA learns to remain in an orbit around the target state, giving us these oscillating point clouds.

Grid:

Bunny:

Logo:

Now what?

So, where do we go from here?

We have seen that GNCA can reach global coherence through local computation, which is not that different from what we do in graph representation learning. In fact, the first GNN paper, back in 2005, already contained this idea.

But moving forward, it’s easy to see that the idea of emergent computation on graphs could apply to many scenarios, including swarm optimization and control, modelling epidemiological transmission, and it could even improve our understanding of complex biological systems, like the brain.

GNCA enable the design of GCA transition rules, unlocking the power of decentralised and emergent computation to solve real-world problems.

The code for the paper is available on Github and feel free to reach out via email if you have any questions or comments.

This blog post is the short version of our NeurIPS 2021 paper:

Learning Graph Cellular Automata
D. Grattarola, L. Livi, C. Alippi

You can cite the paper as follows:

@inproceedings{grattarola2021learning,
  title={Learning Graph Cellular Automata},
  author={Grattarola, Daniele and Livi, Lorenzo and Alippi, Cesare},
  booktitle={Neural Information Processing Systems},
  year={2021}
}

A practical introduction to GNNs - Part 2

Fri, 12 Mar 2021 00:00:00 +0000

This is Part 2 of an introductory lecture on graph neural networks that I gave for the “Graph Deep Learning” course at the University of Lugano.

After a practical introduction to GNNs in Part 1, here I show how we can formulate GNNs in a much more flexible way using the idea of message passing.

First, I introduce message passing. Then, I show how to implement message-passing networks in Jax/pseudocode using a paradigm called “gather-scatter”. Finally, I show how to implement a couple of more advanced GNN models.

The full slide deck is available here.

In Part 1 of this series we constructed our first kind of GNN by replicating the behavior of conventional CNNs on data supported by graphs.

The core building block that we used in our simple GNNs looked like this:

\[\mathbf{X}' = \mathbf{R}\mathbf{X}\mathbf{\Theta}\]

which, as we saw, has two effects:

All node attributes \(\mathbf{X}\) are transformed using the learnable matrix \(\mathbf{\Theta}\);
The attribute of each node gets replaced with a weighted sum of its neighbors via the reference operator \(\mathbf{R}\) (also, sometimes we can include the node itself in the sum);

By combining these two ideas we were able to get a very good approximation of a CNN for graphs.

In this part of the lecture, we will take these two ideas and describe them a little more formally, distilling the essential role that they have in a GNN.

We will see a general framework called message passing, which will allow us to describe more complex GNNs than those we have seen so far.

Message Passing Networks

The idea of message passing networks was introduced in a paper by Gilmer et al. in 2017 and it essentially boils GNN layers down to three main steps:

Every node in the graph computes a message for each of its neighbors. Messages are a function of the node, the neighbor, and the edge between them.
Messages are sent, and every node aggregates the messages it receives, using a permutation-invariant function (i.e., it doesn’t matter in which order the messages are received). This function is usually a sum or an average, but it can be anything.
After receiving the messages, each node updates its attributes as a function of its current attributes and the aggregated messages.

This procedure happens synchronously for all nodes in the graph, so that at each message passing step all nodes are updated.

If we look back at our super-simple GNN formulation \(\mathbf{X}' = \mathbf{R}\mathbf{X}\mathbf{\Theta}\), we can easily see the three message-passing steps:

Message - Each node \(i\) will receive the same kind of message \(\mathbf{\Theta}^\top\mathbf{x}_j\) from all its neighbors \(j \in \mathcal{N}(i)\).
Aggregate - Messages are aggregated with a weighted sum, where weights are defined by the reference operator \(\mathbf{R}\).
Update - Each node simply replaces its attributes with the aggregated messages.
If \(\mathbf{R}\) has a non-zero diagonal, then each node also computes a message “from itself to itself” using \(\mathbf{\Theta}\).

Message passing is usually formalized with the equation in the slide above.

While it may look complicated at first, the formula simply describes the three steps that we saw before, and if you wanted to write it in pseudo-Python it would look something like this:

# For every node in the graph
for i in range(n_nodes):

    # Compute messages from neighbors
    messages = [message(x[i], x[j], e[i, j]) for j in neighbors(i)]

    # Aggregate messages
    aggregated = aggregate(messages)

    # Update node attributes
    x[i] = update(x[i], aggregated)

As long as message, aggregate, and update are differentiable functions, we can train any neural network to transforms its inputs like this.
In fact, this framework is so general that virtually all libraries that implement GNNs are based on it.

For example, Spektral, Pytorch Geometric, and DGL all have a MessagePassing class which looks like this:

class MessagePassing(Layer): # Or `Module`

    def call(self, inputs, **kwargs):  # Or `forward`
        # This is the actual message-passing step
        return self.propagate(*inputs)

    def propagate(self, x, a, e, **kwargs):
        # process arguments and create *_kwargs
        ...

        # Message
        messages = self.message(x, **msg_kwargs)

        # Aggregate
        aggregated = self.aggregate(messages, **agg_kwargs)

        # Update
        output = self.update(aggregated, **upd_kwargs)

        return output

    def message(self, x, **kwargs):
        ...

    def aggregate(self, messages, **kwargs):
        ...

    def update(self, aggregated, **kwargs):
        ...

Gather-Scatter

The cool thing about message passing is that it lets us define the operations that our GNN computes, without necessarily resorting to matrix multiplication.

In fact, the only thing that we specify is how the GNN acts on a generic node \(i\) as a function of its generic neighbors \(j \in \mathcal{N}(i)\).

For instance, let’s say that we wanted to implement the “Edge Convolution” operator from the paper “Dynamic Graph CNN for Learning on Point Clouds”.

In the message-passing framework, we write its effect as:

\[\mathbf{x}_i' = \sum\limits_{j \in \mathcal{N}(i)} \textrm{MLP}\big( \mathbf{x}_i \| \mathbf{x}_j - \mathbf{x}_i \big)\]

If we wanted to implement this as a simple matrix multiplication we would have some troubles, because GNNs of the form \(\mathbf{R}\mathbf{X}\mathbf{\Theta}\) assume that every node sends the same message to each of its neighbors. Here, instead, messages are a function of edges \(j \rightarrow i\).

In fact, this is a limitation of every GNN with edge-dependent messages.

We could still implement our Edge Convolution using broadcasting operations, but it would not be efficient at all. Here’s one way we could do it:

import jax, jax.numpy as jnp

x = ...  # Node attributes of shape [n, f]
a = ...  # Adjacency matrix of shape [n, n]

# Compute all pairwise differences between nodes
x_diff = x[None, :, :] - x[:, None, :]  # shape: (n, n, f)

# Repeat the nodes so that we can concatenate them to the differences
x_repeat = jnp.repeat(x[:, None, :], n, axis=1)  # shape: (n, n, f)

# Concatenate the attributes so that, for each edge, we have x_i || (x_i - x_j)
x_all = jnp.concatenate([x_repeat, x_diff], axis=-1)  # shape: (n, n, 2 * f)

# Give x_i || (x_i - x_j) as input to an MLP
messages = mlp(x_all)  # shape: (n, n, channels)

# Broadcast-multiply `a` to keep only "real" messages
output = a[..., None] * messages  # shape: (n, n, channels)

# Sum along the "neighbors" axis.
output = output.sum(1)  # shape: (n, channels)

This is not ideal, because it cost us \(O(N^2)\) to do something that should have a cost linear in the number of edges (which is a big difference when working with real-world graphs, which are usually very sparse).

In general, using broadcasting to define edge-dependent GNNs means that we have to compute the messages for all possible edges and then simply multiply some of the messages by zero by broadcasting a.

This is because broadcasting is a “dense” operation.

A much better way to achieve our goal is to exploit the advanced indexing features offered by all libraries for tensor manipulation, using a technique called gather-scatter.

The gather-scatter technique requires us to think a bit differently, using node indices to access only the nodes that we are interested in, in a sparse way.

This is much easier done than said, so let’s see an example.

Let us consider an adjacency matrix a:

a = [[1, 0, 1],
     [0, 0, 1],
     [1, 1, 0]]

This matrix is equivalently represented in the sparse COOrdinate format:

row = [0, 0, 1, 2, 2]  # Nodes that are sending a message
col = [0, 2, 2, 0, 1]  # Nodes that are receiving a message

which simply tells us the indices of the non-zero entries of a (we usually also have an extra array that tells us the actual values of the entries, but we won’t need it for now).

It’s easy to see, now, that if we look at all edges of the form \(j \rightarrow i\), then the attributes of all nodes that are sending a message can be retrieved with x[row]. Similarly, the attributes of nodes that are receiving a message can be retrieved with x[col].

This is called gathering the nodes.

In our case, if we want to take the difference of the nodes at the opposite side of an edge, we can simply do x[row] - x[col]. Instead of computing the difference x[j] - x[i] for all possible pairs j, i, like we did before, now we only compute the differences that we are really interested in.

All these operations will give us matrices that have as many rows as there are edges. So for instance, x[row] will look like this:

[x[0],
 x[0],
 x[1],
 x[2],
 x[2]]  # shape: (n_edges, f)

The other half of this story tells us how to aggregate the messages after we have gathered them. We call this scattering.

For all nodes \(i\), we want to aggregate all messages that are being sent via edges that have index \(i\) on the receiving end, i.e., all edges of the form \(j \rightarrow i\). For instance, in the small example above we know that node 2 will receive a message from nodes 0 and 1.

We can do this using some special operations available more or less in all libraries for tensor manipulation:

In TensorFlow, we have tf.math.segment_[sum|prod|mean|max|min].
For PyTorch, we have the Torch Scatter library by ‪Matthias Fey.
In Jax, we only have jax.ops.segment_sum.

These operations apply a reduction to “segments” of a tensor, where the segments are defined by integer indices. Something like this:

# Example: segment sum
data = [5, 1, 7, 2, 3, 4, 1, 3]      # A tensor that we want to reduce
segments = [0, 0, 0, 1, 2, 2, 3, 3]  # Segment indices (we have 4 segments)

output = [0] * (max(segments) + 1)   # One result for each segment
for i, s in enumerate(segments):
    output[s] += data[i]             # It could be a product, max, etc...

>>> output 
[13, 2, 7, 4]

So for instance, if we want to sum all messages based on their intended recipient, we can do:

# recipients = col
aggregated = jax.ops.segment_sum(messages, recipients)

Now we can put all of this together to create our Edge Convolution layer with a gather-scatter implementation:

x = ...  # Node attributes of shape [n, f]
a = ...  # Adjacency matrix of shape [n, n]

# Get indices of the non-zero entries of the adjacency matrix
import scipy
senders, recipients, _ = scipy.sparse.find(a)

# Calculate difference of nodes for each edge j -> i
x_diff = x[senders] - x[recipients]  # shape: (n_edges, f)

# Concatenate x_i with (x_i - x_j) for each edge j -> i
x_all = jnp.concatenate([x[recipients], x_diff], axis=-1)  # shape: (n_edges, 2 * f)

# Give x_i || (x_i - x_j) as input to an MLP
messages = mlp(x_all)  # shape: (n_edges, channels)

# Aggregate all messages according to their intended recipient
output = jax.ops.segment_sum(messages, recipients)  # shape: (n, channels)

Wrap this up in a layer and we’re done!

Here’s what it looks like in Spektral and in Pytorch Geometric.

Methods

Since now we’ve moved past the simple models based on reference operators and edge-independent messages that we saw in the first part of this series, we can look at some more advanced methods.

For instance, the popular Graph Attention Networks by Veličković et al. can be implemented as a message-passing network using gather-scatter:

# Transform node attributes with a dense layer (defined elsewhere)
h = dense(x)

# Concatenate attributes of recipients/senders
h_cat = jnp.concatenate([h[recipients], h[senders]], axis=-1)

# Compute attention logits w/ a dense layer (single output, LeakyReLU)
logits = dense(h_cat)

# Apply softmax only to the logits in the same segment, as defined by recipients
# i.e., normalize the scores only among the neighbors of each node.
#
# Note that segment_softmax does **not** reduce the tensor: `coef` has the same 
# size as `logits`.
#
# This function is available in Spektral and PyG.
coef = segment_softmax(logits, recipients)

# Now we aggregate with a weighted sum (weights given by coef)
output = jax.ops.segment_sum(coef * h[senders], recipients)

And, easily enough, we can also define a message-passing network that includes edge attributes in the computation of messages. One of my favorite models is the Edge-Conditioned Convolution by Simonovsky & Komodakis, of which I’ve summarized the math in the slide above.

To implement it with gather-scatter we can do:

# Use a Filter-Generating Network to create the weights (defined elsewhere)
kernel = fgn(e)

# Reshape the weights so that we have a matrix of shape (f, f_) for each edge
kernel = jnp.reshape(kernel, (-1, f, f_))

# Multiply the node attribute of each neighbor by the associated edge-dependent
# kernel. 
# We can use einsum to do this efficiently
messages = jnp.einsum("ab,abc->ac", x[senders], kernel)

# Aggergate with a sum
output = jax.ops.segment_sum(messages, recipients)

Once you get the hang of it, building GNNs becomes so intuitive that you’ll never want to go back to the matrix-multiplication-based implementations. Although, sometimes, it makes sense to do it. But that’s a story for another day.

With the first two parts of this blog series in your toolbelt, you should be able to go a long way in the world of GNNs.

The next and final part will take a more historical and mathematical journey in the world of GNNs. We’ll cover spectral graph theory and how we can define the operation of convolution on graphs.

I have left this for last because it is not essential to understand and use GNNs in practice, although I think that understanding the historical perspective that led to the creation of modern GNNs is very important.

Stay tuned.

A practical introduction to GNNs - Part 1

Wed, 03 Mar 2021 00:00:00 +0000

This is Part 1 of an introductory lecture on graph neural networks that I gave for the “Graph Deep Learning” course at the University of Lugano.

At this point in the course, the students had already seen a high-level overview of GNNs and some of their applications. My goal was to give them a practical understanding of GNNs.

Here I show that, starting from traditional CNNs and changing a few underlying assumptions, we can create a neural network that processes graphs.

The full slide deck is available here.

My goal for this lecture is to show you how Graph Neural Networks (GNNs) can be obtained as a generalization of traditional convolutional neural networks (CNNs), where instead of images we have graphs as input.

But what does it mean that a CNN can be made more general? Why are graphs a more general version of images?

We know that CNNs are designed to process data that describe the world through a collection of discrete data points: time steps in a time series, pixels in an image, pixels in a video, etc.

However, one aspect of images and time series that we rarely (if at all) consider explicitly is the fact that the collection of data points alone is not enough. The order in which pixels are arranged to form an image is possibly more important than the pixels themselves.
An image can be in color or in grayscale but, as long as the arrangement of pixels is the same, we’ll likely be able to recognize the image for what it is.

We could go as far as saying that an image is only an image because its pixels are arranged in a particular structure: pixels that represent points close in space or time should also be next to each other in the collection. Change this structure, and the image loses meaning.

CNNs are designed to take this locality into account. They are designed to transform the value of each pixel, not as a function of the whole image (like a MLP would do), but as a function of the pixel’s immediate surroundings. Its neighbors.

Since locality is a kind of relation between pixels, it is natural to represent the underlying structure of an image using a graph. And, by requiring that each pixel is related only to the few other pixels that are closer to it, our graph will be a regular grid. Every pixel has 8 neighbors (give or take boundary conditions), and the CNN uses this fact to compute a localized transformation.

You can also interpret it the other way around. The kind of processing that the CNN does means that the transformation of each pixel will only depend on the few pixels that fall under the convolutional kernel. We can say that the grid structure emerges as a consequence of the CNN’s inductive bias.

In any case, the important thing to note is that the grid structure does not depend on the specific pixel values. We separate the values of the data points from the underlying structure that supports them.

With this perspective in mind, the question of “how to make CNNs work on graphs” becomes:

Can we create a neural network in which the structure of the data is no longer a regular grid, but an arbitrary graph that we give as input?

In other words, since we know that data and structure are different things, can we change the structure as we please?

The only thing that we require is that the CNN does the same kind of local processing as it did for the regular grid: transform each node as a function of its neighbors.

If we look at what this request entails, we immediately see some problems:

In the “regular grid” case, the learnable kernel of the CNN is compact and has a fixed size: one set of weights for each possible neighbor of a pixel, plus one set for the pixel itself. In other words, the kernel is supported by a smaller grid. We can’t do that easily for an arbitrary graph. Since nodes can have a variable number of neighbors, we also need a kernel that varies in size. Possible, but not straightforward.
In the regular grids processed by CNNs, we have an implicit notion of directionality. We always know where up, down, left and right are. When we move to an arbitrary graph, we might not be able to define a direction. Direction is, in essence, a kind of attribute that we assign to the edges, but in our case we also allow graphs that have no edge attributes at all. Ask yourself: do you have an up-and-to-the-left follower on Twitter?

To go from CNN to GNN we need to solve these problems.

[I recall notation here because the students had already seen most of these things anyway, but the concept of “reference operator” gave me a nice segue into the next slide.]

All this talking about edge attributes also made me remember that now is a good time to do a notation check. Briefly:

We define a graph as a collection of nodes and edges.
Nodes can have vector attributes, which we represent in a neatly packed matrix \(\mathbf{X} \in \mathbb{R}^{N \times F}\) (sometimes called a graph signal). Same thing for edges, with attributes \(\mathbf{e}_{ij} \in \mathbb{R}^S\) for edge i-j.

Then there are the characteristic matrices of a graph:

The adjacency matrix \(\mathbf{A}\) is binary and has a 1 in position i-j if there exists an edge from node i to node j. All entries are 0 otherwise.
The degree matrix \(\mathbf{D}\) counts the number of neighbors of each node. It’s a diagonal matrix so that the degree of node i is in position i-i.
The Laplacian, which we will use a lot later, is defined as \(\mathbf{L} = \mathbf{D} - \mathbf{A}\).
Finally, the normalized adjacency matrix is \(\mathbf{A}_n = \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}\).

Note that \(\mathbf{A}\), \(\mathbf{L}\), and \(\mathbf{A}_n\) share the same sparsity pattern, if you don’t count the diagonal. Their only non-zero entries are in position i-j only if edge i-j exists.

Since we’re more interested in this specific property than in the actual values that are stored in the non-zero entries, let’s give it a name: we call any matrix that has the same sparsity pattern of \(\mathbf{A}\) a reference operator (sometimes a structure operator, sometimes a graph shift operator, it’s not important).

Also note: so far we are considering graphs with undirected edges. This means that all reference operators will be symmetric (if edge i-j exists, then edge j-i exists).

Reference operators are nice.

First of all, they are operators. You multiply them by a graph signal and you get a new graph signal in return. Let’s look at the “shape” of the multiplication: N-by-N times N-by-F equals N-by-F. Checks out.

But not only that. By their own definition, multiplying a reference operator by a graph signal will compute a weighted sum of each node’s neighborhood. Let’s expand the matrix multiplication from the slide above to see what happens to node 1 when we apply a reference operator.

All values \(\mathbf{r}_{ij}\) that are not associated with an edge are 0, so we have:

\[(\mathbf{R}\mathbf{X})_1 = \mathbf{r}_{12}\cdot\mathbf{x}_2 + \mathbf{r}_{13}\cdot\mathbf{x}_3 + \mathbf{r}_{14}\cdot\mathbf{x}_4\]

Look at that: with a simple matrix multiplication we can now do the same kind of local processing that the CNN does.

Since applying a reference operator results in a simple sum-product, the result will not depend on the particular order in which we consider the nodes. As long as row \(i\) of the reference operator describes the connections of the node with attributes \(\mathbf{x}_i\), the result will be the same. We say that this kind of operation is equivariant to permutations of the nodes.
This is good, because the particular order with which we consider the nodes is not important. Remember: we’re only interested in the structure – which nodes are connected to which.

Now that we are able to aggregate information from a node’s neighborhood, we only need to solve the issue of how to create the learnable kernel and we will have a good first approximation of a CNN for graphs. Remember the two issues that we have:

Neighborhoods vary in size;
We don’t know how to orient the kernel (i.e., we may not have attributes that allow us to distinguish a node’s neighbors);

These problems are also related to our request that the GNN must be equivariant to permutations. We cannot simply assign a different weight to each neighbor because we would need to train the GNN on all possible permutations of the nodes in order to make it equivariant.

However, there is a simple solution: use the same set of weights for each node in the neighborhood.
Let our weights be a matrix \(\mathbf{\Theta} \in \mathbb{R}^{F \times F'}\), so that the output will have \(F'\) “feature maps”.

Now, we simply use \(\mathbf{\Theta}\) to transform the node attributes, then sum them over using a reference operator.

Let’s check the shapes to make sure that it works out: N-by-N times N-by-F times F-by-F’ equals N-by-F’.
We went from graph signal to graph signal, with new node attributes that we obtain as a local, learnable, and differentiable transformation.

Done! We have our first GNN: \(\mathbf{X}' = \mathbf{R} \mathbf{X} \mathbf{\Theta}\).

One thing that is still missing from our relatively simple implementation is the ability to have kernels that span more than the immediate neighborhood of a node. In fact, in a CNN this is usually a hyperparameter. Also, depending on the reference operator that we use, we may or may not consider a node itself when computing its transformation: it depends on whether \(\mathbf{R}\) has a non-zero diagonal.

Luckily we can generalize the idea of a bigger kernel to the graph domain: we simply process each node as a function of its neighbors up to \(K\) steps away from it.

We can achieve this by considering that applying a reference operator to a graph signal has the effect of making node attributes flow through the graph. Apply a reference operator once, and all nodes will “read” from their immediate neighbors to update themselves. Apply it again, and all nodes will read again from their neighbors, except that this time the information that they read will be whatever the neighbors computed at the previous step.

In other words: if we multiply a graph signal by \(\mathbf{R}^{K}\), each node will update itself with the node attributes of nodes \(K\) steps away.

In a CNN, this would be equivalent to having a kernel shaped like an empty square. To make the kernel full, we simply sum all “empty square” kernels up to the desired size. In our case, instead of considering \(\mathbf{R}^{K}\), we consider a polynomial of \(\mathbf{R}\) up to order \(K\).

This is called a polynomial graph filter, and we will see a different interpretation of it in Part 3 of this series.

Note that this filter solves both problems that we had before, and also makes our GNN more expressive:

The value of a node itself is always included in the transformation, since \(\mathbf{R}^{0} = \mathbf{I}\);
The sum of polynomials up to order \(K\) will necessarily cover all neighbors in a radius of \(K\) steps;
Since we can treat neighborhoods separately, we can also have different weights \(\mathbf{\Theta}^{(k)}\) for each \(k\)-hop neighborhood. This is like having a radial filter, a function that only depends on the radius from the origin.

This idea of using a polynomial filter to create a GNN was first introduced in a paper by Defferrard et al., which can be seen as the first scalable and practical implementation of a GNN ever proposed.

In that paper they used a particular choice of polynomial, namely one for which different powers are defined in a recursive manner, called a Chebyshev polynomial.

In particular, as reference operator they use a version of the graph Laplacian that is first normalized and then rescaled so that its eigenvalues are between -1 and 1. Then, using the recursive formulation of Chebyshev polynomials, they build a polynomial graph filter.

The reason why they use these polynomials and not the simple ones we saw above is not important, for now. Let us just say: they have some desirable properties and they are fast to compute.

Just a few months after the paper by Defferrard et al. was published on ArXiv, a new paper by Kipf & Welling also appeared online.

In that paper, the authors looked at the Chebyshev filter proposed by Defferrard et al. and introduced a few key changes to make the layer more simple and more scalable.

They changed the reference operator. Instead of the rescaled and normalized Laplacian, they assumed that \(\lambda_{max} = 2\) so that the whole formulation of the operator was simplified to \(-\mathbf{A}_n\).
They proposed to use polynomials of order 1, following the intuition that \(K\) layers of order 1 would be equivalent to 1 layer of order \(K\). In particular, they also added non-linearities between each successive layer, leading to more complex transformations of the nodes at each propagation step.
They observed that the same set of weights could be used both for a node itself and its neighbors. No need to have \(\mathbf{\Theta}^{(0)}\) and \(\mathbf{\Theta}^{(1)}\) as different weights.
After simplifying the layer down to \(\mathbf{X}' = ( \mathbf{I} + \mathbf{A}_n) \mathbf{X} \mathbf{\Theta},\) they observed that a more stable behavior could be obtained by instead using \(\mathbf{R} = \mathbf{D}^{-1/2} (\mathbf{I} + \mathbf{A}) \mathbf{D}^{-1/2}\) as reference operator.

Putting this all together, we get to what is commonly known as the Graph Convolutional Network (GCN):

\[\mathbf{X}' = \mathbf{D}^{-1/2} (\mathbf{I} + \mathbf{A}) \mathbf{D}^{-1/2} \mathbf{X} \mathbf{\Theta}\]

What we have seen so far is a very simple construction that takes the general concepts behind CNNs and, by changing a few assumptions, extends them to the case in which the input is an arbitrary graph instead of a grid.

This is far from the whole story, but it should give you a good starting point to learn about GNNs.

In the next part of this series we will see:

How to describe what we just saw as a general algorithm that allows us to describe a much richer family of operations on graphs.
How to throw edge attributes in the mix and create GNNs that can treat neighbors differently.
How to make the entries of a reference operator a learnable function.
A general recipe for a GNN that should work well for many problems.

Stay tuned.

Telestrations Neural Networks

Tue, 21 Jan 2020 00:00:00 +0000

Yesterday, it was board game day at the lab where I have been working recently. Everyone got together for lunch at Snakes & Lattes, a Torontonian board game cafè chain, and we spent a couple of hours laughing and chatting and, obviously, playing board games.

The lab has a go-to traditional game for the occasion: Telestrations. The game is inspired by the classic childhood’s game of Chinese whispers (or Telephone, or Wireless phone, or Gossip, there’s a bunch of different names for different countries) and its rules are pretty simple.

Everyone gets a booklet, an erasable sharpie, and a list of random terms like “flamingo” or “pipe dream” or “treehouse”. Everyone picks a word and writes it on the first page of the booklet: that’s the secret source word.

At each turn, players pass their booklet to the person on their right, and the rules are as follows:

When you see a word, you turn the page and you have sixty seconds to draw whatever the word is;
When you see a drawing, you turn the page and you write your best guess for what is pictured.

Players keep alternating between guessing, drawing, and passing down the booklets until every booklet has done a full round of the table and is back in the hands of the original owner. For extra fun, everybody gets to draw their secret source word at the very beginning.

In other words, it’s a written game of Chinese whispers where every other word is drawn instead of written.

There are some rules to decide who wins at the end, but the obvious source of entertainment is the complete chaos that ensues as information gets corrupted drawing after drawing. At the end of a round, not one of the original secret words ever survives.

So now the obvious, rational, almost trivial question is: what happens when you use a GAN to draw, and an image classifier to guess?
Well, here I am to show you!

How-to in three paragraphs

BigGAN can generate images conditioned on an ImageNet label. So if you give it label 1, it will generate goldfish, if you give it label 42, it will generate an agama, and so on.

ResNet does the opposite: if you show it a goldfish, it will try to guess what it is. To make things more interesting, I added a bit of noise to the guessing procedure, so that sometimes we get a random one out of the top-5 guesses. If you think that this is unreasonable, try and play a game with real humans, I dare you.

The idea now is to play the game using BigGAN to draw, and ResNet to guess: you start with a label, you have BigGAN generate an image of that label, you classify that image to get a new label, and so on.

Results

I’ll start with my favourite sequence: honeycomb to cheeseburger. The images below are read top-to-bottom, left-to-right. At the very top you see the source class, then the first generated image, then what that image was classified as, then the next generated image, etc..

The first image is generated from class 599 of ImageNet, “honeycomb”. It looked a lot like a bagel, I guess because of that bright spot in the middle (?), so the ResNet classified it as such. From that classification, we get a couple of bagel-y looking pieces of bread, which soon become French loafs, then dough.
Then, that perfect-looking dough in image 6 gets classified as a wooden spoon (probably because of the extra noise that I mentioned). Finally, the green spot on the wooden spoon confuses ResNet into thinking it’s a cheeseburger, and we get juicy burgers until the end. That burger generation is impressive, not gonna lie.

Moving on: trilobite to long-horned beetle. The first two trilobites look really good, but then get classified as isopods after two turns (curiously, isopods and trilobites look a lot similar but are not that closely related according to Reddit). From the isopod label, we get what is clearly a marine creature (look at the background), which unfortunately gets classified as a cockroach. From there, we stay on dry land and just get more and more specialized bugs until the end.

The next one is REALLY good because it’s remarkably similar to a real game of Telestrations. It could happen. Hell, it probably happened.

We start with a coffeepot. At image three, the coffeepot is a bit ambiguous and becomes a teapot. Understandable, I would probably have made that mistake myself. Then we get a proper teapot, that gets recognized as such. The next image, however, is half-assed by the player and it’s not clear at all what it is. The next player guesses that it’s a pitcher. The next guy tries his best but eventually, the pitcher becomes a vase.

Nothing more to say, I can see this happening in real life.

Our next and last one is also a likely sequence.

A volcano. Easy. We get two perfect volcano drawings. Except that the last one gets classified as a type of tent.

The next player over-does it, and draws a full camping spot with caravans instead of a tent. Curiously, we still have a volcano-looking thing in the background, but that’s just a coincidence (no information from previous images or labels is preserved between turns).

The camp is seen as a bee house. Next thing we know, there’s a weird-looking BigGAN human harvesting honey. But ResNet doesn’t care about the human and focuses on the crate in the middle, instead.

We get a good-looking crate, that becomes a chest, and we stay with chests until the end.

The yurt and the apiary are the only weird ones in this sequence, and the least likely to appear in a human game. I can see someone drawing a full camping spot instead of a single yurt, and I can see how one would mistake a poorly-drawn volcano for a tent, but no human would ignore the beekeeper in image 4.

I have generated a bunch of these sequences on my laptop, and these are just four random ones that I got. It’s really easy to get fun sequences. So here’s how I did it.

Code

First of all, I was not going to spend a single € to train anything involved in this project because, like, let’s be real…

So I turned to Google and I found:

I usually write my stuff in TensorFlow but whatever, let’s PyTorch this one.

We start with some essential imports:

import numpy as np
import torch
from PIL import Image
from pytorch_pretrained_biggan import BigGAN, one_hot_from_int, truncated_noise_sample, convert_to_images
from spektral.utils import init_logging
from torchvision import models
from torchvision import transforms

and we define a couple of useful variables:

iterations = 8  # How many players there are
standard_noise = 0.3  # Some random noise because people are not perfect
current_class = np.random.randint(0, 1001)  # The secret source word is random

# Load ImageNet class list
with open('imagenet_classes.txt') as f:
    labels = [line.strip() for line in f.readlines()]

imagenet_classes.txt can be found online, it’s just a list of ImageNet class names.

Now, let’s create the models that we will use. First we create the GAN:

gan = BigGAN.from_pretrained('biggan-deep-256')
gan.to('cuda')

Then, we create the ResNet50 ImageNet classifier:

classifier = models.resnet50(pretrained=True)
classifier.eval()  # Do this to set the model to inference mode

and its image pre-processor:

transform = transforms.Compose([
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
 )])

We will be drawing and guessing images of 256 x 256 pixels (cropped to 224 x 244 for ResNet50). The hard-coded normalization is just something that you have to do for Torchvision models, no biggie.

So now we have loaded the networks. Let’s define some helper functions that will compute the main steps of the game for us:

def draw(label, truncation=1.):
    # Create the inputs for the GAN
    class_vector = one_hot_from_int([label], batch_size=1)
    class_vector = torch.from_numpy(class_vector)
    class_vector = class_vector.to('cuda')

    noise_vector = truncated_noise_sample(truncation=truncation, batch_size=1)
    noise_vector = torch.from_numpy(noise_vector)
    noise_vector = noise_vector.to('cuda')

    # Generate image
    with torch.no_grad():
        output = gan(noise_vector, class_vector, truncation)
    output = output.to('cpu')

    # Get a PIL image from a Torch tensor
    img = convert_to_images(output)

    return img
    

def guess(img, top=5):
    # Pre-process image
    img = transform(img[0])

    # Classify image
    classification = classifier(img.unsqueeze(0))
    _, indices = torch.sort(classification, descending=True)
    percentage = torch.nn.functional.softmax(classification, dim=1)[0]

    # Get the global ImageNet class, labels, and the predicted probabilities
    idxs = np.array([idx for idx in indices[0]][:top])
    labs = np.array([labels[idx] for idx in indices[0]][:top])
    probs = np.array([percentage[idx].item() for idx in indices[0]][:top])
    
    return idxs, labs, probs

Now we can start playing!

output_imgs = []  # Stores the drawings
output_labels = []  # Stores the guesses
output_labels.append(labels[current_class])

# Main game loop
for i in range(iterations):
    # Draw an image
    img = draw(current_class)
    output_imgs.append(img[0])

    # Guess what the image is
    idxs, labs, probs = guess(img, top=top)

    # Add noise
    probs += np.random.uniform(0, standard_noise, size=probs.shape)
    probs /= probs.sum() # Re-normalize because of noise

    # Choose from the predictions
    choice = np.random.choice(np.arange(len(labs)), p=probs)
    current_class = idxs[choice]
    output_labels.append(labs[choice])

At the end of the game, we will have the generated drawings in output_imgs and the guesses in output_labels.

Here, instead of copy-pasting from the cells above you can just look at the full gist.

Conclusions

What can I say? It’s neural networks playing Telestrations.

“No new knowledge can be extracted from my telling. This confession has meant nothing.”

Cheers!

In case you didn’t know: agamas (label 42 of ImageNet) are extra-fucking-cool lizards.

Pitfalls of Graph Neural Network Evaluation 2.0

Fri, 13 Dec 2019 00:00:00 +0000

In this post, I’m going to summarize some conceptual problems that I have found when comparing different graph neural networks (GNNs) between them.

I’m going to argue that it is extremely difficult to make an objectively fair comparison between structurally different models and that the experimental comparisons found in the literature are not always sound.

I will try to suggest reasonable solutions whenever possible, but the goal of this post is simply to make these issues appear on your radar and maybe spark a conversation on the matter.

Some of the things that I’ll say are also addressed in the original Pitfalls of Graph Neural Network Evaluation (Shchur et al., 2018), which I warmly suggest you read.

Neighbourhoods

The first source of inconsistency when comparing GNNs comes from the fact that different layers are designed to take into account neighborhoods of different sizes.
We usually have that a layer either looks at the 1-neighbours of each node, or it has a hyperparameter K that controls the size of the neighbourhood. Some examples of popular methods (implemented both in Spektral and Pytorch Geometric) in either category:

1-hop: GCN, GAT, GraphSage, GIN;
K-hop: Cheby, ARMA, APPNP, SGC.

A fair evaluation should keep these differences into account and allow each GNN to look at the same neighborhoods, but at the same time, it could be argued that a layer designed to operate on larger neighborhoods is more expressive. How can we tell what is better?

Let’s say we are comparing GCN with Cheby. The equivalent of a 2-layer GCN could be a 2-layer Cheby with K=1, or a 1-layer Cheby with K=2. In the GCN paper, they use a 2-layer Cheby with K=3. Should they have compared with a 6-layer GCN?

Moreover, this difference between methods may have an impact on the number of parameters, nonlinearity, and overall amount of regularization in a GNN.
For instance, a GCN that reaches a neighborhood of order 3 may have 3 dropout layers, while the equivalent Cheby with K=3 will have only one.
Another example: an SGC architecture can reach any neighborhood with a constant number of parameters, while other methods can’t.

We’re only looking at one simple issue, and it is already difficult to say how to fairly evaluate different methods. It gets worse.

Regularization and training

Regularization is an aspect that is particularly essential in GNNs, because the community uses very small benchmark datasets and most GNNs tend to overfit like crazy (more on this later). For these reasons, the performance of a GNN can vary wildly depending on how the model is regularized. This is true for all other hyperparameters in general, because things like the learning rate and batch size can be a form of implicit regularization.

The literature is largely inconsistent with how regularization is applied across different papers, making it difficult to say whether the performance improvements reported for a model are due to the actual contribution or to a different regularization scheme.

The following are often found in the literature:

High learning rates;
High L2 penalty;
Extremely high dropout rates on node features and adjacency matrix;
Low number of training epochs;
Low patience for early stopping.

I’m going to focus on a few of these.

First, I argue that setting a fixed number of training epochs is a form of alchemy that should be avoided if possible, because it’s incredibly task-specific. Letting a model train to convergence is almost always a better approach, because it’s less dependent on the initialization of the weights. If the validation performance is not indicative of the test performance and we need to stop the training without a good criterion, then something is probably wrong.

A second important aspect that I feel gets overlooked often is dropout.
In particular, when dropout is applied to the adjacency matrix it leads to big performance improvements, because the GNN is exposed to very noisy instances of the graphs at each training step and is forced to generalize well.
When comparing different models, if one is using dropout on the adjacency matrix then all the others should do the same. However, the common practice of comparing methods using the “same architecture from the original paper” means that some methods will be tested with dropout on A, and some without, as if the dropout is a particular characteristic of only some methods.

Finally, the remaining key factors in training are the learning rate and weight decay. These are often given as-is in the literature, but it is a good idea to tune them whenever possible. For what it’s worth, I can personally confirm that searching for a good learning rate, in particular, can lead to unexpected results, even for well-established methods (if the model is trained to convergence).

Parallel heads

Heads are parallel computational units that perform the same calculation with different weights and then merge the results to produce the output. To give a sense of the problems that one may encounter when comparing methods that use heads, I will focus on two methods: GAT and ARMA.

Having parallel attention heads is fairly common in NLP literature, from where the very concept of attention comes, and therefore it was natural to do the same in GAT.

In ARMA, using parallel stacks is theoretically motivated by the fact that ARMA filters of order H can be computed by summing H ARMA filters of order 1. While similar in practice to the heads in GAT, in this case having parallel heads is key to the implementation of this particular graph filter.

Because of these fundamental semantic differences, it is impossible to say whether a comparison between GAT with H heads and an ARMA layer of order H is fair.

Extending to the other models as well, it is not guaranteed that having parallel heads would necessarily lead to any practical improvements for a given model. Some methods can, in fact, benefit from a simpler architecture. It is therefore difficult to say whether a comparison between monolithic and parallel architectures is fair.

Datasets

Finally, I’m going to spend a few words on datasets, because there is no chance of having a fair evaluation if the datasets on which we test our models are not good. And in truth, the benchmark datasets that we use for evaluating GNNs are not that good.

Cora, CiteSeer, PubMed, and the Dortmund benchmark datasets for graph kernels: these are, collectively, the Iris dataset of GNNs, and should be treated carefully. While a model should work on these in order to be considered usable, they cannot be the only criterion to run a fair evaluation.

Recently, the community has moved towards a more sensible use of the datasets (ok, maybe I was exaggerating a bit about Iris), thanks to papers like this and this. However, many experiments in the literature still had to be repeated hundreds of times in order to give significant results, and that is bad for three reasons: time, money, and the environment, in no particular order.
Especially if running a grid search of hyperparameters, it just doesn’t make sense to be using datasets that require that much computation to give reliable outcomes, more so if we consider that these are supposed to be easy datasets.

Personally, I find that there are better alternatives out there, that however are not considered often. For node classification, the GraphSage datasets (PPI and Reddit) are significantly better benchmarks than the citation networks (although they’re inductive tasks). For graph-level learning, QM9 has 134k small graphs, of variable order, and will lead to minuscule uncertainty about the results after a few runs. I realize that it is a dataset for regression, but it still is a better alternative to PROTEINS. For classification, Filippo Bianchi, with whom I’ve recently worked a lot, released a dataset that simply cannot be classified without using a GNN. You can find it here.

I will admit that I am as guilty as the next person when it comes to using the “bad” datasets mentioned above. One reason is that it is easy to not move away from what everybody else is doing. One reason is that reviewers outright ask for them if you don’t include them, caring little for anything else.

I think we can do better, as a community.

In conclusion

I started thinking seriously about these issues as I was preparing a paper that required me to compare several models for the experiments. I am not sure whether the few solutions that I have outlined here are definitive, or even correct, but I feel that this is a conversation that needs to be had in the field of GNNs.

Many of the comparisons that are found in the wild do not take any of this stuff into account, and I think that this may ultimately slow the progress of GNN research and its propagation to other fields of science.

If you want to continue this conversation, or if you have any ideas that could complement this post, shoot me an email or look for me on Twitter.

Cheers!

Implementing a Network-based Model of Epilepsy with Numpy and Numba

Thu, 03 Oct 2019 00:00:00 +0000

Mathematically modeling how epilepsy acts on the brain is one of the major topics of research in neuroscience. Recently I came across this paper by Oscar Benjamin et al., which I thought that it would be cool to implement and experiment with.

The idea behind the paper is simple enough. First, they formulate a mathematical model of how a seizure might happen in a single region of the brain. Then, they expand this model to consider the interplay between different areas of the brain, effectively modeling it as a network.

Single system

We start from a complex dynamical system defined as follows:

\[\dot{z} = f(z) = (\lambda - 1 + i \omega)z + 2z|z|^2 - z|z|^4\]

where \( z \in \mathbb{C} \) and \(\lambda\) controls the possible attractors of the system. For \( 0 < \lambda < 1 \), the system has two stable attractors: one fixed point and one attractor that oscillates with an angular velocity of \(\omega\) rad/s.
We can consider the stable attractor as a simplification of the brain in its resting state, while the oscillating attractor is taken to be the ictal state (i.e., when the brain is having a seizure).

We can also consider a noise-driven version of the system:

\[dz(t) = f(z)\,dt + \alpha\,dW(t)\]

where \( W(t) \) is a Wiener process rescaled by a factor \( \alpha \).
A Wiener process \( W(t)_{t\ge0} \), sometimes called Brownian motion, is a stochastic process with the following properties:

\(W(0) = 0\);
the increments between two consecutive observations are normally distributed with a variance equal to the time between the observations:

\[W(t + \tau) - W(t) \sim \mathcal{N}(0, \tau).\]

In the noise-driven version of the system, it is guaranteed that the system will eventually escape any region of phase space, moving from one attractor to the other.

In short, we have a system that due to external, unpredictable inputs (the noise), will randomly switch from a state of rest to a state of oscillation, which we consider as a seizure.

The two figures below show an example of the system starting from the stable attractor and then moving to the oscillator. Since the system is complex, we can observe its dynamics in phase space:

Or we can observe the real part of \( f(t) \) as if we were reading an EEG of brain activity:

See how the change of attractor almost looks like an epileptic seizure?

Network model

While this simple model of seizure initiation is interesting on its own, we can also take our modeling a step further and explicitly represent the connections between different areas of the brain (or sub-systems, if you will) and how they might affect the propagation of seizures from one area to the other.

We do this by defining a connectivity matrix \( A \) where \( A_{ij} = 1 \) if sub-system \( i \) has a direct influence on sub-system \( j \), and \( A_{ij} = 0 \) otherwise. In practice, we also normalize the matrix by dividing each row element-wise by the product of the square roots of the node’s out-degree and in-degree.

Starting from the system described above, the dynamics of one node in the networked system are described by:

\[dz_{i}(t) = \big( f(z_i) + \beta \sum\limits_{j \ne i} A_{ji} (z_j - z_i) \big) + \alpha\,dW_{i}(t)\]

If we look at the individual nodes, their behavior may not seem different than what we had with the single sub-system, but in reality, the attractors of these networked systems are determined by the connectivity \( A \) and the coupling strength \( \beta \).

Here’s what the networked system of 4 nodes pictured above looks like in phase space:

And again we can also look at the real part of each node:

If you want to have more details on how to control the different attractors of the system, I suggest you look at the original paper. They analyze in depth the attractors and escape times of all possible 2-nodes and 3-nodes networks, as well as giving an overview of higher-order networks.

Implementing the system with Numpy and Numba

Now that we got the math sorted out, let’s look at how to translate this system in Numpy.

Since the system is so precisely defined, we only need to convert the mathematical formulation into code. In short, we will need:

The core functions to compute the complex dynamical system;
The main loop to compute the evolution of the system starting from an initial condition.

While developing this, I quickly realized that my original, kinda straightforward implementation was painfully slow and that it would have required some optimization to be usable.

This was the perfect occasion to use Numba, a JIT compiler for Python that claims to yield speedups of up to two orders of magnitude.
Numba can be used to JIT compile any function implemented in pure Python, and natively supports a vast number of Numpy operations as well. The juicy part of Numba consists of compiling functions in nopython mode, meaning that the code will run without ever using the Python interpreter. To achieve this, it is sufficient to decorate your functions with the @njit decorator and then simply run your script as usual.

Code

At the very start, let’s deal with imports and define a couple of helper functions that we are going to use only once:

import numpy as np
from numba import njit

def degree_power(adj, pow):
    """
    Computes D^{p} from the given adjacency matrix.

    :param adj: rank 2 array.
    :param pow: exponent to which elevate the degree matrix.
    :return: the exponentiated degree matrix.
    """
    degrees = np.power(adj.sum(1), pow).reshape(-1)
    degrees[np.isinf(degrees)] = 0.
    D = np.diag(degrees)

    return D


def normalized_adjacency(adj):
    """
    Normalizes the given adjacency matrix using the degree matrix as
    D^{-1/2}AD^{-1/2} (symmetric normalization).

    :param adj: rank 2 array.
    :return: the normalized adjacency matrix.
    """
    normalized_D = degree_power(adj, -0.5)
    output = normalized_D.dot(adj).dot(normalized_D)

    return output

The code for these functions was copy-pasted from Spektral and slightly adapted so that we don’t need to import the entire library just for two functions. Note that there’s no need to JIT compile these two functions because they will run only once, and in fact, it is not guaranteed that compiling them will be less expensive than simply executing them with Python. Especially because both functions are heavily Numpy-based already, so they should run at C-like speed.

Moving forward to implementing the actual system. Let’s first define the fixed hyper-parameters of the model:

omega = 20               # Frequency of oscillations in rad/s
alpha = 0.2              # Intensity of the noise
lamb = 0.5               # Controls the possible attractors of each node
beta = 0.1               # Coupling strength b/w nodes
N = 4                    # Number of nodes in the system
seconds_to_generate = 1  # Number of seconds to evolve the system for
dt = 0.0001              # Time interval between consecutive states

# Random connectivity matrix
A = np.random.randint(0, 2, (N, N))
np.fill_diagonal(A, 0)
A_norm = normalized_adjacency(A).astype(np.complex128)

The core of the dynamical system is the update function \( f(z) \), that in code looks like this:

@njit
def f(z, lamb=0., omega=1):
    """The deterministic update function of each node.

    :param z: complex, the current state.
    :param lamb: float, hyper-parameter to control the attractors of each node.
    :param omega: float, frequency of oscillations in rad/s.
    """
    return ((lamb - 1 + complex(0, omega)) * z
            + (2 * z * np.abs(z) ** 2)
            - (z * np.abs(z) ** 4))

There’s not much to say here, except that using complex instead of np.complex seems to be slightly faster (157 ns vs. 178 ns), although the performance impact on the overall function is clearly negligible.

To compute the noise-driven system, we need to define the increment function of a complex Wiener process. We can start by implementing the increment function of a simple Wiener process, first:

@njit
def delta_wiener(size, dt):
    """Returns the random delta between two consecutive steps of a Wiener
    process (Brownian motion).

    :param size: tuple, desired shape of the output array.
    :param dt: float, time increment in seconds.
    :return: numpy array with shape 'size'.
    """
    return np.sqrt(dt) * np.random.randn(*size)

At the time of writing this, Numba does not support the size argument in np.random.normal but it does support np.random.randn. Instead of setting the scale parameter explicitly, we simply multiply the sampled values by the scale.
Since we are using the scale, and not the variance, we have to take the square root of the time increment dt.

Finally, we can compute the increment of a complex Wiener process as \( U(t) + jV(t) \), where both \( U \) and \( V \) are simple Wiener processes:

@njit
def complex_delta_wiener(size, dt):
    """Returns the random delta between two consecutive steps of a complex
    Wiener process (Brownian motion). The process is calculated as u(t) + jv(t)
    where u and v are simple Wiener processes.

    :param size: tuple, the desired shape of the output array.
    :param dt: float, time increment in seconds.
    :return: numpy array of np.complex128 with shape 'size'.
    """
    u = delta_wiener(size, dt)
    v = delta_wiener(size, dt)

    return u +  v * 1j

Now that we have all the necessary components to define the noise-driven system, let’s implement the main step function:

@njit
def step(z):
    """
    Compute one time step of the system, s.t. z[t+1] = z[t] + step(z[t]).

    :param z: numpy array of np.complex128, the current state.
    :return: numpy array of np.complex128.
    """
    # Matrix with pairwise differences of nodes
    delta_z = z.reshape(-1, 1) - z.reshape(1, -1)

    # Compute diffusive coupling
    diffusive_coupling = np.diag(A_norm.T.dot(delta_z))

    # Compute change in state
    update_from_self = f(z, lamb=lamb, omega=omega)
    update_from_others = beta * diffusive_coupling
    noise = alpha * complex_delta_wiener(z.shape, dt)
    dz = (update_from_self + update_from_others) * dt + noise

    return dz

Originally, I had implemented the following line

delta_z = z.reshape(-1, 1) - z.reshape(1, -1)

delta_z = z[..., None] - z[None, ...]

but Numba does not support adding new axes with None or np.newaxis.

Also, when computing diffusive_coupling, a more efficient way of doing

np.diag(A.T.dot(B))

would have been

np.einsum('ij,ij->j', A, B)

for reasons which I still fail to understand (3.48 µs vs. 2.57 µs, when A and B are 3 by 3 float matrices). However, Numba does not support np.einsum.

Finally, we can implement the main loop function that starts from a given initial state z0 and computes steps number of updates at time intervals of dt.

@njit
def evolve_system(z0, steps):
    """
    Evolve the system starting from the given initial state (z0) for a given
    number of time steps (steps).

    :param z0: numpy array of np.complex128, the initial state.
    :param steps: int, number of steps to evolve the system for.
    :return: list, the sequence of states.
    """
    steps_in_percent = steps / 100
    z = [z0]
    for i in range(steps):
        if not i % steps_in_percent:
            print(i / steps_in_percent, '%')
        dz = step(z[-1])
        z.append(z[-1] + dz)

    return z

I had originally wrapped the loop in a tqdm progress bar, but an old-fashioned if and print can reduce the overhead by 50% (2.29s vs. 1.23s, tested on a simple for loop with 1e7 iterations). Pre-computing steps_in_percent also reduces the overhead by 30% compared to computing it every time.
(You’ll notice that at some point it just became a matter of optimizing every possible aspect of this :D)

The only thing left to do is to evolve the system starting from a given intial state:

z0 = np.zeros(N).astype(np.complex128)  # Starting conditions
steps = int(seconds_to_generate / dt)   # Number of steps to generate

timesteps = evolve_system(z0, steps)
timesteps = np.array(timesteps)

You can now run any analysis on timesteps, which will be a Numpy array of np.complex128. Note also how we had to cast the initial conditions z0 to this dtype, in order to have strict typing in the JIT-compiled code.

I published the full code as a Gist, including the code I used to make the plots.

General notes on performance

My original implementation was based on a Simulator class that implemented all the same methods in a compact abstraction:

class Simulator(object):
    def __init__(self, N, A, dt=1e-4, omega=20, alpha=0.05, lamb=0.5, beta=0.1):
        ...

    @staticmethod
    def f(z, lamb=0., omega=1):
        ...

    @staticmethod    
    def delta_weiner(size, dt):
        ...

    @staticmethod
    def complex_delta_weiner(size, dt):
        ...

    def step(self, z):
        ...

    def evolve_system(self, z0, steps):
        ...

There were some issues with this implementation, the biggest one being that it is much more messy to JIT compile an entire class with Numba (the substance of the code did not change much, and I’ve explicitly highlighted all implementation changes above).

Having moved to a more functional style feels cleaner and it honestly looks more elegant (opinions, I know). Crucially, it also allowed me to optimize each function to work flawlessly with Numba.

After optimizing all that was optimizable, I tested the old code against the new one and the speedup was about 31x, going from ~8k iterations/s to ~250k iterations/s.

Most of the improvement came from Numba and removing the overhead of Python’s interpreter, but it must be said that the true core of the system is dealt with by Numpy. In fact, as we increase the number of nodes the bottleneck becomes the matrix multiplication in Numpy, eventually leading to virtually no performance difference between using Numba or not (verified for N=1000 - the 31x speedup was for N=2).

I hope that you enjoyed this post and hopefully learned something new, be it about models of the epileptic brain or Python optimization.

Cheers!

MinCUT Pooling in Graph Neural Networks

Thu, 25 Jul 2019 00:00:00 +0000

In our latest paper, we presented a new pooling method for GNNs, called MinCutPool, which has a lot of desirable properties as far as pooling goes:

It’s based on well-understood theoretical techniques for node clustering;
It’s fully differentiable and learnable with gradient descent;
It depends directly on the task-specific loss on which the GNN is being trained, but …
It can be trained on its own without a task-specific loss if needed;
It’s fast;

The method is based on the minCUT optimization problem, which consists of finding a cut on a weighted graph in such a way that the overall weight of the cut is minimized. We considered a continuous relaxation of the minCUT problem and implemented it as a neural network layer to provide a sound pooling method for GNNs.

In this post, I’ll describe the working principles of minCUT pooling and show some applications of the layer.

Background

The K-way normalized minCUT is an optimization problem to find K clusters on a graph by minimizing the overall intra-cluster edge weight. This is equivalent to solving:

\[\text{maximize} \;\; \frac{1}{K} \sum_{k=1}^K \frac{\sum_{i,j \in \mathcal{V}_k} \mathcal{E}_{i,j} }{\sum_{i \in \mathcal{V}_k, j \in \mathcal{V} \backslash \mathcal{V}_k} \mathcal{E}_{i,j}},\]

where \(\mathcal{V}\) is the set of nodes, \(\mathcal{V_k}\) is the \(k\)-th cluster of nodes, and \(\mathcal{E_{i, j}}\) indicates a weighted edge between two nodes.

If we define a cluster assignment matrix \(C \in \{0,1\}^{N \times K}\), which maps each of the \(N\) nodes to one of the \(K\) clusters, the problem can also be re-written as:

\[\text{maximize} \;\; \frac{1}{K} \sum_{k=1}^K \frac{C_k^T A C_k}{C_k^T D C_k}\]

where \(A\) is the adjacency matrix of the graph, and \(D\) is the diagonal degree matrix.

While finding the optimal minCUT is an NP-hard problem, there exist relaxations that can find near-optimal solutions in polynomial time. These relaxations, however, are still very expensive and are not able to generalize to unseen samples.

MinCUT pooling

The idea behind minCUT pooling is to take a continuous relaxation of the minCUT problem and implement it as a GNN layer with a custom loss function. By minimizing the custom loss, the GNN learns to find minCUT clusters on any given graph and aggregates the clusters to reduce the graph’s size.
At the same time, because the layer can be used as a part of a larger architecture, any other loss that is being minimized during training will influence the clusters found by MinCutPool, making them optimal for the particular task at hand.

At the core of minCUT pooling there is a MLP, which maps the node features \(\mathbf{X}\) to a continuous cluster assignment matrix \(\mathbf{S}\) (of size \(N \times K\)):

\[\mathbf{S} = \textrm{softmax}(\text{ReLU}(\mathbf{X}\mathbf{W}_1)\mathbf{W}_2)\]

We can then use the MLP to generate \(\mathbf{S}\) on the fly, and reduce the graphs with simple multiplications as:

\[\mathbf{A}^{pool} = \mathbf{S}^T \mathbf{A} \mathbf{S}; \;\;\; \mathbf{X}^{pool} = \mathbf{S}^T \mathbf{X}.\]

At this point, we can already make a couple of considerations:

Nodes with similar features will likely belong to the same cluster because they will be “classified” similarly by the MLP. This is especially true when using message-passing layers before pooling, since they will cause the node features of connected nodes to become similar;
Because of the MLP, \(\mathbf{S}\) is pretty fast to compute and the layer can generalize to new graphs once it has been trained.

This is already pretty good, and it covers some of the main desiderata of a GNN layer, but we also want to explicitly account for the connectivity of the graph in order to pool it.

This is where the minCUT optimization comes in.

By slightly adapting the minCUT formulation above, we can design an auxiliary loss to train the MLP, so that it will learn to solve the minCUT problem in an unsupervised way.
In practice, our unsupervised regularization loss encourages the MLP to cluster together nodes that are strongly connected with each other and weakly connected with the nodes in the other clusters.

The full unsupervised loss that we minimize in order to achieve this is:

\[\mathcal{L}_u = \mathcal{L}_c + \mathcal{L}_o = \underbrace{- \frac{Tr ( \mathbf{S}^T \mathbf{A} \mathbf{S} )}{Tr ( \mathbf{S}^T\mathbf{D} \mathbf{S})}}_{\mathcal{L}_c} + \underbrace{\bigg{\lVert} \frac{\mathbf{S}^T\mathbf{S}}{\|\mathbf{S}^T\mathbf{S}\|_F} - \frac{\mathbf{I}_K}{\sqrt{K}}\bigg{\rVert}_F}_{\mathcal{L}_o},\]

where \(\mathbf{A}\) is the normalized adjacency matrix of the graph.

Let’s break this loss down and see how it works.

Cut loss

The first term, \(\mathcal{L}_c\), encourages the MLP to find cluster assignments that solve the minCUT problem (to see why, compare it with the minCUT maximization that I described above). We refer to this loss as the cut loss.

In particular, minimizing the numerator leads to clustering together nodes that are strongly connected on the graph, while the denominator prevents any of the clusters to be too small.

The cut loss is bounded between -1 and 0, which are ideally reached in the following situations:

\(\mathcal{L}_c = 0\) when all pairs of connected nodes are assigned to different clusters;
\(\mathcal{L}_c = -1\) when there are \(K\) disconnected components in the graph, and \(\mathbf{S}\) exactly maps the \(K\) components to the \(K\) clusters;

The figure below shows what these situations might look like. Note that both cases can only happen if \(\mathbf{S}\) is binary.

However, because of the continuous relaxation, \(\mathcal{L}_c\) is non-convex and there are spurious minima that can be found by SGD.
For example, for \(K = 4\), the uniform assignment matrix

\[\mathbf{S}_i = (0.25, 0.25, 0.25, 0.25) \;\; \forall i,\]

would cause the numerator and the denominator of \(\mathcal{L}_c\) to be equal, and the loss to be \(-1\).
A similar situation occurs when all nodes in the graph are assigned to the same cluster.

This can be easily verified with Numpy:

In [1]: # Adjacency matrix
   ...: A = np.array([[1, 0, 1],  
   ...:               [0, 1, 0],  
   ...:               [1, 0, 1]])

In [2]: # Degree matrix
   ...: D = np.diag(A.sum(-1))

In [3]: # Perfect cluster assignment
   ...: S = np.array([[1, 0], [0, 1], [1, 0]])

In [4]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[4]: 1.0

In [5]: # All nodes uniformly distributed 
   ...: S = np.ones((3, 2)) / 2

In [6]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[6]: 1.0

In [7]: # All nodes in the same cluster 
   ...: S = np.array([[1, 0], [1, 0], [1, 0]]) 

In [8]: np.trace(S.T @ A @ S) / np.trace(S.T @ D @ S)
Out[8]: 1.0

Orthogonality loss

The second term, \(\mathcal{L}_o\), helps to avoid such degenerate minima of \(\mathcal{L}_c\) by encouraging the MLP to find clusters that are orthogonal between each other. We call this the orthogonality loss.

In other words, \(\mathcal{L}_o\) encourages the MLP to “make a decision” about which nodes belong to which clusters, avoiding those degenerate solutions where \(\mathbf{S}\) assigns one \(K\)-th of a node to each cluster.

Moreover, we can see that the perfect minimizer of \(\mathcal{L}_o\) is only reached if we have \(N \le K\) nodes, because in general, given a \(K\) dimensional vector space, we cannot find more than \(K\) mutually orthogonal vectors. The only way to minimize \(\mathcal{L}_o\) given \(N\) assignment vectors is, therefore, to distribute the nodes between the \(K\) clusters. This causes the MLP to avoid the other type of spurious minima of \(\mathcal{L}_c\), where all nodes are in a single cluster.

Interaction of the two losses

We can see how the two loss terms interact with each other to find a good solution to the cluster assignment problem. The figure above shows the evolution of the unsupervised loss as the network is trained to cluster the nodes of Cora (plot on the left). We can see that as the network is trained, the normalized mutual information (NMI) between the cluster assignments and the true labels improves, meaning that the layer is learning to find meaningful clusters (plot on the right).

Note how \(\mathcal{L}_c\) starts from a trivial assignment (-1) due to the random initialization and then moves away from the spurious minima as the orthogonality loss forces the MLP towards more sensible solutions.

Pooled graph

As a further consideration, we can take a closer look at the pooled adjacency matrix \(\mathbf{A}^{pool}\).
First of all, we can see that it is a \(K \times K\) matrix that contains the number of links connecting each cluster. For example, the entry \(\mathbf{A}^{pool}_{1,\;2}\) contains the number of links between the nodes in cluster 1 and cluster 2. We can also see that the trace of \(\mathbf{A}^{pool}\) is being maximized in \(\mathcal{L}_c\). Therefore, we can expect the diagonal elements \(\mathbf{A}^{pool}_{i,\;i}\) to be much larger than the other entries of \(\mathbf{A}^{pool}\).

For this reason, \(\mathbf{A}^{pool}\) will represent a graph with very strong self-loops, and the message-passing layers after pooling will have a hard time propagating information on the graph (because the self-loops will keep sending the information of a node back onto itself, and not its neighbors).

To address this problem, a solution is to remove the diagonal of \(\mathbf{A}^{pool}\) and renormalize the matrix by its degree, before giving it as output of the pooling layer:

\[\hat{\mathbf{A}} = \mathbf{A}^{pool} - \mathbf{I}_K \cdot diag(\mathbf{A}^{pool}); \;\; \tilde{\mathbf{A}}^{pool} = \hat{\mathbf{D}}^{-\frac{1}{2}} \hat{\mathbf{A}} \hat{\mathbf{D}}^{-\frac{1}{2}}\]

In the paper, we combined minCUT with message-passing layers that have a built-in skip connection, in order to bring each node’s information forward (e.g., Spektral’s GraphConvSkip). However, if your GNN is based on the graph convolutional networks (GCN) of Kipf & Welling, you may want to manually add the self-loops back after pooling.

Notes on gradient flow

The unsupervised loss \(\mathcal{L}_u\) can be optimized on its own, adapting the weights of the MLP to compute an \(\mathbf{S}\) that solves the minCUT problem under the orthogonality constraint.

However, given the multiplicative interaction between \(\mathbf{S}\) and \(\mathbf{X}\), the gradient of the task-specific loss (i.e., whatever the GNN is being trained to do) can flow through the MLP. We can see in the picture above how there is a path going from the input \(\mathbf{X}^{(t+1)}\) to the output \(\mathbf{X}_{\textrm{pool}}^{(t+1)}\), directly passing through the MLP.

This means that the overall solution found by the GNN will keep into account both the graph structure (to solve minCUT) and the final task.

Code

Implementing minCUT in TensorFlow is fairly straightforward. Let’s start from some setup:

  import tensorflow as tf
  from tensorflow.keras.layers import Dense

  A  = ... # Adjacency matrix (N x N)
  X  = ... # Node features (N x F)
  n_clusters = ...  # Number of clusters to find with minCUT

First, the layer computes the cluster assignment matrix S by applying a softmax MLP to the node features:

H = Dense(16, activation='relu')(X)
S = Dense(n_clusters, activation='softmax')(H)  # Cluster assignment matrix

The cut loss is then implemented as:

# Cut loss
A_pool = tf.matmul(
  tf.transpose(tf.matmul(A, S)), S
)
num = tf.trace(A_pool)

D = tf.reduce_sum(A, axis=-1)
D_pooled = tf.matmul(
  tf.transpose(tf.matmul(D, S)), S
)
den = tf.trace(D_pooled)

mincut_loss = -(num / den)

And the orthogonality loss is implemented as:

# Orthogonality loss
St_S = tf.matmul(tf.transpose(S), S)
I_S = tf.eye(n_clusters)

ortho_loss = tf.norm(
    St_S / tf.norm(St_S) - I_S / tf.norm(I_S)
)

Finally, the full unsupervised loss of the layer is obtained as the sum of the two auxiliary losses:

total_loss = mincut_loss + ortho_loss

The actual pooling step is simply implemented as a simple multiplication of S with A and X, then we zero-out the diagonal of A_pool and re-normalize the matrix. Since we already computed A_pool for the numerator of \(\mathcal{L}_c\), we only need to do:

# Pooling node features
X_pool = tf.matmul(tf.transpose(S), X)

# Zeroing out the diagonal
A_pool = tf.linalg.set_diag(A_pool, tf.zeros(tf.shape(A_pool)[:-1]))  # Remove diagonal

# Normalizing A_pool
D_pool = tf.reduce_sum(A_pool, -1)
D_pool = tf.sqrt(D_pool)[:, None] + 1e-12  # Add epsilon to avoid division by 0
A_pool = (A_pool / D_pool) / tf.transpose(D_pool)

Wrap this up in a layer, and use the layer in a GNN. Done.

You can find minCUT pooling implementations both in Spektral and Pytorch Geometric.

Experiments

Unsupervised clustering

Because the core of MinCutPool is an unsupervised loss that does not require labeled data in order to be minimized, we can optimize \(\mathcal{L}_u\) on its own to test the clustering ability of minCUT.

A good first test is to check whether the layer is able to cluster a grid (the size of the clusters should be the same) and to isolate communities in a network. We see in the figure below that minCUT was able to do this perfectly.

To make things more interesting, we can also test minCUT on the task of graph-based image segmentation. We can build a region adjacency graph from a natural image, and cluster its nodes in order to see if regions with similar colors are clustered together.
The results look nice, and remember that this was obtained by only optimizing \(\mathcal{L}_u\)!

Finally, we also checked the clustering abilities of MinCutPool on the popular citations datasets: Cora, Citeseer, and Pubmed. As mentioned before, we used the NMI score to see whether the layer was clustering together nodes of the same class. Note that the layer did not have access to the labels during training.

You can check the paper to see how minCUT fared in comparison to other methods, but in short: it did well, sometimes by a full order of magnitude better than other methods.

Autoencoder

Another interesting unsupervised test that we did was to check how much information is preserved in the coarsened graph after pooling. To do this, we built a simple graph autoencoder with the structure pictured below:

The “Unpool” layer is simply obtained by transposing the same \(\mathbf{S}\) found by minCUT, in order to upscale the graph instead of downscaling it:

\[\mathbf{A}^\text{unpool} = \mathbf{S} \mathbf{A}^\text{pool} \mathbf{S}^T; \;\; \mathbf{X}^\text{unpool} = \mathbf{S}\mathbf{X}^\text{pool}.\]

We tested the graph AE on some very regular graphs that should have been easy to reconstruct after pooling. Surprisingly, this turned out to be a difficult problem for some pooling layers from the GNN literature. MinCUT, on the other hand, was able to defend itself quite nicely.

Supervised inductive tasks

Finally, we tested whether minCUT provides an improvement on the usual graph classification and graph regression tasks.
We picked a fixed GNN architecture and tested several pooling strategies by swapping the pooling layers in the network.

The dataset that we used were:

I’m not gonna report the comparisons with other methods, but I will highlight an interesting sanity check that we performed in order to see whether using GNNs and graph pooling even made sense at all.

Among the various methods that we tested, we also included:

A simple MLP which did not exploit the relational information carried by the graphs;
The same GNN architecture without pooling layers.

We were once again surprised to see that, while minCUT yielded a consistent improvement over such simple baselines, other pooling methods did not.

Conclusions

Working on minCUT pooling was an interesting experience that deepened my understanding of GNNs, and allowed me to see what is really necessary for a GNN to work.

We have put the paper on arXiv, and you can check the official implementations of the method in Spektral and Pytorch Geometric.

If you want to use MinCutPool in your own work, you can cite us with:

@article{bianchi2019mincut,
  title={Spectral Clustering with Graph Neural Networks for Graph Pooling},
  author={Filippo Maria Bianchi and Daniele Grattarola and Cesare Alippi},
  booktitle={Proceedings of the 37th International Conference on Machine learning (ICML)},
  year={2020}
}

Cheers!

Detecting Hostility from Skeletal Graphs Using Non-Euclidean Embeddings

Sat, 13 Apr 2019 00:00:00 +0000

The first paper on which I worked during my PhD is about detecting changes in sequences of graphs using non-Euclidean geometry and adversarial autoencoders. As a real-world application of the method presented in the paper, we showed that we could detect epileptic seizures in the brain, by monitoring a stream of functional connectivity brain networks.

In general, the methodology presented in the paper can work for any data that:

can be represented as graphs;
has a temporal dimension;
has a change that you want to identify somewhere along the stream of data;
has i.i.d. samples.

There are a lot of temporal networks that can be found in the wild, but not many datasets respect all the requirements at the same time. What’s more, many public datasets have very little samples along the temporal axis. Recently, however, I was looking for some nice graph classification dataset on which to test Spektral, and I stumbled upon the NTU RGB+D dataset released by the Nanyang Technological University of Singapore.
The dataset consists of about 60 thousand video clips of people performing everyday actions, including mutual actions and some health-related ones. The reason why I found this dataset is that it contains skeletal annotations for each frame of each video clip, meaning lots and lots of graphs that can be used for graph classification.

NTU RGB+D for change detection

While reading through the website, however, I realized that this dataset could actually be a good playground for our change detection methodology as well, because it respects almost all requirements:

it has graphs;
it has a temporal dimension;
it has classes, which can be easily converted to what we called the regimes of our graph streams;

The fourth requirement of having i.i.d. samples is due to the nature of the change detection test that we adopted in the paper. The test is able to detect changes in stationarity of a stochastic process, which means that it can tell whether the samples coming from the process have been drawn from a different distribution than the one observed during training.
In order to do so, the test needs to estimate whether a window of observations from the process is significantly different than what observed in the nominal regime. This requires having i.i.d. samples in each window.

By their very nature, however, the graphs in NTU RGB+D are definitely not i.i.d. (they would have been, had the subjects been recorded under a strobe light – dammit!).
There are several ways of converting a heavily autocorrelated signal to a stationary one, with the simplest one being randomizing along the time axis. The piece-wise stationarity requirement is a very strong one, and we are looking into relaxing it, but for testing the method on NTU RGB+D we had to stick with it.

Setting

Defining the change detection problem is easy: have a nominal regime of neutral or positive actions like walking, reading, taking a selfie, or being at the computer, and try to detect when the regime changes to a negative action like falling down, getting in fights with people, or feeling sick (there are at least 5 action classes of people acting hurt or sick in NTU RGB+D).

Applications of this could include:

monitoring children and elderly people when they are alone;
detecting violence in at-risk, crowded situations;
detecting when a driver is distracted;

In all of these situations, you might have a pretty good idea of what you want to be happening at a given time, but have no way of knowing how things could go wrong.

We chose the “hugging” action for the nominal, all-is-well regime, and we took the “punching/slapping” class to symbolize any unexpected, undesirable behaviour that deviates from our concept of nominal. Then, we trained our adversarial autoencoder to represent points on an ensemble of constant-curvature manifolds, and we ran the change detection test. At this point, it would probably help if one was familiar with the details of the paper. In short, what we do is:

take an adversarial graph autoencoder (AAE);
train the AAE on the nominal samples that you have at training time;
impose a geometric regularization onto the latent space of the AAE, so that the embeddings will lie on a Riemannian constant-curvature manifold (CCM).
This happens in one of two ways:
1. use a prior distribution with support on the CCM to train the AAE;
2. make the encoder maximise the membership of its embeddings to the CCM (this is the one we use for this experiment);
use the trained AAE to represent incoming graphs on the CCM;
run the change detection test on the CCM;

This procedure can be adapted to learn a representation on more than one CCM at a time, by having parallel latent spaces for the AAE. This worked pretty well in the paper, so we tried the same here. We also chose one of the two types of change detection tests that we introduced in the paper, namely the one we called Riemannian, because it gave us the best results on the seizure detection problem.

Results

Running the whole method on the stream of graphs gave us very nice results. We were able to recognize the change from friendly to violent interactions in most experiments, although sometimes the autoencoder failed to capture the differences between the two regimes (and consequently, the CDT couldn’t pick up the change).

An interesting thing that we observed is that when using an ensemble of three different geometries, namely spherical, hyperbolic, and Euclidean, the change would only show up in the spherical CCM. This was a consistent result that gave us yet another confirmation of two things:

assuming Euclidean geometry for the latent space is not always a good idea;
our idea of learning a representation on multiple CCMs at the same time worked as expected. Originally, we suggested this trick to potential adopters of our CDT methodology, in order to not having to guess the best geometry for the representation. Now, we have the confirmation that it is indeed a good idea, because the AAE will choose the best geometry for the task on its own.

Figure 2 above (hover over the images to see the captions) shows the embeddings produced by the encoder on the test stream of graphs. Figure 3 shows the three accumulators used in the change detection test to decide whether or not to raise an alarm indicating that a change occurred. In both pictures, the decision for raising an alarm is informed almost exclusively by the spherical CCM.

Conclusions

That’s all, folks!
This was a pretty little experiment to run, and it gave us further insights into the world of non-Euclidean neural networks. We have actually updated the paper with the findings of this new experiment, and you can also try and play with our algorithm using the code on Github (the code there is for the synthetic experiments of the paper, but you can adapt it to any dataset easily).

If you want to mention our CDT strategy in your work, you can cite:

@article{grattarola2018change,
  title={Change Detection in Graph Streams by Learning Graph Embeddings on Constant-Curvature Manifolds},
  author={Grattarola, Daniele and Zambon, Daniele and Livi, Lorenzo and Alippi, Cesare},
  journal={IEE Transactions on Neural Networks and Learning Systems},
  year={2019},
  doi={10.1109/TNNLS.2019.2927301}
}

Cheers!

Daniele Grattarola

My second interview on Machine Learning Street Talk

My interview on Machine Learning Street Talk

Graph Neural Cellular Automata

Learning CA rules

Learning GCA rules

Experiments

Voronoi GCA

Boids

Morphogenesis

Now what?

Read more

A practical introduction to GNNs - Part 2

Message Passing Networks

Gather-Scatter

Methods

A practical introduction to GNNs - Part 1

Telestrations Neural Networks

How-to in three paragraphs

Results

Code

Conclusions

Pitfalls of Graph Neural Network Evaluation 2.0

Neighbourhoods

Regularization and training

Parallel heads

Datasets

In conclusion

Implementing a Network-based Model of Epilepsy with Numpy and Numba

Single system

Network model

Implementing the system with Numpy and Numba

Code

General notes on performance

MinCUT Pooling in Graph Neural Networks

Background

MinCUT pooling

Cut loss

Orthogonality loss

Interaction of the two losses

Pooled graph

Notes on gradient flow

Code

Experiments

Unsupervised clustering

Autoencoder

Supervised inductive tasks

Conclusions

Detecting Hostility from Skeletal Graphs Using Non-Euclidean Embeddings

NTU RGB+D for change detection

Setting

Results

Conclusions