Pitfalls of Graph Neural Network Evaluation 2.0

In this post, I’m going to summarize some conceptual problems that I have found when comparing different graph neural networks (GNNs) between them.

I’m going to argue that it is extremely difficult to make an objectively fair comparison between structurally different models and that the experimental comparisons found in the literature are not always sound.

I will try to suggest reasonable solutions whenever possible, but the goal of this post is simply to make these issues appear on your radar and maybe spark a conversation on the matter.

Some of the things that I’ll say are also addressed in the original Pitfalls of Graph Neural Network Evaluation (Shchur et al., 2018), which I warmly suggest you read.


The first source of inconsistency when comparing GNNs comes from the fact that different layers are designed to take into account neighborhoods of different sizes.
We usually have that a layer either looks at the 1-neighbours of each node, or it has a hyperparameter K that controls the size of the neighbourhood. Some examples of popular methods (implemented both in Spektral and Pytorch Geometric) in either category:

A fair evaluation should keep these differences into account and allow each GNN to look at the same neighborhoods, but at the same time, it could be argued that a layer designed to operate on larger neighborhoods is more expressive. How can we tell what is better?

Let’s say we are comparing GCN with Cheby. The equivalent of a 2-layer GCN could be a 2-layer Cheby with K=1, or a 1-layer Cheby with K=2. In the GCN paper, they use a 2-layer Cheby with K=3. Should they have compared with a 6-layer GCN?

Moreover, this difference between methods may have an impact on the number of parameters, nonlinearity, and overall amount of regularization in a GNN.
For instance, a GCN that reaches a neighborhood of order 3 may have 3 dropout layers, while the equivalent Cheby with K=3 will have only one.
Another example: an SGC architecture can reach any neighborhood with a constant number of parameters, while other methods can’t.

We’re only looking at one simple issue, and it is already difficult to say how to fairly evaluate different methods. It gets worse.

Regularization and training

Regularization is an aspect that is particularly essential in GNNs, because the community uses very small benchmark datasets and most GNNs tend to overfit like crazy (more on this later). For these reasons, the performance of a GNN can vary wildly depending on how the model is regularized. This is true for all other hyperparameters in general, because things like the learning rate and batch size can be a form of implicit regularization.

The literature is largely inconsistent with how regularization is applied across different papers, making it difficult to say whether the performance improvements reported for a model are due to the actual contribution or to a different regularization scheme.

The following are often found in the literature:

  • High learning rates;
  • High L2 penalty;
  • Extremely high dropout rates on node features and adjacency matrix;
  • Low number of training epochs;
  • Low patience for early stopping.

I’m going to focus on a few of these.

First, I argue that setting a fixed number of training epochs is a form of alchemy that should be avoided if possible, because it’s incredibly task-specific. Letting a model train to convergence is almost always a better approach, because it’s less dependent on the initialization of the weights. If the validation performance is not indicative of the test performance and we need to stop the training without a good criterion, then something is probably wrong.

A second important aspect that I feel gets overlooked often is dropout.
In particular, when dropout is applied to the adjacency matrix it leads to big performance improvements, because the GNN is exposed to very noisy instances of the graphs at each training step and is forced to generalize well.
When comparing different models, if one is using dropout on the adjacency matrix then all the others should do the same. However, the common practice of comparing methods using the “same architecture from the original paper” means that some methods will be tested with dropout on A, and some without, as if the dropout is a particular characteristic of only some methods.

Finally, the remaining key factors in training are the learning rate and weight decay. These are often given as-is in the literature, but it is a good idea to tune them whenever possible. For what it’s worth, I can personally confirm that searching for a good learning rate, in particular, can lead to unexpected results, even for well-established methods (if the model is trained to convergence).

Parallel heads

Heads are parallel computational units that perform the same calculation with different weights and then merge the results to produce the output. To give a sense of the problems that one may encounter when comparing methods that use heads, I will focus on two methods: GAT and ARMA.

Having parallel attention heads is fairly common in NLP literature, from where the very concept of attention comes, and therefore it was natural to do the same in GAT.

In ARMA, using parallel stacks is theoretically motivated by the fact that ARMA filters of order H can be computed by summing H ARMA filters of order 1. While similar in practice to the heads in GAT, in this case having parallel heads is key to the implementation of this particular graph filter.

Because of these fundamental semantic differences, it is impossible to say whether a comparison between GAT with H heads and an ARMA layer of order H is fair.

Extending to the other models as well, it is not guaranteed that having parallel heads would necessarily lead to any practical improvements for a given model. Some methods can, in fact, benefit from a simpler architecture. It is therefore difficult to say whether a comparison between monolithic and parallel architectures is fair.


Finally, I’m going to spend a few words on datasets, because there is no chance of having a fair evaluation if the datasets on which we test our models are not good. And in truth, the benchmark datasets that we use for evaluating GNNs are not that good.

Cora, CiteSeer, PubMed, and the Dortmund benchmark datasets for graph kernels: these are, collectively, the Iris dataset of GNNs, and should be treated carefully. While a model should work on these in order to be considered usable, they cannot be the only criterion to run a fair evaluation.

Recently, the community has moved towards a more sensible use of the datasets (ok, maybe I was exaggerating a bit about Iris), thanks to papers like this and this. However, many experiments in the literature still had to be repeated hundreds of times in order to give significant results, and that is bad for three reasons: time, money, and the environment, in no particular order.
Especially if running a grid search of hyperparameters, it just doesn’t make sense to be using datasets that require that much computation to give reliable outcomes, more so if we consider that these are supposed to be easy datasets.

Personally, I find that there are better alternatives out there, that however are not considered often. For node classification, the GraphSage datasets (PPI and Reddit) are significantly better benchmarks than the citation networks (although they’re inductive tasks). For graph-level learning, QM9 has 134k small graphs, of variable order, and will lead to minuscule uncertainty about the results after a few runs. I realize that it is a dataset for regression, but it still is a better alternative to PROTEINS. For classification, Filippo Bianchi, with whom I’ve recently worked a lot, released a dataset that simply cannot be classified without using a GNN. You can find it here.

I will admit that I am as guilty as the next person when it comes to using the “bad” datasets mentioned above. One reason is that it is easy to not move away from what everybody else is doing. One reason is that reviewers outright ask for them if you don’t include them, caring little for anything else.

I think we can do better, as a community.

In conclusion

I started thinking seriously about these issues as I was preparing a paper that required me to compare several models for the experiments. I am not sure whether the few solutions that I have outlined here are definitive, or even correct, but I feel that this is a conversation that needs to be had in the field of GNNs.

Many of the comparisons that are found in the wild do not take any of this stuff into account, and I think that this may ultimately slow the progress of GNN research and its propagation to other fields of science.

If you want to continue this conversation, or if you have any ideas that could complement this post, shoot me an email or look for me on Twitter.