Daniele GrattarolaA blog about AI and other interesting stuff.
https://danielegrattarola.github.io/
Fri, 17 Jan 2020 15:32:54 +0000Fri, 17 Jan 2020 15:32:54 +0000Jekyll v3.8.5Pitfalls of Graph Neural Network Evaluation 2.0<p>In this post, I’m going to summarize some conceptual problems that I have found when comparing different graph neural networks (GNNs) between them.</p>
<p>I’m going to argue that it is extremely difficult to make an objectively fair comparison between structurally different models and that the experimental comparisons found in the literature are not always sound.</p>
<p>I will try to suggest reasonable solutions whenever possible, but the goal of this post is simply to make these issues appear on your radar and maybe spark a conversation on the matter.</p>
<p>Some of the things that I’ll say are also addressed in the original <a href="https://arxiv.org/abs/1811.05868">Pitfalls of Graph Neural Network Evaluation (Shchur et al., 2018)</a>, which I warmly suggest you read.</p>
<!--more-->
<h2 id="neighbourhoods">Neighbourhoods</h2>
<p>The first source of inconsistency when comparing GNNs comes from the fact that different layers are designed to take into account neighborhoods of different sizes.<br />
We usually have that a layer either looks at the 1-neighbours of each node, or it has a hyperparameter K that controls the size of the neighbourhood. Some examples of popular methods (implemented both in Spektral and Pytorch Geometric) in either category:</p>
<ul>
<li>1-hop: <a href="https://arxiv.org/abs/1609.02907">GCN</a>, <a href="https://arxiv.org/abs/1710.10903">GAT</a>, <a href="https://arxiv.org/abs/1706.02216">GraphSage</a>, <a href="https://arxiv.org/abs/1810.00826">GIN</a>;</li>
<li>K-hop: <a href="https://arxiv.org/abs/1606.09375">Cheby</a>, <a href="https://arxiv.org/abs/1901.01343">ARMA</a>, <a href="https://arxiv.org/abs/1810.05997">APPNP</a>, <a href="https://arxiv.org/abs/1902.07153">SGC</a>.</li>
</ul>
<p>A fair evaluation should keep these differences into account and allow each GNN to look at the same neighborhoods, but at the same time, it could be argued that a layer designed to operate on larger neighborhoods is more expressive. How can we tell what is better?</p>
<p>Let’s say we are comparing GCN with Cheby. The equivalent of a 2-layer GCN could be a 2-layer Cheby with K=1, or a 1-layer Cheby with K=2. In the GCN paper, they use a 2-layer Cheby with K=3. Should they have compared with a 6-layer GCN?</p>
<p>Moreover, this difference between methods may have an impact on the number of parameters, nonlinearity, and overall amount of regularization in a GNN. <br />
For instance, a GCN that reaches a neighborhood of order 3 may have 3 dropout layers, while the equivalent Cheby with K=3 will have only one. <br />
Another example: an SGC architecture can reach any neighborhood with a constant number of parameters, while other methods can’t.</p>
<p>We’re only looking at one simple issue, and it is already difficult to say how to fairly evaluate different methods. It gets worse.</p>
<h2 id="regularization-and-training">Regularization and training</h2>
<p>Regularization is an aspect that is particularly essential in GNNs, because the community uses very small benchmark datasets and most GNNs tend to overfit like crazy (more on this later).
For these reasons, the performance of a GNN can vary wildly depending on how the model is regularized. This is true for all other hyperparameters in general, because things like the learning rate and batch size can be a form of implicit regularization.</p>
<p>The literature is largely inconsistent with how regularization is applied across different papers, making it difficult to say whether the performance improvements reported for a model are due to the actual contribution or to a different regularization scheme.</p>
<p>The following are often found in the literature:</p>
<ul>
<li>High learning rates;</li>
<li>High L2 penalty;</li>
<li>Extremely high dropout rates on node features and adjacency matrix;</li>
<li>Low number of training epochs;</li>
<li>Low patience for early stopping.</li>
</ul>
<p>I’m going to focus on a few of these.</p>
<p>First, I argue that setting a fixed number of training epochs is a form of alchemy that should be avoided if possible, because it’s incredibly task-specific. Letting a model train to convergence is almost always a better approach, because it’s less dependent on the initialization of the weights. If the validation performance is not indicative of the test performance and we need to stop the training without a good criterion, then something is probably wrong.</p>
<p>A second important aspect that I feel gets overlooked often is dropout. <br />
In particular, when dropout is applied to the adjacency matrix it leads to big performance improvements, because the GNN is exposed to very noisy instances of the graphs at each training step and is forced to generalize well. <br />
When comparing different models, if one is using dropout on the adjacency matrix then all the others should do the same. However, the common practice of comparing methods using the “same architecture from the original paper” means that some methods will be tested with dropout on A, and some without, as if the dropout is a particular characteristic of only some methods.</p>
<p>Finally, the remaining key factors in training are the learning rate and weight decay.
These are often given as-is in the literature, but it is a good idea to tune them whenever possible. For what it’s worth, I can personally confirm that searching for a good learning rate, in particular, can lead to unexpected results, even for well-established methods (if the model is trained to convergence).</p>
<h2 id="parallel-heads">Parallel heads</h2>
<p><em>Heads</em> are parallel computational units that perform the same calculation with different weights and then merge the results to produce the output. To give a sense of the problems that one may encounter when comparing methods that use heads, I will focus on two methods: GAT and ARMA.</p>
<p>Having parallel attention heads is fairly common in NLP literature, from where the very concept of attention comes, and therefore it was natural to do the same in GAT.</p>
<p>In ARMA, using parallel <em>stacks</em> is theoretically motivated by the fact that ARMA filters of order H can be computed by summing H ARMA filters of order 1. While similar in practice to the heads in GAT, in this case having parallel heads is key to the implementation of this particular graph filter.</p>
<p>Because of these fundamental semantic differences, it is impossible to say whether a comparison between GAT with H heads and an ARMA layer of order H is fair.</p>
<p>Extending to the other models as well, it is not guaranteed that having parallel heads would necessarily lead to any practical improvements for a given model. Some methods can, in fact, benefit from a simpler architecture.
It is therefore difficult to say whether a comparison between monolithic and parallel architectures is fair.</p>
<h2 id="datasets">Datasets</h2>
<p>Finally, I’m going to spend a few words on datasets, because there is no chance of having a fair evaluation if the datasets on which we test our models are not good. And in truth, the benchmark datasets that we use for evaluating GNNs are not that good.</p>
<p>Cora, CiteSeer, PubMed, and the Dortmund benchmark datasets for graph kernels: these are, collectively, the Iris dataset of GNNs, and should be treated carefully. While a model should work on these in order to be considered usable, they cannot be the only criterion to run a fair evaluation.</p>
<p>Recently, the community has moved towards a more sensible use of the datasets (ok, maybe I was exaggerating a bit about Iris), thanks to papers like <a href="https://arxiv.org/abs/1811.05868">this</a> and <a href="https://arxiv.org/abs/1910.12091">this</a>. However, many experiments in the literature still had to be repeated hundreds of times in order to give significant results, and that is bad for three reasons: time, money, and the environment, in no particular order.<br />
Especially if running a grid search of hyperparameters, it just doesn’t make sense to be using datasets that require that much computation to give reliable outcomes, more so if we consider that these are supposed to be <em>easy</em> datasets.</p>
<p>Personally, I find that there are better alternatives out there, that however are not considered often. For node classification, the GraphSage datasets (PPI and Reddit) are significantly better benchmarks than the citation networks (although they’re inductive tasks).
For graph-level learning, QM9 has 134k small graphs, of variable order, and will lead to minuscule uncertainty about the results after a few runs. I realize that it is a dataset for regression, but it still is a better alternative to PROTEINS.
For classification, Filippo Bianchi, with whom I’ve recently worked a lot, released a dataset that simply cannot be classified without using a GNN. You can find it <a href="https://github.com/FilippoMB/Benchmark_dataset_for_graph_classification">here</a>.</p>
<p>I will admit that I am as guilty as the next person when it comes to using the “bad” datasets mentioned above. One reason is that it is easy to not move away from what everybody else is doing. One reason is that reviewers outright ask for them if you don’t include them, caring little for anything else.</p>
<p>I think we can do better, as a community.</p>
<h2 id="in-conclusion">In conclusion</h2>
<p>I started thinking seriously about these issues as I was preparing a paper that required me to compare several models for the experiments.
I am not sure whether the few solutions that I have outlined here are definitive, or even correct, but I feel that this is a conversation that needs to be had in the field of GNNs.</p>
<p>Many of the comparisons that are found in the wild do not take any of this stuff into account, and I think that this may ultimately slow the progress of GNN research and its propagation to other fields of science.</p>
<p>If you want to continue this conversation, or if you have any ideas that could complement this post, shoot me an email or look for me on <a href="https://twitter.com/riceasphait">Twitter</a>.</p>
<p>Cheers!</p>
Fri, 13 Dec 2019 00:00:00 +0000
/posts/2019-12-13/pitfalls.html
AIGNNpostsImplementing a Network-based Model of Epilepsy with Numpy and Numba<p><img src="https://danielegrattarola.github.io/images/2019-10-03/2_nodes_complex_plane.png" alt="" class="full-width" /></p>
<p>Mathematically modeling how epilepsy acts on the brain is one of the major topics of research in neuroscience.
Recently I came across <a href="https://mathematical-neuroscience.springeropen.com/articles/10.1186/2190-8567-2-1">this paper</a> by Oscar Benjamin et al., which I thought that it would be cool to implement and experiment with.</p>
<p>The idea behind the paper is simple enough. First, they formulate a mathematical model of how a seizure might happen in a single region of the brain. Then, they expand this model to consider the interplay between different areas of the brain, effectively modeling it as a network.</p>
<!--more-->
<h2 id="single-system">Single system</h2>
<p>We start from a complex dynamical system defined as follows:</p>
<script type="math/tex; mode=display">\dot{z} = f(z) = (\lambda - 1 + i \omega)z + 2z|z|^2 - z|z|^4</script>
<p>where \( z \in \mathbb{C} \) and \(\lambda\) controls the possible attractors of the system.
For \( 0 < \lambda < 1 \), the system has two stable attractors: one fixed point and one attractor that oscillates with an angular velocity of \(\omega\) rad/s.<br />
We can consider the stable attractor as a simplification of the brain in its resting state, while the oscillating attractor is taken to be the <em>ictal</em> state (i.e., when the brain is having a seizure).</p>
<p>We can also consider a <em>noise-driven</em> version of the system:</p>
<script type="math/tex; mode=display">dz(t) = f(z)\,dt + \alpha\,dW(t)</script>
<p>where \( W(t) \) is a Wiener process rescaled by a factor \( \alpha \).<br />
A Wiener process \( W(t)_{t\ge0} \), sometimes called <em>Brownian motion</em>, is a stochastic process with the following properties:</p>
<ul>
<li>\(W(0) = 0\);</li>
<li>the increments between two consecutive observations are normally distributed with a variance equal to the time between the observations:</li>
</ul>
<script type="math/tex; mode=display">W(t + \tau) - W(t) \sim \mathcal{N}(0, \tau).</script>
<p>In the noise-driven version of the system, it is guaranteed that the system will eventually <em>escape</em> any region of phase space, moving from one attractor to the other.</p>
<p>In short, we have a system that due to external, unpredictable inputs (the noise), will randomly switch from a state of rest to a state of oscillation, which we consider as a seizure.</p>
<p>The two figures below show an example of the system starting from the stable attractor and then moving to the oscillator.
Since the system is complex, we can observe its dynamics in phase space:</p>
<p><img src="https://danielegrattarola.github.io/images/2019-10-03/1_nodes_complex_plane.png" alt="" class="centered" /></p>
<p>Or we can observe the real part of \( f(t) \) as if we were reading an EEG of brain activity:</p>
<p><img src="https://danielegrattarola.github.io/images/2019-10-03/1_nodes_re_v_time.png" alt="" class="centered" /></p>
<p>See how the change of attractor almost looks like an epileptic seizure?</p>
<h2 id="network-model">Network model</h2>
<p>While this simple model of seizure initiation is interesting on its own, we can also take our modeling a step further and explicitly represent the connections between different areas of the brain (or sub-systems, if you will) and how they might affect the propagation of seizures from one area to the other.</p>
<p>We do this by defining a connectivity matrix \( A \) where \( A_{ij} = 1 \) if sub-system \( i \) has a direct influence on sub-system \( j \), and \( A_{ij} = 0 \) otherwise. In practice, we also normalize the matrix by dividing each row element-wise by the product of the square roots of the node’s out-degree and in-degree.</p>
<p>Starting from the system described above, the dynamics of one node in the networked system are described by:</p>
<script type="math/tex; mode=display">dz_{i}(t) = \big( f(z_i) + \beta \sum\limits_{j \ne i} A_{ji} (z_j - z_i) \big) + \alpha\,dW_{i}(t)</script>
<p>If we look at the individual nodes, their behavior may not seem different than what we had with the single sub-system, but in reality, the attractors of these networked systems are determined by the connectivity \( A \) and the coupling strength \( \beta \).</p>
<p><img src="https://danielegrattarola.github.io/images/2019-10-03/4_graph.png" alt="" class="centered" /></p>
<p>Here’s what the networked system of 4 nodes pictured above looks like in phase space:</p>
<p><img src="https://danielegrattarola.github.io/images/2019-10-03/4_nodes_complex_plane.png" alt="" class="centered" /></p>
<p>And again we can also look at the real part of each node:</p>
<p><img src="https://danielegrattarola.github.io/images/2019-10-03/4_nodes_re_v_time.png" alt="" class="centered" /></p>
<p>If you want to have more details on how to control the different attractors of the system, I suggest you look at the <a href="https://mathematical-neuroscience.springeropen.com/articles/10.1186/2190-8567-2-1">original paper</a>. They analyze in depth the attractors and <em>escape times</em> of all possible 2-nodes and 3-nodes networks, as well as giving an overview of higher-order networks.</p>
<h2 id="implementing-the-system-with-numpy-and-numba">Implementing the system with Numpy and Numba</h2>
<p>Now that we got the math sorted out, let’s look at how to translate this system in Numpy.</p>
<p>Since the system is so precisely defined, we only need to convert the mathematical formulation into code. In short, we will need:</p>
<ol>
<li>The core functions to compute the complex dynamical system;</li>
<li>The main loop to compute the evolution of the system starting from an initial condition.</li>
</ol>
<p>While developing this, I quickly realized that my original, kinda straightforward implementation was painfully slow and that it would have required some optimization to be usable.</p>
<p>This was the perfect occasion to use <a href="http://numba.pydata.org/">Numba</a>, a JIT compiler for Python that claims to yield speedups of up to two orders of magnitude.<br />
Numba can be used to JIT compile any function implemented in pure Python, and natively supports a vast number of Numpy operations as well.
The juicy part of Numba consists of compiling functions in <code class="language-plaintext highlighter-rouge">nopython</code> mode, meaning that the code will run without ever using the Python interpreter.
To achieve this, it is sufficient to decorate your functions with the <code class="language-plaintext highlighter-rouge">@njit</code> decorator and then simply run your script as usual.</p>
<h2 id="code">Code</h2>
<p>At the very start, let’s deal with imports and define a couple of helper functions that we are going to use only once:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">numba</span> <span class="kn">import</span> <span class="n">njit</span>
<span class="k">def</span> <span class="nf">degree_power</span><span class="p">(</span><span class="n">adj</span><span class="p">,</span> <span class="nb">pow</span><span class="p">):</span>
<span class="s">"""
Computes D^{p} from the given adjacency matrix.
:param adj: rank 2 array.
:param pow: exponent to which elevate the degree matrix.
:return: the exponentiated degree matrix.
"""</span>
<span class="n">degrees</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">(</span><span class="n">adj</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="nb">pow</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">degrees</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">isinf</span><span class="p">(</span><span class="n">degrees</span><span class="p">)]</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">degrees</span><span class="p">)</span>
<span class="k">return</span> <span class="n">D</span>
<span class="k">def</span> <span class="nf">normalized_adjacency</span><span class="p">(</span><span class="n">adj</span><span class="p">):</span>
<span class="s">"""
Normalizes the given adjacency matrix using the degree matrix as
D^{-1/2}AD^{-1/2} (symmetric normalization).
:param adj: rank 2 array.
:return: the normalized adjacency matrix.
"""</span>
<span class="n">normalized_D</span> <span class="o">=</span> <span class="n">degree_power</span><span class="p">(</span><span class="n">adj</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">normalized_D</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">adj</span><span class="p">)</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">normalized_D</span><span class="p">)</span>
<span class="k">return</span> <span class="n">output</span>
</code></pre></div></div>
<p>The code for these functions was copy-pasted from <a href="https://danielegrattarola.github.io/spektral/">Spektral</a> and slightly adapted so that we don’t need to import the entire library just for two functions. Note that there’s no need to JIT compile these two functions because they will run only once, and in fact, it is not guaranteed that compiling them will be less expensive than simply executing them with Python. Especially because both functions are heavily Numpy-based already, so they should run at C-like speed.</p>
<p>Moving forward to implementing the actual system. Let’s first define the fixed hyper-parameters of the model:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">omega</span> <span class="o">=</span> <span class="mi">20</span> <span class="c1"># Frequency of oscillations in rad/s
</span><span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.2</span> <span class="c1"># Intensity of the noise
</span><span class="n">lamb</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="c1"># Controls the possible attractors of each node
</span><span class="n">beta</span> <span class="o">=</span> <span class="mf">0.1</span> <span class="c1"># Coupling strength b/w nodes
</span><span class="n">N</span> <span class="o">=</span> <span class="mi">4</span> <span class="c1"># Number of nodes in the system
</span><span class="n">seconds_to_generate</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># Number of seconds to evolve the system for
</span><span class="n">dt</span> <span class="o">=</span> <span class="mf">0.0001</span> <span class="c1"># Time interval between consecutive states
</span>
<span class="c1"># Random connectivity matrix
</span><span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">N</span><span class="p">))</span>
<span class="n">np</span><span class="o">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">A_norm</span> <span class="o">=</span> <span class="n">normalized_adjacency</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">complex128</span><span class="p">)</span>
</code></pre></div></div>
<p>The core of the dynamical system is the update function \( f(z) \), that in code looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span>
<span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">lamb</span><span class="o">=</span><span class="mf">0.</span><span class="p">,</span> <span class="n">omega</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="s">"""The deterministic update function of each node.
:param z: complex, the current state.
:param lamb: float, hyper-parameter to control the attractors of each node.
:param omega: float, frequency of oscillations in rad/s.
"""</span>
<span class="k">return</span> <span class="p">((</span><span class="n">lamb</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="nb">complex</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">omega</span><span class="p">))</span> <span class="o">*</span> <span class="n">z</span>
<span class="o">+</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">z</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="o">-</span> <span class="p">(</span><span class="n">z</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> <span class="o">**</span> <span class="mi">4</span><span class="p">))</span>
</code></pre></div></div>
<p>There’s not much to say here, except that using <code class="language-plaintext highlighter-rouge">complex</code> instead of <code class="language-plaintext highlighter-rouge">np.complex</code> seems to be slightly faster (157 ns vs. 178 ns), although the performance impact on the overall function is clearly negligible.</p>
<p>To compute the noise-driven system, we need to define the increment function of a complex Wiener process. We can start by implementing the increment function of a simple Wiener process, first:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span>
<span class="k">def</span> <span class="nf">delta_wiener</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">):</span>
<span class="s">"""Returns the random delta between two consecutive steps of a Wiener
process (Brownian motion).
:param size: tuple, desired shape of the output array.
:param dt: float, time increment in seconds.
:return: numpy array with shape 'size'.
"""</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">dt</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="o">*</span><span class="n">size</span><span class="p">)</span>
</code></pre></div></div>
<p>At the time of writing this, Numba <a href="https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html#distributions">does not support</a> the <code class="language-plaintext highlighter-rouge">size</code> argument in <code class="language-plaintext highlighter-rouge">np.random.normal</code> but it does support <code class="language-plaintext highlighter-rouge">np.random.randn</code>. Instead of setting the <code class="language-plaintext highlighter-rouge">scale</code> parameter explicitly, we simply multiply the sampled values by the scale.<br />
Since we are using the scale, and not the variance, we have to take the square root of the time increment <code class="language-plaintext highlighter-rouge">dt</code>.</p>
<p>Finally, we can compute the increment of a complex Wiener process as \( U(t) + jV(t) \), where both \( U \) and \( V \) are simple Wiener processes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span>
<span class="k">def</span> <span class="nf">complex_delta_wiener</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">):</span>
<span class="s">"""Returns the random delta between two consecutive steps of a complex
Wiener process (Brownian motion). The process is calculated as u(t) + jv(t)
where u and v are simple Wiener processes.
:param size: tuple, the desired shape of the output array.
:param dt: float, time increment in seconds.
:return: numpy array of np.complex128 with shape 'size'.
"""</span>
<span class="n">u</span> <span class="o">=</span> <span class="n">delta_wiener</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">delta_wiener</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">)</span>
<span class="k">return</span> <span class="n">u</span> <span class="o">+</span> <span class="n">v</span> <span class="o">*</span> <span class="mf">1j</span>
</code></pre></div></div>
<p>Now that we have all the necessary components to define the noise-driven system, let’s implement the main step function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="s">"""
Compute one time step of the system, s.t. z[t+1] = z[t] + step(z[t]).
:param z: numpy array of np.complex128, the current state.
:return: numpy array of np.complex128.
"""</span>
<span class="c1"># Matrix with pairwise differences of nodes
</span> <span class="n">delta_z</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Compute diffusive coupling
</span> <span class="n">diffusive_coupling</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">A_norm</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta_z</span><span class="p">))</span>
<span class="c1"># Compute change in state
</span> <span class="n">update_from_self</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">lamb</span><span class="o">=</span><span class="n">lamb</span><span class="p">,</span> <span class="n">omega</span><span class="o">=</span><span class="n">omega</span><span class="p">)</span>
<span class="n">update_from_others</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">diffusive_coupling</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">complex_delta_wiener</span><span class="p">(</span><span class="n">z</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">dt</span><span class="p">)</span>
<span class="n">dz</span> <span class="o">=</span> <span class="p">(</span><span class="n">update_from_self</span> <span class="o">+</span> <span class="n">update_from_others</span><span class="p">)</span> <span class="o">*</span> <span class="n">dt</span> <span class="o">+</span> <span class="n">noise</span>
<span class="k">return</span> <span class="n">dz</span>
</code></pre></div></div>
<p>Originally, I had implemented the following line</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">delta_z</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">z</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>as</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">delta_z</span> <span class="o">=</span> <span class="n">z</span><span class="p">[</span><span class="o">...</span><span class="p">,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">-</span> <span class="n">z</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="o">...</span><span class="p">]</span>
</code></pre></div></div>
<p>but Numba does not support adding new axes with <code class="language-plaintext highlighter-rouge">None</code> or <code class="language-plaintext highlighter-rouge">np.newaxis</code>.</p>
<p>Also, when computing <code class="language-plaintext highlighter-rouge">diffusive_coupling</code>, a more efficient way of doing</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">B</span><span class="p">))</span>
</code></pre></div></div>
<p>would have been</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">einsum</span><span class="p">(</span><span class="s">'ij,ij->j'</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">)</span>
</code></pre></div></div>
<p>for reasons which I still fail to understand (3.48 µs vs. 2.57 µs, when <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> are 3 by 3 float matrices). However, Numba does not support <code class="language-plaintext highlighter-rouge">np.einsum</code>.</p>
<p>Finally, we can implement the main loop function that starts from a given initial state <code class="language-plaintext highlighter-rouge">z0</code> and computes <code class="language-plaintext highlighter-rouge">steps</code> number of updates at time intervals of <code class="language-plaintext highlighter-rouge">dt</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span>
<span class="k">def</span> <span class="nf">evolve_system</span><span class="p">(</span><span class="n">z0</span><span class="p">,</span> <span class="n">steps</span><span class="p">):</span>
<span class="s">"""
Evolve the system starting from the given initial state (z0) for a given
number of time steps (steps).
:param z0: numpy array of np.complex128, the initial state.
:param steps: int, number of steps to evolve the system for.
:return: list, the sequence of states.
"""</span>
<span class="n">steps_in_percent</span> <span class="o">=</span> <span class="n">steps</span> <span class="o">/</span> <span class="mi">100</span>
<span class="n">z</span> <span class="o">=</span> <span class="p">[</span><span class="n">z0</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">steps</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">i</span> <span class="o">%</span> <span class="n">steps_in_percent</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">i</span> <span class="o">/</span> <span class="n">steps_in_percent</span><span class="p">,</span> <span class="s">'</span><span class="si">%</span><span class="s">'</span><span class="p">)</span>
<span class="n">dz</span> <span class="o">=</span> <span class="n">step</span><span class="p">(</span><span class="n">z</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="n">z</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">dz</span><span class="p">)</span>
<span class="k">return</span> <span class="n">z</span>
</code></pre></div></div>
<p>I had originally wrapped the loop in a <code class="language-plaintext highlighter-rouge">tqdm</code> progress bar, but an old-fashioned <code class="language-plaintext highlighter-rouge">if</code> and <code class="language-plaintext highlighter-rouge">print</code> can reduce the overhead by 50% (2.29s vs. 1.23s, tested on a simple <code class="language-plaintext highlighter-rouge">for</code> loop with 1e7 iterations). Pre-computing <code class="language-plaintext highlighter-rouge">steps_in_percent</code> also reduces the overhead by 30% compared to computing it every time.<br />
(You’ll notice that at some point it just became a matter of optimizing every possible aspect of this :D)</p>
<p>The only thing left to do is to evolve the system starting from a given intial state:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">z0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">N</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">complex128</span><span class="p">)</span> <span class="c1"># Starting conditions
</span><span class="n">steps</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">seconds_to_generate</span> <span class="o">/</span> <span class="n">dt</span><span class="p">)</span> <span class="c1"># Number of steps to generate
</span>
<span class="n">timesteps</span> <span class="o">=</span> <span class="n">evolve_system</span><span class="p">(</span><span class="n">z0</span><span class="p">,</span> <span class="n">steps</span><span class="p">)</span>
<span class="n">timesteps</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">timesteps</span><span class="p">)</span>
</code></pre></div></div>
<p>You can now run any analysis on <code class="language-plaintext highlighter-rouge">timesteps</code>, which will be a Numpy array of <code class="language-plaintext highlighter-rouge">np.complex128</code>. Note also how we had to cast the initial conditions <code class="language-plaintext highlighter-rouge">z0</code> to this <code class="language-plaintext highlighter-rouge">dtype</code>, in order to have strict typing in the JIT-compiled code.</p>
<p><a href="https://gist.github.com/danielegrattarola/c663346b529e758f0224c8313818ad77">I published the full code as a Gist, including the code I used to make the plots.</a></p>
<h2 id="general-notes-on-performance">General notes on performance</h2>
<p>My original implementation was based on a <code class="language-plaintext highlighter-rouge">Simulator</code> class that implemented all the same methods in a compact abstraction:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Simulator</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">dt</span><span class="o">=</span><span class="mf">1e-4</span><span class="p">,</span> <span class="n">omega</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">lamb</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">beta</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="o">...</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">lamb</span><span class="o">=</span><span class="mf">0.</span><span class="p">,</span> <span class="n">omega</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="o">...</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">delta_weiner</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">):</span>
<span class="o">...</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">complex_delta_weiner</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">dt</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">evolve_system</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z0</span><span class="p">,</span> <span class="n">steps</span><span class="p">):</span>
<span class="o">...</span>
</code></pre></div></div>
<p>There were some issues with this implementation, the biggest one being that it is much more messy to JIT compile an entire class with Numba (the substance of the code did not change much, and I’ve explicitly highlighted all implementation changes above).</p>
<p>Having moved to a more functional style feels cleaner and it honestly looks more elegant (opinions, I know). Crucially, it also allowed me to optimize each function to work flawlessly with Numba.</p>
<p>After optimizing all that was optimizable, I tested the old code against the new one and the speedup was about 31x, going from ~8k iterations/s to ~250k iterations/s.</p>
<p>Most of the improvement came from Numba and removing the overhead of Python’s interpreter, but it must be said that the true core of the system is dealt with by Numpy. In fact, as we increase the number of nodes the bottleneck becomes the matrix multiplication in Numpy, eventually leading to virtually no performance difference between using Numba or not (verified for <code class="language-plaintext highlighter-rouge">N=1000</code> - the 31x speedup was for <code class="language-plaintext highlighter-rouge">N=2</code>).</p>
<p><br />
I hope that you enjoyed this post and hopefully learned something new, be it about models of the epileptic brain or Python optimization.</p>
<p>Cheers!</p>
Thu, 03 Oct 2019 00:00:00 +0000
/posts/2019-10-03/epilepsy-model.html
tutorialcodeepilepsypostsMinCUT Pooling in Graph Neural Networks<p><img src="https://danielegrattarola.github.io/images/2019-07-25/horses.png" alt="Embeddings" class="full-width" /></p>
<p>Pooling in GNNs is a fairly complicated task that requires a solid understanding of a graph’s structure in order to work properly.</p>
<p>In <a href="https://arxiv.org/abs/1907.00481">our latest paper</a>, we presented a new pooling method for GNNs, called <strong>minCUT pooling</strong>, which has a lot of desirable properties for pooling methods:</p>
<ol>
<li>it’s based on well-understood theoretical techniques for node clustering;</li>
<li>it’s fully differentiable and learnable with gradient descent;</li>
<li>it depends directly on the task-specific loss on which the GNN is being trained, but …</li>
<li>… it can be trained on its own without a task-specific loss, if needed;</li>
<li>it’s fast;</li>
</ol>
<p>The method is based on the minCUT optimization problem from operational research. We considered a relaxed and differentiable version of minCUT, and implemented it as a neural network layer in order to provide a sound pooling method for GNNs.</p>
<p>In this post I’ll describe the working principles of minCUT pooling and show some applications of the layer.</p>
<!--more-->
<h2 id="background">Background</h2>
<p><img src="https://danielegrattarola.github.io/images/2019-07-25/mincut_problem.png" alt="Embeddings" /></p>
<p>The <a href="https://en.wikipedia.org/wiki/Minimum_k-cut">K-way normalized minCUT</a> is an optimization problem to find K clusters on a graph by minimizing the overall intra-cluster edge weight. This is equivalent to solving:</p>
<script type="math/tex; mode=display">\text{maximize} \;\; \frac{1}{K} \sum_{k=1}^K \frac{\sum_{i,j \in \mathcal{V}_k} \mathcal{E}_{i,j} }{\sum_{i \in \mathcal{V}_k, j \in \mathcal{V} \backslash \mathcal{V}_k} \mathcal{E}_{i,j}},</script>
<p>where <script type="math/tex">\mathcal{V}</script> is the set of nodes, <script type="math/tex">\mathcal{V_k}</script> is the <script type="math/tex">k</script>-th cluster of nodes, and <script type="math/tex">\mathcal{E_{i, j}}</script> indicates a weighted edge between two nodes.</p>
<p>If we define a <strong>cluster assignment matrix</strong> <script type="math/tex">C \in \{0,1\}^{N \times K}</script>, which maps each of the <script type="math/tex">N</script> nodes to one of the <script type="math/tex">K</script> clusters, the problem can also be re-written as:</p>
<script type="math/tex; mode=display">\text{maximize} \;\; \frac{1}{K} \sum_{k=1}^K \frac{C_k^T A C_k}{C_k^T D C_k}</script>
<p>where <script type="math/tex">A</script> is the adjacency matrix of the graph, and <script type="math/tex">D</script> is the diagonal degree matrix.</p>
<p>While finding the optimal minCUT is a NP-hard problem, there exist relaxations that can be leveraged by <a href="https://en.wikipedia.org/wiki/Spectral_clustering">spectral clustering (SC)</a> to find near-optimal solutions in polynomial time. Still, the complexity of SC is in the order of <script type="math/tex">O(N^3)</script> for a graph of <script type="math/tex">N</script> nodes, making it expensive to apply to large graphs.</p>
<p>A possible way of solving this scalability issue is to search for good cluster assignments using SGD, which is the idea on which we based our implementation of minCUT pooling.</p>
<h2 id="mincut-pooling">MinCUT pooling</h2>
<p><img src="https://danielegrattarola.github.io/images/2019-07-25/GNN_pooling.png" alt="Embeddings" /></p>
<p>We designed minCUT pooling to be used in-between message-passing layers of GNNs. The idea is that, like in standard convolutional networks, pooling layers should help the network capture broader patterns in the input data by summarizing local information.</p>
<p>At the core of minCUT pooling there is a MLP, which maps the node features <script type="math/tex">\mathbf{X}</script> to a <strong>continuous</strong> cluster assignment matrix <script type="math/tex">\mathbf{S}</script> (of size <script type="math/tex">N \times K</script>):</p>
<script type="math/tex; mode=display">\mathbf{S} = \textrm{softmax}(\text{ReLU}(\mathbf{X}\mathbf{W}_1)\mathbf{W}_2)</script>
<p>We can then use the MLP to generate <script type="math/tex">\mathbf{S}</script> on the fly, and reduce the graphs with simple multiplications as:</p>
<script type="math/tex; mode=display">\mathbf{A}^{pool} = \mathbf{S}^T \mathbf{A} \mathbf{S}; \;\;\; \mathbf{X}^{pool} = \mathbf{S}^T \mathbf{X}.</script>
<p>At this point, we can already make a couple of considerations:</p>
<ol>
<li>Nodes with similar features will likely belong to the same cluster, because they will be “classified” similarly by the MLP. This is especially good when using message-passing layers before pooling, since they will cause the node features of connected nodes to become similar;</li>
<li><script type="math/tex">\mathbf{S}</script> depends only on the features of the graph, making the layer <strong>transferable</strong> to new graphs once it has been trained.</li>
</ol>
<p>This is already pretty good, and it covers some of the main desiderata of a GNN layer, but it still isn’t enough. We want to explicitly account for the connectivity of the graph in order to pool it.</p>
<p>This is where the minCUT optimization comes in.</p>
<p>By slightly adapting the minCUT formulation above, we can design an auxiliary loss to train the MLP, so that it will learn to solve the minCUT problem in an unsupervised way. <br />
In practice, our unsupervised regularization loss encourages the MLP to cluster together nodes that are strongly connected with each other and weakly connected with the nodes in the other clusters.</p>
<p>The full unsupervised loss that we minimize in order to achieve this is:</p>
<script type="math/tex; mode=display">\mathcal{L}_u = \mathcal{L}_c + \mathcal{L}_o =
\underbrace{- \frac{Tr ( \mathbf{S}^T \mathbf{A} \mathbf{S} )}{Tr ( \mathbf{S}^T\mathbf{D} \mathbf{S})}}_{\mathcal{L}_c} +
\underbrace{\bigg{\lVert} \frac{\mathbf{S}^T\mathbf{S}}{\|\mathbf{S}^T\mathbf{S}\|_F} - \frac{\mathbf{I}_K}{\sqrt{K}}\bigg{\rVert}_F}_{\mathcal{L}_o},</script>
<p>where <script type="math/tex">\mathbf{A}</script> is the <a href="https://danielegrattarola.github.io/spektral/utils/convolution/#normalized_adjacency">normalized</a> adjacency matrix of the graph.</p>
<p>Let’s break this loss down and see how it works.</p>
<h3 id="cut-loss">Cut loss</h3>
<p>The first term, <script type="math/tex">\mathcal{L}_c</script>, forces the MLP to find a cluster assignment to solve the minCUT problem (to see why, compare it with the minCUT maximization that we described above). We refer to this loss as the <strong>cut loss</strong>.</p>
<p>In particular, minimizing the numerator leads to clustering together nodes that are strongly connected on the graph, while the denominator prevents any of the clusters to be too small.</p>
<p>The cut loss is bounded between -1 and 0, which are <strong>ideally</strong> reached in the following situations:</p>
<ul>
<li><script type="math/tex">\mathcal{L}_c = -1</script> when there are <script type="math/tex">K</script> disconnected components in the graph, and <script type="math/tex">\mathbf{S}</script> exactly maps the <script type="math/tex">K</script> components to the <script type="math/tex">K</script> clusters;</li>
<li><script type="math/tex">\mathcal{L}_c = 0</script> when all pairs of connected nodes are assigned to different clusters;</li>
</ul>
<p>The figure below shows what these situations might look like. Note that in the case <script type="math/tex">\mathcal{L}_c = 0</script> the clustering corresponds to a bipartite grouping of the graph. This could be a desirable outcome for some applications, but the general assumption is that connected nodes should be clustered together, and not vice-versa.</p>
<p><img src="/images/2019-07-25/loss_bounds.png" alt="L_c bounds" /></p>
<p>We must also consider that <script type="math/tex">\mathcal{L}_c</script> is non-convex, and that there are spurious minima that can be found via SGD.<br />
For example, for <script type="math/tex">K = 4</script>, the uniform assignment matrix</p>
<script type="math/tex; mode=display">\mathbf{S}_i = (0.25, 0.25, 0.25, 0.25) \;\; \forall i,</script>
<p>would cause the numerator and the denominator of <script type="math/tex">\mathcal{L}_c</script> to be equal, and the loss to be <script type="math/tex">-1</script>.<br />
A similar situation occurs when all nodes in the graph are assigned to the same cluster.</p>
<p>This can be easily verified with Numpy:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="c1"># Adjacency matrix
</span> <span class="o">...</span><span class="p">:</span> <span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="o">...</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="o">...</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]])</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="c1"># Degree matrix
</span> <span class="o">...</span><span class="p">:</span> <span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="c1"># Perfect cluster assignment
</span> <span class="o">...</span><span class="p">:</span> <span class="n">S</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]])</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">D</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="mf">1.0</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="c1"># All nodes uniformly distributed
</span> <span class="o">...</span><span class="p">:</span> <span class="n">S</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">D</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="mf">1.0</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="c1"># All nodes in the same cluster
</span> <span class="o">...</span><span class="p">:</span> <span class="n">S</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]])</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">A</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">S</span><span class="o">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">D</span> <span class="o">@</span> <span class="n">S</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="mf">1.0</span>
</code></pre></div></div>
<h3 id="orthogonality-loss">Orthogonality loss</h3>
<p>The second term, <script type="math/tex">\mathcal{L}_o</script>, helps to avoid such degenerate minima of <script type="math/tex">\mathcal{L}_c</script> by encouraging the MLP to find clusters that are orthogonal between each other. We call this the <strong>orthogonality loss</strong>.</p>
<p>In other words, <script type="math/tex">\mathcal{L}_o</script> encourages the MLP to “make a decision” about which nodes belong to which clusters, avoiding those degenerate solutions where <script type="math/tex">\mathbf{S}</script> assigns all nodes equally to every cluster. <br />
By adding the orthogonality constraint, we force the MLP to find a non-trivial assignment for the nodes.</p>
<p>Moreover, we can see that the perfect minimizer of <script type="math/tex">\mathcal{L}_o</script> is only attainable if we have <script type="math/tex">N \le K</script> nodes, because in general, given a <script type="math/tex">K</script> dimensional vector space, we cannot find more than <script type="math/tex">K</script> mutually orthogonal vectors.
The only way to minimize <script type="math/tex">\mathcal{L}_o</script> given <script type="math/tex">N</script> assignment vectors is therefore to distribute the nodes equally between the <script type="math/tex">K</script> clusters. This causes the MLP to avoid the other type of spurious minima of <script type="math/tex">\mathcal{L}_c</script>, where all nodes are in a single cluster.</p>
<h2 id="interaction-of-the-two-losses">Interaction of the two losses</h2>
<p><img src="/images/2019-07-25/cora_mc_loss+nmi.png" alt="Loss terms" /></p>
<p>We can see how the two loss terms interact with each other to find a good solution to the cluster assignment problem.
The figure above shows the evolution of the unsupervised loss as the network is trained to cluster the nodes of Cora (plot on the left). We can see that as the network is trained, the normalized mutual information (NMI) score of the clustering improves, meaning that the layer is able to find meaningful clusters (plot on the right).</p>
<p>Note how <script type="math/tex">\mathcal{L}_c</script> starts from a trivial assignment (-1) due to the random initialization, and then moves away from the spurious minima as the orthogonality loss forces the MLP towards more sensible solutions.</p>
<h3 id="pooled-graph">Pooled graph</h3>
<p>As a further consideration, we can take a closer look at the pooled adjacency matrix <script type="math/tex">\mathbf{A}^{pool}</script>. <br />
First of all, we can see that it is a <script type="math/tex">K \times K</script> matrix that contains the number of links connecting each cluster. For example, the entry <script type="math/tex">\mathbf{A}^{pool}_{1,\;2}</script> contains the number of links between the nodes in cluster 1 and cluster 2, while the entry <script type="math/tex">\mathbf{A}^{pool}_{1,\;1}</script> is the number of links between the nodes in cluster 1. <br />
We can also see that the trace of <script type="math/tex">\mathbf{A}^{pool}</script> is exactly the numerator that is being minimized in <script type="math/tex">\mathcal{L}_c</script>. Therefore, we can expect the diagonal elements <script type="math/tex">\mathbf{A}^{pool}_{i,\;i}</script> to be much larger than the other entries of <script type="math/tex">\mathbf{A}^{pool}</script>.</p>
<p>For this reason, <script type="math/tex">\mathbf{A}^{pool}</script> will represent a graph with very strong self-loops, and the message-passing layers after pooling will have a hard time propagating information on the graph (because the self-loops will keep sending the information of a node back onto itself, and not its neighbors).</p>
<p>To address this problem, a solution is to remove the diagonal of <script type="math/tex">\mathbf{A}^{pool}</script> and renormalize the matrix by its degree, before giving it as output of the pooling layer:</p>
<script type="math/tex; mode=display">\hat{\mathbf{A}} = \mathbf{A}^{pool} - \mathbf{I}_K \cdot diag(\mathbf{A}^{pool}); \;\; \tilde{\mathbf{A}}^{pool} = \hat{\mathbf{D}}^{-\frac{1}{2}} \hat{\mathbf{A}} \hat{\mathbf{D}}^{-\frac{1}{2}}</script>
<p>Our reccomendation is to combine minCUT with message-passing layers with a built-in skip connection, in order to bring each node’s information forward (e.g., Spektral’s <a href="https://danielegrattarola.github.io/spektral/layers/convolution/#graphconvskip">GraphConvSkip</a>).
However, if your GNN is based on the <a href="https://danielegrattarola.github.io/spektral/layers/convolution/#graphconv">graph convolutional networks (GCN)</a> of <a href="https://arxiv.org/abs/1609.02907">Kipf & Welling</a>, you may want to manually re-compute the normalized Laplacian after pooling in order to add the self-loops back.</p>
<h3 id="notes-on-gradient-flow">Notes on gradient flow</h3>
<p><img src="/images/2019-07-25/mincut_layer.png" alt="mincut scheme" /></p>
<p>A couple of notes for the gradient-heads out there.</p>
<p>The unsupervised loss <script type="math/tex">\mathcal{L}_u</script> can be optimized on its own, adapting the weights of the MLP to compute an <script type="math/tex">\mathbf{S}</script> that solves the minCUT problem under the orthogonality constraint.</p>
<p>However, given the multiplicative interaction between <script type="math/tex">\mathbf{S}</script> and <script type="math/tex">\mathbf{X}</script>, the gradient can also flow from the task-specific loss (i.e., whatever the GNN is being trained to do) through the MLP. We can see in the picture above how there is a path going from the input <script type="math/tex">\mathbf{X}^{(t+1)}</script> to the output <script type="math/tex">\mathbf{X}_{\textrm{pool}}^{(t+1)}</script>, directly passing through the MLP.</p>
<p>This means that the overall solution found by the GNN will keep into account both the graph structure (to solve minCUT) and the final task.</p>
<h2 id="code">Code</h2>
<p>Implementing minCUT in TensorFlow is fairly straightforward. Let’s start from some setup:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">tensorflow.keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span>
<span class="n">A</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># Adjacency matrix (N x N)
</span> <span class="n">X</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># Node features (N x F)
</span> <span class="n">n_clusters</span> <span class="o">=</span> <span class="o">...</span> <span class="c1"># Number of clusters to find with minCUT
</span></code></pre></div></div>
<p>First, the layer computes the cluster assignment matrix <code class="language-plaintext highlighter-rouge">S</code> by applying a softmax MLP to the node features:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">H</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">X</span><span class="p">)</span>
<span class="n">S</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="n">n_clusters</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)(</span><span class="n">H</span><span class="p">)</span> <span class="c1"># Cluster assignment matrix
</span></code></pre></div></div>
<p>The cut loss is then implemented as:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Cut loss
</span><span class="n">A_pool</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">tf</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">S</span><span class="p">)),</span> <span class="n">S</span>
<span class="p">)</span>
<span class="n">num</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">A_pool</span><span class="p">)</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_sum</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">D_pooled</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">tf</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">D</span><span class="p">,</span> <span class="n">S</span><span class="p">)),</span> <span class="n">S</span>
<span class="p">)</span>
<span class="n">den</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="n">D_pooled</span><span class="p">)</span>
<span class="n">mincut_loss</span> <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="n">num</span> <span class="o">/</span> <span class="n">den</span><span class="p">)</span>
</code></pre></div></div>
<p>And the orthogonality loss is implemented as:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Orthogonality loss
</span><span class="n">St_S</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">S</span><span class="p">),</span> <span class="n">S</span><span class="p">)</span>
<span class="n">I_S</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">n_clusters</span><span class="p">)</span>
<span class="n">ortho_loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span>
<span class="n">St_S</span> <span class="o">/</span> <span class="n">tf</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">St_S</span><span class="p">)</span> <span class="o">-</span> <span class="n">I_S</span> <span class="o">/</span> <span class="n">tf</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">I_S</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Finally, the full unsupervised loss of the layer is obtained as the sum of the two auxiliary losses:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">total_loss</span> <span class="o">=</span> <span class="n">mincut_loss</span> <span class="o">+</span> <span class="n">ortho_loss</span>
</code></pre></div></div>
<p>The actual pooling step is simply implemented as a simple multiplication of <code class="language-plaintext highlighter-rouge">S</code> with <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">X</code>, then we zero-out the diagonal of <code class="language-plaintext highlighter-rouge">A_pool</code> and re-normalize the matrix. Since we already computed <code class="language-plaintext highlighter-rouge">A_pool</code> for the numerator of <script type="math/tex">\mathcal{L}_c</script>, we only need to do:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Pooling node features
</span><span class="n">X_pool</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">S</span><span class="p">),</span> <span class="n">X</span><span class="p">)</span>
<span class="c1"># Zeroing out the diagonal
</span><span class="n">A_pool</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">set_diag</span><span class="p">(</span><span class="n">A_pool</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">A_pool</span><span class="p">)[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]))</span> <span class="c1"># Remove diagonal
</span>
<span class="c1"># Normalizing A_pool
</span><span class="n">D_pool</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_sum</span><span class="p">(</span><span class="n">A_pool</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">D_pool</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">D_pool</span><span class="p">)[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">+</span> <span class="mf">1e-12</span> <span class="c1"># Add epsilon to avoid division by 0
</span><span class="n">A_pool</span> <span class="o">=</span> <span class="p">(</span><span class="n">A_pool</span> <span class="o">/</span> <span class="n">D_pool</span><span class="p">)</span> <span class="o">/</span> <span class="n">tf</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">D_pool</span><span class="p">)</span>
</code></pre></div></div>
<p>Wrap this up in a layer, and use the layer in a GNN. Done.</p>
<h2 id="experiments">Experiments</h2>
<h3 id="unsupervised-clustering">Unsupervised clustering</h3>
<p>Because the core of minCUT pooling is an unsupervised loss that does not require labeled data in order to be minimized, we can optimize <script type="math/tex">\mathcal{L}_u</script> on its own to test the clustering ability of minCUT.</p>
<p>A good first test is to check whether the layer is able to cluster a grid (the size of the clusters should be the same), and to isolate communities in a network.
We see in the figure below that minCUT was able to do this perfectly.</p>
<p><img src="/images/2019-07-25/regular_clustering.png" alt="Clustering with minCUT pooling" /></p>
<p>To make things more interesting, we can also test minCUT on the task of graph-based image segmentation. We can build a <a href="https://scikit-image.org/docs/dev/auto_examples/segmentation/plot_rag.html">region adjacency graph</a> from a natural image, and cluster its nodes in order to see if regions with similar colors are clustered together. <br />
The results look nice, and remember that this was obtained by only optimizing <script type="math/tex">\mathcal{L}_u</script>!</p>
<p><img src="/images/2019-07-25/horses.png" alt="Horse segmentation with minCUT pooling" /></p>
<p>Finally, we also checked the clustering abilities of minCUT pooling on the popular citations datasets: Cora, Citeseer, and Pubmed.
As mentioned before, we used the Normalized Mutual Information (NMI) score to test whether the layer was clustering together nodes of the same class. Note that the layer did not have access to the labels during training (meaning that we didn’t need to decide how to split the data into train and test sets, which is a known issue in the GNN community).</p>
<p>You can check <a href="https://arxiv.org/abs/1907.00481">the paper</a> to see how minCUT fared in comparison to other methods, but in short: it did well, sometimes by a full order of magnitude better than previous methods.</p>
<h3 id="autoencoder">Autoencoder</h3>
<p>Another interesting unsupervised test that we came up with was to check how much information is preserved in the coarsened graph after pooling.
To do this, we built a simple graph autoencoder with the structure pictured below:</p>
<p><img src="/images/2019-07-25/ae.png" alt="unsupervised reconstruction with AE" /></p>
<p>The “Unpool” layer is simply obtained by transposing the same <script type="math/tex">\mathbf{S}</script> found by minCUT, in order to upscale the graph instead of downscaling it:</p>
<script type="math/tex; mode=display">\mathbf{A}^\text{unpool} = \mathbf{S} \mathbf{A}^\text{pool} \mathbf{S}^T; \;\; \mathbf{X}^\text{unpool} = \mathbf{S}\mathbf{X}^\text{pool}.</script>
<p>We tested the graph AE on some very regular graphs, that should have been easy to reconstruct after pooling. Surprisingly, this turned out to be a difficult problem for some pooling layers from the GNN literature. MinCUT, on the other hand, was able to defend itself quite nicely.</p>
<p><img src="/images/2019-07-25/reconstructions.png" alt="unsupervised reconstruction with AE" /></p>
<h3 id="supervised-inductive-tasks">Supervised inductive tasks</h3>
<p>Finally, we tested whether minCUT provides an improvement on the usual graph classification and graph regression tasks. <br />
We picked a fixed GNN architecture, and tested several pooling strategies by swapping the pooling layers in the network.</p>
<p>The dataset that we used were:</p>
<ol>
<li><a href="https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets">The Benchmark Data Sets for Graph Kernels</a>;</li>
<li><a href="https://github.com/FilippoMB/Benchmark_dataset_for_graph_classification">A synthetic dataset created by F. M. Bianchi to test GNNs</a>;</li>
<li><a href="http://quantum-machine.org/datasets/">The QM9 dataset for the prediction of chemical properties of molecules</a>.</li>
</ol>
<p>I’m not gonna report the comparisons with other methods, but I will highlight an interesting sanity check that we performed in order to see whether using GNNs and graph pooling even made sense at all.</p>
<p>Among the various methods that we tested, we also included:</p>
<ol>
<li>A simple MLP which did not exploit the relational information carried by the graphs;</li>
<li>The same GNN architecture without pooling layers.</li>
</ol>
<p>We were once again surprised to see that, while minCUT yielded a consistent improvement over such simple baselines, other pooling methods did not.</p>
<h2 id="conclusions">Conclusions</h2>
<p>Working on minCUT pooling was an interesting experience that deepened my understanding of GNNs, and allowed me to see what is really necessary for a GNN to work.</p>
<p>We have put the paper <a href="https://arxiv.org/abs/1907.00481">on arXiv</a>, and I’m going to release an official implementation of the layer on <a href="https://danielegrattarola.github.io/spektral/layers/pooling/">Spektral</a> soon.</p>
<p>If you want to build on our work and use minCUT in your own GNNs, you can cite us with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{bianchi2019mincut,
title={Mincut Pooling in Graph Neural Networks},
author={Filippo Maria Bianchi and Daniele Grattarola and Cesare Alippi},
journal={arXiv preprint arXiv:1907.00481},
year={2019}
}
</code></pre></div></div>
<p>Cheers!</p>
Thu, 25 Jul 2019 00:00:00 +0000
/posts/2019-07-25/mincut-pooling.html
AIGNNpoolingpostsDetecting Hostility from Skeletal Graphs Using Non-Euclidean Embeddings<p>The first paper on which I worked during my PhD is about <a href="https://arxiv.org/abs/1805.06299">detecting changes in sequences of graphs using non-Euclidean geometry and adversarial autoencoders</a>. As a real-world application of the method presented in the paper, we showed that we could detect epileptic seizures in the brain, by monitoring a stream of functional connectivity brain networks.</p>
<p>In general, the methodology presented in the paper can work for any data that:</p>
<ol>
<li>can be represented as graphs;</li>
<li>has a temporal dimension;</li>
<li>has a change that you want to identify somewhere along the stream of data;</li>
<li>has i.i.d. samples.</li>
</ol>
<p>There are <a href="https://icon.colorado.edu/#!/networks">a lot</a> of temporal networks that can be found in the wild, but not many datasets respect all the requirements at the same time. What’s more, many public datasets have very little samples along the temporal axis. <!--more-->
Recently, however, I was looking for some nice graph classification dataset on which to test <a href="https://danielegrattarola.github.io/spektral">Spektral</a>, and I stumbled upon the <a href="http://rose1.ntu.edu.sg/datasets/actionrecognition.asp">NTU RGB+D</a> dataset released by the Nanyang Technological University of Singapore.<br />
The dataset consists of about 60 thousand video clips of people performing everyday actions, including mutual actions and some health-related ones. The reason why I found this dataset is that it contains skeletal annotations for each frame of each video clip, meaning lots and lots of graphs that <a href="https://arxiv.org/abs/1801.07455">can be used for graph classification</a>.</p>
<h2 id="ntu-rgbd-for-change-detection">NTU RGB+D for change detection</h2>
<p><img src="https://danielegrattarola.github.io/images/2019-04-13/graphs.svg" alt="graphs" title="Figure 1: examples of hugging and punching graphs." class="threeq-width" /></p>
<p>While reading through the website, however, I realized that this dataset could actually be a good playground for our change detection methodology as well, because it respects almost all requirements:</p>
<ol>
<li>it has graphs;</li>
<li>it has a temporal dimension;</li>
<li>it has classes, which can be easily converted to what we called the <em>regimes</em> of our graph streams;</li>
</ol>
<p>The fourth requirement of having i.i.d. samples is due to the nature of the change detection test that we adopted in the paper. The test is able to detect changes in stationarity of a stochastic process, which means that it can tell whether the samples coming from the process have been drawn from a different distribution than the one observed during training. <br />
In order to do so, the test needs to estimate whether a window of observations from the process is significantly different than what observed in the nominal regime. This requires having i.i.d. samples in each window.</p>
<p>By their very nature, however, the graphs in NTU RGB+D are definitely not i.i.d. (they would have been, had the subjects been recorded under a strobe light – dammit!).<br />
There are several ways of converting a heavily autocorrelated signal to a stationary one, with the simplest one being randomizing along the time axis.
The piece-wise stationarity requirement is a very strong one, and we are looking into relaxing it, but for testing the method on NTU RGB+D we had to stick with it.</p>
<h2 id="setting">Setting</h2>
<p>Defining the change detection problem is easy: have a nominal regime of neutral or positive actions like walking, reading, taking a selfie, or being at the computer, and try to detect when the regime changes to a negative action like falling down, getting in fights with people, or feeling sick (there are at least 5 action classes of people acting hurt or sick in NTU RGB+D).</p>
<p>Applications of this could include:</p>
<ul>
<li>monitoring children and elderly people when they are alone;</li>
<li>detecting violence in at-risk, crowded situations;</li>
<li>detecting when a driver is distracted;</li>
</ul>
<p>In all of these situations, you might have a pretty good idea of what you <em>want</em> to be happening at a given time, but have no way of knowing how things could go wrong.</p>
<p>We chose the “hugging” action for the nominal, all-is-well regime, and we took the “punching/slapping” class to symbolize any unexpected, undesirable behaviour that deviates from our concept of nominal.
Then, we trained our adversarial autoencoder to represent points on an ensemble of constant-curvature manifolds, and we ran the change detection test.
At this point, it would probably help if one was familiar with the details of <a href="https://arxiv.org/abs/1805.06299">the paper</a>. In short, what we do is:</p>
<ol>
<li>take an adversarial graph autoencoder (AAE);</li>
<li>train the AAE on the nominal samples that you have at training time;</li>
<li>impose a geometric regularization onto the latent space of the AAE, so that the embeddings will lie on a Riemannian constant-curvature manifold (CCM).<br />
This happens in one of two ways:
<ol>
<li>use a prior distribution with support on the CCM to train the AAE;</li>
<li>make the encoder maximise the membership of its embeddings to the CCM (this is the one we use for this experiment);</li>
</ol>
</li>
<li>use the trained AAE to represent incoming graphs on the CCM;</li>
<li>run the change detection test on the CCM;</li>
</ol>
<p><img src="https://danielegrattarola.github.io/images/2019-04-13/embeddings.svg" alt="embeddings" title="Figure 2: embeddings produced by the AAE on the three different CCMs. Blue for hugging, orange for punching." class="full-width" /></p>
<p>This procedure can be adapted to learn a representation on more than one CCM at a time, by having parallel latent spaces for the AAE. This worked pretty well in the paper, so we tried the same here.
We also chose one of the two types of change detection tests that we introduced in the paper, namely the one we called <em>Riemannian</em>, because it gave us the best results on the seizure detection problem.</p>
<h2 id="results">Results</h2>
<p>Running the whole method on the stream of graphs gave us very nice results. We were able to recognize the change from friendly to violent interactions in most experiments, although sometimes the autoencoder failed to capture the differences between the two regimes (and consequently, the CDT couldn’t pick up the change).</p>
<p><img src="https://danielegrattarola.github.io/images/2019-04-13/accumulator.svg" alt="accumulator" title="Figure 3: accumulators of R-CDT (see the paper) for the three CCMs. The change is marked with the red line, the decision threshold with the green line. " class="full-width" /></p>
<p>An interesting thing that we observed is that when using an ensemble of three different geometries, namely spherical, hyperbolic, and Euclidean, the change would only show up in the spherical CCM.
This was a consistent result that gave us yet another confirmation of two things:</p>
<ol>
<li>assuming Euclidean geometry for the latent space is not always a good idea;</li>
<li>our idea of learning a representation on multiple CCMs at the same time worked as expected. Originally, we suggested this trick to potential adopters of our CDT methodology, in order to not having to guess the best geometry for the representation. Now, we have the confirmation that it is indeed a good idea, because the AAE will choose the best geometry for the task on its own.</li>
</ol>
<p>Figure 2 above (hover over the images to see the captions) shows the embeddings produced by the encoder on the test stream of graphs. Figure 3 shows the three <em>accumulators</em> used in the change detection test to decide whether or not to raise an alarm indicating that a change occurred.
In both pictures, the decision for raising an alarm is informed almost exclusively by the spherical CCM.</p>
<h2 id="conclusions">Conclusions</h2>
<p>That’s all, folks!<br />
This was a pretty little experiment to run, and it gave us further insights into the world of non-Euclidean neural networks. We have actually <a href="https://arxiv.org/abs/1805.06299">updated the paper</a> with the findings of this new experiment, and you can also try and play with our algorithm using the <a href="https://github.com/danielegrattarola/cdt-ccm-aae">code on Github</a> (the code there is for the synthetic experiments of the paper, but you can adapt it to any dataset easily).</p>
<p>If you want to mention our CDT strategy in your work, you can cite:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{grattarola2018change,
title={Change Detection in Graph Streams by Learning Graph Embeddings on Constant-Curvature Manifolds},
author={Grattarola, Daniele and Zambon, Daniele and Livi, Lorenzo and Alippi, Cesare},
journal={IEE Transactions on Neural Networks and Learning Systems},
year={2019},
doi={10.1109/TNNLS.2019.2927301}
}
</code></pre></div></div>
<p>Cheers!</p>
Sat, 13 Apr 2019 00:00:00 +0000
/posts/2019-04-13/hostility-detection.html
AIexperimentnon-euclideanpostsLas Torres de Eivissa<p><img src="https://danielegrattarola.github.io/images/2018-10-14/1.jpeg" alt="Torre d'En Rovira" class="full-width" /></p>
<p><strong>La Torre d’En Rovira.</strong></p>
<p>We walk under the scorching sun for two hours, in and out of the pine groves where old hippies live in old trucks, not knowing where we’re going except for the fact that we’re moving North. We’re looking for a place to escape the crowded August of the island.<br />
<!--more-->
At the end of the forest, we come to a clearing on the cliff. I realize that we can’t hear the cicadas anymore, just the sea against the rocks, and our shoes against the ground. The sun is a physical presence upon us.<br />
The tower stands 500 meters away. I’d seen it as a kid, when my mother used to take my brother and I on long adventures on the cliffs, but this time is different.<br />
I feel its presence. I feel nature around me. The smell of the trees, which is the first thing that I notice when I get here, the first thing that I miss when I’m gone. The three southernmost Illas, watching us silently from afar.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/2.jpeg" alt="Torre del Carregador de sa Sal Rossa" /></p>
<p><strong>La Torre del Carregador de sa Sal Rossa.</strong></p>
<p>We’re now on a quest. See them all, complete the circle around the Island, and claim its most intimate knowledge as ours. <br />
We go near Eivissa, floating above the parties, and the crowds, and the cars, and the hotels that were made to resemble Miami. The tower is a touristic site, guarded by a man with sad eyes. We climb at the top, look down at the beach. <br />
This place has taught me that everything is dual.
The sky and the sea. The chaos of the day and the silence of the night. The rush of humanity and the peace below the water. The true nature of this island lies in accepting the duality as a whole.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/3.jpeg" alt="Torre del Cap de Campanitx" /></p>
<p><strong>La Torre del Cap de Campanitx.</strong></p>
<p>We walk for one minute on a well-kept dirt road, there’s a house built so far over the cliff that it looks like it’s about to dive in the foam below. The easiest conquest so far. <br />
There was a ladder to access the tower, long ago. Soldiers would retract it when under attack, even though not a single enemy was ever seen off the coasts of the Island since the towers were built.<br />
We try to climb in, but the door is locked and rusty. It’s more like a gentle request to not get in, rather than a prohibition. <br />
I turn my head, and see an island on the sea in front of us. Quiet. Covered in trees. Treasures are probably hidden in every corner of it, buried by some old pirate in the 1600s. Looking it up, I find out that the island is owned by a politician, he built a resort for celebrities on the side that we cannot see from the coast. Duality.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/4.jpeg" alt="Torre del Port de Portinatx" /></p>
<p><strong>La Torre del Port de Portinatx.</strong></p>
<p>The northernmost point of the island. I go aroud the perimeter of the tower, assessing the damage, the imminent collapse. But the tower stands, for now. It stands, forgotten by most, hidden by the forest, surrounded by civilization.
The missing stone bricks make a perfect ladder to get inside, but we are afraid that the tower may not tolerate visitors.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/5.jpeg" alt="Torre des Molar" /></p>
<p><strong>La Torre des Molar.</strong></p>
<p>From the northwestern town of San Miguel, we drive atop a low mountain. <em>“At what point does a hill become a mountain?”</em><br />
The road goes from asphalt to dirt, then from dirt to mud, then from mud to rocks and trees, until our car is about to roll over. We hear dogs barking nearby, someone must be living on the cliff, like in Cala Conta.
We go up, take in the scenery. A lone lizard comes to us to say hi.<br />
The tower stands guard over the port of San Miguel, far below us.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/6.jpeg" alt="Torre de Ses Portes" /></p>
<p><strong>La Torre de Ses Portes.</strong></p>
<p>At the back of the beach there are sand dunes, covered in sea daffodils and rosemary.
Each step ends up twenty centimeters lower than you’d expect, swallowed by the soft ground. I am hypnotized as the daffodils slowly go by.
We sweat our way to the rocks at the end of the beach. As the sun rises from the Mediterranean to our left, the tower sits solemnly in the distance. I feel the wind coming from the land.<br />
Four stories high, the biggest one. At one point in time, this tower hosted eight military men, with cannons, bullets, guns, and food to last months. Once you got in, there was no reason to get out and abandon its safety.<br />
We dive in the crystal clear water for a while. Only one left.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/7.jpeg" alt="Torre des Savinar" /></p>
<p><strong>La Torre des Savinar.</strong></p>
<p>My mind palace, my home, my dream, my safe harbor. The tower stands <em>below</em> us.
We see Es Vedrà. There is no place for sadness, or fear, or indifference. <br />
I go and complete the mission, crawling under the plants and stepping down the rocks. I cricle around the base, watching our final conquest as if it is the first time. I hesitate a full minute before touching its sand-colored bricks. What if the world can exist only as long as nobody has touched all the seven towers of Eivissa?</p>
<p>The sun is setting.</p>
<hr />
<p><strong>Ninth, fourteenth, sixty-second</strong></p>
<p>There are actually nine towers in Eivissa. Two were integrated in the churches of Sant Antoni and Santa Eularia as the years passed. The one in Sant Antoni is the only one with a square base.
Four more in Formentera. One in S’Espalmador, but the island is a national park and cannot be explored.
Forty-eight more were spread around the Pitiüses, built by natives in the inner parts of the islands. Some of them don’t exist anymore, but the majority is still there. They are barely big enough to fit a family of four inside.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-10-14/8.jpeg" alt="Sea daffodils" /></p>
Sun, 14 Oct 2018 00:00:00 +0000
/posts/2018-10-14/torres.html
travelpostsGraph Embeddings on Constant-Curvature Manifolds for Change Detection<p><img src="https://danielegrattarola.github.io/images/2018-06-07/embeddings_plot.png" alt="Embeddings" class="full-width" /></p>
<p>When considering relational problems, the temporal dimension is often crucial to understand whether the process behind the relational graph is evolving, and how; think how often people follow and unfollow each other on Instagram, how the type of content in one’s posts may change over time, and how all of these aspects are echoed throughout the network, interacting with one another in complex ways.</p>
<p>While most works that apply deep learning to temporal networks are focused on the evolution of the graph at the node or edge level, it is extremely interesting to study a graph-based process from a global perspective, at the graph level, to detect trends and changes in the process itself.
<!--more--></p>
<p>This post is a simplified version of <a href="https://arxiv.org/abs/1805.06299">this paper</a>, so you might want to have a look at either one before reading the other.</p>
<h2 id="be-the-change-you-want-to-see-in-the-process">Be the change you want to see in the process</h2>
<p>We consider a process generating a stream of graphs (e.g. hourly snapshots of a power supply grid), and we make the assumption of having two <em>states</em>: a nominal regime (when the grid is stable) and a non-nominal regime (when there’s about to be a blackout). We know what the graphs look like in the nominal state, and we can use a dataset of nominal graphs to train our models, but we cannot know what the non-nominal state will look like: the goal is to discriminate between the two regimes, regardless.<br />
If we consider this as a one-class classification problem, then we have <em>anomaly detection</em>; if we take into account the temporal dimension, then the task is to detect whether the process has permanently shifted to non-nominal, and we have the <em>change detection</em> problem, on which we focus here.</p>
<p>Change detection can be easily formulated in statistical terms by considering two unknown nominal and non-nominal distributions driving a process, and running tests that can tell us whether the process is operating in one regime or the other.
When dealing with graphs, however, things get a bit more complicated.<br />
While detecting changes in stationarity directly on the graph space is possible, it is also analytically complex. In particular, since most graph distances are non-metric, the resulting non-Euclidean geometry of the space is often unknown, making it quite harder to apply standard statistical tools. Even if we consider better-behaved metric distances, the computational complexity of dealing with the graph space is often intractable.<br />
A common approach to circumvent this issue, then, is to represent the graphs in a simpler space via graph embedding.</p>
<h2 id="enter-representation-learning">Enter representation learning</h2>
<p>The key idea behind our approach is the following: we train a <a href="https://arxiv.org/abs/1802.03480">graph autoencoder</a> to extract a representation of the graphs on a somewhat simpler space, so that all the well known statistical tools for change detection become available for us to use.<br />
However, since we already noted that graphs do not naturally lie in Euclidean spaces, we can look for a better embedding space, which can intrinsically represent some of the non-trivial properties of the space of graphs.</p>
<p>Since non-Euclidean geometry is basically any relaxation of the Euclidean one, we can freely pick our favorite non-Euclidean embedding space, where a desirable property of this space is to have computationally tractable metric distances, and a simple analytical form to make calculations easier.</p>
<p>A good family of spaces that reflect these characteristics is the family of <em>constant curvature Riemannian manifolds</em> (CCMs): hyperspheres and hyperboloids.</p>
<h2 id="algorithm-overview">Algorithm overview</h2>
<p>Let’s take a global view of the algorithm before diving into the details. The important steps are:</p>
<ol>
<li>Take a sequence of nominal graphs</li>
<li>Train the AE to embed the graphs on a CCM</li>
<li>Take a stream of graphs that <em>may</em> eventually change to the non-nominal regime</li>
<li>Use the encoder to map the stream to the CCM</li>
<li>Run a change detection test on the CCM to identify changes in the stream</li>
</ol>
<p>The sequence of graphs in step 1 is also mapped to the CCM and used to configure the change detection test (more on that later).<br />
In a real-world scenario, step 3 is the stream of graphs observed by the algorithm after being deployed, where we have no information on the real state of the system. To test our methodology, however, we consider a stream of graphs with a known change point, and use it as ground truth to evaluate performance.</p>
<h2 id="adversarial-graph-autoencoder">Adversarial Graph Autoencoder</h2>
<p><img src="https://danielegrattarola.github.io/images/2018-06-07/scheme.png" alt="Full architecture" class="full-width" /></p>
<p>Building an autoencoder that maps the data distribution to a CCM requires imposing some sort of constraint on the latent space of the network, either by explicitly constraining the representation (e.g. by projecting the embeddings onto the CCM), or by letting the AE learn a representation on a CCM autonomously.<br />
In our approach, we choose a mix of the two solutions: first, we let the AE learn a representation that lies as close as possible to the CCM, and then (once we’re sure that the projection will not introduce too much bias) we rectify the embeddings by clipping them onto the surface of the CCM.</p>
<p>To impose an implicit constraint on the representation, we resort to the <a href="https://arxiv.org/abs/1511.05644">adversarial autoencoder</a> framework, where we take a more GAN-like approach and only use the encoder as the generator, ignoring the decoder.
We define a prior distribution with support on the CCM, by mapping an equivalent Euclidean distribution onto the CCM via the <a href="https://en.wikipedia.org/wiki/Exponential_map_(Riemannian_geometry)">Riemannian exponential map</a>, and we then match the aggregated posterior of the AE with this prior.</p>
<p>This has the twofold effect of 1) implicitly defining the embedding surface that the AE has to learn in order to confuse the discriminator network, and 2) making the AE use all of the latent space uniformly.</p>
<p>Using this <em>CCM prior</em> is the simplest modification to the standard AAE framework that we can make to impose the geometric constraint on the latent space, but in general we may want to drop the statistical conditioning of the posterior and find ways to let the AE learn the representation on the CCM freely.
To do this, we introduce an analytical <em>geometric discriminator</em>.</p>
<h2 id="geometric-discriminator">Geometric discriminator</h2>
<p>If we ignore the statistical conditioning of the AE’s posterior, we are left with the task of simply placing the embeddings onto the CCM.
Since we already defined a training framework for the AE that relies on adversarial learning, we can stick to this methodology and slightly change it to fit our new, relaxed requirements.</p>
<p>The key idea behind adversarial networks is that both the generator and the discriminator strive to be better against each other, but what if the discriminator were already the best possible discriminator that may ever exist? What if the discriminator only had to compute a known classification function, without learning it? <br />
This is the idea behind the geometric discriminator.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-06-07/geom_critic.png" alt="Geometric critic" class="full-width" /></p>
<p>We consider a function <script type="math/tex">D_\kappa(\vec z)</script> depending on the curvature <script type="math/tex">\kappa \ne 0</script> of the CCM, where:</p>
<script type="math/tex; mode=display">D_{\kappa}(\vec z) =
\mathrm{exp}\left(\cfrac{-\big( \langle \vec z, \vec z \rangle - \frac{1}{\kappa} \big)^2}{2\varsigma^2}\right)</script>
<p>which intuitively takes samples <script type="math/tex">\vec z</script> from the latent space and computes their <em>membership</em> to the CCM.<br />
When optimized to fool the geometric discriminator, the AE will learn to place its codes on the CCM, while at the same time being free to choose the best latent representation to optimize the reconstruction loss. <br />
In principle, we could argue that this formulation is equivalent to imposing a regularization term during the optimization of the AE, but experimental results showed us that separating the reconstruction and regularization phases yielded more stable and more effective results.</p>
<h2 id="change-detection-tests-for-ccms">Change detection tests for CCMs</h2>
<p>Having defined a way to represent our graph stream on a manifold with better geometrical properties than the simple Euclidean space, we now have to run the change detection test on the embedded stream of graphs.</p>
<p>Our change detection test is built upon the CUmulative SUMs algorithm (dating back to the 50’s), which basically consists in monitoring a generic stream of points by taking sequential windows of them, computing some <em>local statistic</em> across each window, and summing up the local statistics in a <em>global accumulator</em>.<br />
The algorithm raises an <em>alarm</em> every time that the accumulator exceeds a particular <em>detection threshold</em> (and the accumulator is reset to 0 after that).<br />
Using the (embedded) training graphs, we set the threshold such that the probability of the accumulator exceeding the threshold in the nominal regime is a given value <script type="math/tex">\alpha</script>.
Once the threshold is set, we monitor the operational stream, knowing that any detection rate above <script type="math/tex">\alpha</script> will likely be associated to a change in the process.</p>
<p>Since the detection threshold is set by statistically studying the accumulator, we can estimate it by knowing the distribution of the local statistics that make up the accumulator. To do this, we consider as local statistic the Mahalanobis distance between the mean of the training samples and the mean of the operational window, which thanks to the central limit theorem has a known distribution.</p>
<p>So now we have outlined a change detection test for a generic stream, but where does non-Euclidean geometry come into play? In the paper we propose two different approaches to exploit it, both consisting in picking different ways to build the stream that is monitored by the CUSUM test.</p>
<p><strong>Distance-based CDT (D-CDT)</strong>: we take the training stream of nominal graphs and compute the <em>Fréchet mean</em> of the points on the CCM; for each embedding in the operational stream, then, we compute the geodesic distance between the mean and the embedding itself. This results in the stream of embeddings being mapped to a stream of distances, which we then monitor with the CUSUM-based algorithm described above.</p>
<p><strong>Riemannian CLT-based CDT (R-CDT)</strong>: here we take a slightly more geometrical approach, where instead of considering the Euclidean CLT we take the <em>Riemannian CLT</em> proposed by <a href="https://arxiv.org/abs/1801.00898">Bhattacharya and Lin</a>, which works directly for points on a Riemannian manifold and modifies the Mahalanobis distance to deal with the non-Euclidean geometry. In short, the approach considers a stream of points obtained by mapping the CCM-embedded graphs to a tangent plane using the Riemannian log-map, and computes the detection threshold using the modified local statistic.</p>
<p>This might seem like a lot to deal with, but worry not: <a href="https://github.com/dan-zam/cdg">there’s a public repo to do this stuff for you</a>.</p>
<h2 id="combined-ccms">Combined CCMs</h2>
<p>As final touch, some considerations on what CCM to pick for embedding the graph stream.<br />
In general, there are infinite curvatures to choose from, but we really only need to worry about the sign of the curvature, because that’s what determines whether the space is spherical or hyperbolic (or even Euclidean, if we set the curvature to 0).</p>
<p>Different problems may benefit from different geometries, depending on the task-specific distances that determine the geometry of the original space of graphs (for instance, MNIST - yes, images are graphs too - <a href="https://arxiv.org/abs/1804.00891">has been shown</a> to do well on spherical manifolds).<br />
But how can know whether a sphere or an hyperboloid is the best fit for a problem? How do we know that the Euclidean space isn’t actually the best one?
In principle, we could train an AE for each manifold and test the performance of the algorithm, but what if we don’t have enough data to get reliable results? What if have too much, and training is expensive?</p>
<p>A fairly trivial, but effective solution is to not pick just <em>one</em> manifold (pfffft!), but pick ALL of them at the same time and learn a joint representation.
Formally, we consider an <em>ensemble manifold</em> as the Cartesian product of different CCMs, and slightly adapt our architecture accordingly (essentially we take the relevant building blocks of our pipeline and put them in parallel, with some shared convolutions here and there - check Section 3.3 of the paper for details).<br />
Since the actual values of the curvatures are less important than their signs, we can take only three CCMs to build our ensemble: a spherical CCM of curvature 1, an hyperbolic CCM of curvature -1, and an Euclidean CCM of curvature 0.</p>
<h2 id="experiments">Experiments</h2>
<p>To validate our methodology, we ran experiments in two different settings: a synthetic, controlled one, and a real-world scenario of epileptic seizure detection.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-06-07/delaunay.png" alt="Delaunay triangulations" class="full-width" /></p>
<p>For the synthetic scenario, we considered graphs obtained as the <a href="https://en.wikipedia.org/wiki/Delaunay_triangulation">Delaunay triangulations</a> of points in a plane (pictured above), where we controlled the change in the stream by adding perturbations of different intensity to the support points of the graphs.</p>
<p><img src="https://danielegrattarola.github.io/images/2018-06-07/ieeg.png" alt="iEEG data" class="full-width" /></p>
<p>For the seizure detection scenario, we considered Kaggle’s <a href="https://www.kaggle.com/c/seizure-detection">UPenn and Mayo Clinic’s Seizure Detection Challenge</a> and <a href="https://www.kaggle.com/c/seizure-prediction">American Epilepsy Society Seizure Prediction Challenge</a> datasets, composed of iEEG signals for different human and dog patients, with a different number of electrodes attached to each patient resulting in different multivariate signals. <br />
The signals are provided in 1-second i.i.d. clips of different classes for each patient (the nominal <em>interictal</em> states where the patient is fine, and the non-nominal <em>ictal</em> states where the patient is having a seizure), and the original task of the challenge is to classify the clips correctly.<br />
Since a common approach in neuroscience to deal with iEEG data is to build <a href="https://www.frontiersin.org/articles/10.3389/fnsys.2015.00175/full">functional connectivity networks</a> to study the relationships between different areas of the brain, especially during rare episodes like seizures, this task was the perfect playground to test our complex methodology.
We converted each 1-second clip to a graph using Pearson’s correlation as functional connectivity measure, and the topmost 4 wavelet coefficients of each signal as node attributes.<br />
To simulate the graph streams, we used the labeled training data from the challenges to build the training and operational streams for each patient, where a change in the stream simply consisted in sampling graphs from the ictal class instead of the nominal.</p>
<h3 id="results-in-short">Results in short</h3>
<p>The important aspects that emerged after the experimental phase are the following:</p>
<ol>
<li>The ensemble of CCM, with the geometric critic, and R-CDT is the most effective change detection architecture among the ones tested (which included a purely Euclidean AE and a non-neural baseline for embedding graphs). This highlights how the AE is encoding different, yet useful information on the different CCMs;</li>
<li>Exclusively spherical and hyperbolic AEs are relevant in some rare cases;</li>
<li>Using the geometric discriminator often yields a better performance w.r.t. the standard discriminator, while reducing the number of trainable parameters by up to 13%;</li>
<li>We are able to detect extremely small changes (in the order of <script type="math/tex">10^{-3}</script>) in the distribution driving the Delaunay stream;</li>
<li>We are able to detect changes in both the iEEG detection and prediction challenges with good accuracy in most cases, except for a couple of patients for which we see an accuracy drop;</li>
<li>The model does not require excessive hyperparameter tuning in order to perform well; a single configuration is good in most cases.</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>All methods introduces in this work can go beyond the limited application scenario that we reported in the paper.
Our aim was to introduce a new framework to deal with graphs on a global level, so as to make it possible to study the process underlying a graph-based problem as a whole.<br />
The proposed techniques are modular and fairly generic: the adversarially regularized graph AE can be used to map graphs on CCMs for other tasks, and the embedding technique for CCMs can be used with other autoencoders and other data distributions. The change detection tests are a bit more specific, but represent a nice application of our new framework on relevant use cases.</p>
<p>We’re already working on new applications of this framework, to showcase what we believe to be a great potential, so stay tuned!</p>
<h2 id="credits">Credits</h2>
<p>The code for replicating our experiments is available on <a href="https://github.com/danielegrattarola/cdt-ccm-aae">my Github</a>.<br />
If you wish to reference our paper, you can cite:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{grattarola2018change,
title={Change Detection in Graph Streams by Learning Graph Embeddings on Constant-Curvature Manifolds},
author={Grattarola, Daniele and Zambon, Daniele and Livi, Lorenzo and Alippi, Cesare},
journal={IEE Transactions on Neural Networks and Learning Systems},
year={2019},
doi={10.1109/TNNLS.2019.2927301}
}
</code></pre></div></div>
Thu, 07 Jun 2018 00:00:00 +0000
/posts/2018-06-07/ccm-paper.html
AItutorialpostsMachine Learning Team Projects: a Survival Guide<p><img src="https://danielegrattarola.github.io/images/2018-03-20/cover.jpg" alt="Training a neural network" class="full-width" /></p>
<p>Ever since I started getting closer to machine learning, well before I started my PhD, I have always found it extremely annoying to keep track of experiments, parameters, and minor variations of code that may or may not be of utmost importance to the success of your project.<br />
This gets incredibly uglier as you wander into uncharted territory, when best practices start to fail you (or have never been defined at all) and the amount of details to keep in mind becomes quickly overwhelming.<br />
However, nothing increases the entropy of a project like introducing new people into the equation, each one with a different skillset, coding style, and amount of experience.</p>
<p>In this post I’ll try to sum up some of the problems that I have encountered when doing ML projects in teams (both for research and competitions), and some of the things that have helped me make my life easier when working on a ML project in general.<br />
<!--more-->
Some of these require people to drop their ancient, stone-engraved practices and beliefs: they will hate you for enforcing change, but after a while you’ll all be laughing back at when Bob used to store 40GB of <code class="language-plaintext highlighter-rouge">.csv</code> datasets on a Telegram chat.</p>
<p>The three main areas that I’ll cover are:</p>
<ul>
<li>How to deal with code, so that anyone will be able to reproduce the stuff you did and understand what you did by looking at the code;</li>
<li>How to deal with data, so that good ol’ Bob will not only stop using Telegram as a storage server, but will also stop storing data in that obscure standard from 1997;</li>
<li>How to deal with logs, so that every piece of information needed to replicate an experiment will be stored somewhere, and you won’t need to run a mental Git tree to remember every little change that the project underwent in the previous 6 months.</li>
</ul>
<hr />
<h2 id="code">Code</h2>
<p>In this post I’ll be mostly talking about Python.<br />
That’s because 99% of the ML projects I’ve worked on have been in Python, and the remaining 1% is what Rule 1 of this section is about. I’ll try to keep it as general as possible, but in the end I’m a simple ML PhD student who goes with the trend, so Python it is.<br />
Let’s start from two basic rules (which, I assure you, have been made necessary by experience):</p>
<p><strong>1. Use a single programming language</strong><br />
Your team members may come from different backgrounds, have different skills, and different degrees of experience. This can become particularly problematic when coding for a project, as people will try to stick to the languages they know best (usually the ones they used during their education) because they rightfully feel that their performance may suffer from using a different language.<br />
Democratically deciding on which language to use may be a hard task, but you must never be tempted to tolerate a mixed codebase if you are serious about being a team.<br />
Eventually, someone might have to put their fist down and resort to threat: don’t push that <code class="language-plaintext highlighter-rouge">.r</code> file on my Python repo ever again if you wish to live.</p>
<p><strong>2. Everybody must be using the same version of everything</strong> <br />
This should be pretty obvious, but I’ve witnessed precious hours being thrown to the wind because OpenAI’s <code class="language-plaintext highlighter-rouge">gym</code> (just to name one) was changed in the backend between versions and nobody had a clue why the algorithms were running differently on different machines. <br />
Another undesirable situation may present itself when integrating existing codebases written in different versions of the same language. This is obviously more relevant with Python 2/3, where the code is backwards compatible enough between versions for the integration to go smoothly, but <code class="language-plaintext highlighter-rouge">2/3</code> is sneakily equal to 0 in Python 2 and 0.66 in Python 3 (and this may not always be apparent immediately).</p>
<p>To make it short:</p>
<ul>
<li>check your Pythons.</li>
<li><code class="language-plaintext highlighter-rouge">pip install -U</code> at least once a week (or never at all until you’re done).</li>
</ul>
<p>Going a bit more in depth into the realm of crap that one may find oneself in, even once you’re sure that everyone is synced on the basics, there are some additional rules that can greatly improve the overall project experience and will prepare you for more advanced situations in any team project.</p>
<p><strong>3. Write documentation for at least input and output</strong><br />
You have to work with the sacred knowledge that people may not want to read your code.<br />
Good documentation is the obvious way to avoid most issues when it comes to working on a team project, but codebases tend to get really big and deadlines tend to get really close, so it may not always be possible to spend time documenting in detail every function. <br />
A simple trade-off for the sake of sanity is to limit documentation to a single sentence describing what functions do, but clearly describing what are the expected input and output formats. A big plus here is to perform runtime checks, and fail early when the inputs are wrong.<br />
For instance, one could do something like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="s">""" Does a thing.
:param a: np.ndarray of shape (n_samples, n_channels); the data to be processed.
:param b: None or int; the amount of this in that (if None, it will be inferred).
:return: np.ndarray of shape (n_samples, n_channels + b); the result.
"""</span>
<span class="k">if</span> <span class="n">a</span><span class="o">.</span><span class="n">ndim</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">'Expected rank 2 array, got rank {}'</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">ndim</span><span class="p">))</span>
<span class="k">if</span> <span class="n">b</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="nb">int</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">TypeError</span><span class="p">(</span><span class="s">'b should be int or None'</span><span class="p">)</span>
</code></pre></div></div>
<p><strong>4. <code class="language-plaintext highlighter-rouge">git branch && git gud</code></strong><br />
This is actually a good general practice that should be applied in any coding project.<br />
Do not test stuff on <code class="language-plaintext highlighter-rouge">master</code>, learn to use the tools of the trade, and read the <a href="https://www.git-tower.com/blog/git-cheat-sheet/">Git cheat-sheet</a>.<br />
Do not be afraid to create a branch to test a small idea (fortunately they come cheap), and your teammates will appreciate you for not messing up the codebase.</p>
<p><strong>5. Stick to one programming paradigm and style</strong><br />
This may be the hardest rule of all, especially because it’s fairly generic.
It’s difficult to formalize this rule properly, so here are some examples:</p>
<ul>
<li>write PEP8 compliant code (or the PEP8 equivalent for other languages);</li>
<li>don’t use single letters for variables that have a specific semantic meaning (e.g. don’t use <code class="language-plaintext highlighter-rouge">W</code> when you can use <code class="language-plaintext highlighter-rouge">weights</code>);</li>
<li>keep function signatures coherent;</li>
<li>don’t write cryptic one-liners to show off your power level;</li>
<li>don’t use a <code class="language-plaintext highlighter-rouge">for</code> cycle if everything else is vectorized;</li>
<li>don’t define classes if everything else is done with functions in modules (e.g. don’t create a <code class="language-plaintext highlighter-rouge">Logger</code> class that exposes a <code class="language-plaintext highlighter-rouge">log()</code> method, but create a <code class="language-plaintext highlighter-rouge">logging.py</code> module and <code class="language-plaintext highlighter-rouge">import log</code> from it);</li>
<li>don’t use sparse matrices if everything else is dense (unless absolutely necessary, and always remember Rule 3 anyway).</li>
</ul>
<p>I realize this is all a bit vague, so I’ll just summarize it as “stick to the plan” and shamelessly leave you to learn from experience.</p>
<p><strong>6. Don’t add a dependency if you’ll only use it once</strong><br />
This could have actually been an example of Rule 5, but I’ve seen too many atrocities in this regard to not make it into a rule.<br />
Sometimes it will be absolutely tempting to use a library with which you have experience to do a single task, and you will want to import that library “just this once” to get done with it.<br />
This quickly leads to <a href="https://en.wikipedia.org/wiki/Dependency_hell">dependency hell</a> and puts Rule 2 in danger, so try to avoid it at all costs. <br />
Examples of this include using Pandas because you are not confident enough with Numpy’s slicing, or importing Seaborn because Matplotlib will require some grinding, or copy-pasting that two-LOC solution from StackOverflow.<br />
Of course, this is gray territory and you should proceed with common sense: sometimes it’s really useless to reinvent the wheel, in which case you can <code class="language-plaintext highlighter-rouge">import</code> away without guilt, but most times a quick Google search will provide you with native solutions within the existing requirements of the project.</p>
<p><strong>7. Comment non-trivial code, but do not over-commit to the cause</strong><br />
Comments should be a central element of any codebase, because they are the most effective way of allowing others (especially the less skilled) to understand what you did; they are the only ones that can save your code’s understandability should Rule 5 come less.<br />
Especially in ML projects, where complex ideas may lead to complex architectures, and most stuff is usually vectorized (i.e. ugly, ugly code may happen more frequently than not), leaving a good trail of comments behind you may be crucial for the sake of the project, especially when you find yourself debugging a piece of code that was written six months before.<br />
At the same time, you should avoid commenting every single line of code that you write, in order to keep the code as tidy as possible, reduce redundancy, and improve readability.<br />
So for instance, a good comment would be:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">inputs</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="c1"># Compute the model's output as WX + b
</span></code></pre></div></div>
<p>where the information conveyed is as minimal and as exact as possible (maybe this specific example shouldn’t even require a comment, but you get the idea). Note that in this case the comment refers to variables by other names: this is not necessarily a good practice, but I find it helpful to link what you are doing in the code with what you did in the associated paper.<br />
On the other hand, a comment like the following (actually found in the wild):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Set the model's flag to freeze the weights and prevent training
</span><span class="n">model</span><span class="o">.</span><span class="n">trainable</span> <span class="o">=</span> <span class="bp">False</span>
</code></pre></div></div>
<p>should be avoided at all costs. But you knew that already.</p>
<hr />
<h2 id="data">Data</h2>
<p>Data management is a field that is so vast and so complex that it’s basically impossible for laymen (such as myself) to do a comprehensive review of the best practices and tools.<br />
Here I’ll try to give a few pointers that are available to anyone with basic command line and programming knowledge, as well as some low-hassle tricks to simplify the life of the team.<br />
You should probably note, as a disclaimer, that I’ve never worked with anything bigger than 50GB, so there’s that. But anyway, here we go.</p>
<p><strong>1. Standardize and modernize data formats</strong><br />
Yes, I know. I know that in 1995, IEEE published an extremely well defined standard to encode an incredibly specific type of information, and that this is exactly the type of data that we’re using right now.
And I know that XML was the semantic language of the future, in 2004.
I know that you searched the entire Internet for that dataset, and that the Internet only gave you a <code class="language-plaintext highlighter-rouge">.mat</code> file in return.<br />
But, this is what we should do instead:</p>
<ol>
<li>use <code class="language-plaintext highlighter-rouge">.npz</code> for matrices;</li>
<li>use <code class="language-plaintext highlighter-rouge">.json</code> for structured data;</li>
<li>use <code class="language-plaintext highlighter-rouge">.csv</code> for classic relational data (e.g. the Iris dataset, stuff with well defined categories);</li>
<li>serialize everything else with libraries like Pickle or H5py.</li>
</ol>
<p>Keep it as simple, as standard, and as modern as possible.<br />
And remember: it’s better to convert data once, and then read from the chosen standard format, rather than converting at runtime, every time.</p>
<p><strong>2. Drop the Dropbox</strong> <br />
Dropbox and Google Drive are consumer-oriented platforms that are specifically designed to help the average user have a simple and effective experience with cloud storage. They surely can be used as backend for more technical situations through the use of command line, but in the end they will bring you down to hell and keep you there forever.<br />
Here’s a short list of tools and tips for cloud storage and data handling that I have used in the past as alternative to the big D (no pun intended).</p>
<p>Data storage:</p>
<ul>
<li>Set up a centralized server (as you most likely do anyway to run heavy computations) and keep everything there;</li>
<li>Set up and S3 bucket and add a <code class="language-plaintext highlighter-rouge">dataset_downloader.py</code> to your code;</li>
<li>Set up a NAS (good for offices, less for remote development);</li>
</ul>
<p>Data transfers:</p>
<ul>
<li>Use the amazing <a href="https://transfer.sh">transfer.sh</a>, a free service that allows you to upload and download files up to 10GB for up to 30 days;</li>
<li>Use <code class="language-plaintext highlighter-rouge">rsync</code>;</li>
<li>Use <code class="language-plaintext highlighter-rouge">sftp</code>;</li>
<li>Use Filezilla or equivalent <code class="language-plaintext highlighter-rouge">sftp</code> clients.</li>
</ul>
<p><strong>3. Don’t use Git to move source files between machines</strong><br />
This is once again an extension of the previous rule.<br />
The situation is the following: you’re debugging a script, testing out hyperparameters, or developing a new feature of your architecture. You need to run the microscopically different script on the department’s server, because your laptop can’t deal with it. You <code class="language-plaintext highlighter-rouge">git commit -m 'fix' && git push origin master</code>. Linus Torvalds dies (and also you broke Rule 4 of the coding section).<br />
Quick fix: keep a <code class="language-plaintext highlighter-rouge">sftp</code> session open and <code class="language-plaintext highlighter-rouge">put</code> the script, instead. Once you’re sure that the code works, then you can roll back the changes on the remote machine, commit from the local machine just once, and then pull on the remote to finish.</p>
<p>This will make life easier for someone who has to roll back the code or browse commits for any other reason, because they won’t have to guess which one of the ten ‘fix’ commits is the right one.</p>
<p><strong>4. Don’t push data to Github</strong><br />
On a similar note, avoid using Github to keep track of your data, especially if the data is subject to frequent changes. Github will block you if you surpass a certain file size, but in general this is a solution that doesn’t scale.<br />
There is one exception to this rule: small, public benchmark datasets. Those are fine and may help people to reproduce your work by conveniently providing them with a working OOB environment, but everything else should be handled properly.</p>
<p><strong>5. Test small, run big</strong> <br />
Keep a small subset of your data on your development machine, big enough to cover all possible use cases (e.g. train/test splits or cross validation), but small enough to keep your runtimes in the order of seconds.<br />
Once you’re ready to run experiments for good, you can use the whole dataset and leave the machine to do its work.</p>
<hr />
<h2 id="experiments">Experiments</h2>
<p>Experiment, runs, call them however you like. It’s the act of taking a piece of code that implements a learning algorithm, throw data at it, get information in return.<br />
I’ve wasted long hours trying to come up with the perfect Excel sheet to keep track of every nuance of my experiments, only to realize that it’s basically impossible to do so effectively.<br />
In the end, I’ve found that the best solutions are to either have your script output a dedicated folder for each run, or have an old school paper notebook on which you record your methodology as you would take notes in class. Since the latter is more time consuming and personal, I’ll focus on the former.</p>
<p><strong>1. Keep hyperparameters together and logged</strong><br />
By my very own, extremely informal definition, hyperparameters are those things that you have to pick by hand (or cross-validation) and that will FUCK! YOU! UP! whenever they feel like it. You might think that the success of your paper depends on your hard work, but it really doesn’t: it’s how you pick hyperparameters.<br />
But asides aside, you really should keep track of the hyperparameters for every experiment that you run, for two simple reasons:</p>
<ol>
<li>They will be there when you need to replicate results or publish your code with the best defaults;</li>
<li>They will be there when you need to write the Experiments section of the paper, so you will be sure that result A corresponds to hyperparameters set B, without having to rely on your source code to keep track of hyperparameters for you.</li>
</ol>
<p>In general, it’s also a good idea to log every possible choice and assumption that you have to make for an experiment, and that includes also meta-information like what optimization algorithm or loss you used in the run.</p>
<p>By logging everything properly, you’ll ensure that every team member will know where to look for information, ad they will not need to assume anything else other than what is written in the logs.</p>
<p>A cool code snippet that I like to run after the prologue of every script is the following (taken from my current project):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Defined somewhere at some point
</span><span class="k">def</span> <span class="nf">log</span><span class="p">(</span><span class="n">string</span><span class="p">,</span> <span class="n">print_string</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="k">global</span> <span class="n">LOGFILE</span>
<span class="n">string</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">string</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">string</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">):</span>
<span class="n">string</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>
<span class="k">if</span> <span class="n">print_string</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">string</span><span class="p">)</span>
<span class="k">if</span> <span class="n">LOGFILE</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">LOGFILE</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">string</span><span class="p">)</span>
<span class="c1"># Define all hyperparameters here
# ...
</span>
<span class="c1"># Log hyperparameters
</span><span class="n">log</span><span class="p">(</span><span class="n">__file__</span><span class="p">)</span>
<span class="n">vars_to_log</span> <span class="o">=</span> <span class="p">[</span><span class="s">'learning_rate'</span><span class="p">,</span> <span class="s">'epochs'</span><span class="p">,</span> <span class="s">'batch_size'</span><span class="p">,</span> <span class="s">'optimizer'</span><span class="p">,</span> <span class="s">'loss'</span><span class="p">]</span>
<span class="n">log</span><span class="p">(</span><span class="s">''</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s">'- {}: {}</span><span class="se">\n</span><span class="s">'</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="nb">eval</span><span class="p">(</span><span class="n">v</span><span class="p">)))</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vars_to_log</span><span class="p">))</span>
</code></pre></div></div>
<p>which will give you a neat and tidy:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/path/to/file/run.py
- learning_rate: 1e-3
- epochs: 100
- batch_size: 32
- optimizer: 'adam'
- loss: 'binary_crossentropy'
</code></pre></div></div>
<p><strong>2. Log architectural details</strong> <br />
This one is an extension of Rule 1, but I just wanted to show off this extremely useful function to convert a Keras model to a string:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model_to_str</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">to_str</span><span class="p">(</span><span class="n">line</span><span class="p">):</span>
<span class="n">model_to_str</span><span class="o">.</span><span class="n">output</span> <span class="o">+=</span> <span class="nb">str</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span>
<span class="n">model_to_str</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="s">''</span>
<span class="n">model</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="n">print_fn</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">to_str</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="n">model_to_str</span><span class="o">.</span><span class="n">output</span>
</code></pre></div></div>
<p>Keep track of how your model is structured, and save this information for every experiment so that you will be able to remember changes in time.<br />
Sometimes, I’ve seen people copy-pasting entire scripts in the output folder of an experiment in order to remember what architecture they used: don’t.</p>
<p><strong>3. Plots before logs</strong><br />
We do science to show our findings to the world, the other members of our team, or at the very least to our bosses and supervisors.<br />
This means that the best results that you may obtain in a project instantly lose their value if you cannot communicate properly what you found, and in 2018 that means that you have to learn how to use data visualization techniques. <br />
<a href="https://www.edwardtufte.com/tufte/books_vdqi">Books</a> have been written on the subject, so I won’t go into details here.
Just remember that a good visualization always trumps a series of unfriendly floats floating around.<br />
Some general tips on how to do data viz:</p>
<ul>
<li>label your axes;</li>
<li>don’t be scared of 3D plots;</li>
<li>time is a powerful dimension that should always be taken into consideration: create animated plots whenever possible (use <code class="language-plaintext highlighter-rouge">matplotlib.animation</code> or <code class="language-plaintext highlighter-rouge">imageio</code> to create gifs in Python);</li>
<li>if you have an important metric of interest (e.g. best accuracy) and you’ve already saturated your plot’s dimensions, print it somewhere on the plot rather than storing it in a separate file.</li>
</ul>
<p><strong>4. Keep different experiments in different scripts</strong><br />
This should probably go in the Code section of this post, but I’ll put it here as it relates more to experiments than to code.<br />
Even with Rules 1 and 2 accounted for, sometimes you will have to make changes that are difficult to log.
In this case, I find it a lot more helpful to clone the current script (or create a new branch) and implement all variations on the new file.<br />
This will prevent things like “"”temporarily””” hardcoding stuff to quickly test out a new idea, or having <code class="language-plaintext highlighter-rouge">if</code> statements in every other code block to account for the two different methodologies, and it will only add a bit of overhead to your development time.<br />
The only downside of this rule is that sometimes you’ll find a bug or implement a cool plot in the new script, and then you’ll have to sync the old file with the new one. However, editors like PyCharm make it easy to keep files synced: just select the two scripts and hit <code class="language-plaintext highlighter-rouge">ctrl + d</code> to open the split-view editor which conveniently highlights the differences and lets you move the code around easily.</p>
<hr />
<p>This is far from a complete guide (probably far from a guide at all), and i realize that some of the rules are not even related to working in teams. I just wanted to put together a small set of practices that I picked up from people way more skilled than me, in the hope of making collaborations easier, simplifying the workflow of other fellow PhD students that are just beginning to work with code seriously, and eventually, hopefully, leading to a more standardized way of publishing ML research in the sake of reproducibility and democratization.<br />
I am sure that many people will know better, smarter, more common practices that I am unaware of, so please do <a href="https://danielegrattarola.github.io/about/">contact me</a> if you want to share some of your knowledge.</p>
<p>Cheers!</p>
Tue, 20 Mar 2018 00:00:00 +0000
/posts/2018-03-20/ml-team-projects.html
codetutorialpostsOverthinking Japan<p><img src="https://danielegrattarola.github.io/images/2017-10-31/shinjuku.jpeg" alt="Piss Alley in Shinjuku, Tokyo (ph. Daniele Grattarola)" class="full-width" /></p>
<p>I went to Japan with the expectation of finding a culture perfectly balanced
between the immovable certainty of the past and the unforgiving, unstoppable
forward pull of the future. These are, after all, the two forces that I find myself subject to every day of my life: a hard, consolidated core of ground beliefs and values (like family, loyalty, tradition), and a constant attraction towards the bleeding edge, the unknown, the new.</p>
<p>I won’t hide, however, that there are other reasons for which this unique country exercises a heavy charisma over me, reasons that I think can be easily shared by many other people of my generation.<br />
<!--more--></p>
<p>For one, childhood. I grew up with a substantial percentage of my entertainment, and by extension the center of mass around which my happiness gravitated (read: the games that I played as a kid), coming from Japan.<br />
Cartoons, comic books, games, video games: Japanese, or Japanese representations of the western world.
It’s not a mystery that in times of uncertainty, or difficulty, we turn to the past to seek happiness, and I think that even for someone with a healthy and (semi) successful life there are factors for which this mechanism could become relevant. The constant competition of your peers, the <a href="https://ribbonfarm.com/2017/08/17/the-premium-mediocre-life-of-maya-millenial/">premium mediocrity</a> of our millenial lives, the melting economic glacier: all potential reasons for which one could want to look back, rather than ahead.
This does not mean that I consciously weighted this aspect when planning my trip, but it could explain a bias when I spun the globe and pointed to the destination for a self-searching, once-in-a-lifetime solo travel after graduation.</p>
<p><img src="https://danielegrattarola.github.io/images/2017-10-31/osaka.jpeg" alt="Osaka seen from the Tsutenkaku tower (ph. Daniele Grattarola)" class="full-width" /></p>
<p>And another important component of this bias is, unsurprisingly, the Internet.
I don’t mean the internet of social media, forums, “AI” journalism, and porn. I mean the Internet as an entity, with a capital “I”. The omnipotent and omniscient God of the Internet Kvlt, the powerful, unaccessible aesthetical phenomenon at the center of Vaporwave, the enabling technology of cyberpunk, the tyrant freedom and freeing tyranny of our lives. It’s that thing that you can’t explain to your parents when they ask you why memes are a thing.<br />
The blast of the Internet was heavily fueled by Japanese culture and aesthetics, and we, as a collective, elected Japan as the physical embodiment of our intangible universe of information.<br />
There are interesting and complex aspects to this, because while Japan was being brought forth as the de facto land of the Internet, Japanese millenials were diving deeper and deeper into the Internet culture, tumbling in a downward (or upward) spiral until the country became the conceptual caricature of itself, an entity that doesn’t need to stress its characteristics in order to make them evident, because it already managed to erase the barrier between reality and its representation. If you think about it, this is the same exact phenomenon that characterizes the Internet (again, capital “I”), where information transcends the real world and becomes meta-information.
And I find in this meta-information, in the post-post-post-ironic memes, in monospaced fonts, classic statues over pink backgrounds, 80’s aesthetics, 8-bit music, and glitch art the most fascinating cultural movement of our time.</p>
<p>Given these (at least two) sources of bias in my choice, I booked a flight and headed to Japan.<br />
As expected, it was exactly what I thought it would be. But it was even more.</p>
<p>When walking alone in the forests, the firework-like cities, or the wooden temples from which you could see all the way down to your Self, I was humming the mono themes from Pokemon Gold because that’s where my mind would go more often than not. I found myself spending six hours in Akihabara mesmerized by the otaku culture even if I don’t even watch anime anymore (if I ever did at all). I almost cried at the neon covered buildings because there I realized that cyberpunk aesthetic is real, and that I was there to witness it. The barrier between reality and its representation in my mind shattered, as if by being there I became part of the meta-universe myself.<br />
A big part of what enabled this to happen is the almost constant solitude in which I lived for 15 days, with a linguistic barrier there to isolate me even from the passive communication around me and facilitating the deconstruction of reality which was going on in my mind.</p>
<p><img src="https://danielegrattarola.github.io/images/2017-10-31/buddha.jpeg" alt="Golden Buddha in Shitenno-ji temple, Osaka (ph. Daniele Grattarola)" class="full-width" /></p>
<p>Then there was the introspection, because you can’t help but think about your existence when you find yourself staring at a ten meter golden Buddha who, in turn, stares back at you with a mystical power so strong that the statue itself seems to be alive. And in those moments, armed with a few simple spiritual axioms that I picked up over the years, I could make sense of religion as a whole, not by embracing it, but by tearing it apart and understanding the mechanism underneath, with the same level of comprehension that you would get by opening up a clock and seeing the cogs, rather than simply using it to read the time (or, for me, seeing other people using it, as luckily I never felt the need of an outside force to explain my unknowns).<br />
By looking past the religious wall, I could then easily navigate philosophy (again, armed of a few authors that I have in my cultural baggage) and try to make sense of existence, or what derives thereof (<a href="http://exsubstantia.com/about">ex substantia</a>), because only in those specific situations the Search could yield results.<br />
This last level is what, for me, completed the unification, what finally removed the Veil of Maya and made me experience the metaphysical through the senses, and in turn explained the aristotelian Act by deconstructing the Potency, by stepping into Plato’s cave and forcing the true concepts to come out to the world.</p>
<p>Eventually, what this trip meant for me was a step forward to a view of existence in which all barriers between the real and the conceptualized are pointless, where either can be used to convey meaning to one’s life and where substance (i.e. the true essence of things) can be found in either.</p>
<p>Besides overthinking, however, getting to know a different culture was amazing, the food was mind-blowing, I met amazing people, and learning Japanese is being lots of fun. I feel like I’m now ready to start a new adventure, like this trip was the perfect full stop to my first chapter and the start of a new one, in which I will test myself in the international arena of research and maybe, hopefully, leave a trace in the advances of humanity.</p>
<p>I warmly suggest you to check out <a href="https://insidekyoto.com">Inside Kyoto</a> to get around Japan, it made the whole trip a lot more enjoyable for me.</p>
<p><img src="https://danielegrattarola.github.io/images/2017-10-31/bridge.jpeg" alt="Shinkyo brigde, Nikko (ph. Daniele Grattarola)" class="full-width" /></p>
Tue, 31 Oct 2017 00:00:00 +0000
/posts/2017-10-31/overthinking-japan.html
travelphilosophypostsThe Fermi Paradox of Superintelligence<p><img src="https://danielegrattarola.github.io/images/2017-09-25/seti.jpg" alt="The Allen Telescope Array (Public domain)" class="full-width" /></p>
<p>Attirbuted to Enrico Fermi as a back-of-the-envelope astrobiological philosphy
exercise, Fermi’s paradox is a simply put question: <em>where is everybody?</em><br />
In other words, if life is a truly common phenomenon in the universe, then
the probability of a civilization solving the problem of interstellar travel
should be pretty high, and the effects of such a civilization on the galaxy
should be extremely evident to an observer (think entire stars being instantly
harvested for power).<br />
However, the SETI remains unsuccessful, hence where is everyone?
<!--more--></p>
<p><a href="https://scholar.google.it/scholar?q=fermi+paradox">A quick search on Scholar</a>
will give you literally thousands of reasons not to panic (or maybe do the
opposite, depending on how much you like aliens), and provide you with many
logical reasons why the presence of ETI could go unnoticed, leaving us in our
quiet and lonely neighborhood.<br />
Having sorted that out, we can safely go back to our Twitter feeds to discuss
<a href="https://twitter.com/dog_rates/status/775410014383026176">serious business</a>
and <a href="https://www.theguardian.com/technology/2017/aug/14/elon-musk-ai-vastly-more-risky-north-korea">Elon Musk’s latest views on AI</a>.<br />
We can also go back to our comfy evening reads, which in my case mean Hofstader’s
<em>GEB</em> (for the sixth time or so) and Nick Bostrom’s <em>Superintelligence</em>,
while listening to the sound of the falling rain.
And then, when the words are stirring inside your head, and the rain has
flushed the world clean, and the only sound you can hear is the quiet whir
of the GPU fan, while it’s training that 10-layer net that you’ve been recently
working on; only then, you might find yourself asking the question: <em>where the hell is superintelligence?</em></p>
<p>That’s a reasonable question, isn’t it? Just look at all the informed opinions
of the better, wiser people than me that I cited above on the subject.
They may disagree on its nature, but no one (<em>no one</em>) disagrees that AGI will
have an almost infinite growth rate.<br />
Go back to the second sentence of this post, and replace “interstellar travel”
with “artificial intelligence” to have an idea of what that may look like.
And we’re not talking of a simple Kardashev scale boost; a superintelligence would
be well aware of its physical limitations, so we would likely be looking at an
end-of-the-galaxy scenario, with all available matter being converted to
computing infrastructure, <em>à la</em> Charles Stross.</p>
<p>A phenomenon so powerful that its mere exisistence would change the scale of
reference for every other phenomena in the universe, something so good at self
imporving that it would only be limited by physical laws.</p>
<p>If the probability of life in the universe is high, then so is the probability
of a civilization developing a superintelligence, with all its extremely evident
effects.<br />
So where is it?</p>
<p>So far, I only talked about a catastrophic scenario following the creation of a
superintelligence, but we have to consider the postive and neutral outcomes,
before denying the existence of a super AI.</p>
<p>If the effects of a superintelligence on its universe were not so devastating,
if we look at the brightest end of the spectrum of possibility, think about the
advances that such an entity would bring to its creators. All the technologically
motivated solutions to the Fermi paradox, at that point, would be null, leaving
us with a whole lot less formal analysis, and a whole lot more speculation on
alien sociology and superintelligent motives.<br />
What reason could a civilization with a superintelligence in its toolbox have
to not reach out of its planet?</p>
<p>Moreover, we couldn’t still exclude a catastrophic end of the galaxy, if the
computational necessities of the AI required it to favor an absolute good
(its existence) to a relative one (our existence).
Therefore, if we allow for a truly <em>good</em> superintelligence to exist somewhere
in the universe right now, we have to imagine natural, moral, or logical
impediments that prevent it from communicating outwards and spreading infinite
progress.</p>
<p>From another perspective, even if talking about <em>likelihood</em> in this context
is a serious gnoseological gamble, it seems that the neutral scenarios would
likely be the less noticeable: superintelligence is there, but it has no drive
to manifest itself.<br />
That could either be for a lack of necessity (it wouldn’t need energy, or matter,
or information), or a lack of advantage (it wouldn’t want to reveal its presence
to avoid danger, or to avoid pointless expenses of limited resources), and it
would be a fairly easy and rational explanation to the superintelligence paradox.<br />
A system in such a perfect equilibrium would probably exist in a super-state,
free from the unforgiving grip of entropy and eternally undetected (short of
using another superintelligence, at which point the paradox would already be
solved).</p>
<p>I stop here with the examples, because it’s out of my capabilities to summarize
all possible scenarios, especially when we consider that the Fermi paradox has
inspired fifty years of debate.
And if we think about this debate, we see that it extends back in the ages,
that “alien civilization”, “AI”, or “God” have been used interchangeably without
changing the essence of the discourse: why, if <em>they</em> exist, are they not
manifest?</p>
<p>As hard as we try to rationalize this question, we perfectly know that there
are only two possible outcomes, that we are looking for a black swan that we will
either find, or keep looking for. At the same time, we’ve learned very well
how to coexist with this uncertainty, because we only need the possibility of a
forced ignorance, in order to accept an ignorance undesired.</p>
<p>And so, as men of rational intellects we can be crushed by the lack of knowledge,
or be incessantly driven by it, knowing that every second spent in the search
is a second that could have only be wasted otherwise. And those who are crushed,
may turn to faith, and find peace in knowing that their happiness resides with
the black swan, where it can’t be touched by mortal sight.</p>
<p>In the end, while telling us a lot about human condition, this thought leaves
us back in our quiet neighborhood of probable but unverifiable truths.<br />
However, when considering the practically null amout of time that constitutes our
lives, a question may come to one’s mind: is it closer to human nature to
think of a God, or to seek one out?</p>
Mon, 25 Sep 2017 00:00:00 +0000
/posts/2017-09-25/fermi-paradox-ai.html
AIpostsNew Blog<p><img src="https://danielegrattarola.github.io/images/header/oz.jpg" alt="Australian Outback (ph. Daniele Grattarola)" class="full-width" /></p>
<p>This blog is an updated version of my old edgy, teenage-years blog at <a href="http://exsubstantia.com">exsubstantia.com</a>.
I kept a somewhat similar style because I like the way Exsubstantia looks, but I hope that the difference in content
will speak for itself.</p>
<p>I will use this blog to talk mostly about two things:</p>
<ol>
<li>interesting concepts that I come across when working on my research and projects</li>
<li>interesting stuff that I come across when traveling around</li>
</ol>
<p>but I’ll try to keep it varied, and also talk about other interesting things.
I will try to include code whenever necessary, but for more complex projects you can check out <a href="https://github.com/danielegrattarola">my Github</a>.
You can also find me on <a href="https://twitter.com/riceasphait">Twitter</a> and <a href="https://www.instagram.com/riceasphait/">Instagram</a> as @riceasphait.</p>
<p>Cheers!</p>
Tue, 19 Sep 2017 00:00:00 +0000
/posts/2017-09-19/new-blog.html
updateposts