segunda-feira, 6 de julho de 2009

Google Summer of Code Midterm update

This last semester I took a software engineer course and there are those design patterns that seem really cool at first but quickly become a pain, one of which is the template method pattern, I've learned the hard way that it's not always a good idea to use it. I'll explain why as I talk about what I've done so far in Gephi.

I'm working on the implementation of force directed graph spatialization algorithms, specifically Yifan Hu's Multilevel algorithm. There is an enormous amount of force directed spatialization algorithms out there, but the idea of a force directed algorithm is simple: you define a set of force laws in your graph and then find a minimum energy state (the position of the vertex in the space defines the state of the layout).

A classic example is the Spring Electrical model: you put an electrical charge of the same sign on every vertex so that they repel each other, but then you treat every edge as a spring that attracts neighbors so that they don't sit far from each other. Of course this could seem like a simple model, but there's a lot of variations that can make the final layout be really bad or really good (e.g. what's the natural spring length? what's their elastic constant? what would be the electrical constant? should the electrical field be quadratic inverse?).

Also, once you have defined the force law on the edges and between nodes, how can one compute the minimum energy state? The systems of equations that relates vertex positions and layout energy is not usually something pretty, so a really common method is to calculate the resulting force on each vertex and then displace each vertex based on the force acting on it and repeat it several times in hope that the layout will evolve to a local minimum. It's basically an integration to simulate the dynamics of the system.

So how about the complexity of this integration? On each iteration it must evaluate the resulting force on every vertex, so if we have N nodes and E edges, the straight-forward algorithm will evaluate E edge forces and N choose 2 = N*(N-1)/2 node forces, so we have O(E + N^2) force calculations on every integration step, which isn't really good if we are going to work on graphs as large as 10k, 100k or even +1M nodes. Luckily Physic researchers that were interested in simulating n body systems have worked on this problem (the n-body force calculation problem), and there's a O(N log N) approximate algorithm by Barnes-Hut. In short, it uses a quadtree structure for log N retrieval of far away clusters, so that the force applied on a node can be approximated if there's a sufficiently far away cluster of nodes. Of course, there's the problem of finding a suitable definition of "far away cluster" and so on, which can easy vary between graphs or layouts of the same graph.

Ok, so we can calculate the forces in O(N log N + E) now, which is much better, so what's the step size we should use on the integration? Given that a node has a resulting force, by how much should we displace it on the space? If we use a big displacement step the integration can take longer because the nodes keep jumping their minima. What's even worse than missing the minimum is that a node might jump right into a pole (in case you have an energy law with poles, such as any inverse distance force), and this can be a disaster to the system. One might think about a way to limit the displacement step and reduce it so that the system eventually converges to a layout (not necessarily of minimum energy, because the minimization of energy it's not really really mandatory), for instance, Yifan Hu's algorithm uses an adaptive cooling scheme that resembles the simulated annealing method.

There are several improvements that could be made to these models and I hope to test then using Gephi, e.g. the displacement step could be defined individually, instead of a global displacement step (which is used by Yifan Hu's algorithm), moreover the individual displacement could be defined by its mass. Nodes could then have inertia (velocity) and then the cooling scheme would be something like a friction force, which sounds to be a more natural way to reduce the energy from the system. Again, this all depends on testing the algorithms and models to see if the resulting layouts are nice or not.

?????
I started this post talking about the template method pattern and ended up talking a lot about graph spatialization (maybe too much, I hope I didn't bore you to death). So, what about the template method?
I started implementing the algorithm in a generic way: I have a force law for edges, let's take this force and calculate over the edges. I have a force law for nodes, let me take this and calculate the force, now let's integrate, and so on. So I ended up implementing the algorithm in an AbstractForceDirectedLayout class that had abstract methods for the forces and the cooling scheme. Ok, now it's flexible enough so I can just inherit from it and override those methods to define whatever force and cooling scheme I want. It's simply the template method pattern.

At first it was all good and simple, but then it got messy really quickly. What if I want to use the same force law on two different Layout implementations? To avoid code duplication I defined the force in its own class and then compose it in each concrete Layout, so the actual template method just delegated to the composed force. What a nuisance! Why didn't it delegate it on the first opportunity? The whole template method pattern was just obscuring the code in the end. See http://tech.puredanger.com/2007/07/03/pattern-hate-template and Object Composition vs. Inheritance for more.

I'm currently refactoring the whole code so I can be productive again.
To learn more about force-directed layout algorithms there's the chapter 5 of Roberto Tamassia's Handbook of Graph Drawing and Visualization (the book is still not published, but it's looking good).

Oh, and I'm pleased to announce that now I have a cool heldersuzuki at gephi.org e-mail, so you can contact me if you feel like it. I should be posting more about the project as soon as possible, so stay tuned!