Self-Organisation of Hypercolumns based on Force-Directed

Self-Organisation of Hypercolumns based on Force-Directed Clustering MORTEN DIESEN Master’s Degree Project Stockholm, Sweden 2005 TRITA-NA-E05103...

3 downloads 373 Views 2MB Size
Self-Organisation of Hypercolumns based on Force-Directed Clustering

MORTEN DIESEN

Master’s Degree Project Stockholm, Sweden 2005

TRITA-NA-E05103

Numerisk analys och datalogi KTH 100 44 Stockholm

Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden

Self-Organisation of Hypercolumns based on Force-Directed Clustering

MORTEN DIESEN

TRITA-NA-E05103

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering, Royal Institute of Technology year 2005 Supervisor at Nada was Anders Lansner Examiner was Anders Lansner

Abstract Attractor neural networks are often evaluated and found to perform well on sparse random patterns. However, real world data often have a large amount of correlation, which tends to decrease the performance of these neural networks dramatically. This thesis describes the first steps in the creation of a hidden layer for preprocessing data to address these problems. We describe a novel algorithm for non-parametric clustering, which we apply in the preprocessing step in order to find correlations in the input data. The algorithm uses mutual information as measure and interprets this measure as a force, which is later used in a force-directed clustering. The algorithm is found to work as expected and is migrated to a parallel cluster computer. The parallel performance is very good and the algorithm scales almost linearly with the number of processors used.

Sj¨ alvorganisation av hyperkolumner med hj¨ alp av kraftbaserad klustring Examensarbete

Sammanfattning Attraktorn¨atverk har ofta utv¨arderats och funnits fungera v¨al f¨or glesa, slumpvisa m¨onster. Riktiga data inneh˚ aller dock oftast mycket korrelation, vilket leder till en dramatisk f¨ors¨amring av neuronn¨atens kapacitet. Den h¨ar rapporten beskriver de f¨orsta stegen i skapandet av ett g¨omt f¨orbearbetningslager f¨or att komma till r¨atta med dessa problem. Vi beskriver en algoritm f¨or icke-parametrisk klustring som vi applicerar i f¨orbearbetningssteget, f¨or att finna korrelation i indata. Algoritmen anv¨ander mutual information som m˚ att och vi anv¨ander detta som en kraft i den p˚ af¨oljande kraftbaserade klustringen. Algoritmen visar sig fungera som f¨orv¨antat och en parallell version f¨or klusterdatorer utvecklas. Den parallella prestandan ¨ar mycket bra och k¨ortiden skalar n¨astan linj¨art med antalet processorer som anv¨ands.

Contents 1 Introduction 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Attractor neural networks and hypercolumns . . 2.1.2 The need for a hidden layer . . . . . . . . . . . . 2.1.3 Feature extraction . . . . . . . . . . . . . . . . . 2.1.4 Information theory . . . . . . . . . . . . . . . . . 2.2 Non-parametric clustering . . . . . . . . . . . . . . . . . 2.2.1 Introduction to non-parametric clustering . . . . 2.2.2 The research field of clustering . . . . . . . . . . 2.2.3 Gravity based graph layout . . . . . . . . . . . . 2.3 N-body simulation . . . . . . . . . . . . . . . . . . . . . 2.3.1 Description of the N-body problem . . . . . . . . 2.3.2 History . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Effective implementations of N-body simulations 2.4 Competitive and selective learning . . . . . . . . . . . .

1 1 1 2

. . . . . . . . . . . . . .

3 3 3 3 4 4 5 5 5 8 8 8 9 9 10

. . . . . . .

11 11 11 11 14 14 16 16

4 Implementation of the model 4.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Winner-takes-all hypercolumn dynamics in the color coding . 4.1.2 Integration algorithm in the N-body simulation . . . . . . . .

17 17 17 17

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

3 Our model 3.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . 3.2 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Transformation from color pictures to hypercolumns 3.2.2 Mutual information between hypercolumns . . . . . 3.2.3 Force directed clustering of the hypercolumns . . . . 3.2.4 Partitioning of the hypercolumns . . . . . . . . . . . 3.3 Explanation of the approach . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

18 18 19 19 20

5 Evaluation and performance 5.1 Performance in terms of correctness . . . . . . 5.2 Complexity . . . . . . . . . . . . . . . . . . . . 5.2.1 Complexity of the sequential algorithm 5.2.2 Complexity of the parallel algorithm . . 5.3 Measured performance in terms of speed . . . . 5.3.1 Performance of the sequential program . 5.3.2 Parallel performance . . . . . . . . . . . 5.4 Reducing the number of calculations needed . . 5.5 Summary of the evaluation chapter . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

22 22 23 27 27 28 28 29 31 33

. . . . .

35 35 35 36 36 37

4.2

4.1.3 Brute force N-body force calculations 4.1.4 Parameters . . . . . . . . . . . . . . . Parallel design . . . . . . . . . . . . . . . . . 4.2.1 Parallelizing the calculations . . . . . 4.2.2 Message-passing paradigm . . . . . . .

6 Discussion 6.1 Generalizing the algorithm to other domains . 6.2 Generalizing the algorithm to different tasks . 6.2.1 Performance on other data . . . . . . 6.2.2 Further possibilities . . . . . . . . . . 6.3 Future work . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

7 Conclusion and summary

38

Bibliography

39

A Complete list of the parameters and their values

41

Chapter 1

Introduction Attractor neural networks are currently the most reasonable models of cortical associative memory. However, normally these models are highly abstract, and they are evaluated on random patterns since their performance decrease drastically when the input patterns are correlated. Real world data are almost always highly correlated, and thus there is a need for a transformation from highly correlated patterns to patterns that can be stored efficiently in the attractor neural networks. This thesis describes a novel clustering algorithm for the first step in the creation of a hidden layer for preprocessing of real world data, before providing them as input to a neural network. The complete transformation will need additional steps, like feature extraction, that will not be described in this thesis. The algorithm we are proposing is a general non-parametric clustering algorithm that can be applied to quite different problems as well. At the end of the thesis we briefly discuss the performance of the algorithm in general.

1.1

Objectives

The goal of this thesis is to implement, optimize and evaluate an algorithm for the first step in the creation of a hidden layer for the preprocessing of input data to an attractor neural network. The algorithm will be parallelized and optimized to run effectively on a parallel cluster computer. The algorithm is general and could be applied to input data from any domain. In this thesis we will focus on the case of pictures, since this gives us the ability to visualize the performance of the algorithm in a natural way.

1.2

Scope

The preprocessing requires additional steps that will not be a subject of investigation in this thesis. The final output from the algorithm investigated here is a partitioning of the pixel-positions in the pictures into different parts, where the pixel-positions in the same partition have a higher degree of mutual information than pixel-positions 1

across the partitions. The feature extraction that will follow is not dealt with in this work. The algorithm can be applied to any domain of input data, but it is not within the scope of this thesis to evaluate the performance of the algorithm when used for preprocessing other data than pictures. The algorithm is also general in the sense that it could be used for clustering tasks that has nothing to do with preprocessing input to a neural network. This subject is not our main concern in this thesis, although some aspects of this is discussed in chapter 6. The purpose of this thesis is not to compare our approach with other approaches for solving the task, but to evaluate how well our algorithm works, and to make an efficient parallel version of it.

1.3

Overview of the thesis

Chapter 2 provides a brief introduction to attractor neural networks and the problems that arise when using them with real world input data. This chapter also provides an overview of the research field of non-parametric clustering showing what has been achieved in other studies. In chapter 3 we present our model and the algorithms we are using in detail. Chapter 4 describes the implementation of the model, the design choices and the parameters. Further, in this chapter one can read about the parallelization of the algorithm and the different design choices that has to be made in order to make the algorithm run fast on a parallel computer with distributed memory. Chapter 5 provides a theoretical and a practical evaluation of both performance regarding correctness and performance in terms of speed and complexity. In Chapter 6 we discuss the algorithm from a broader perspective, both as a preprocessing algorithm and as a general non-parametric clustering algorithm. Finally, in chapter 7 we summarize the work.

2

Chapter 2

Background 2.1 2.1.1

Preliminaries Attractor neural networks and hypercolumns

Attractor neural networks are a special type of recurrent networks. These neural networks (NNs) are used in tasks like optimization, associative memory, pattern recognition and noise reduction. The most widely known attractor NN is the Hopfield net, developed by Hopfield (1982). Attractor NNs are often used as autoassociative memories and the task of the NN is then to store patterns and recall them from a partial or noisy version of the original pattern (Johansson, 2004). A problem that arises with attractor NNs is that the network activity is not kept at an appropriate level, i.e. too many, or too few, units are active at a given time. To overcome this problem, local winner-take-all modules have been proposed (Holst and Lansner, 1996). This form of activation control is easy to implement and is somewhat biologically plausible. Such a winner-take-all module is called a hypercolumn in analogy with the hypercolumns in cortex. The units of a hypercolumn typically represent discrete, mutually exclusive, values of a feature, like yellow, green, pink and so on in the case of color.

2.1.2

The need for a hidden layer

The storage capacity of an attractor neural network is almost always measured with sparse random patterns. The reason is that the storage capacity is drastically reduced if the patterns that are to be stored are correlated (Johansson et al., 2001). In practice this is a huge problem since almost all real world data are highly correlated. If one, for example, wants to store pictures, where each pixel in the picture is represented by a hypercolumn, the hypercolumns that are close to each other will have a high correlation which will lead to bad performance of the network. When attractor neural networks have been used with real world data, the data have been preprocessed to overcome this limitation of the networks. The result of this preprocessing is a hidden layer. There probably exists a corresponding preprocessing 3

function in the brain. The preprocessing we propose will be done in two steps. The first step is described in this thesis, and the output from this step is a partitioning of the hypercolumns. The second step is a local feature extraction that will be performed over the partitions created by this first step.

2.1.3

Feature extraction

The feature extraction step is beyond the scope of this thesis. However, we believe that in order to understand the problem we deal with in this work, one has to know a little about the following feature extraction step. The feature extraction will be performed over the output data of our algorithm in order to find a representation of the data that is suitable for an attractor neural network to store. Feature extraction is the process of finding features describing the input data. These features can be found by using one of several well known algorithms. The problem of feature extraction in pictures is especially well studied, since most computer vision tasks include some kind of feature extraction. Two of the most commonly used methods are PCA (Principal Component Analysis), a classic statistical method (Srivastava, 2002), and ICA (Independent Component Analysis), see Hyvarinen and Karhunen (2001). In pictures the most common method is to split up the pictures in small squares and do a local feature extraction in each of these so called patches (Simoncelli and Olshausen, 2001). Since there is a large amount of locality in pictures, i.e. pixels close to each other often describe the same object, this is a very effective way to reduce the amount of complexity in the feature extraction. What our algorithm does is to create these patches dynamically. The main advantage of this approach is that the local feature extraction algorithms can be applied to other data than pictures, where there is no trivial way to split the input data into patches, i.e. the data do not have locality in the way that pictures do. Another advantage is that the dynamically created patches produced by our algorithm are more efficient to use, since their shape will depend on the actual input data.

2.1.4

Information theory

In order to understand the rest of this thesis some elementary concepts from information theory is needed. Mutual information Mutual information between two variables X and Y is a measure of how much information about Y we get from knowing X (Jones, 1979). More formally, mutual information, I, between two discrete variables, X and Y , is defined as I(X, Y ) =

n X m X

j=1 k=1

pjk log

pjk , pj qk

(2.1) 4

where X and Y are variables of n and m discrete values respectively. The probability pj is the probability that X will have value number j, and the probability q k is the probability that Y will have value number k. The probability p jk is the probability that X will have value number j and Y will have value number k. Alternative ways of defining mutual information has been proposed but the above definition is the most common one, and the one we will use throughout the rest of this thesis. Second-order statistics In first-order statistics one uses the values of the discrete variables directly to calculate mean value and variance. In second-order statistics one does not use these values directly, but instead one compares the values of the variables pairwise and uses the result to find the interesting features. Since we are primarily interested in the correlation between our discrete variables (the pixels), we will use second-order statistics in our algorithm.

2.2 2.2.1

Non-parametric clustering Introduction to non-parametric clustering

Cluster analysis is an important technique in exploratory data analysis, and clustering methods have been used in a large number of scientific disciplines (Jain and Dubes, 1988). In statistical clustering methods the probability distribution of the points needs to be estimated. There are three different kind of models used. When a priori knowledge of the structures of the clusters is provided (i.e. one knows the kind of distribution), one can use parametric clustering models. In this case there are only a few parameters that have to be found (e.g. mean and variance) in order to find the the appropriate model. When no a priori knowledge is provided, one has to estimate a very large number of parameters, and this is called non-parametric clustering. There is also a third model, called mixture model, where the regarded distribution is made up of a weighted sum of several parametric distributions. In this case there are more parameters to find than in parametric clustering, since there are several distributions, and all of them have some parameters that needs be estimated.

2.2.2

The research field of clustering

A model granular magnet One approach to pairwise clustering is to use a similarity threshold Θ, remove edges with weight less than Θ, and then identify the connected components left and call them clusters (Gdalyahu et al., 1999). This approach has been further developed by Blatt et al. The idea is to use a model from physics, a model granular magnet, to cluster data. The model uses a ferromagnetic Potts model, where each point v 5

in the data is assigned a so called spin. The configuration S has the energy defined by H(S) =

X

Jij (1 − δSi ,Sj ),

(2.2)



where < i, j > stands for neighboring nodes v i and vj . In order to make this algorithm general one has to specify all nodes v as neighbors to each other. The property of a granular magnet used in this algorithm is the property that between certain temperatures there is a so called super-magnetic phase, where spin waves occur. A Monte Carlo simulation of the system is performed, and the spin waves found are interpreted as clusters. The algorithm was found to perform very well on some datasets, but the algorithm is rather slow, since all points are adjacent to all others in the general case and one has to run a Monte Carlo simulation in order to find the clusters (Blatt et al., 1997). Clustering with deterministic annealing Another clustering algorithm that is inspired by a model from physics was proposed by Hofmann and Buhmann (1997). They use a variant of the optimization algorithm Simulated annealing (Kirkpatrick et al., 1983). Simulated annealing is based upon an analogy with experimental annealing procedure, and the algorithm has been successful in many branches of Artificial Intelligence. One defines an energy function of the system and uses a Markov process which stochastically samples the solution space Ω. A new solution is accepted according to the Metropolis principle; it is always accepted if the new energy is lower than the energy in the earlier solution, and if the new solution has a higher energy than the earlier state it is accepted with a weighted probability that is inversely proportional to the temperature T . The probability of a transition between a configuration w old and a new configuration wnew can therefore be written P (wold → wnew ) =

(

1 if ∆H ≤ 0 ) ( −∆H otherwise, e T

(2.3)

where H is the energy of the system and ∆H ≡ H(w new ) − H(wold ). A Markov process with the transition matrix given by 2.3 converges to an equilibrium probability distribution known as the Gibbs distribution (Gardiner, 1983). A stochastic search according to a Markov process with transition matrix 2.3 allows one to estimate expectation values of system parameters by computing time averages in a Monte Carlo simulation. These parameters are the variables in the optimization problem, and it is their value that is the solution to the clustering problem. However, a Monte Carlo simulation is a very slow method, and Hoffmann and Buhmann proposed a variant called deterministic annealing instead. The idea is to analytically estimate relevant expectation values of the system instead of using the Monte Carlo simulation. 6

The actual clustering is based on a boolean matrix, where M iv = 1 if the point i is in cluster v and 0 otherwise. The optimization problem is to minimize the sum of dissimilarities between data in the same cluster. A randomized algorithm for pairwise clustering Another pairwise clustering strategy is used by agglomerative algorithms. These algorithms start with N clusters (one cluster for every point) and in every step the two clusters that are most similar are merged. This procedure is repeated until the similarity of the closest clusters is lower than some given limit (Gdalyahu et al., 1999). One can also approach pairwise clustering in a graph theoretic way, using the notion of cuts in graphs. Let G(V, E) be a connected directed graph with a source s and a sink t. A cut (S, T ) partitions the vertexes V in G into two disjoint sets S and T so that s ∈ S and t ∈ T . The capacity of the cut cap(S, T ) is the sum of the weights of all edges that cross the cut. The minimal cut is the cut of all cuts that has the minimal capacity. The minimal cut clustering algorithm uses a cascade of minimal cuts to divide the complete graph (all points) into components (clusters). See Wu and Leahy (1993) for details. The two approaches described above is combined by Gdalyahu et al. (1999). They describe a stochastic version of an agglomerative algorithm that interprets the similarities between points (the weights between the nodes) as the probabilities that these points are in the same cluster. This algorithm has many of the desired properties in common with the two approaches described above and with the granular magnet approach, but it has a better noise tolerance. Markovian relaxation Another approach described in literature is a method based on a matrix of dissimilarities. The idea is based on turning this matrix into a Markov process and then examine the decay of mutual information during the relaxation of this process. The clusters emerge as quasi-stable structures during the relaxation process and is extracted by the so called information bottleneck method. The first step in this algorithm is to turn the pairwise dissimilarity matrix into a Markov process by assigning a state of a Markov chain to each of the data points and assigning transition probabilities between states as a function of the points’ pairwise distances. Since distances are additive and probabilities are multiplicative the following formula is used p(xi (t + 1)|xj (t)) ∝ e−λd(xi ,xj ) ,

(2.4)

where d(xi , xj ) is the distance between point xi and point xj . The left hand side is the probability that the Markov chain move from x j at time t to xi at time t+1, and λ−1 is a length scaling factor that equals the mean pairwise distance of the k nearest neighbors to the point xi . One can imagine a random walk starting at every point in the Markov chain. If the pairwise distances between the points are all finite, one 7

obtains an ergodic Markov process with a single stationary distribution. For every step taken a little more information is lost, and when the final stationary distribution is reached the information about the initial state is completely lost. By measuring the information loss during the relaxation process, one can find that there are times when the loss rate is low and times where the loss rate is high. This is measured by comparing the mutual information between the current state and the initial state. Low information loss indicates the formation of quasi-stable structures in the matrix. At these relaxation times the transition probability matrix is approximately a projection matrix, i.e. satisfying P 2t ≈ P t where the most invariant subgraphs correspond to clusters. By using the information bottleneck method these clusters can be found, given no additional information. See Tishby and Slonim (2000).

2.2.3

Gravity based graph layout

The problem of non-parametric clustering, using second-order statistics, is somewhat similar to that of graph layout. There are several algorithms to layout a graph with respect to forces between the nodes. There are also some algorithms that use approximative N-body simulation algorithms to layout a graph (Quigley and Eades, 1999). However, since we do not want to make any assumptions regarding our data, we have forces between all the points in our case. This corresponds to a complete graph, and this makes the algorithms and techniques for graph layout useless, since their very idea is to make use of the graph structure. The structure is used in order to decrease the number of force calculations in clever ways, but if the graph is complete one has to do all the force calculations anyway.

2.3

N-body simulation

A N-body simulation will be used in our algorithm to cluster the hypercolumns. N-body simulations are used to simulate for example the movements of planets and stars. However, it is common to use models from physics in different branches of science.

2.3.1

Description of the N-body problem

The N-body problem occurs when several bodies are acting with a force on each other. The problem can be solved analytically only for N = 2. For all larger values of N a computer simulation is needed. The problem arises in a wide range of sciences like astronomy, chemistry, physics etc. Since the problem is well known, hard to solve and frequently arising, there has been a lot of research on the topic. In almost all applications in science the forces between the bodies are distance dependent, and it is therefore quite easy to come up with approximation algorithms to speed up the calculations. Generally there can be three forces present in a N-body simulation. There is a long-range gravitational force, there is a short-range repelling force (or collisions 8

must be handled) and sometimes there is also a global force that acts on all bodies in the system.

2.3.2

History

N-body simulations was first studied by Von Hoerner in 1960. The computer facilities at that time allowed no more than 16 particles in the simulations. Nevertheless, the problems that arise and the basic formulas used are the same as in today’s large scale simulations (Aarseth, 2003). Both hardware and software has evolved dramatically since the 60’s, and today simulations with several hundred million particles can be executed (Dubinski et al., 2003).

2.3.3

Effective implementations of N-body simulations

There are two ways of achieving great speed in the N-body simulations: approximation algorithms or dedicated hardware. Whenever the need for accuracy is not too demanding one can use one of the several possible approximation algorithms available. However, when high accuracy is needed, there is no other way to do the simulation, than to calculate all the bodies’ forces acting on all the other bodies. In that case one can use dedicated hardware to speed up the simulation. Algorithms The complexity of the naive approach is O(N 2 ) for each step in the simulation, since all particles are acting with a force on all other particles and one has to add the forces together to calculate the total force acting on a specific body. There are however several approximation techniques that can be used to reduce the complexity of the problem at the cost of accuracy. The two most common approximation algorithms are Barnes-Hut and Fast Multipole Expansion. These algorithms are built upon Newton’s observation that if several bodies are close together and far enough from the examined body, these bodies will approximately act like one massive body on the examined body. By building a tree of the bodies, one can find clever ways of approximating the forces from several distant bodies at once, thereby reducing the complexity of the problem (Aarseth, 2003). Hardware To perform large N-body simulations both general purpose supercomputers (Dubinski et al., 2003) and specially built hardware solutions exist. The most famous special purpose computers are the family of GRAPE computers from Tokyo, Japan (Makino et al., 1994). They have managed to keep a low cost and at the same time increase performance dramatically in N-body simulations. This is done by using specially built pipelines for the force calculations but an ordinary computer for all other calculations. Since the force calculations are the heaviest part of a N-body simulation, the speed-up is dramatic with this solution. In the simplest case, the 9

host computer sends the positions and the masses of all particles to the specialized hardware and gets back the forces acting on all particles. The GRAPE system has achieved high performance on N-body simulations through a pipelined and highly parallel architecture. Copies of GRAPE are used at many institutions around the world, and GRAPE has also received several awards because of its cost efficiency (Kawai et al., 2000).

2.4

Competitive and selective learning

Competitive learning will be performed by our algorithm both at the beginning and at the end. At the beginning we will use a competitive learning algorithm to find a suitable coding in the hypercolumns, and after the N-body simulation we will use a competitive learning algorithm to partition the hypercolumns. Competitive learning is a compression tool that is popular in the NN field, since compression is often needed and the algorithm is rather fast and easy to implement. The closely related vector quantization (VQ) algorithms can always obtain better compression performance than any other coding method based on the encoding of scalars, according to Shannon’s rate distortion theory (Ueda and Nakano, 1994). The main idea is to compress the data by representing the data by so called units that are placed into the problem space at positions that best resemble the data. The problem is to place the units so that they cover the data points in an optimal way, i.e. they should be spread in the problem space in a way that describes the points as accurately as possible. A point belongs to the unit that is closest, i.e. the units have winner-take-all dynamics. In practice one runs an incremental algorithm that moves the units a little step towards each point that belongs to this unit. Often one runs into what is called dead units, which happens whenever a unit is placed too far away from the points and has no points belonging to it. The distortion is a measure of how well the units are describing the points. Ueda and Nakano (1994) use this distortion to overcome the problem of dead units. Once in a while they remove the unit with the lowest distortion, and places a new unit close to the unit with the largest distortion. In this way one can reduce the problem of dead units, and the distribution of points between the units becomes more equal. This approach is also somewhat similar to biological processes where non-stimulated cells die.

10

Chapter 3

Our model 3.1

Problem definition

The problem we are facing can formally be described as follows. We have a set of color pictures, all of them of the same size (the same number of pixels). The task is to find a compact representation of these pictures by partitioning the hypercolumns into several different partitions according to the similarity of the pixels (i.e. the similarity of the hypercolumns’ internal activity) over all the pictures. The algorithm is actually more general than what it may seem from the sentence above, but the first step described in the next section has to do with the input. Therefore, it is most convenient to begin by describing the actual task we have been engaged in, and that concerns pictures.

3.2

The model

To achieve the partitioning of the hypercolumns that we are looking for, our algorithm has to perform four steps. Transform the pixels of a color picture into entries in hypercolumns. Calculate mutual information between the hypercolumns. Do a force directed clustering using a N-body simulation. Find clusters, i.e. partition the hypercolumns. A schematic figure that shows how the different steps in the algorithm are connected can be found in figure 3.1.

3.2.1

Transformation from color pictures to hypercolumns

The first step is not needed if we are given data that are already in a format suitable for hypercolumnar representation and mutual information calculations. However, in our pictures, we have 256 different values of each of the 3 colors (red, green and 11

points in the RGB-spac e

hypercolumns (binary vectors)

partitions for the color coding competivie selective learning

color pictures

calculate mutual information

mutual information matrix

randomly located bodies

n-body simulation

competivie selective learning

Figure 3.1. An illustration of the different steps in our algorithm. The first row describes the color coding. The second row begins with the creation of the mutual information matrix and continues with the N-body simulation. The last step is the partitioning of the sphere.

blue). Each hypercolumn should have an entry for each of these values, which means that if we did not somehow preprocess the pictures our hypercolumns would need to be 2563 ≈ 16 million entries long. That is very unrealistic and impossible to use in our calculations of mutual information. Therefore, we have to do something to decrease the number of entries needed. We choose to do this preprocessing by competitive and selective learning as described in 2.4. In our case we have the points placed in a three dimensional space, since we have color pictures with three channels (i.e. RGB). We place a number of units among the points and run the competitive and selective learning algorithm until the distortion has become acceptable. The distortion is a measure of how well the units are representing the points. When the algorithm is done, we have found a partitioning of the points into a number of partitions, where no partition is too small compared to the others. This partitioning can be seen as a compression, since all points can now be represented by the unit they are belonging to. This is exactly what we want, since we can adjust the number of units to increase or decrease the level of granularity. 12

The algorithm for this first step is outlined in algorithm 1.

Algorithm 1 Competitive and selective learning for representation of color pixels 1: procedure ColorCSL(An array pictures of RGB-pictures, desired number of partitions m, a threshold , a learning rate parameter η) 2: for all pictures pic ∈ pictures do 3: for all pixels pix ∈ pic do 4: Place a point p at (pix.red, pix.green, pix.blue) in the RGB-space 5: Add p to P 6: end for 7: end for 8: Randomly place m units u ∈ U in the RGB-space 9: while dif f >  do 10: distBef ore ← calcDistortion(U, P ) 11: /* Do the selection step once in a while */ 12: if modulo(i, selectionInterval) = 0 then 13: uhighestDist ← findHighestDistUnit(U ) 14: ulowestDist ← findLowestDistUnit(U ) 15: randomV ector ← getRandomVectorOfLen(3) /* 3 for RGB-space */ 16: ulowestDist ← uhighestDist + randomV ector 17: end if 18: /* Do the competitive step */ 19: for all points p ∈ P do 20: u ← findClosestUnit(p) 21: u.pos ← u.pos + η · p.pos 22: end for 23: distAf ter ← calcDistortion(U, P ) 24: dif f ← abs(distBef ore − distAf ter) 25: end while 26: end procedure

Once we have found the partitions of the pixels we can code our hypercolumns. Each pixel in the picture has a corresponding hypercolumn that consists of a number of entries. Each entry corresponds to a partition. If a point belongs to unit 3, in the competitive learning, the hypercolumn has the values [0,0,1,0,...]. A hypercolumn is therefore a binary vector in our representation, since we use a so called winner-takeall approach, i.e. the unit that is closest to the point gets the point. One could of course think of other smoother functions, where the unit that is the second closest unit gets a little bit of the representation. In that case, one would not get a binary vector. If the values in the vector are allowed to be continuous, the calculation of mutual information will be much harder. In our model we have chosen to work only with binary vectors, i.e. all our hypercolumns have a winner-take-all dynamics. 13

3.2.2

Mutual information between hypercolumns

Once we have created a representation of the pictures in the hypercolumns, we can calculate the mutual information between these. The mutual information between two hypercolumns is calculated by comparing the values of the hypercolumns over all pictures in input data. Formally, the formula we are using is I(X, Y ) =

n X m X

j=1 k=1

pjk log

pjk , pj qk

(3.1)

where X and Y are variables of n and m possible discrete values respectively. In our case we have n = m. The probability pjk is interpreted as the probability that the first hypercolumn is active (has value 1) at position j at the same time as the second hypercolumn is active at position k. This probability is calculated by going through all the pictures and count how often this happens and then divide this number by the total number of pictures. When we have found the mutual information between all hypercolumns, we have a triangular matrix with side-length equal to the number of pixels in the original pictures. This matrix is used in the next step to set the long-range forces between the hypercolumns. However, in order to have more control over the strength of the forces in the next step, we first normalize the values in the matrix, so that all values are between 0 and 1.

3.2.3

Force directed clustering of the hypercolumns

First the hypercolumns are positioned randomly in a number of dimensions. In our simulations the number of dimensions has most of the time been 3. More about this design choice can be found in section 4.1.4. We are now all set to perform the N-body simulation. Each body is affected by two forces, one long-range force that is proportional to the mutual information between the hypercolumns, and one short-range force that acts repelling at short distances. The attracting long-range force is given by the mutual information matrix, and the repelling short-range force is calculated by the hypercolumns’ positions in the space. The second force is decreasing with the square of the distance between the hypercolumns. In every time-step these two forces from all other hypercolumns are added together and the hypercolumns are moved accordingly. The acceleration is proportional to the force, which corresponds to a physical system in which massless bodies are moving in a viscous medium. During a small time-step, dt, the acceleration is approximately constant, which we allow ourself to use in order to decrease the computational requirements. More on this design choice in section 4.1.2. The long-range force is always positive, thus the hypercolumns will get more closely together as the algorithm runs. However, the repelling short-range force will prevent the bodies from coming too close to each other, and after a while the system will find a stable configuration. This stable configuration is found by calculating the mean movement of the bodies, and when this movement is sufficiently small we are done with the simulation. The outline of the algorithm is described in algorithm 2. 14

Algorithm 2 N-body simulation to cluster hypercolumns 1: procedure N-bodySim(A vector of points P , a matrix containing mutual information between hypercolumns M , a threshold ) 2: meanM ov ← ∞ 3: for all p ∈ P do 4: Create body b and add b to B 5: b.pos ← p 6: b.f ← 0 7: end for 8: while meanM ov >  do 9: for all b1 ∈ B do 10: for all b2 6= b1 ∈ B do 11: dir ← calcDirection(b1 , b2 ) 12: dist ← calcDistance(b1 , b2 ) 13: flong ← dir· M .getMI(p1 , p2 ) 14: fshort ← (−1) · dir · dist−2 15: b1 .f ← flong + fshort 16: end for 17: end for 18: for all b ∈ B do 19: b.pos ← b.pos + b.f · dt /* see note */ 20: b.f ← 0 21: end for 22: end while 23: end procedure

15

Note that at line 19 in the algorithm we are doing the integration under the assumption that the acceleration during a time step dt is constant. A motivation for this is provided in 4.1.2. Actually, the updating value (b.f · dt) should be divided by 2, but since this is just a constant scale parameter we can ignore that.

3.2.4

Partitioning of the hypercolumns

The preceding step gives us a sphere of hypercolumns (if we are working in three dimensions). The hypercolumns that are close to each other in this sphere has the highest mutual information. In order to use this to find the representation that we are looking for, we need to do another competitive learning phase. This time we place the units inside the sphere of hypercolumns. The competitive and selective learning will adjust these partition units so that we have a fairly even partitioning of the hypercolumns among these units. Again, we are using the winner-take-all approach so that each hypercolumn belongs to exactly one unit, i.e. we split the hypercolumns into partitions.

3.3

Explanation of the approach

The problem of partitioning the input data that is solved by our algorithm can be solved in quite different ways as well. If one calculates the mutual information between the data points and interprets this as a distance measure, all that is left is an optimization problem that can be solved in many different ways. It is far beyond the scope of this thesis to compare our algorithm with other approaches, but we would nevertheless like to point out some key features of our algorithm. First, it is incremental so that it could be used online to adapt to changes in the input data dynamically. This could for example be useful if the algorithm is used to preprocess data from an industrial process and some components are suddenly exchanged. Second, the algorithm has some biological properties that is interesting and useful. The force-directed self-organization can be seen as a way to model the growing axons from one neural structure to another.

16

Chapter 4

Implementation of the model A prototype of the model was first implemented in Python. This prototype gave evidence that the model was working as expected and gave a fairly good understanding of the parameters involved. The actual implementation was then made in standard C, and finally a parallel version for cluster computers was implemented using MPI.

4.1 4.1.1

Design choices Winner-takes-all hypercolumn dynamics in the color coding

We have chosen to let the hypercolumns have winner-take-all dynamics. This means that the hypercolumns can have only one active unit at a time. The very idea of hypercolumns is that they should have this property to some extent. In our setting the introduction of a smoother function, with one unit very active and some of the surrounding units to some extent active, would just introduce noise to the system and complicate the calculation of mutual information between the hypercolumns.

4.1.2

Integration algorithm in the N-body simulation

There are many integration algorithms that could have been used in our simulation for the integration of the bodies’ positions. We found however that we could use a fairly rough approximation and still achieve the desired results. In our application we are not sensitive to small errors regarding the positions of the bodies, since all we want is that the particles with the highest amount of mutual information shall be close to each other when we are done. In other applications of N-body simulations, such as simulations of colliding galaxies, it is crucial to have as small errors as possible since the errors are here propagating and can alter the result completely. Our long-range force is not distance dependent, which means that our errors will not propagate from one time-step to another. We can therefore use the approximation that the acceleration during a time-step dt is constant, which saves 17

a lot of calculations in the simulation. This kind of approximation is often called the leap-frog integration scheme.

4.1.3

Brute force N-body force calculations

As mentioned in section 2.3.3, there exists several approximation algorithms that instead of calculating the forces between all bodies in every time-step take advantage of the observation that bodies far away from the body examined are acting like one massive body on the examined body. This observation is dependent on the fact that the gravitational force is distance dependent, since we have to define a threshold for what is meant by far away. In our setting we do not have a long-range force that is distance dependent, therefore this whole set of algorithms is impossible to use. As far as we know, there is no approximation algorithm for N-body simulations that can be applied to our problem. It is therefore necessary to do the force calculations by brute force (no pun intended).

4.1.4

Parameters

Number of dimensions Each point (pixel) in the input data has 3 dimensions when working with full color pictures in RGB-format. If the algorithm were applied to some other kind of data the number of dimensions of the input data points could be very different. However, after the first step the number of dimensions is just one, since the competitive learning algorithm yields just a vector of units. The number of dimensions in the N-body simulation, however, is not dependent on the input data, since the first thing done here is randomizing the initial positions of the particles in all dimensions. The number of dimensions in our simulations has mostly been 3. The only reason for choosing this value is that the real world is in 3 dimensions and that 2 dimensions did not give equally good results in our tests. Generally, it is considered bad to introduce more complexity than needed in an algorithm (Occam’s razor), and since the execution time will also be increased by introducing more dimensions, we have chosen to work with 3 dimensions in this application. However, if one find that the performance in 3 dimensions is not satisfactory, there is no problem to increase the number of dimensions used in the N-body simulation. Number of units in the color coding The number of units in the color coding is also a parameter one has to deal with. If the number is too high the compression rate of the color coding is lower, and this means that we will have to spend more time calculating the mutual information and that the mutual information between the particles will be more sensitive, i.e. only particles that are extremely similar will add up to the long-range force. If the number of color coding units is too small, the mutual information will be too 18

rough, and particles that are not very similar will nevertheless seem similar. In our implementation we have mostly used 8 units for the color coding. All other parameters There are other parameters in the algorithm as well, for example the  (error tolerance) and η (learning rate) in the competitive learning algorithms. These parameters have been set to their current value by trial-and-error. The algorithm is not that sensitive to their values, but one has to make a trade-off between the accuracy and the speed. A complete list of the parameters and their values can be found in appendix A.

4.2 4.2.1

Parallel design Parallelizing the calculations

In order to make the algorithm able to handle large pictures, we have to parallelize the algorithm so that it can be run on a parallel cluster computer. There are some design choices to be made in order to achieve good parallel performance. The main issues are that the workload of the processors should be as even as possible and that the communication between them should be kept at a minimum. The algorithm is not a so called embarrassingly parallel one, in which it is easy to split the data between several processors and then combine their answers at the end. One can easily see that in order to calculate the movement of one body in the N-body simulation, we have to know the positions of all other bodies, since all bodies are acting with a force on our body. Therefore, one must send messages between the processors every time-step to tell where all the bodies are located. However, the calculation of mutual information is embarrassingly parallel, since we have a triangular matrix (because of the symmetry property of the mutual information) of values and we can divide this matrix into pieces and have a processor work on each piece. The competitive and selective learning are not computationally expensive, and we found that there is no need to make them run in parallel at all. If we calculated the mutual information by dividing the matrix into pieces, we would have to combine the results in a large matrix and then distribute it to all processors before the next step, the N-body simulation. All processors would then know about all the bodies’ mutual information. It turns out that this is not at all necessary, since we will parallelize the N-body simulation too. The Nbody simulation is parallelized by assigning each processor a number of bodies to be responsible for, i.e. to calculate how these bodies will move when affected by the other bodies. More details about this in section 4.2.2. This means that each processor only has to know the mutual information between its own bodies and the rest, i.e. a number of rows in the mutual information matrix. Therefore, it is a better choice not to divide the calculation of the mutual information matrix into pieces and then combine the result, but instead have every processor to compute a small stripe 19

(a number of rows) of the mutual information matrix. In this way we have eliminated the need of gathering the pieces of the matrix and distribute the whole matrix to all processors. Since this matrix has a side-length equal to the number of pixels in a picture, we save a lot of communication by this approach. However, additional computations are needed, since we can not make use of the symmetry property of the mutual information matrix. As a result we have to perform about twice as many calculations as in the first approach. Since the communication between step 2 and step 3 is eliminated, the extra computations are well worth their costs. An illustration of the two different approaches is provided in figure 4.1.

Figure 4.1. The figure shows how the computations of the mutual information would be split among the processors in the first and the second approach respectively. The areas of the same color should be calculated by the same processor. In the left figure, only half of the matrix has to be calculated. Nevertheless, the second approach is chosen, since it minimizes the communication.

4.2.2

Message-passing paradigm

There are several possible message-passing paradigms that can be applied to the communication between the processors in a N-body simulation, see Andrews (2000) (pp. 553-569) for a review. We have chosen to implement the pipeline approach, which means that every processor is responsible for a number of bodies and that the force between these bodies and the other bodies are calculated by this processor. The positions of the bodies are sent to the next processor in a circular manner, and when a message has visited all processors they have all completed one time-step. The outline of the algorithm is provided in algorithm 3, and figure 4.2 is a graphical illustration of the same algorithm.

20

Algorithm 3 Outline of the message-passing 1: procedure Message-passing 2: next ← processors[mod(myRank + 1, numP roc)] 3: prev ← processors[mod(numP roc + myRank − 1, numP roc)] 4: Send all my bodies’ positions to next 5: Calculate internal forces among my bodies 6: for all i ∈ [0 : size − 2] do 7: Receive a block of bodies’ positions from prev 8: Calculate the forces these bodies have on my bodies 9: if i 6= size − 2 then 10: Send the block of bodies to next 11: end if 12: end for 13: Move my bodies according to the forces 14: Reset the forces of my bodies to 0 15: end procedure

Figure 4.2. The figure shows how the message-passing works in the case of 4 processors, A, B, C and D. The set of bodies that each processor has been assigned is named by the processor’s letter in lowercase. First, the forces among the processors’ own bodies are calculated, then the processors start to pass their bodies around in a circular manner.

21

Chapter 5

Evaluation and performance This chapter describes the performance of our algorithm. We investigate the correctness, the theoretical complexity and the execution time for both the sequential and the parallel version. The algorithm was tested with three sizes of pictures, 32x32, 64x64 and 128x128 pixels. The pictures were mostly outdoor, natural scenery pictures. For the 32x32 and the 64x64 sizes, the pictures used were actually small parts of a larger picture. We found that if a large picture was resized to these small sizes, it became too blurred and the performance of the algorithm decreased. Most of the pictures used were collected from the public domain photography site Bigfoto (www.bigfoto.com).

5.1

Performance in terms of correctness

In our application it is not easy to measure the correctness of the output. The output of the algorithm is a partitioning of the hypercolumns, and there is no way to tell whether the partitioning is correct, since there is no correct answer to compare it with. However, what one can do is to visualize the partitions using different colors. Each point is assigned a color, based on in what partition the point is placed. The hypercolumns that are close to each other should in general have more mutual information than hypercolumns far from each other, since objects in the pictures often have the same color in several pixels close to each other. This means that we can expect the colored hypercolumns to form connected areas of colors. By looking at these visualizations one can see if the algorithm works as expected or not. Throughout the work this has been our main tool for comparing the performance of different configurations of our algorithm. If other input data than pictures were used this would not be possible, thus making it very hard to measure the correctness of the algorithm. One could of course use one of the standard databases for machine learning for performance measure. The problem with these is that the classes provided will probably not be found by our algorithm. Our algorithm would cluster together the points with the highest degree of mutual information, but in the last step pieces of the resulting sphere would be interpreted as partitions and 22

30

25

20

15

10

5

0

0

5

10

15

20

25

30

Figure 5.1. The colored hypercolumns (pixel-positions) show the resulting partitions. The color of a point shows what partition the point belongs to. The connected areas show that the algorithm works as expected. This is the result from a set of 32x32 pixel sized pictures.

these would probably not correspond to the classes. A more thorough discussion about this matter is provided in section 6.2. We are therefore forced to use only the connected areas in the plotted figures as evidence showing that the algorithm works. In 5.2 the sphere created by the N-body simulation is shown, with the partitions plotted in different colors. In figure 5.1, figure 5.3 and figure 5.4 the partitioning of the hypercolumns is shown by their positions in the pictures. Note that in these plots we have partitioned the hypercolumns into 8 or 9 partitions. The reason is simply that this allows the reader to see the partitions clearly. When preprocessing data for a neural network, we will not use such large partitions, but rather have partitions of sizes like 40 hypercolumns, corresponding to about 400 partitions for pictures of size 128x128. In figure 5.5 we have partitioned the bodies into 20 different partitions.

5.2

Complexity

Since our main focus is to have an easily parallelized algorithm meeting some computational and biological principles, the strength of the algorithm is not that it is fast on an ordinary desktop computer. One can easily assume that since we are using second-order statistics, our algorithm must compare all bodies with each other, which will lead to quadratic complexity. Below we are investigating the algorithm’s complexity in more detail and also the complexity of the parallelized version. 23

10000 8000 6000 4000 2000 0 10000

10000

5000

5000

0

0

Figure 5.2. The sphere of the hypercolumns. The colors of the points show the resulting partitions. The sphere is almost empty inside, and it has a hole at the bottom. This hole comes from the fact that there are edges on the input pictures where the correlation, and therefore the mutual information, is lower than for other positions.

60

50

40

30

20

10

0

0

10

20

30

40

50

60

Figure 5.3. This figure shows the result for a picture set with picture sizes of 64x64 pixels. The connected areas show that the algorithm works as expected.

24

120

100

80

60

40

20

0

0

20

40

60

80

100

120

Figure 5.4. The result for a picture set with 128x128 pixel sized pictures. The color of a point shows which partition it belongs to. We have not been able to perform as many executions with these large pictures as we have done with the smaller ones, therefore the parameter values are not as good as in the cases with the smaller pictures. This leads to a little less satisfactory result, where all the partitions do not form connected areas. Nevertheless, one can clearly see that most of the partitions are well formed. The upper part of the figure is more correctly partitioned, and we believe that this can be explained by the fact that we are using pictures of natural scenes, where the upper part often contains a part of the sky. The lower part of the figure is more cluttered, and this comes probably from the fact that there are often a lot of small objects in these positions in the pictures.

.

25

120

100

80

60

40

20

0

0

20

40

60

80

100

120

Figure 5.5. This figure shows the result when the 128x128 pixel sized pictures were partitioned into 20 different partitions. Again the upper part of the figure is more correctly partitioned than the lower part.

26

5.2.1

Complexity of the sequential algorithm

In order to analyze the complexity of our algorithm we need to define some variables. These variables are explained in table 5.1. Name n m i d b k jc jf

Explanation number of bodies (the number of pixels in a picture) number of pictures used number of iterations in the N-body simulation number of dimensions used in the N-body simulation number of pictures used to find the color representation in step 1 number of intervals (number of units used in the color representation) number of iterations needed in the CSL in step 1 number of iterations needed in the CSL in step 4 Table 5.1. Definition of variables for the complexity analysis.

Step 1, the competitive learning for the color coding, has the complexity O(nm+ njc bk) ∈ O(nm), since jc , b and k are all constants. The same reasoning can be applied to the competitive learning in step 4, which has the complexity O(nm + njf k) ∈ O(nm). In step 2, the calculation of the mutual information, one needs to compare all pixels in all pictures with each other. The complexity of this step is also dependent upon the number of intervals in the color coding. The complexity of this step can, therefore, be written O(n 2 · (m + k 2 )) ∈ O(n2 m), since the number of intervals, k is a constant. In the N-body simulation, step 3, we calculate the force between all body-pairs in every iteration and in all dimensions. This leads to the following complexity for this step: O(n 2 id) ∈ O(n2 ). The last inclusion is possible since the number of dimensions is a constant and the number of iterations is independent of the number of bodies. The last assumption is not necessarily true since the number of iterations depends on how fast the system becomes stable, and that might be dependent on the number of bodies. However, in our tests we have not found any clear connection between these parameters, and there are other factors that influences the number of iterations more clearly. Therefore, we feel that this assumption is justified. From the analysis above we can conclude that the total complexity of our algorithm is O(n2 m).

5.2.2

Complexity of the parallel algorithm

The total complexity of the algorithm is of course not affected by the parallelization. Instead, the interesting question is how much work each processor has to perform. The first and the fourth step of the algorithm have linear complexity, and there is no need to make parallel versions of them at all. Step 2 and step 3 are the heavy parts, and both these steps were parallelized. The complexity for each processor was therefore decreased since the processors are sharing the work among them. Instead of computing the mutual information and the forces for all body-pairs, each processor performs the computation only for the pairs in which their bodies are one 27

of the parts. This means that the complexity for each processor is decreased from O(n2 m) to O( np nm), where p is the number of processors.

5.3

Measured performance in terms of speed

From the previous section we know that the execution time of the algorithm is strongly increasing with the number of bodies, i.e. with the size of the pictures. The most costly step in the algorithm is the force calculations in the N-body simulation, which is repeated many times during the execution. In this section we report some tests to see how fast the algorithm is in practice. The purpose of these tests were primarily to compare the practical results with the theoretical analysis in the previous section. The actual figures are not especially interesting, it is the relative increment of the execution time that we are interested in.

5.3.1

Performance of the sequential program

The sequential algorithm was tested on a 2.4 GHz Pentium 4 processor with 1 Gb of RAM-memory. Two sizes of pictures were used: 32x32 and 64x64 pixels. Picture size 32x32 pixels 64x64 pixels

128 pictures 49.8 s 44 m 35 s

1024 pictures N/A 28 m 15 s

Table 5.2. This table shows the execution time of the sequential version of the algorithm on a 2.4 GHz Pentium 4 processor with 1 Gb RAM.

In table 5.2 one can see that the execution time is actually shorter if we are using more pictures. This might seem surprising since the complexity analysis in section 5.2.1 indicates that the execution time will increase with the number of pictures, since the calculation of mutual information (the second step in the algorithm) is dependent upon this parameter. However, there is a simple, yet interesting explanation; the N-body simulation does not run a predefined number of steps, instead the termination is determined by the stability of the system, i.e. by the size of the movements of the bodies. When the simulation begins the bodies will move large distances, but at the end they will only move a tiny distance at each step. This fact is used as a termination condition. The execution time is decreasing when we use more pictures simply because when using more picture less noise will be affecting the simulation, i.e. there will be a more distinct difference between the bodies that have a large amount of mutual information and the bodies that do not. This means that the system will find a stable configuration faster, and thus that fewer simulation steps will be needed until the threshold for termination is reached. Therefore we actually decrease our execution time by using more pictures. Note that since we are normalizing the mutual information matrix (see section 3.2.2) it really is the diffuseness that is decreased, and not the absolute value of the long-range force 28

that is increased. There must of course be a limit to how many pictures it is worth using. In a real application of the algorithm one does probably not even have the ability to choose.

5.3.2

Parallel performance

Environment The parallel implementation has been tested and executed on the cluster computer Lucidor at PDC at KTH. Lucidor is a distributed memory computer from HP. It consists of 90 nodes, each with two 900 MHz Itanium 2 processors and 6 GB of main memory. The interconnect is Myrinet with a card and port rate of 2+2 Gbit/s (PDC, 2005). Our largest simulations have used 64 of these nodes at the same time. The execution times were measured with the command MPI Wtime in the master process (process with rank 0). Execution time The execution times are given in table 5.3. The set of small pictures, with pictures of size 32x32 pixels, was run with 128 pictures. The two larger problem sizes were run with 1024 pictures. The difference in the number of pictures used is due to the fact that more pictures are needed in order to make the noise level low enough when there are many pixels in the pictures, as described in the section about sequential performance (section 5.3.1). Picture size 32x32 pixels 64x64 pixels 128x128 pixels

2 processors 81s 21m 7h 18m

4 proc. 55s 14m 4h 24m

8 proc. 22s 7m 2h 28m

16 proc. 17s 3m 38s 47m

32 proc. N/A 1m 47s 29m

64 proc. N/A 1m 7s N/A

Table 5.3. Execution time for the parallel version of the algorithm on the Lucidor cluster with 2-64 processors. The values for the largest pictures are not all measured values, but approximated values based upon executions with different numbers of iterations in the N-body simulation. This adjustment makes comparisons between the executions of different picture sizes possible. For the small pictures, the overhead becomes large with more than 8 processors, and the values for 32 and 64 processors are therefore irrelevant. The missing value for the large pictures is due to the heavy load of Lucidor and the limited time for this project.

In the figures 5.6, 5.7 and 5.8 we have plotted the time through the number of processors. One can see that the execution time scales almost linearly with the number of processors. The execution time for the smallest pictures (32x32) at 16 processors is much larger than what could have been expected. However, this is quite natural, since the data blocks on each processor here is too small, and the communication overhead becomes comparatively high. In figure 5.7 (pictures of size 64x64 pixels) this effect is not present, since the data blocks in this case are much larger. For the picture set with pictures of size 128x128 pixels the speed-up 29

between 8 and 16 processors are super-linear, which probably can be explained by some memory or cache related issue. When using 16 instead of 8 processors the data blocks become half the size, and therefore the cache memory suddenly can become sufficiently large for all data. When executed on 16 and 32 processors the larger pictures (128x128 pixels) has an execution time that is 13 and 16 times as long as the execution time for the medium sized pictures (64x64 pixels). This is what one might expect, since the quadratic complexity in the number of pixels means that 4 times as many pixels takes about 16 times as long to process. For both the medium sized pictures and the large pictures the figures show that the speed-up scales almost linearly. Since linear scaling is the higher limit of how well a parallel code can scale, the diagrams clearly shows that we have parallelized our algorithm in a very effective way.

Time/second

Time/second

80 60 40

90 80 70 60 50 40 30 20

20 2

4

10

8 16 Number of processors

2

4 8 16 Number of processors

Figure 5.6. The execution time for pictures of size 32x32 pixels plotted for different number of processors. One can see that the execution time for 16 processors is longer than expected. This is due to to the fact that the computations for each processor is too small, and the communication overhead becomes too high.

Speed-Up To see the actual speed-up achieved by the parallelization, one usually measure the absolute speed-up. The absolute speed-up is defined as Sp =

Ts , Tp

(5.1)

where Ts is the execution time of the best sequential algorithm and T p is the execution time of the parallel algorithm. The absolute speed-up measure is used to evaluate how much time one saves by parallelizing an algorithm, compared to running a sequential version. Since we have already established, in the section above, that the parallelization as such was very effective, i.e. that the speed-up scales very well with the number of processors, it is interesting to compare the sequential version with the parallelized one. In figure 5.9 one can see the absolute speed-up for 30

4

10

Time/second

Time/second

1500

1000

500

0

3

10

2

10

1

2 8 16

10

32 64 Number of processors

2

4 8 16 32 Number of processors

64

Figure 5.7. The execution time for pictures of size 64x64 pixels. Here one can see that the algorithm scales very well with the number of processors, since the logarithmic plot at the right almost forms a straight line.

the pictures of size 64x64. The speed-up was, as far as we know, about the same for all problem sizes, and therefore we only include figures for one of the picture sizes. The speed-up is roughly half the optimal value, e.g. 4 processors make the algorithm twice as fast compared to running the sequential program. When using more processors, the speed-up becomes clear, and for 16 processors the absolute speed-up is 7.0. Since we have already seen that the algorithm scales very well with the number of processors, we expect the absolute speed-up to be much greater with even more processors. Even though the absolute speed-up is not more than half of the optimum, this is relatively good. It is impossible to reach the optimum absolute speed-up since there will always be some communication and synchronization added in a parallel program.

5.4

Reducing the number of calculations needed

The algorithm has quadratic complexity in the number of pixels in the pictures. However, for large pictures one can use the property of mutual information decay in order to decrease the number of calculations needed. The mutual information will decrease with the distance between the pixels compared. For large pictures, pixels that are far apart will have a very small amount of mutual information. By setting a threshold at a suitable value, we can ignore the small values in the mutual information matrix, thus decreasing the number of force calculations needed in our simulation. The value of the threshold must not be set too high, since that would effect the result of the simulation. Figure 5.10 shows how the mutual information decreases with the distance between the pixels compared. In order to use the technique suggested above, we need to work with large pictures, since the extra work for sorting out small values will not pay off unless many values in the mutual information matrix can be ignored. In pictures of size 64x64 pixels the extra work will not pay off, and the algorithm did not become any 31

4

3

x 10

5

10

Time/second

Time/second

2.5 2 1.5 1

4

10

0.5 0

3

24

8

10

16 32 Number of processors

2

4 8 16 32 Number of processors

Figure 5.8. The execution time for pictures of size 128x128 pixels. The superlinear speed-up between 8 and 16 processors is probably due to some memory or cache related issue, since we use only about half as much memory when we double the number of processors. The values in these plots are not all measured values, but approximated values based upon executions with different numbers of iterations. The values have been adjusted in this way in order to enable comparisons between these executions and the executions with 64x64 pixel sized pictures.

25

Absolute speed−up

20

15

10

5

0

2 4

8

16

32 Number of processors

64

Figure 5.9. The absolute speed-up for the parallel version of our algorithm, in the case of pictures of size 64x64 pixels.

faster in our tests when the threshold was introduced. However, for pictures of size 128x128 pixels and larger the extra work will probably pay off, and the execution 32

0.35

normalized mutual information

0.3

0.25

0.2

0.15

0.1

0

5

10

15

20

25 distance

30

35

40

45

50

Figure 5.10. The mutual information between the pixels decreases with the distance between them. This figure shows the mean mutual information between pixels at different distances in pictures of size 64x64 pixels. The number of color coding levels used were 8. The figure includes only distances with more than 1000 pairs.

time will probably become shorter. Due to the limited time for this project we have not been able to justify this hypothesis by exact figures, since it will take many executions before finding a suitable value for the threshold, a value not to large to affect the result but large enough to decrease the execution time. It is worth noting that the effect of the threshold will increase fast with the size of the pictures, therefore it is quite possible that the execution time will not be much longer for pictures larger than 128x128 pixels than it is for that size. The idea of introducing a threshold is based on our a priori knowledge about pictures. There is no straight forward generalization of this technique to other domains, where no a priori knowledge is provided. In many domains similar knowledge does exist, but definitely not in all possible domains where our algorithm can be applied.

5.5

Summary of the evaluation chapter

In this chapter we have seen that our algorithm works as expected. There is no easy way to measure the correctness in figures but by visualizing the partitions found by 33

the algorithm, one can see that the partitions are made up of connected areas of points. This is what we expect will happen if the algorithm works. We have also shown that the complexity of the algorithm is O(n 2 m), where n is the number of pixels and m is the number of pictures. However, we found that the term m does not affect the real execution time in this linear way. This is due to the fact that the algorithm runs until a stable configuration has been found and that the time needed for this is, to some extent, inversely proportional to m. The optimal number of pictures to use is dependent upon the number of pixels, since larger pictures require more accurate statistics. Finally, the execution time of the parallel program was investigated. The added communication overhead made the absolute speed-up (speed-up compared to the sequential program) small for small numbers of processors. However, the algorithm is meant to be run on a great number of processors, and the speed-up compared to the sequential algorithm is not very interesting since there is really no option to run the algorithm on a sequential computer. What is more interesting is how the algorithm scales with the number of processors. We found that our algorithm scaled very well, in fact almost optimal, with the number of processors.

34

Chapter 6

Discussion 6.1

Generalizing the algorithm to other domains

As a preprocessing step to a neural network, our algorithm is far more general than described so far in this thesis, since input data are not restricted to pictures. For example, one can find correlations in data from an industrial process or find correlations in time-series that otherwise would decrease the storage capacity of the neural network. Different domains require tuning of the parameters, but except from that we expect our algorithm to perform well on other domains as well, without any, or at least very minor, changes in the code. The first step, the competitive learning for the color coding, can look quite different for different domains. Full color pictures are made up of three color channels, which means that learning in this case is done in three dimensions. Other data could have one or many dimensions, but the overall procedure would be the same nonetheless. One can also imagine cases where we are already provided mutual information data between points. In such a case one can simply ignore the first and second step (except the normalization of the mutual information), and use the mutual information directly in the N-body simulation in step 3. Due to the limited time for this project, we have concentrated on the case of pictures. Further research will have to confirm the generality of the algorithm.

6.2

Generalizing the algorithm to different tasks

One could generalize the algorithm further and treat it like other clustering algorithms. Our algorithm has up to this point been discussed mainly within its original context, as a preprocessing step to a neural network. However, the algorithm can be seen as a general clustering algorithm that clusters data points according to a force acting between them. Since any distance measure can be mapped to a force, one can find applications of the algorithm in a wide range of domains. 35

6.2.1

Performance on other data

In order to find out how well our algorithm performs in ordinary classification tasks, we have run some tests with public domain databases for machine learning. When applying our algorithm to these domains, we omit step 1, the color coding, and step 2, the calculation of mutual information. Instead we are using the distances given by the attributes of the instances in the data. We have tested our algorithm on two small datasets from the machine learning repository at University of California, Irvine (www.ics.uci.edu/∼mlearn/MLRepository.html). The first dataset contains 7 different categories of animals. Each animal has 16 attributes (binary and integers) and belongs to one class. Without too much tweaking of the parameters our algorithm classified 89 of 100 animals correctly. The second dataset contains 178 different wines of 3 different classes. Each instance has 13 attributes (11 floats and 2 integers). Our algorithm classified 169 of 178 (95 %) instances correctly in this test. Although our algorithm is not best suited for these kinds of machine learning tasks, the results show that the algorithm works well for these tasks as well. In the original setting, where we clustered hypercolumns from pictures, for the middle sized pictures (64x64 pixels) we had 4096 instances (hypercolumns) with 1024 attributes (number of pictures) each. Since the two machine learning datasets we tested contained only a few hundred instances with less than 20 attributes each, the algorithm was quite fast in these tests. For example, the classification of the wines took only about 2 seconds on a modern PC.

6.2.2

Further possibilities

The possible applications of the algorithm are many, even outside of the original context of preprocessing data for a neural network. Even though the performance on the tested datasets for classification were quite satisfactory, the algorithm is probably even more suitable to domains were the instances do not have attributes, but do have some qualities that can only be measured using second-order statistics (mutual information). One such domain is document clustering, where one wants to cluster documents that are in some sense similar. In this case there are no obvious attributes, but by calculating mutual information between the documents our algorithm can be used to cluster them. One could also use the information provided by step 3 in our algorithm, the Nbody simulation, as a kind of topography map in order to see which data points are the most related to each other. This could also be applied to the document clustering problem, in which case one could extract information about similar documents directly from the sphere (or hypersphere) of the third step. The topography map described above could be compared to Kohonen’s self-organization feature maps (SOFM or SOM), see Kohonen (1990). Further analysis and comparisons between our approach and Kohonen’s feature maps is beyond the scope of this thesis. 36

6.3

Future work

The algorithm described in this thesis comprises just the first steps in the hidden layer proposed for preprocessing the input to a neural network. In order to really test the performance of our algorithm we need the other steps as well. Only then can we run the whole system and measure how the memory capacity is affected. This is beyond the scope of this thesis, but it is the primary task for further investigations in later works. It would also be interesting to investigate how well our algorithm performs in a preprocessing task for some other domain than pictures, as described in section 6.1. However, in order to evaluate this we need the whole system, since there is no general way to visualize other domains in the way that we have done with pictures. The algorithm was found to perform quite well in some ordinary classification tasks (section 6.2), and it would be interesting to investigate the performance and the similarities and dissimilarities with other approaches further. Probably, our algorithm is well suited for clustering data having mutual information but lacking a set of attributes that can be measured. Some possible examples would be text documents (in problems like text mining), pictures in some recognition tasks, and songs in a music collection. More formally, such domains have that in common that they are more suitable for analysis with second-order than with first-order statistics. They also have that in common that they lack well defined natural classes, thereby making any partitioning based on similarity interesting.

37

Chapter 7

Conclusion and summary Attractor neural networks are usually evaluated on sparse random patterns, since real world data decrease their performances dramatically due to the correlations almost always present in real world data. To deal with this problem a preprocessing hidden layer has been proposed. This thesis presents the first steps in the creation of such a hidden layer. We have presented an algorithm that uses second-order statistics to self-organize the information in a way that shows the correlation. This is done by a force-directed clustering based on a model from physics, called the N-body simulation. Finally, the algorithm splits the information into partitions which will be used in the feature extraction in the next step of the hidden layer. The algorithm was found to perform as expected. However, there is no absolute measure for this performance until the whole hidden layer is created and we can test the whole system. In this thesis we have used plots of the resulting partitions to visualize the correctness of the algorithm. The algorithm was parallelized and run on a cluster computer. The parallelization was found to be very good, and, as far as we have been able to measure, the algorithm scales almost perfectly with the number of processors. Future work will include tests of the complete system, as mentioned above. It will also include tests of the algorithm on other domains than pictures. In this thesis we have also found some interesting topics outside of the original context of the algorithm (the hidden layer), and it would be interesting in future works to see how well the algorithm can be applied to some of these problems.

38

Bibliography S. J. Aarseth. Gravitational N-body simulations. Cambridge University Press, 2003. G. R. Andrews. Foundations of Multithreaded, Parallel, and Distributed Programming. Addison Wesley Longman, Inc., 2000. M. Blatt, S. Wiseman, and E. Domany. Data clustering using a model granular magnet. Neural Computation, 9(8):1805–1842, nov 1997. J. Dubinski, R. Humble, U. Pen, C. Loken, and P. Martin. High performance commodity networking in a 512-cpu teraflop beowulf cluster for computational astrophysics. ArXiv Astrophysics e-prints, May 2003. C. Gardiner. Handbook of Stochastic Methods. Springer, 1983. Y. Gdalyahu, D. Weinshall, and M. Werman. A randomized algorithm for pairwise clustering. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pages 424–430. MIT Press, 1999. T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (1):1–14, jan 1997. A. Holst and A. Lansner. A higher order bayesian neural network for classification and diagnosis. In A. Gammerman, editor, Computational Learning and Probabilistic Reasoning, chapter Applied Decision Technologies, pages 251–260. John Wiley & Sons Ltd., 1996. J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences, volume 79, pages 2554–2558, 1982. A. Hyvarinen and J. Karhunen. Interscience, 2001.

Independent Component Analysis.

Wiley-

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. C. Johansson. Towards Cortex Isomorphic Attractor Neural Networks. Licentiate thesis, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, 2004. 39

C. Johansson, A. Sandberg, and A. Lansner. A Capacity Study of a Bayesian Neural Network with Hypercolumns. Technical Report TRITA-NA-P0120, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, 2001. D. Jones. Elementary Information Theory. Oxford University Press, 1979. A. Kawai, T. Fukushige, J. Makino, and M. Taiji. Grape-5: A special-purpose computer for n-body simulations. Publ. of the Astronomical Society of Japan, 52: 659–676, Aug. 2000. S. Kirkpatrick, C. D. Gellat, and M. P. Vecchi. Optimization by simulatied annealing. Science, 270:671–680, 1983. T. Kohonen. The self-organizing map. In Proceedings of the IEEE, volume 78, pages 1464–1480, 1990. J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto. Grape-4: a one-tflops specialpurpose computer for astrophysical n-body problem. In Conference on High Performance Networking and Computing: Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pages 429–438. ACM Press, 1994. Computing Resources; Lucidor. PDC, Parallelldatorcenter at KTH, Stockholm, www.pdc.kth.se/compresc/machines/lucidor.html, may 2005. A. J. Quigley and P. Eades. ProVEDA: A scheme for Progressive Visualization and Exploratory Data Analysis of clusters. In Proceedings of 3rd Software Visualization Workshop, Dec. 1999. E. Simoncelli and B. Olshausen. Natural image statistics and neural representation. Annual Review of Neuroscience, 24:1193–1215, 2001. M. S. Srivastava. Methods of Multivariate Statistics. Wiley-Interscience, 2002. N. Tishby and N. Slonim. Data clustering by markovian relaxation and the information bottleneck method. In NIPS, pages 640–646, 2000. N. Ueda and R. Nakano. A new competitive learning approach based on an equiditortion principle for designing optimal vector quantizers. Neural Networks, pages 1211–1227, 1994. Z. Wu and R. Leahy. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 15(11):1101–1113, 1993.

40

Appendix A

Complete list of the parameters and their values In order to provide enough information to the reader that wants to repeat our experiments, we include a list of all our parameters and some comments about their values. The parameters here are an example and are not to be thought of as the optimal parameter set. The parameter values below are for the case of 64x64 pixel sized pictures. Most of the parameters have the same values for all picture sizes, in the other cases we have written a comment about it. Parameter name numPictures numDim universeLength meanMovementBreakThreashold dt checkInterval numChannels numIntervals (numColorUnits) numPicsInColorLearn selectIntervalColorCSL selectMaxIterColorCSL epsilonColorCSL etaColorCSL numUnits (number of partitions) selectInterval selectMaxIter epsilonFinal etaFinal

Value 1024 3 10000 100.0 10.0 50 3 8 10 10 50 0.00000001 0.005 8-400 10 50 0.00000001 0.005

41

Comment 128 enough with small pictures see discussion in section 4.1.4 increase for really large pictures depends upon checkInterval

3 if RGB, 1 if gray-scale

8 used for visualizations