## Abstract

Many real-world datasets can be naturally described by multiple views. Due to this, multi-view learning has drawn much attention from both academia and industry. Compared to single-view learning, multi-view learning has demonstrated plenty of advantages. Clustering has long been serving as a critical technique in data mining and machine learning. Recently, multi-view clustering has achieved great success in various applications. To provide a comprehensive review of the typical multi-view clustering methods and their corresponding recent developments, this chapter summarizes five kinds of popular clustering methods and their multi-view learning versions, which include k-means, spectral clustering, matrix factorization, tensor decomposition, and deep learning. These clustering methods are the most widely employed algorithms for single-view data, and lots of efforts have been devoted to extending them for multi-view clustering. Besides, many other multi-view clustering methods can be unified into the frameworks of these five methods. To promote further research and development of multi-view clustering, some popular and open datasets are summarized in two categories. Furthermore, several open issues that deserve more exploration are pointed out in the end.

### Keywords

- clustering
- multi-view clustering
- multi-view k-means
- multi-view spectral clustering
- multi-view matrix factorization
- tensor decomposition
- deep learning

## 1. Introduction

Clustering is one of the most critical unsupervised learning techniques, which has been widely applied for data analysis, such as social network analysis, gene expression analysis, heterogeneous data analysis, and market analysis. The goal of clustering is to partition a dataset into several groups such that data samples in the same group are more similar than those in different groups. Clustering plays an important role in mining the hidden patterns. However, most of the existing clustering algorithms are designed for single-view data.

With the rapid development of Internet and communication technology (ICT), the accesses to extract data are dramatically extended. That is, data can be collected from multiple sources or multiple facets. In such setting, each datum is associated with much richer information, which results in the requirement that to mine the intrinsic and valuable patterns hidden in the data, it is a necessity to take full advantage of the information contained in multiple sources. This issue is formally referred to as * multi-view learning*. To be more specific, each view corresponds to one source of information. For example, web pages can be described by both the page-contents (one view) and the hyperlink information (another view). Besides, different facets of a datum can also be treated as different views. For instance, an image can be characterized by its shape, color, and location.

Obviously, integrating the information contained in multiple views can bring great benefits for data clustering. The most straightforward way to utilize the information of all views is to concatenate the data features of each view together and then perform the traditional clustering methods such as * multi-view clustering*.

As an emerging and effective paradigm in data mining and machine learning, multi-view clustering refers to the clustering of the same class of data samples with multi-view representations, either from various information sources or from different feature generators. It is clear that if the clustering method cannot cope appropriately with multi-views, these views may even degrade the performance of multi-view clustering. To make use of multi-view information to improve clustering results, there are two main challenges to overcome. The first one is how to naturally ensemble the multiple clustering results of all the views. The second one is how to learn the importance of different views to the clustering task. In addition, these two issues should be figured out simultaneously. Thus, to achieve these goals, new clustering objective function should be designed, followed by the new solving method.

Multi-view clustering was first studied by Bickel and Scheffer [1] in 2004. They extended the classic

The remainder of this chapter is organized as follows. Section 2 describes the benefits of multi-view clustering. Section 3 details the aforementioned five multi-view clustering methods. Section 4 summarizes two kinds of popular open datasets. Several open issues are illustrated in Section 5. Section 6 concludes this chapter.

## 2. Benefits of multi-view clustering

Compared with the clustering methods that are implemented on single-view data, multi-view clustering is expected to obtain more robust and novel partitioning results by exploiting the redundant and complementary information in different views [5], as stated in the following sections.

### 2.1. Benefit one: accurate description of data

It is obvious that single-view data may contain incomplete knowledge, while multi-view data usually contains complementary and redundant information, which results in a more accurate description of the data. For example, it may fail to identify the intrinsic community structures of a social network via just leveraging the friendships. However, if more information such as users’ demographics can be obtained, it is more inclined to find out the implicit relationships between users.

### 2.2. Benefit two: reducing noises of data

Even when the information contained in single-view data is complete, there may exist some unavoidable noises. It is apparent that data cleaning is one critical issue in data analysis, which can tremendously affect the performance of clustering algorithms. It is quite hard and costly to remove all the noises of data, and thus single-view noisy data usually leads to unsatisfactory clustering results. On the other hand, multi-view clustering is able to circumvent the side effect of noises or corrupted data in each view and emphasize the common patterns shared by multi-view data.

### 2.3. Benefit three: wider range of applications

There is no doubt that all the multi-view clustering methods can be applied to single-view data. However, many clustering tasks are impossible to implement by single-view clustering due to its limitations. For example, data with multiple modalities is becoming more and more common and heterogeneous information networks are gaining increasing popularity as well. These types of data naturally fit into multi-view learning, while cannot be settled by single-view learning methods appropriately. In all, the complementary property among multi-view data can overcome the limitations of single-view data and expand their application areas.

## 3. Multi-view clustering methods

Due to the widespread use of multi-view datasets in practice, many realistic applications are accomplished by multi-view learning methods, such as community detection in social networks, image annotation in computer vision, and cross-domain user modeling in recommendation systems [6]. Meanwhile, based on the seminal work of Bickel and Scheffer [1], plenty of multi-view clustering methods have been proposed [2, 3, 5]. As explained in Section 1, this chapter seeks to review five kinds of typical clustering methods and their multi-view versions, which include

### 3.1. Multi-view clustering via k -means

#### 3.1.1. Preliminaries of k -means

As a classic clustering algorithm,

where

As can be seen, Eq. (1) adopts the Euclidean distance to measure the similarities between data samples. However, there are many data structures or data distributions in real world. Thus, it is not always suitable to apply this basic form of

where

With the aid of the kernel function, there is no need to explicitly provide the transformation function

#### 3.1.2. Basic form of multi-view k -means

Both the

where

Similarly, the objective function of the multi-view kernel

Denote

where

#### 3.1.3. Variants of multi-view k -means

The basic formulations of multi-view

As aforementioned, it is NP-hard to find the optimal solution of the multi-view

where

### 3.2. Multi-view clustering via spectral clustering

Spectral clustering is built upon the spectral graph theory. In recent years, spectral clustering has become one of the most popular clustering algorithms and shown its effectiveness in various real-world applications ranging from statistics, computer sciences to bioinformatics. Due to its adaptation in data distribution, spectral clustering often outperforms traditional clustering algorithms such as

#### 3.2.1. Preliminaries of spectral clustering

Spectral clustering is closely related to the minimum cut problem of graphs. It first performs dimensionality reduction on the original data space by leveraging the spectrum of the similarity matrix of data samples and then performs

According to the above graph cut theory, two popular versions of spectral clustering are developed, i.e., the ratio cut (RatioCut) and the normalized cut (Ncut). The classical relaxed form of the RatioCut [26] is shown as below:

where

#### 3.2.2. Basic form of multi-view spectral clustering

Multi-view spectral clustering is able to learn the latent cluster structures by fusing the information contained in multiple graphs. Similar to multi-view

Kumar et al. [28] firstly present a multi-view spectral clustering approach, which has a flavor of co-training idea widely used in semi-supervised learning. It follows the consistency of multi-view learning that each view gives the same labels for all data samples. So it can use the eigenvector of one view to “label” another view and vice versa. For example, via computing two views’ eigenvectors, say

where

where

#### 3.2.3. Variants of multi-view spectral clustering

The basic form of multi-view spectral clustering achieves the basic goals of multi-view learning. However, some issues have not yet been considered. For instance, the weight parameter

The model above needs a manually specified parameter

where

### 3.3. Multi-view clustering via matrix factorization

In the fields of data mining and machine learning, matrix factorization (MF) is an effective latent factor learning model. Given a data matrix

#### 3.3.1. Preliminaries of matrix factorization

As is well known, there are many matrix factorization models, including the singular value decomposition, Cholesky decomposition, LU decomposition, QR decomposition, and Schur decomposition. These factorization models either have too strict restrictions on the factor matrices or lack the ability to be applied to data analysis. Due to the wide applications of NMF in recommending systems, NMF has drawn much attention in both academia and industry. In fact, NMF can be regarded as an extension of the standard

where

where

#### 3.3.2. Basic form of multi-view matrix factorization

The hypothesis behind multi-view clustering is that different views should admit the same underlying clustering structures of the datasets. That is, the coefficient matrices learned from different views should be as consistent as possible. To this end, a soft regularization term is introduced to enforce the coefficient matrices of different views toward a common consensus [37]. For a given dataset

where

Instead of enforcing a rigid common consensus constraint on all the views as in Eq. (15), another form of basic multi-view NMF for clustering is the pair-wise CoNMF model [38], which imposes similarity constraints on each pair of views. Through the pair-wise co-regularization, it is expected that the coefficient matrices learned from two views can complement with each other during the factorization process. And therefore, high-quality clustering results can be yielded. The co-regularization objective function of the pair-wise CoNMF model is defined intuitively as follows:

where

Similar to the optimization of the standard single-view NMF model, all the three basic multi-view NMF clustering models can be optimized via the multiplicative updating rules.

#### 3.3.3. Variants of multi-view matrix factorization

As the locality preserving learning and the manifold learning have been shown very important to promote the performance of clustering algorithms, Cai et al. [39] propose a graph (or manifold) regularized NMF model GNMF for single-view clustering with satisfying performance. Note that the aforementioned multi-view NMF models cannot preserve the local geometrical structures of the samples. To overcome this limitation, a multi-manifold regularized NMF model (MMNMF) is proposed in [40]. MMNMF incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to preserve the local geometrical structures of the multi-view data space. The multi-manifold regularization has also been considered in [41]. Moreover, the correntropy-induced metric (CIM) is adapted to measure the reconstruction error, since CIM has achieved excellent performance in many applications. CIM is also insensitive to large errors that are mainly introduced from heavy noises. A much simpler formulation of the manifold regularized multi-view NMF model is developed in [42]. Without the explicit constraint that enforces a rigid common manifold consensus, an auxiliary matrix is involved to add constraints on the column sums of the basis matrix

Due to its great interpretability and high efficacy, NMF has been widely employed for graph clustering [48]. In such setting, the data matrix

### 3.4. Multi-view clustering via tensor decomposition

In this part, we analyze multi-view clustering from a multilinear algebra perspective and present several novel multi-view clustering algorithms (note that the notations used in this part are self-contained). Tensor is known as a multidimensional matrix or multiway array [51]. In multi-view research field, data can be naturally modeled as a third-order tensor with objects, features, and view dimensions. An intuitive way is to compact different views along the view dimension of the tensor (see Figure 1). Another widely adopted way is to transform each feature matrix to a similarity matrix before compacting them.

#### 3.4.1. Preliminaries of tensor decomposition

In the field of data mining and machine learning, tensor decomposition is an emerging and effective tool for processing multi-view data. In this section, some basic knowledge on tensors and tensor decomposition methods is provided. We refer the readers to [51, 52] for a comprehensive understanding of these topics.

#### 3.4.1.1. Notations

Let

#### 3.4.1.2. CP decomposition

The idea of expressing tensor as the sum of a number of rank-one tensors comes from the study of Hitchcock [53]. Then, Cattell [54] proposed the idea of parallel proportional analysis. The popular CP decomposition comes from the ideas of Carroll and Chang [55] (canonical decomposition) and Harshman [56] (parallel factors). Taking a third-order tensor

where

#### 3.4.1.3. Tucker decomposition

The idea of Tucker decomposition is introduced by Tucker [57]. The Tucker decomposition is a form of higher-order singular value decomposition (HOSVD) [58]. It decomposes a tensor

The cutting-edge technique for calculating the factor matrices is proposed in [59].

#### 3.4.2. Tensor decomposition-based multi-view clustering

In multi-view clustering, the goal is to find out some meaningful group of objects from the data. The above CP decomposition naturally divides the multi-view data into several components, which can be seen as the clusters. Thus, it can be directly applied to solve multi-view clustering problems. For a given dataset

#### 3.4.2.1. Total variation based CP (TVCP)

In some clustering problems, a consecutive range of time points is non-negligible. For example, in the dataset with authors, publications, and a sequence of time points, we are interested in figuring out which group of authors work in the same topics during a period of time. Chen et al. [60] propose a total variation based tensor decomposition method (TVCP) for the constraint on a period of consecutive time points. The total variation regularizes the time factor to obtain a piece-wise constant function w.r.t. time points. Owing to the piece-wise constant function, the decomposition can be relatively consistent in a cluster and separated between clusters. The TVCP model is formulated as follows:

where

#### 3.4.2.2. Relations between Tucker decomposition and spectral clustering

Liu et al. [61] propose a framework of multi-view clustering via tensor decomposition, mainly the Tucker decomposition. According to the framework, the common type of multi-view spectral clustering is equivalent to a Tucker decomposition problem as follows:

Another form of multi-view spectral clustering can also be written as a Tucker problem:

With this framework, variety of spectral clustering problems can be solved by a tensor decomposition algorithm. We can see the strong connection between them as well as the strong capability of tensor methodology.

Canonical correlation analysis is designed to inspect the linear relationship between two sets of variables [62]. In multi-view learning, a typical approach is to maximize the sum of pair-wise correlations between different views [63]. Without loss of high-order correlations, Luo et al. [64] propose a tensor canonical correlation analysis (TCCA), which is equivalent to CP decomposition of the correlation tensor. Khan et al. [65] propose a Bayesian extension of CP decomposition for multiple coupled tensors sharing common latent factors.

### 3.5. Multi-view clustering via deep learning

With the third wave of artificial intelligence, deep learning is gaining increasing popularity in recent years. Deep learning has demonstrated excellent performance in many real-world applications, such as face recognition, image annotation, natural language processing, object detection, customer relationship management, and mobile advertising. Typically, deep learning models are composed of multiple nonlinear transformations and thus can learn a better feature representation than traditional shallow models [66]. However, deep learning requires labeled training data to learn the models, which limits its application in data clustering for the reason that training data with cluster labels are not available in many cases. Despite the hardness, there are some works devoted to adjusting shallow clustering models for deep learning. Here, we introduce two popular deep clustering models and their extensions to the multi-view environment.

#### 3.5.1. Deep auto-encoder

An auto-encoder [67] is an artificial neural network adopted for unsupervised learning, the goal of which is to learn a representation for each data sample. An auto-encoder always consists of two parts: the encoder and the decoder. The encoder plays the role of a nonlinear mapping function that can map each data sample to a representation space. The decoder demands accurate data reconstruction from the representation generated by the encoder. Auto-encoder has been shown to be similar to spectral clustering in theory; however, it is more efficient and flexible in practice. The auto-encoder can be easily deepened via adding more encoder layers and corresponding decoder layers. Figure 2 (a) gives an example of the framework of the deep auto-encoder.

Although auto-encoder can learn a compact representation for each data sample, it contributes little to clustering since it does not require that the representation vectors of similar data samples should also be similar. To make the learned feature representation better capture the cluster structures, many variants of deep auto-encoder models have been proposed. In [68], a novel regularization term that is similar to the objective function of

#### 3.5.2. Deep matrix factorization

Another line of developing deep clustering models is deepening the MF models. As shown earlier, MF, especially NMF, has demonstrated outstanding performance in many applications. Thus, it is worth building a deep structure for MF in the hope that better feature representations can be obtained to facilitate clustering. Figure 2(b) illustrates an example of the framework of the deep MF models. Compared to the deep auto-encoders, both deep MF and deep auto-encoders are trying to minimize the reconstruction errors. However, unlike deep auto-encoders, the mapping function of deep MF is linear.

The first nonnegative deep network based on NMF is proposed in [74] for speech separation. This architecture can be discriminatively trained for optimal separation performance. Then Li et al. [75] propose a novel weakly supervised deep MF model to uncover the latent image representations and tag representations embedded in the latent subspace by collaboratively exploring the weakly supervised tagging information, the visual structure, and the semantic structure. In [76], a deep semi-NMF model is further developed for learning latent attribute representations. Semi-NMF is a popular variant of NMF by relaxing the factorized basis matrix to be real-valued. This practice makes semi-NMF have much wider applications than NMF since the datasets in real world may contain complex information, for instance, the attributes may be mix-signed. Considering the fact that these deep MF models are trying to factorize the basis matrix hierarchically alone, Qiu et al. [77] further propose a deep orthogonal NMF model which can decompose the coefficient matrix hierarchically. This model is able to learn higher-level representations for clusters. These deep MF models have achieved great success in data clustering for single-view data. However, they are seldom utilized for multi-view clustering. A recent work [78] attempts to extend the deep semi-NMF model for multi-view clustering, which can dissemble unimportant factors layer by layer and generate an effective consensus representation in the last layer. Another work [79] proposes to address the incomplete multi-view clustering problem via deep semantic mapping. The proposed model first projects all incomplete multi-view data to a unified representation in a common subspace, which is further executed by standard shallow NMF for clustering.

## 4. Open datasets

No one can make bricks without straw. In this section we will first list two kinds of open datasets that can be used in multi-view clustering, i.e., feature-based and graph-based datasets. Then we will discuss the performance of multi-view clustering on them briefly.

### 4.1. Feature-based datasets

Audio genre [80] consists of 1886 audio tracks classified into 9 music genres, which are Blues, Electronic, Jazz, Pop, Rap/HipHop, Rock, Folk/Country, Alternative, and Funk/Soul. Forty-nine low-level audio features have been extracted and they are grouped into 15 vector spaces.

NUS-WIDE [81] is a web image dataset composed of 269,648 images, 5018 related tags, and 81 ground-truth concepts. Six types of low-level features have been extracted: 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments extracted over 5

UCF101 [82] consists of 101 human action classes. These actions can be divided into five types: human-object interaction, body-motion only, human-human interaction, playing musical instruments, and sports. There are over 13,000 clips and 27 hours of video data in it.

Handwritten numerals [83] is composed of 2000 handwritten digits which are divided into 10 classes. Four types of feature sets have been extracted: Zernike moments, Karhunen-Loeve features, Fourier descriptors, and image vectors. For Zernike set, it has 47 rotation invariant Zernike moments and 6 morphological features. For Fourier set, it has 76 two-dimensional shape descriptors. Both Zernike and Fourier feature sets are rotation invariant. For Karhunen-Loeve set, it has 64 Karhunen-Loeve transform which corresponds to the projection of images onto the eigenvectors of a covariance matrix.

### 4.2. Graph-based datasets

DBLP coauthorship [84] is a coauthorship network composed of 10,305 authors. There are 617 layers in it, each layer representing different publication categories.

Facebook [85] is a three-layer social network composed of 1640 users with multiple types of ties. The first layer shows whether two users are friends. The second layer shows whether users are in a same group. The third layer shows whether users are in the same photos uploaded by users.

CiteSeer [86] consists of 3312 scientific publications classified into 6 classes, which are Agents, AI, DB, IR, ML, and HCI. It can be represented as an annotated network, where nodes represent scientific publications and links represent the citation relationships. For each node, there is a 3703-dimensional one-hot encoding vector representing the absence/presence of key words.

Enron e-mail [87] consists of 184 users and 44 layers. Although it is a temporal network, it can be considered as a multi-layer network. Each layer represents communication in different months.

### 4.3. Performance on different datasets

For feature-based datasets, when confronted with the situation where we need to reconstruct the views, the performance of classical methods, like deep learning, is not promising. But multi-view clustering can give satisfactory results under this condition. In some cases, classical methods can also give good performance for feature-based datasets where all features are descriptions of the same object from different perspectives. For graph-based datasets, multi-view clustering naturally fits into them since different graphs can be processed by different views.

For both feature-based and graph-based datasets, when the scale of datasets becomes significantly large, most multi-view clustering methods have the potential to outperform other clustering methods on speed. For example, multi-view matrix factorization is quite suitable to parallel process.

## 5. Open issues

Although multi-view clustering has demonstrated its superiority over single-view clustering in many applications, there are still many open issues deserving much more attention from both academia and industry. Several vital open issues are summarized in this part.

### 5.1. View construction

Although there are many typical methods to construct views, they all have their own drawbacks. It is well known that if we cannot extract valuable information from the original data and put it into different views appropriately, the performance will be highly limited no matter how delicate the algorithm is. So it is important to find efficient ways of constructing and evaluating multiple views.

### 5.2. Incomplete view

When constructing different views, we may find that for some views, the information is not complete. In other words, even though we know how to construct views appropriately, we do not have enough information to do it, which is very common in practical problems. In real world, it is very difficult to ensure the completeness of data. This unbalanced relationship between complete views and incomplete views could cause huge problems. Moreover, these incomplete views may influence views with complete information. To solve it, one possible way is to construct these lost information from other views.

### 5.3. Single-view to multi-view

In multi-view learning, sometimes researchers will convert single-view data into multiple views and apply relevant algorithms on them. In practice, it may give good performance, but there are few theoretical researches on the proof of its reliability. Since the original data is single view, it is important to make it clear: is it necessary to complicate a simple task? We should not only focus on the final performance, the trade-off between cost and benefit is also important.

### 5.4. Deep leaning in multi-view

Deep learning has shown remarkable performance in many fields. One common way to deal with data composed of different types of sources is to combine them together and then feed them into a deep learning model. It often works well. Although multi-view learning seems to be a more reasonable way to deal with data composed of different types of sources, there is no evidence showing that multi-view learning has an obvious advantage over deep learning. Another issue is that when using deep learning in multi-view learning, we need to train different neural networks for different views separately. This method has two drawbacks. One is that the number of neural networks depends on the number of views. When there are many views, the calculation is huge. The other is that it fails to unify different views during training.

## 6. Conclusion

Multi-view clustering has demonstrated variety of real-world applications, such as community detection in social networks, image annotation in computer vision, cross-domain user modeling in recommendation systems, and protein interaction analysis in bioinformatics. This chapter provides a comprehensive review of the typical multi-view clustering methods and their corresponding recent developments by focusing on five most typical and popular clustering methods, which include

In the field of multi-view clustering, there are many algorithms whose source codes are exposed by their authors. For example, the co-training^{1} and co-regularization^{2} methods of classical multi-view spectral clustering are open in GitHub with MATLAB. The variants ^{3} and ^{4} are also implemented by MATLAB.

## Notes

- https://github.com/areslp/matlab/tree/master/code_cospectral
- https://github.com/areslp/matlab/tree/master/code_coregspectral
- https://github.com/rciszek/mse
- http://www.escience.cn/people/fpnie/index.html;jsessionid = 253C211B5AEDB8C09865FFEAEAACFB73-n1