Optimal Image Transport on Sparse Dictionaries

zoom
Figure 1. An illustration of optimal image transfer on sparse dictionaries. Given a content image Fig. 1a and reference image Fig. 1b, the proposed method learns the individual dictionaries Fig.1c and then derives an optimal transport plan over the learned dictionaries, giving the transferred results Fig. 1b and Fig. 1d, respectively.

Abstract:

In this paper, we derive a novel optimal image transport algorithm over sparse dictionaries by taking advantage of Sparse Representation (SR) and Optimal Transport (OT). Concisely, we design a unified optimization framework in which the individual image features (color, textures, styles, etc.) are encoded using sparse representation compactly, and an optimal transport plan is then inferred between two learned dictionaries in accordance with the encoding process. This paradigm gives rise to a simple but effective way for simultaneous image representation and transformation, which is also empirically solvable because of the moderate size of sparse coding and optimal transport sub-problems. We demonstrate its versatility and many benefits to different image-to-image translation tasks, in particular image color transform and artistic style transfer, and show the plausible results for photo-realistic transferred effects.

Introduction:

Image-to-image translation is an interesting but rather challenging image synthesis problem in image processing and computer vision fields. Recent studies have shown that many vision-based tasks can be posed as a special style transfer problem within the image-to-image translation framework, for example, image color matching and transfer, image super-resolution, texture synthesis, non-photorealistic rendering and artistic stylization and so on. Despite the varying backgrounds and generalizations, they actually share a very similar goal — that is, automatically converting an image from one domain to another while preserving the semantic styles or contents for either better interpretation or visual-pleasant purposes.

In a general sense, image translation problem can be formed as an admissible map in latent feature spaces while preserving the interest of contents or information (color, texture or styles, etc.). In nature, it is necessary to solve two fundamental sub-problems, that is, image encoding and feature transformation. The former is to seek a useful image representation tool to extract the discriminative or individual image styles, while the latter aims to infer an appropriate mapping for the encoded image styles while maintaining abstract information such as image structures, textures and high-level semantic characteristics. As illustrated hereafter, many vision-based tasks can be understood from the two sub-problems. The difference mainly underpins the process of image encoding and feature transformation. Image color matching, for example, tends to take image intensities (or, hue and saturation) as a meta-representation and solves a mapping problem between color palettes to determine the transferred results. In contrast, it is crucial to have both concisely-designed image encoding and feature transformation for artistic image stylization in view of the complexity of abstract styles. In many scenarios, it is also important for the encoding process to have a compact form for the sake of reducing computational cost, while the translation problem involves a special transport map between two distributions of individual image features. It has also witnessed recent efforts to address the two sub-problems for different image style transfer applications, including the ever-increasing deep learning-based methods . Despite the great success, there is still considerable interest to exploit more easy-configured and powerful tools to achieve more visual-appealing results.

In this paper, we propose a novel approach for image-to-image translation, in which an optimal transport map is directly posed on sparse dictionaries learned from sparse image coding. Specifically, sparse representation is applied as a feature extractor to encode the latent features of images. Optimal transport is subsequently inferred over the learned dictionaries to provide an optimal styles-swapping plan in accordance with the style encoding process. This new model, as shown in Fig. 1, inherits two-fold benefits of sparse representation and optimal transport. On the one hand, sparse representation provides us a maneuverable and easy-understood image editing tool to encode the semantic features of images, as it has been demonstrated in a variety of successful applications such as image denoising and super-resolution . On the other hand, optimal transport allows us to swap the encoded features or styles based on a linear mapping. Moreover, the size of learned dictionaries is moderate in practice, which also helps to alleviate the high computational cost of the native optimal transport. We will illustrate that this new paradigm, with a slight relaxation, is empirically solvable and gives rise to a closed-form solution to many image translation problems. We also demonstrate its versatility and many benefits to image-to-image translation with two typical tasks: color transform and artistic style transfer, and show their high-quality transferred results.

Image-to-image translation, as aforementioned, covers a wide range of vision-based tasks despite their different backgrounds and generalizations. We briefly review some existing color transform and artistic style transfer methods, in particular the ones for photo-realistic results because of the close connections to the proposed optimal style transfer on sparse dictionaries.

Color matching or transfer is a typical image-to-image translation application keen on photo-realistic results. The purpose is straightforward, that is, to alter color appearance of an image based on a reference image. In the early stage, color transfer is usually posed as a one-dimensional histogram matching problem between two color distributions — for example, histogram equalization or specification. As suggested in the pioneering work, color transfer is implemented by matching their global statistical mean and covariance of two images. Such a strategy is then extended to other color space or combined with the lightness and brightness information. They, however, may produce non-harmonic results because of the non-consistent color distributions of natural images. Other methods also resort to some color segmentation techniques for local color matching, while such a strategy is highly dependent on the semantic constraints for color segmentation in practice.

To reduce the notorious non-harmonic artifacts, recent advances based on optimal transport have gained great attention in color transfer applications. The work can be dated back to the study of histogram-matching problems and the relation to optimal transport for gray images. Such a strategy is then extended to color images and videos. The assigned color could more or less avoid the undesired visual artifacts. It is worth noting that the transport map is mostly deduced from the discrete samplings, the solution may be not reachable for a very large-scale problem, for example, in the cases of using optimal transport, where a naive optimal transport has a \(\mathcal{O}(n^3)\) complexity for \(n\) pair samples. More recently, the relaxed and regularized OT methods are also explored for color transfer to tackle the high computational cost. However, as pointed out in , an exact transform of color distributions is not enough in practical applications because color densities may have very different shapes and outliers. As a consequence, the transfer performance may be limited by the sampling and interpolation processes.

Simultaneously, image style transfer — which is mostly dedicated to non-photorealistic image rendering for artistic effects, has been extensively studied for the long-standing dream of generating attractive artworks automatically. Most traditional methods are either based on line-drawing and stroke-based rendering techniques to produce the prescribed effects, including image stippling, pencil sketching, watercolor, oil painting, and so on. As shown in difference-of-Gaussians (DoG) operator and flow-based filtering, they boost the salient line features or main structures of images, and help to yield aesthetically pleasing lines when synthesizing line drawings and cartoon-like art effects. The stroke-based rendering technique is another prevalent strategy for artistic image stylization, in which the brush strokes are iteratively aligned according to the variants of local color, size, and orientation information. With careful design, it is possible to generate high-quality results for some prescribed styles but may be limited in style diversity. The reader is referred to the surveys for more details.

More recently, it has also witnessed the great success of neural style transfer with the renaissance of deep learning methods. In the pioneering work, for example, a novel iterative optimization scheme over a conventional neural network is proposed to match the learned features within a pre-trained classification network. The idea is subsequently developed by many deep learning methods for more efficient stylization, stroke-based paintings and universal style transfer. In the last few years, it has also witnessed many efforts for photo-realistic style transfer. Despite the impressive artistic effects, they may suffer from some unpredictable effects with spatial distortions and artifacts which are not consistent with semantic interpretations or should not happen in real photographs, since it is still not comprehensively understood the mechanism of the deep encoding process. Recent studies have shown that the matching of features can be formulated as an optimal transport problem between the learned features for more favorable results. The achievement of deep learning methods is largely due to the two-fold benefits of many deep learning architectures — that is, the encoding ability of neural networks offers a powerful and ubiquitous tool for high-level visual features, and the transformation map between deep features is also learnable during the training process. Moreover, many deep learning methods are usually limited by the availability of very large-scale training datasets. The reader is also referred to the work for more details of deep learning-based techniques.

Preliminary

In this section, we briefly introduce sparse representation and optimal transport, as they form the key ingredients of the proposed model for image style (feature) encoding and transformation.

zoom
Figure 2. Sparse representation of an image with the distribution of dictionary.

Sparse Representation:

Sparse and redundant representation has been used as a simple and important method for signal/image analysis and processing, in which the signal/image is assumed to be compactly approximated by a linear combination of a few fundamental elements — known as a basis set or a dictionary. One of the overwhelming benefits of sparse representation is to reduce the size of large-scale problems in signal/image processing fields since the majority of information is encoded by a small set of basis functions weighted by sparse coefficients. In image processing and computer vision fields, sparse representation provides an effective image editing tool to encode the latent middle-level vision features of images. The basis or dictionary of sparse representation can be either selected from a group of pre-defined functions such as discrete cosine transform (DCT) and wavelet or learned from training data. We here consider the learning-based dictionaries for better performance.

Mathematically, let \(\mathbf{X}=\{\boldsymbol{x}_i\}_{i=1}^N\),\(\boldsymbol{x}_i \in \mathbb{R}^{d}\) be a set of data samplings, for example, the vectorized image patches, sparse representation then aims to discover a group of dictionary vectors \(\{\boldsymbol{d}_j\}_{j=1}^n, \boldsymbol{d}_i \in \mathbb{R}^{d},n \ll N\) (or, denoted as \(\mathbf{D}= [\boldsymbol{d}_1, \cdots, \boldsymbol{d}_n]\)) associated with the efficient matrix \(\mathbf{A}\in \mathbb{R}^{n \times N}\), which can be written as,

\[\begin{equation} \begin{aligned} \mathbf{X} = \mathbf{D}\mathbf{A} \quad \text{s.t.} \left\|\boldsymbol{\alpha}_i\right\|_{0} \leq K, \end{aligned} \label{eq1} \end{equation}\]

where \(\boldsymbol{\alpha}_i \in \mathbb{R}^{n}\) denotes the represented coefficients of sampling \(\boldsymbol{x}_i\), corresponding to the \(i\)-th column vector of the coefficient matrix \(\mathbf{A}\), \(\left\|\cdot\right\|_{0}\) is known as pseudo \(L_0\)-norm counting the non-zero elements of a vector. The constraint \(\left\|\boldsymbol{\alpha}_i\right\|_{0} \leq K\) suggests that the number of non-zero entries in \(\boldsymbol{\alpha}_i\) is no more than \(K\). In other words, the coefficient \(\mathbf{A}\) has sparse characteristics, the assumption of which forms the nature of sparse representation.

Despite the simple form, it is generally a challenging problem to give a direct solution for sparse representation because both dictionary \(\mathbf{D}\) and sparse coefficient matrix \(\mathbf{A}\) in Eq. \ref{eq1} are unknown in advance. Moreover, the solution is always not unique when solving one by fixing another. The problem is also known as an NP-hard in view of the non-convex \(L_0\)-norm constraint. In practice, it always resorts to some approximate algorithms such as the method of optimal directions (MOD), generalized PCA or K-SVD algorithm to give a solution. Other methods for approximate solutions may be based on relaxation techniques, for example, replacing the non-convex \(L_0\) constraint with its convex \(L_1\) approximation.

The choice of an appropriate dictionary is also crucial to sparse representation for many practical applications. We here consider an example in Fig. 2 for interpretation. The process starts with the cropped image patches that are randomly sampled from an image or a dataset. The data matrix \(\mathbf{X}\) is formed by concatenating these vectorized patches, and the dictionary \(\mathbf{D}\) and coefficient \(\mathbf{A}\) can be achieved by solving Eq. \ref{eq1} accordingly. Before diving deeper, we point here out that the row sum of the coefficient matrix is important to the proposed method, as it measures the frequency of each dictionary atom occurring in an image. Mathematically, it provides a probability distribution of image dictionary atoms. We will illustrate how to learn a pair of coupled dictionaries to represent the abstract styles of images in accordance with the optimal transport between learned dictionaries.

Optimal Transport :

Optimal transport is also a well-developed mathematical theory, which can be traced back to Monge’s problem and then discovered under different backgrounds. We here review the Monge problem and its Kantorovitch relaxation for the sake of complementary.

The Monge’s Problem: : Let \(\mu, \nu\) be two probability measures on two metric spaces \(\mathcal{X} \in \mathbb{R}^{n},\mathcal{Y} \in \mathbb{R}^{m}\), and given a cost function \(c(x, y): \mathcal{X} \times \mathcal{Y} \rightarrow [0, \infty]\), which represents the effort of transporting the mass from \(x \in \mathcal{X}\) to \(y \in \mathcal{Y}\), the Monge’s formulation aims to find a transport map \(T: \mathcal{X} \rightarrow \mathcal{Y}\), realizing the infimum of the function:

\(\begin{equation} \begin{aligned} \inf_{T_{\sharp}\mu=\nu}{\int_{\mathcal{X}}{c(\mu,T(\mu)) d \nu(x)}}, \end{aligned} \label{eq2} \end{equation}\) where \({T_{\sharp}\mu\stackrel{\text{def}}{=} \nu}\) is known as the \textit{push-forward} operator that pushes forward the mass of \(\mu\) to \(\nu\). The transport map \(T\) attains when reaching the infimum, the existence of which in practice, however, is not always guaranteed, for example, when only one of \(\mu\) and \(\nu\) is a Dirac function.

The Kantorovitch Relaxation: Alternatively, a simple relaxation of Monge’s problem initiated by Kantorovich is guaranteed to have a solution. The key idea is that the mass at any point of \(x\) can be potentially dispatched across several locations of \(y\). The equivalent Kantorovitch formulation of the optimal transport seeks for a probabilistic coupling \(\pi \in \mathcal{P}\left(\mathcal{X} \times \mathcal{Y} \right)\) between \(\mathcal{X}\) and \(\mathcal{Y}\):

\[\begin{equation} \begin{aligned} \inf_{\pi \in \Pi} \int_{\mathcal{X} \times \mathcal{Y}} c(x, y) \mathrm{d} \pi(x, y), \end{aligned} \label{eq3} \end{equation}\]

where \(\Pi\stackrel{\text{def}}{=}\left\{\pi \in\left(\mathbb{R}^{+}\right)^{\mathcal{X} \times \mathcal{Y}} \mid \pi_{\mathcal{X} }=\mu, \pi_{\mathcal{Y} }=\nu\right\}\) is the set of transportation plans with the joint distribution of marginals \(\mu\) and \(\nu\). In this formulation, \(\pi\) can be understood as a joint probability measure with marginals \(\mu\) and \(\nu\). The cost function \(c(x,y)\) can be chosen, for instance, as Euclidean distance between two locations \(x\) and \(y\), while other types of metrics could be considered, such as Riemann distances over a manifold.

Discrete Case: In practice, the distributions \(\mu\) and \(\nu\) are always accessible through discrete samples, which leads to a discrete optimal transport problem. Considering two discrete probability measures \(\mu =: \sum^m_{i=1} \boldsymbol{a}_i {\delta}_{\boldsymbol{x}_i}\) and \(\nu =: \sum^n_{j=1} \boldsymbol{b}_j {\delta}_{\boldsymbol{y}_j}\) sampled from the source and target samples \(\left\{\boldsymbol{x}_{i}\right\}_{i=1}^{m}\), \(\left\{\boldsymbol{y}_{j}\right\}_{j=1}^{n}\) with \(\boldsymbol{x}_{i}, \boldsymbol{y}_{j} \in \mathbb{R}^d\), it is straightforward to rewrite the discrete Kantorovitch optimal transport as,

\[\begin{equation} \begin{aligned} \min_{\mathbf{T} \in \Pi(\boldsymbol{a},\boldsymbol{b})} \langle \mathbf{C}, \mathbf{T} \rangle \stackrel{\text{def}}{=} \min_{\mathbf{T} \in \Pi(\boldsymbol{a},\boldsymbol{b})}{\sum_{i,j} \mathbf{C}_{i,j} \mathbf{T}_{i,j}} \end{aligned} \label{eq4} \end{equation}\]

where \(\mathbf{T}\) is a coupling matrix with entries \(\mathbf{T}_{i,j}\) describing the amount of mass flowing from \(\boldsymbol{a}_i\) to \(\boldsymbol{b}_j\) and \(\Pi(\boldsymbol{a},\boldsymbol{b}) \stackrel{\text{def}}{=} \{ \mathbf{T} \in \mathbb{R}_+^{m\times n} \vert \mathbf{T}\boldsymbol{1}_n =\boldsymbol{a}, \mathbf{T}^\top\boldsymbol{1}_m = \boldsymbol{b} \}\). The cost matrix \(\mathbf{C}\in \mathbb{R}^{m \times n}\) has the entries \(\mathbf{C}_{\boldsymbol{i,j}} = c(\boldsymbol{x}_i,\boldsymbol{y}_j)\) specifying the transport effort between the location pair \((\boldsymbol{x}_i, \boldsymbol{y}_j)\). %If \(n=m\), and \(\mu\), \(\nu\) are uniform measures, \(\Pi(\boldsymbol{a},\boldsymbol{b})\) is the Birkhoff polytope of size \(n\), and the solution of Eq. \ref{eq4} is a permutation matrix.

It is well-known that Eq. \ref{eq4} can be rewritten into a special linear program problem and solved using linear solvers such as network flow solver or transportation simplex. Despite the simple form, the linear solvers are also computationally expensive, especially for a large-scale case — for example, having \(\mathcal{O}{(n^3)}\) complexity for \(n\) pair samples with network flow solvers. In many practical tasks, the Sinkhorn’s algorithm is always chosen as a faster alternative method to solve such a discrete optimal transport approximately.

zoom
zoom
zoom
zoom
zoom
Figure 3. An illustration of optimal transport over two dictionaries. Given the content and reference images, the proposed method firstly learns the individual dictionaries and then derives an optimal transport map between the learned dictionaries.

Optimal Image Transfer

We suggest that image-to-image translation problems can be implemented by means of sparse representation and optimal transport. Image transfer, as pointed out, aims to solve the feature encoding and transformation problems. We illustrate that sparse representation provides an effective tool to encode the individual and discriminate features of different images since it has been used as a feature extractor in many image processing tasks. Sparse coefficients can be viewed as a counting process of dictionary elements, as shown in Fig. 2, which gives a probability measure to weigh the importance of each style element. Consequently, a transport plan between the encoded features is attainable and efficiently computed based on optimal transport under the small size of learned dictionaries (See Fig. 3).

Problem Formulation:

The proposed method starts with sparse representation to encode the latent styles of images and then transports the styles to reproduce a stylized image. For simplicity, we take into account an image transfer problem between two images \(x\) and \(y\) with the features or styles \(s_x\), \(s_y\), respectively. Without loss of generality, let \(\mathbf{X}=\{\boldsymbol{x}_i\}_{i=1}^M, \mathbf{Y}=\{\boldsymbol{y}_j\}_{j=1}^N, \boldsymbol{x}_i,\boldsymbol {y}_j \in \mathbb{R}^{d}\) be the vectorized patches sampled from image \(x\) and \(y\), the latent styles \(s_x\) and \(s_y\) can be expressed by sparse dictionaries given by,

\[\begin{equation} \begin{aligned} \begin{cases} &\mathbf{X} = \mathbf{D}^x\mathbf{A}, \qquad s.t. \quad {\lvert\lvert \boldsymbol{\alpha}_i \rvert\rvert}_0 \leq K_1,\\ &\mathbf{Y} = \mathbf{D}^y\mathbf{B}, \qquad s.t. \quad {\lvert\lvert \boldsymbol{\beta}_j \rvert\rvert}_0 \leq K_2, \end{cases} \end{aligned} \label{eq5} \end{equation}\]

where \(\mathbf{D}^x = [\boldsymbol{d}_1^x, \cdots, \boldsymbol{d}_m^x]\) and \(\mathbf{D}^y= [\boldsymbol{d}_1^y, \cdots, \boldsymbol{d}_n^y]\) are the style dictionaries. We assume the entries \(\boldsymbol{d}_i^x,\boldsymbol{d}_j^y \in \mathbb{R}^{d}\) are in the same space. \(\mathbf{A} \in \mathbb{R}^{d\times M}, \mathbf{B} \in \mathbb{R}^{d\times N}\) contain the coefficient vectors \(\boldsymbol{\alpha}_i\) and \(\boldsymbol{\beta}_j\) of the \(i\)-th and \(j\)-th samplings \(\boldsymbol{x}_i\) and \(\boldsymbol{y}_j\). The constraints suggest the weights \(\boldsymbol{\alpha}_i\) and \(\boldsymbol{\beta}_j\) tend to be sparse — that is, the number of non-zero entries is less than the positive \(K_1\) (or, \(K_2\)).

The dictionaries \(\mathbf{D}^x\) and \(\mathbf{D}^y\), in general cases, may not exactly encode the styles \(s_x\) and \(s_y\), while, as demonstrated hereafter, it is also enough to provide favorable transferring results for image transfer tasks. The use of sparse/redundant representation, on the one hand, is to find a compact and effective image editing tool for the latent styles. Notice that the size of dictionaries is always much less than samplings in practice, thus optimal transport between two dictionaries \(\mathbf{D}^x\) and \(\mathbf{D}^y\) is affordable even using linear program solver, which is significantly reduced the computational cost compared with the naive case over the samplings \(\mathbf{X}\) and \(\mathbf{Y}\). On the other hand, it is reasonable to assume that the weights of each element in learned dictionaries indicate the contribution of the latent style to an image. Notice also that the row sum of each row of coefficient matrices counts the total contribution of each style element. Let \(\boldsymbol{a}, \boldsymbol{b}\) be the row sum of coefficient matrices, we have \(\boldsymbol{a} = \boldsymbol{A}\boldsymbol{1}_M, \boldsymbol{b} =\boldsymbol{B}\boldsymbol{1}_N\), where \(\boldsymbol{1}_M(N)\) is the vector with all \(M(N)\) entries being value 1; and two discrete probability distributions \(\mu =: \sum^n_{i=1} \boldsymbol{a}_i {\delta}_{\boldsymbol{d}^x_i}\) and \(\nu =: \sum^m_{j=1}\boldsymbol{b}_j {\delta}_{\boldsymbol{d}^y_j}\) of the learned dictionaries, where \(\boldsymbol{a}_k\) and \(\boldsymbol{b}_k\) are the \(k\)-th element of \(\boldsymbol{a}\) and \(\boldsymbol{b}\), and \(\delta{(\cdot)}\) is the Dirac function. Recalling the Kantorovich relaxation of the transport problem in Sec. \ref{subsec3:optimal_transport}, we have an optimal transport on the learned dictionaries,

\[\begin{equation} \begin{aligned} \min_{\mathbf{T} \in \Pi(\boldsymbol{a},\boldsymbol{b})} \langle \mathbf{C}, \mathbf{T} \rangle \stackrel{\text{def}}{=} {\sum_{i,j} \mathbf{C}_{i,j} \mathbf{T}_{i,j}}, \end{aligned} \label{eq6} \end{equation}\]

where \(\mathbf{C}_{i,j} \stackrel{\text{def}}{=} c(\boldsymbol{d}^x_i, \boldsymbol{d}^y_j)\) is the ground cost function to move the dictionary element \(\boldsymbol{d}^x_i\) to \(\boldsymbol{d}^y_j\), and the transport mapping function \(\mathbf{T}\) satisfies,

\[\begin{equation} \begin{aligned} \Pi(\boldsymbol{a},\boldsymbol{b}) \stackrel{\text{def}}{=} \{ \mathbf{T} \in R_+^{m\times n} \vert \mathbf{T}\boldsymbol{1}_n = \boldsymbol{a}, \mathbf{T}^\top\boldsymbol{1}_m = \boldsymbol{b}\}. \end{aligned} \label{eq7} \end{equation}\]

In view of the above notations, we now interpret that the optimal style transfer over the learned style dictionaries, parameterized by \(\mathbf{T} \in R_{+}^{n \times m}\), is generalized into,

\[\begin{equation} \begin{array}{rllc} & \displaystyle \min_{\mathbf{T}} \langle \mathbf{C}, \mathbf{T} \rangle \stackrel{\text{def}}{=}\displaystyle \min_{\mathbf{T}} \sum_{i,j} \mathbf{C}_{i,j} \mathbf{T}_{i,j} & &\\ \text{s.t.} & \mathbf{D}^x\mathbf{A} =\mathbf{X}, & {\lvert\lvert \boldsymbol{\alpha}_i \rvert\rvert}_0 \leq K_1, & \\ & \mathbf{D}^y\mathbf{B} = \mathbf{Y}, & {\lvert\lvert \boldsymbol{\beta}_i \rvert\rvert}_0 \leq K_2, & \\ & \mathbf{A}\boldsymbol{1}_M=\boldsymbol{a}, & \mathbf{B}\boldsymbol{1}_N =\boldsymbol{b}, & \\ & \mathbf{T}\boldsymbol{1}_n = \boldsymbol{a}, & \mathbf{T}^\top\boldsymbol{1}_m = \boldsymbol{b}. & \end{array} \label{eq8} \end{equation}\]

It is clear from Eq. \ref{eq8} that the objective function aims to infer an optimal transport plan between the learned style dictionaries, in which the first and second constraints are sparse representations for images, and the last two constraints specify the property distributions of dictionaries and the necessary conditions of transport plan. Despite the simple form of Eq. \ref{eq8}, it is not easy to solve due to the \(L_0\)-norm constraints. In what follows, a relaxed model of Eq. \ref{eq8} is further discussed with an approximate solution under some mild assumptions.

Relaxed Model: As illustrated, image style transfer is formulated into an optimal transport over sparse dictionaries with both sparse representation and optimal transport constraints. However, a direct solution to Eq. \ref{eq8} is not available due to two-fold facts: (1) the sparse coefficient constrains \({\lvert\lvert \boldsymbol{\alpha}_i\rvert\rvert}_0 \leq K_1\) and \({\lvert\lvert \boldsymbol{\beta}_i\rvert\rvert}_0 \leq K_2\) are non-convex and difficult to solve in practice; and (2) the sub-problem with respect to the variable \(\mathbf{A}\) is a typical Sylvester equation constrained by \(\mathbf{D}^x\mathbf{A} =\mathbf{X}\) and \(\mathbf{A}\boldsymbol{1}_M = \boldsymbol{a}\)\footnote{The Sylvester equation has the form \(AX+XB=C\), whose solution is computational expensive in case of a large-scale problem.}. As a result, it is necessary to reduce the problem for a more efficient solution.

Paying attention to \(\mathbf{A}\boldsymbol{1}_M=\boldsymbol{a}\), \(\mathbf{B}\boldsymbol{1}_N = \boldsymbol{b}\), we have \(\mathbf{X}\boldsymbol{1}_M = \mathbf{D}^x \boldsymbol{a}, \mathbf{Y}\boldsymbol{1}_N = \mathbf{D}^y \boldsymbol{b}\) by multiplying \(\mathbf{D}^x\) and \(\mathbf{D}^y\) in both sides of the third-line constraints in Eq. \ref{eq8}. As a result, a relaxed minimization problem can be written in the form,

\[\begin{equation} \begin{array}{c l l r} & \displaystyle \min_{\mathbf{T}} \langle \mathbf{C}, \mathbf{T} \rangle \stackrel{\text{def}}{=}\displaystyle \min_{\mathbf{T}} \sum_{i,j} \mathbf{C}_{i,j} \mathbf{T}_{i,j} & & \\ \text{s.t.} & \mathbf{D}^x\mathbf{A} =\mathbf{X}, & {\lvert\lvert \boldsymbol{\alpha}_i \rvert\rvert}_0 \leq K_1, & \\ & \mathbf{D}^y\mathbf{B} = \mathbf{Y}, & {\lvert\lvert \boldsymbol{\beta}_i \rvert\rvert}_0 \leq K_2, & \\ & \mathbf{D}^x\boldsymbol{a}=\mathbf{X}\boldsymbol{1}_M, & \mathbf{D}^y\boldsymbol{b}=\mathbf{Y}\boldsymbol{1}_N, & \\ & \mathbf{T}\boldsymbol{1}_n = \boldsymbol{a}, & \mathbf{T}^\top\boldsymbol{1}_m = \boldsymbol{b}. & \end{array} \label{eq9} \end{equation}\]

This relaxation adores the constraints on sparse dictionaries instead of coefficients, leading to two-fold benefits. On the one hand, the constraints \(\mathbf{D}^x\boldsymbol{a}=\mathbf{X}\boldsymbol{1}_M\), and \(\mathbf{D}^y\boldsymbol{b}=\mathbf{Y}\boldsymbol{1}_N\) in Eq. \ref{eq9} change the sparse coefficient constraints into style dictionaries constraints, which helps to learn more favorable style dictionaries. On the other hand, it reduces the sub-problem with respect to \(\mathbf{A}\) (or, \(\mathbf{B}\)) into a standard sparse coding form, thereby avoiding the complex Sylvester equation. As interpreted hereafter, the relaxed problem can be approximately optimized using an alternative variable splitting method under the assumption of \(p\)-Wasserstein cost function.

The Solution of p-Wasserstein Case:

The cost function \(\mathbf{C}_{\boldsymbol{i,j}} \stackrel{\text{def}}{=} c(\boldsymbol{d}^x_i, \boldsymbol{d}^y_j)\) is important to the optimal transport plan. It turns out that the transport plan always exists when taking into account \(p (p \geq 1)\)-Wasserstein distance \(\boldsymbol{W}_p^p (\boldsymbol{\mu},\boldsymbol{\nu})\). For simplicity, we consider \(p=2\) Wasserstein distance, where the cost function is defined as \(\mathbf{C}_{i,j} = {\Vert\boldsymbol{d}^x_i-\boldsymbol{d}^y_j \Vert}^2_2\), which measures the distance of a pair of dictionary elements \((\boldsymbol{d}^x_i, \boldsymbol{d}^y_j)\). Clearly, we have \(\mathbf{C}_{\boldsymbol{i,j}}=0\) if \(\boldsymbol{d}^x_i = \boldsymbol{d}^y_j\).

Recalling the relaxed model in Eq. \ref{eq9}, we first rewrite the constrained optimization problem into an unconstrained one based on regularization techniques and then apply the well-known alternative variable splitting algorithm to solve it approximately. By introducing the Lagrangian multiplier technique, the above problem can be reformulated into an unconstrained optimization problem and solved via an alternative minimization scheme as follows:

\[\begin{equation} \begin{aligned} \begin{split} \operatorname*{argmin}_{\{\mathbf{D}^x,\mathbf{D}^y, \mathbf{A},\mathbf{B}, \mathbf{T}\}} & \gamma \sum_{i,j} \mathbf{T}_{i,j} {\Vert\boldsymbol{d}^x_i-\boldsymbol{d}^y_j \Vert}^2_2 + {\big\Vert \mathbf{X}- \mathbf{D}^x \mathbf{A} \big\Vert}_{F}^2+ {\big\Vert \mathbf{Y}- \mathbf{D}^y \mathbf{B} \big\Vert}_{F}^2 \\ + & \lambda_x {\big\Vert \mathbf{X}\boldsymbol{1}_M- \mathbf{D}^x \boldsymbol{a} \big\Vert}_{F}^2 + \lambda_y {\big\Vert \mathbf{Y}\boldsymbol{1}_M- \mathbf{D}^y\boldsymbol{a} \big\Vert}_{F}^2 \\ + & \tau_x {\big\Vert\mathbf{T}\boldsymbol{1}_n - \boldsymbol{a} \big\Vert}_{2}^2 + \tau_y {\big\Vert\mathbf{T}^\top\boldsymbol{1}_m - \boldsymbol{b} \big\Vert}_{2}^2 \end{split} \end{aligned} \label{eq10} \end{equation}\]

Where \(\gamma, \lambda_{x(y)}, \tau_{x(y)}\) and \(\kappa_{x(y)}\) are positive Lagrangian multipliers. Accordingly, the solution of Eq. \ref{eq10} convergent to that of Eq. \ref{eq9} when the Lagrangian multipliers go to infinity. The main idea of the alternating method is to solve the problem sequentially by fixing one variable from another. It is easy to see that Eq. \ref{eq9} can be divided into three sub-problems: sparse coding, style dictionaries learning and transport map inferring, respectively. For brevity, we describe the solution for the variables \(\mathbf{T}, \mathbf{D}^x, \mathbf{A}\), and \(\boldsymbol{a}\), and \(\mathbf{D}^y, \mathbf{B}\), and \(\boldsymbol{b}\) can be processed analogically.

Sparse coding: By fixing \(\mathbf{T}, \mathbf{D}^x, \mathbf{D}^y\) and \(\boldsymbol{a}\), \(\boldsymbol{b}\) in Eq. \ref{eq10}, the optimization problem w.r.t. the variables \(\mathbf{A}\) and \(\mathbf{B}\) is then reduced into a standard sparse encoding problem. Considering \(\mathbf{X}\boldsymbol{1}_M= \mathbf{D}^x \boldsymbol{a}\),\(\mathbf{Y}\boldsymbol{1}_N = \mathbf{D}^y \boldsymbol{b}\), the variables \(\mathbf{A}\) and \(\mathbf{B}\) only have sparse constraints. Taking the coefficient \(\mathbf{A}\) for example, the representation vectors \(\boldsymbol{\alpha}_i\) for each example \(\boldsymbol{x}_i\) in \(\mathbf{X}\) is the \(i\)-th column of \(\mathbf{A}\), which is attained by solving the following problem,

\[\begin{equation} \begin{aligned} \min_{\boldsymbol{\alpha}_i} {\lvert\lvert \boldsymbol{\alpha}_i \rvert\rvert}_0, \quad s.t. \quad \boldsymbol{x}_i = \mathbf{D}^x\boldsymbol{\alpha}_i. \end{aligned} \label{eq11} \end{equation}\]

As aforementioned, the sparse coding problem of Eq. \ref{eq11} can be solved by many existing methods such as matching pursuit (MP) or orthogonal matching pursuit (OMP) algorithms. We here use the OMP method for ease of implementation. Once the coefficients \(\mathbf{A}\) and \(\mathbf{B}\) are obtained, we have the row sums of the coefficients — that is, \(\boldsymbol{a}=\mathbf{A}\boldsymbol{1}_M\), \(\boldsymbol{b}=\mathbf{B}\boldsymbol{1}_N\) and they provide a discrete probability measure for dictionary atoms. It is worth mentioning here that the probability distributions \(\boldsymbol{a}\) and \(\boldsymbol{b}\) must be positive, while the learned coefficients may be negative here. It is possible to remedy \(\boldsymbol{d}^x_i=-\boldsymbol{d}^x_i\) and \(\mathbf{A}(i,:) =-\mathbf{A}(i,:)\) without affecting the sparse encoding process when the \(i\)-th row sum \(\boldsymbol{a}_i\) is negative. The non-negative sparse coding is also an alternative way for remedy in practice. Without the ambiguity \(\boldsymbol{a}\) and \(\boldsymbol{b}\) are also denoted as the normalized counterparts.

Learning style dictionaries: By analogy, we then fix the coefficients \(\mathbf{A}, \mathbf{B}\) and transport map \(\mathbf{T}\) and update the style dictionaries \(\mathbf{D}^x\) and \(\mathbf{D}^y\), respectively. Considering the relaxed constraints \(\mathbf{X}\boldsymbol{1}_M= \mathbf{D}^x \boldsymbol{a}, \mathbf{Y}\boldsymbol{1}_N = \mathbf{D}^y \boldsymbol{b}\), we rewrite the Tikhonov regularization form of the sub-problem of Eq. \ref{eq10} with respect to \(\mathbf{D}^x\) (or, \(\mathbf{D}^y\)) as,

\[\begin{equation} \begin{aligned} \begin{split} \operatorname*{argmin}_{\{\boldsymbol{d}^x_i\}} &{\Vert \mathbf{D}^x\mathbf{A} - \mathbf{X} \Vert}_{F}^2 + \lambda_x {\Vert \mathbf{D}^x \boldsymbol{a} - \mathbf{X}\boldsymbol{1}_M \Vert}_{2}^2 \\+& \gamma \sum_{i,j} \mathbf{T}_{i,j} {\Vert\boldsymbol{d}^x_i-\boldsymbol{d}^y_j \Vert}^2_2 \end{split} \end{aligned} \label{eq12} \end{equation}\]

where \(\lambda_x\) and \(\tau_x\) are the positive weights. Accordingly, Eq. \ref{eq10} can be viewed as a regularized dictionary learning problem. To learn the redundant style dictionaries, we use the famous K-SVD algorithm to update each element \(\boldsymbol{d}^x_i\) of dictionary \(\mathbf{D}^x\) sequentially. Notice that the original K-SVD algorithm is not directly applicable due to the regularization terms in Eq. \ref{eq12}. We instead introduce an extended K-SVD algorithm for updating dictionaries. For clarity, we first review the original K-SVD algorithm and show how to extend it to the proposed model.

The Extended K-SVD Algorithm: In the K-SVD algorithm, sparse representation is to factorized an image \(\mathbf{X}\) into a multiple form of the dictionary \(\mathbf{D}^x\) and coefficient \(\mathbf{A}\), that is, \(\mathbf{X}=\mathbf{D}^x\mathbf{A}\). We now focus on the style dictionaries-learning process by fixing the sparse coefficient \(\mathbf{A}\). The dictionary \(\mathbf{D}^x\) can be updated by minimizing the objective function,

\[\begin{equation} \begin{aligned} \begin{split} {\Vert \mathbf{X}-\mathbf{D}^x\mathbf{A} \Vert}_{F}^2 = & \left\| \mathbf{X}- \sum_{j=1}^n \boldsymbol{d}^x_j \boldsymbol{\alpha}_{T}^j \right\|_{F}^2 = \left\| \left( \mathbf{X}- \sum_{j \neq k} \boldsymbol{d}^x_j \boldsymbol{\alpha}_{T}^j \right) - \boldsymbol{d}^x_k \boldsymbol{\alpha}_{T}^k \right\|_{F}^2 \\ = & \left\| \mathbf{E}^k- \boldsymbol{d}^x_k \boldsymbol{\alpha}_{T}^k \right\|_{F}^2 \end{split} \end{aligned} \label{eq13} \end{equation}\]

where \(\mathbf{E}^k=\mathbf{X}- \sum_{j \neq k} \boldsymbol{d}^x_j \boldsymbol{\alpha}_{T}^j\) is the residual part that not involves the \(k\)-th dictionary element \(\boldsymbol{d}^x_k\), and \(\alpha_{T}^k\) is the \(k\)-th row of the coefficient matrix \(\mathbf{A}\). With the above form, \(\boldsymbol{d}_i^x\) is updated as the first left eigenvector given by SVD algorithm based on the K-SVD algorithm. The basic idea here is to decompose the term \(\mathbf{D}^x \mathbf{A}\) into the sum of \(n\) rank-1 matrices, where only one dictionary element \(\boldsymbol{d}^x_k\) is involved and can be updated independently in each time when fixing the remainder \(\mathbf{E}^k\). This strategy helps to learn the redundant dictionaries more efficiently.

The process can be analogically extended to the regularization case. Let \(\mathbf{E}^k =\mathbf{X}-\sum_{k\neq i} \boldsymbol{d}_k^x \boldsymbol{\alpha}_{k}^T\) and \(\mathbf{F}^k= \mathbf{X}\mathbf{1}_{M}-\sum_{k\neq i} \boldsymbol{d}_k^x \boldsymbol{a}_{k}^T\) be the residual of \({\Vert \mathbf{D}^x\mathbf{A} - \mathbf{X} \Vert}_{F}^2\) and \({\Vert \mathbf{D}^x \boldsymbol{a} - \mathbf{X}\boldsymbol{1}_M \Vert}_{2}^2\) without using the element \(\boldsymbol{d}_i^x\), where \(\boldsymbol{\alpha}_{k}\) and \(\boldsymbol{a}_{k}\) are the \(k\)-th row vector of \(\mathbf{A}\) and \(\boldsymbol{a}\). With this notation, the first two terms in Eq. \ref{eq12} can be rewritten as \({\Vert \mathbf{E}^k -\boldsymbol{d}_i^x \boldsymbol{\alpha}_{i}^T \Vert}_{F}^2\) and \({\Vert \mathbf{F}^k - \boldsymbol{d}_i^x \boldsymbol{a}_{i}^T \Vert}_{F}^2\), respectively. The sub-problem Eq. \ref{eq12} with respect to the \(k\)-th dictionary \(\boldsymbol{d}^x_k\) can be rewritten into the form,

\[\begin{equation} \begin{aligned} \begin{split} \operatorname*{argmin}_{\boldsymbol{d}^x_k} & \left\| \mathbf{E}^k- \boldsymbol{d}^x_k \boldsymbol{\alpha}_{T}^k \right\|_{F}^2 + \lambda_x \left\| \mathbf{F}^k- \boldsymbol{d}^x_k \boldsymbol{a}_{T}^k \right\|_{F}^2 + & \gamma \sum_{j} \mathbf{T}_{k,j} {\Vert\boldsymbol{d}^x_k-\boldsymbol{d}^y_j \Vert}^2_2. \end{split} \end{aligned} \label{eq14} \end{equation}\]

It is easy to verify that \(\boldsymbol{d}_i^x\) has a closed-form solution because the objective function in Eq. \ref{eq14} is quadratic with respect to \(\boldsymbol{d}^x_k\). The last two terms in Eq. \ref{eq14} can be treated as the regularization terms compared with Eq. \ref{eq13} and such a regularization also helps to strengthen a more stable numerical solution. The reader is referred to the K-SVD algorithm for more details.

Optimal Transport: The transport mapping \(\mathbf{T}\) over the learned style dictionaries is a standard optimal transport problem when fixing the style dictionaries \(\mathbf{D}^x, \mathbf{D}^y\) and coefficients \(\mathbf{A}, \mathbf{B}\), that is,

\[\begin{equation} \begin{aligned} \begin{split} &\operatorname*{argmin}_{\mathbf{T}} \ \sum_{i,j} \mathbf{T}_{i,j} {\Vert\boldsymbol{d}^x_i-\boldsymbol{d}^y_j \Vert}^2_2, \\ & s.t. \quad \mathbf{T}\boldsymbol{1}_n = \boldsymbol{a}, \ \mathbf{T}^\top\boldsymbol{1}_m = \boldsymbol{b}. \end{split} \end{aligned} \label{eq15} \end{equation}\]

It is worth noting that a discrete optimal transport can be solved by linear programming, while the computational cost increases significantly over the large-scale samplings. It is empirically solvable in our case due to the small size of style dictionaries, which forms one of the cores of our method. Notice that \(\boldsymbol{a}\) and \(\boldsymbol{b}\) in Eq. \ref{eq12} represent the normalized counterparts.

zoom

Entropy-Regularized Optimal Transport: It is well-known that an exact solution to optimal transport based on the network flow method has computational complexity \(O\left(n^3\right)\) for the \(n\) samplings. The solution may be unavailable when \(n\) exceeds thousands of samplings in a general PC platform. Instead, we resort to a more efficient entropy-regularized optimal transport,

\[\begin{equation} \begin{aligned} \begin{split} \operatorname*{argmin}_{\mathbf{T}}& \ \sum_{i,j} \mathbf{C}_{i,j} \mathbf{T}_{i,j} + \eta H(\mathbf{T}) \\ s.t. \quad &\mathbf{T}\boldsymbol{1}_m = \boldsymbol{a}, \quad \mathbf{T}^\top\boldsymbol{1}_n = \boldsymbol{b}. \end{split} \end{aligned} \end{equation} \label{eq16}\]

where \(H(\mathbf{T}) = \sum_{i,j}\mathbf{T}_{i,j} (log (\mathbf{T}_{i,j})-1)\) is the negative entropic regularization, and \(\eta\) is the positive regularization parameter. As interpreted in, the regularized model of Eq. \ref{eq16} is a convex optimization problem and can be solved with the Sinkhorn-Knopp algorithm. A detailed solution is also presented in Alg. \ref{alg1}. Note that the sub-problems in Alg. \ref{alg1} involve component-wise divide operators \(\oslash\) that can be computed efficiently.

zoom
Figure 4. Visual comparison of color transfer on natural images. From left to right: (a) content images, (b) MKL, (c) Regularized OT, (d) PhotoNAS, (e) NeuralPreset, (f) our results, and (g) reference images. (Zoom in for better view).

Image Synthesis:

Once the dictionaries \(\mathbf{D}^{x}\) and \(\mathbf{D}^{y}\) and transport map \(\mathbf{T}\) are obtained, given an image \(\mathbf{x}\) in style \(s_{x}\), it is then easy to reconstruct an image \(\hat{\mathbf{y}}\) with the style \(s_{y}\) by swapping the corresponding style dictionaries but keeping the sparse representing coefficients invariant, that is,

\[\begin{equation} \begin{split} \hat{\mathbf{y}}_i= \hat{\mathbf{D}}^{x} \boldsymbol{\alpha}_i =\mathbf{T}({\mathbf{D}}^{y}) \boldsymbol{\alpha}_i, \end{split} \label{eq17} \end{equation}\]

where the \(k\)-th column of \(\hat{\mathbf{D}}^{x}\) is \(\hat{\boldsymbol{d}}^{x}_i=\mathbf{T}(\boldsymbol{d}^{y}_j)=\frac{\sum_{j=1}^n \mathbf{T}_{i,j} \boldsymbol{d}^y_j}{\sum_{j=1}^n \mathbf{T}_{i,j}}\), which can be viewed as a posterior mean estimate to define a one-to-one of the transfer function, and \(\boldsymbol{\alpha}_{i}\) is the sparse coefficients of patch \(\mathbf{x}_{i}\). In most cases, the image \(\mathbf{x}\) is not exactly the same as training data — but is sampled from an identical distribution, the coefficients \(\boldsymbol{\alpha}_i\) for each patch \(\mathbf{x}_i\) can be learned based on the sparse coding Eq. \ref{eq14}. In the cases of photo-realistic image transfer, it may be preferable to add some constraints for more consistent local textures, for example, using a simple gradient regularization for image synthesis,

\[\begin{equation} \begin{split} \operatorname*{argmin}_{\hat{\mathbf{y}}} \ & {\Vert \hat{\mathbf{y}} -\hat{\mathbf{D}}^{x} \boldsymbol{\alpha} \Vert}_{2}^2 + \rho {\Vert \nabla \hat{\mathbf{y}} - \nabla \mathbf{x} \Vert}_{2}^2\\ \end{split} \label{eq18} \end{equation}\]

where \(\nabla\) is the gradient operator and \(\rho\) is the parameter to weight the gradient regularization term. It is easy to verify that Eq. \ref{eq18} can be easily computed due to its closed-form solution.

Experiment Results

In this section, we extensively illustrate the performance of optimal style transfer and show the empirical evidence on two fundamental image-to-image translation tasks: color transform and artistic style transfer. In each scenario, the sparse coefficients and individual dictionaries are firstly learned using sparse representation and then an optimal transport map is derived on the learned dictionaries (See Fig. Fig. 1). The process is updated iteratively until it converges to a given stop criteria. The learned dictionaries are treated as individual feature styles for image reconstruction.

zoom
Figure 7. Visual results of artistic style transfer of the proposed method. The content images in the first column are transferred according to the given styles in the first row. (Zoom in for better view).

As shown in Fig. 6, the OT-based MKL mapping and ROT model faithfully convert the content color in both cases according to the guidance of the reference images. The former reveals color degradation with over-smoothing details (Fig. \ref{fig6-b}), while the latter exhibits block artifacts (Fig. \ref{fig6-c}) due to the sub-sampling strategy, although they can be partially rectified by some filtering methods. The neural transfer methods produce fine details but they may suffer from inappropriate color mapping, leading to over-saturated or under-saturated effects. In contrast, the proposed model (Fig. \ref{fig6-f}) greatly alleviates the drawbacks for better results due to the strategy of simultaneous representation and transformation, while the improvement is achieved at the expense of high computational cost of dictionary training and image reconstruction compared with the sub-sampling strategy.

zoom
Figure 8. Visual comparison of artistic style transfer with deep learning methods: (a) Input content and style images, (b) AdaIN, (c) WCT, (d) PhotoWCT, (e) WCT$^2$ (f) StyTr$^2$, (g) QuantArt and (h) our results. (Zoom in for better view).

Photo-realistic Style Transfer:

We further show that the proposed model is applicable to artistic style transfer, especially the photo-realistic effects. As shown in Fig. 6, we present the transferred results with the artworks created by famous artists. It is clear that all cases have very complex content information composed of sophisticated painting strokes and multiple colors, and textures for aesthetic effects. To receive visual-pleasant results, we randomly select \(100K\) patches from content and reference images, and use \(1024\) dictionary elements for training. The optimal transport is solved by the entropy-regularized method with regularization parameter \(\gamma = 0.05\) and \(200\) iterations for each optimal transport step. The new model is possible to produce visual-pleasant transferred results with consistently aesthetic styles and well-preserved details. In Fig. Fig. 7, we additionally show the high-quality model performance in the scenarios of different reference styles, which demonstrate the benefits of sparse representation and optimal transport.

In Fig. Fig. 8, we compare the transferred results with deep learning-based methods. The AdaIN model and WCT method are two well-known arbitrary image style transfer methods. As shown in Fig. 8b and Fig. 8c, both of them are capable of transforming the global features such as image colors and main structures for impressive results, however, they may have limitations in suppressing non-consistent local details. Recently, the transformer-based architecture shows great attention in many vision-based tasks. We here also include the StyTr\(^2\) model for comparison, however, it also suffers slight non-consistent local details in Fig. 8d despite the powerful transformer architecture. In addition, we also compare the transferred results with the photo-realistic methods: PhotoWCT model, WCT\(^2\) method and QuantArt, in which the non-consistent local details can be significantly suppressed as verified in Fig. 8e \(\sim\) Fig. 8g. Similarly, the proposed method also gives arise to photo-realistic results with more consistent local textures and details in \ref{fig8-h}. The results are also observed in Fig. Fig. 9. In summary, the proposed method is applicable for both natural and artistic images, and the results further demonstrate its ability in retrieving consistent details.

Additionally, we present the quantitative performance against recent deep learning-based methods. Notice that an objective assessment is often difficult due to the lack of ground truth in aesthetic significance. As suggested in recent work, we adopt the structural similarity (SSIM) of edge maps between content and transferred images to indicate detail preservation ability. We also take the structure fidelity into account based on the intrinsic image transfer (IIT) algorithm in consideration of the robust structure-preserving property in varying illumination (color and brightness) conditions. Meanwhile, we introduce three deep learning-based evaluation metrics: LPIPS loss, Gram loss (VGG style features) and FID metric, which measure the perceptual similarity between the content and generated images from the aspects of image content, style and visual fidelity, respectively.

The evaluation is conducted on a small subset with 21 paired content and reference images sampled from of the WikiArt dataset. For fairness, all images are rescaled into \(512\times512\) resolution and the statistic results are listed in \ref{tab1}. As we can see, the results are consistent with the visual effects in Fig. Fig. 8. The AdaIN method is efficient but has low performance of structural fidelity on both SSIM-edge and IIT indexes. The WCT receives obvious improvements in structural similarity, but the perceptual similarity is decreased due to the over-smoothing local structures. The increased trend of structural similarity is also observed in both PhotoWCT and WCT\(^2\) methods. Moreover, they have much better perception-based LPIPS, Gram loss, and FID metric, which can be also demonstrated from their photo-realistic transferred effects. The StyTr\(^2\) shows very similar effects as AdaIN method and the QuantArt produce high-quality consistent structures, but has limited perceptual similarity in Gram loss and FID metric, which may be caused by the limited color transform in some cases. In contrast, the proposed method gives a fine balance between structural fidelity and perceptual similarity in content, style, and visual fidelity, which is observed in the visual effects in Fig. Fig. 8. The benefits mainly underpin the fact that sparse representation provides a simple but effective tool that is especially suitable for low-level or middle-level feature extraction, especially in the case of pursuing photo-realistic image transfer effects.