使用Pytorch从零开始构建Energy-based Model

本文介绍: 知识回顾:[1][2][3][4][5][6][7][8]在本教程中，我们将研究，并重点关注它们作为生成模型的应用。在 2012 年深度学习大肆炒作之前，能量模型一直是一种流行的工具。然而，近年来，由于提出了改进的训练方法和技巧，基于能量的模型越来越受到关注。尽管它们仍处于研究阶段，但它们已证明在某些情况下优于强大的生成对抗网络，业已成为生成图像的最先进技术。因此，了解基于能量的模型很重要，由于理论有时可能很抽象，我们将通过大量示例来展示基于能量的模型的思想。

知识回顾:
[1] 生成式建模概述
[2] Transformer I，Transformer II
[3] 变分自编码器
[4] 生成对抗网络，高级生成对抗网络 I，高级生成对抗网络 II
[5] 自回归模型
[6] 归一化流模型
[7] 基于能量的模型
[8] 扩散模型 I, 扩散模型 II

在本教程中，我们将研究基于能量的深度学习模型，并重点关注它们作为生成模型的应用。在 2012 年深度学习大肆炒作之前，能量模型一直是一种流行的工具。然而，近年来，由于提出了改进的训练方法和技巧，基于能量的模型越来越受到关注。尽管它们仍处于研究阶段，但它们已证明在某些情况下优于强大的生成对抗网络，业已成为生成图像的最先进技术。因此，了解基于能量的模型很重要，由于理论有时可能很抽象，我们将通过大量示例来展示基于能量的模型的思想。

首先，让我们导入下面的标准库。

## Standard libraries
import os
import json
import math
import numpy as np
import random

## Imports for plotting
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgb
import matplotlib
from mpl_toolkits.mplot3d.axes3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
matplotlib.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim
# Torchvision
import torchvision
from torchvision.datasets import MNIST
from torchvision import transforms
# PyTorch Lightning
try:
    import pytorch_lightning as pl
except ModuleNotFoundError: # Google Colab does not have PyTorch Lightning installed by default. Hence, we do it here if necessary
    !pip install --quiet pytorch-lightning&gt;=1.4
    import pytorch_lightning as pl
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

# Path to the folder where the datasets are/should be downloaded (e.g. CIFAR10)
DATASET_PATH = "../data"
# Path to the folder where the pretrained models are saved
CHECKPOINT_PATH = "../saved_models/tutorial8"

# Setting the seed
pl.seed_everything(42)

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print("Device:", device)

我们还有预先训练的模型，可以在下面下载。

import urllib.request
from urllib.error import HTTPError
# Github URL where saved models are stored for this tutorial
base_url = "https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial8/"
# Files to download
pretrained_files = ["MNIST.ckpt", "tensorboards/events.out.tfevents.MNIST"]

# Create checkpoint path if it doesn't exist yet
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# For each file, check whether it already exists. If not, try downloading it.
for file_name in pretrained_files:
    file_path = os.path.join(CHECKPOINT_PATH, file_name)
    if "/" in file_name:
        os.makedirs(file_path.rsplit("/",1)[0], exist_ok=True)
    if not os.path.isfile(file_path):
        file_url = base_url + file_name
        print(f"Downloading {file_url}...")
        try:
            urllib.request.urlretrieve(file_url, file_path)
        except HTTPError as e:
            print("Something went wrong. Please try to download the file from the GDrive folder, or contact the author with the full output including the following error:n", e)

能量模型

在本教程的第一部分中，我们将回顾基于能量的模型的理论（相同的理论已在第 7 讲中讨论过）。虽然以前的大多数模型都有分类或回归的目标，但基于能量的模型是从不同的角度出发的：密度估计。给定一个包含大量元素的数据集，我们想要估计整个数据空间的概率分布。举个例子，如果我们对 CIFAR10 的图像进行建模，我们的目标是在所有可能大小的图像上获得概率分布。这些图像很有可能看起来很真实，并且是 10 个 CIFAR 类别之一。像图像之间插值这样的简单方法不起作用，因为图像的维度极高（尤其是大型高清图像）。因此，我们转向在复杂数据上表现良好的深度学习方法。

然而，我们如何使用简单的神经网络来处理如此多维度概率分布

(

)

p(math bf{x})

$p (x)$ 的预测问题？问题是我们不能只预测 0 到 1 之间的分数，因为数据的概率分布需要满足两个属性：

概率分布需要给任何可能的
概率密度必须对所有可能的输入求和/积分为 1： $int_{mathbf{x}} p(mathbf{x}) dmathbf{x} = 1 ∫xp(x)dx=1。$

幸运的是，实际上有很多方法可以实现这一点，其中之一就是基于能量的模型。基于能量的模型的基本思想是，您可以通过除以其体积，将任何预测值大于零的函数转换为概率分布。想象我们有一个神经网络，它有一个神经元作为输出，就像回归中一样。我们可以称这个网络为

(

)

E_{theta}(mathbf{x})

$E_{θ} (x)$ ，在这里

$θ$ 是我们的网络参数，并且

mathbf{x}

$x$ 是输入数据（例如图像）。输出

E_{theta}

$E_{θ}$ 是一个介于

−

∞

-infty

$- \infty$ 和

∞

infty

$\infty$ 的标量值。现在，我们可以使用基本概率论来标准化所有可能输入的分数：

(

)

exp

⁡

(

−

(

)

where

{

∫

exp

⁡

(

−

(

)

is c ontinuous

∑

exp

⁡

(

−

(

)

is di screte

begin{split}q_{theta}(mathbf{x}) = frac{exp left(-E_{theta}(mathbf{x})right)}{Z_{theta}} hsp ace{5mm}text{where}hsp ace{5mm} Z_{theta} = begin{cases} int_{mathbf{x}}expleft(-E_{theta}(mathbf{x})right) dmathbf{x} & text{if }xtext{ is c ontinuous}\ sum_{mathbf{x}}expleft(-E_{theta}(mathbf{x})right) & text{if }xtext{ is di screte} end{cases}end{split}

$q_{θ} (x) = \frac{exp ( - E _{θ} ( x ) )}{Z _{θ}} where Z_{θ} = {\int_{x} exp (- E_{θ} (x)) d x \sum_{x} exp (- E_{θ} (x)) if x is co ntinuous if x is di sc rete$
这个exp-function 确保我们为任何可能的输入分配大于零的概率。我们在前面使用在

$E$ 前面使用负号是因为我们称

E_{theta}

$E_{θ}$ 为能量函数：高似然性的数据点具有低能量，而低似然性的数据点具有高能量。 $KaTeX parse error: Undefined control sequence: θ at position 4: Z_{̲θ̲}$ 是我们的归一化项，可确保密度积分/总和为 1。我们可以通过对

(

)

q_{theta}(mathbf{x})

$q_{θ} (x)$ 积分来证明这一点：

∫

(

)

∫

exp

⁡

(

−

(

)

∫

exp

⁡

(

−

(

)

∫

exp

⁡

(

−

(

)

∫

exp

⁡

(

−

(

)

int_{mathbf{x}}q_{theta}(mathbf{x})dmathbf{x} = int_{mathbf{x}}frac{expleft(-E_{theta}(mathbf{x})right)}{int_{mathbf{tilde{x}}}expleft(-E_{theta}(mathbf{tilde{x}})right) dmathbf{tilde{x}}}dmathbf{x} = frac{int_{mathbf{x}}expleft(-E_{theta}(mathbf{x})right)dmathbf{x}}{int_{mathbf{tilde{x}}}expleft(-E_{theta}(mathbf{tilde{x}})right) dmathbf{tilde{x}}} = 1

$\int_{x} q_{θ} (x) d x = \int_{x} \frac{exp ( - E _{θ} ( x ) )}{\int _{\tilde{x}} exp ( - E _{θ} ( x ~ ) ) d x ~} d x = \frac{\int _{x} exp ( - E _{θ} ( x ) ) d x}{\int _{\tilde{x}} exp ( - E _{θ} ( x ~ ) ) d x ~} = 1$
请注意，我们将概率分布称为

(

)

q_{theta}(mathbf{x})

$q_{θ} (x)$ , 因为这是模型学习到的分布，并且经过训练以尽可能接近真实的未知分布

(

)

p(mathbf{x})

$p (x)$ 。

这种概率分布公式的主要好处是它具有很大的灵活性，我们可以以任何我们喜欢的方式选择

E_{theta}

$E_{θ}$ ，没有任何限制。然而，当看上面的等式时，我们可以看到一个基本问题：我们如何计算

Z_{theta}

$Z_{θ}$ ？对于高维输入和/或更大的神经网络, 我们没有机会

Z_{theta}

$Z_{θ}$ 进行解析计算，但任务要求我们知道

Z_{theta}

$Z_{θ}$ 。尽管我们无法确定某个点的确切可能性，但我们可以使用一些方法来训练基于能量的模型。因此，我们接下来将研究“Contr astive Divergence”来训练模型。

Contr astive Divergence

当我们训练生成式模型时，通常是通过最大似然估计来完成的。换句话说，我们尝试最大化训练集中示例的可能性。由于归一化常数

Z_{theta}

$Z_{θ}$ 未知，无法确定点的确切like lihood，我们需要训练略有不同的基于能量的模型。我们不能仅仅最大化非标准化概率

exp

⁡

(

−

(

train

)

exp(-E_{theta}(mathbf{x}_{text{train}}))

$exp (- E_{θ} (x_{train}))$
因为不能保证

Z_{theta}

$Z_{θ}$ 保持不变，或者说

tr ain

mathbf{x}_{text{tr ain}}

$x_{tr ai n}$
比其他更有可能。然而，如果我们的训练基于比较点的like lihood，我们就可以创建一个稳定的目标。也就是说，我们可以重写我们的最大似然目标，其中我们与我们模型的随机采样数据点相比来最大化

train

mathbf{x}_{text{train}}

$x_{train}$ ：

∇

MLE

(

;

)

−

(

)

[

∇

log

⁡

(

)

]

(

)

[

∇

(

)

]

−

(

)

[

∇

(

)

]

begin{split}begin{split} nabla_{theta}mathcal{L}_{text{MLE}}(mathbf{theta};p) & = -mathbb{E}_{p(mathbf{x})}left[nabla_{theta}log q_{theta}(mathbf{x})right]\[5pt] & = mathbb{E}_{p(mathbf{x})}left[nabla_{theta}E_{theta}(mathbf{x})right] – mathbb{E}_{q_{theta}(mathbf{x})}left[nabla_{theta}E_{theta}(mathbf{x})right] end{split}end{split}

$\nabla_{θ} L_{MLE} (θ; p) = - E_{p (x)} [\nabla_{θ} lo g q_{θ} (x)] = E_{p (x)} [\nabla_{θ} E_{θ} (x)] - E_{q_{θ} (x)} [\nabla_{θ} E_{θ} (x)]$
请注意，损失仍然是我们想要最小化的目标。因此，我们尝试最小化数据集中数据点的能量，同时最大化模型中随机采样数据点的能量（我们如何采样将在下面解释）。尽管这个目标听起来很直观，但它实际上是如何从我们的原始分布中得出的

(

)

q_{theta}(mathbf{x})

$q_{θ} (x)$ ？诀窍是我们通过单个蒙特卡罗采样近似

Z_{theta}

$Z_{θ}$ 。这正与我们上面所写的目标完全相同。

从视觉上看，我们可以按如下方式看待目标（图片来源 – Stefano Ermon 和 Aditya Grover）：
在这里插入图片描述
在我们的例子中，

f_{theta}

$f_{θ}$ 代表

exp

⁡

(

−

(

)

exp(-E_{theta}(mathbf{x}))

$exp (- E_{θ} (x))$ 。右边的点称为“正确答案”，代表数据集中的一个数据点（即

train

mathbf{x}_{text{train}}

$x_{train}$ ），左边的点“错误答案”，来自我们模型的样本（即

sample

mathbf{x}_{text{sample}}

$x_{s amp le}$ ）。因此，我们尝试“上拉”数据集中数据点的概率，同时“下推”随机采样点。拉和推的两个力是平衡的

(

)

(

)

q_{theta}(mathbf{x})=p(mathbf{x})

$q_{θ} (x) = p (x)$ 。

从基于能量的模型中采样

为了从基于能量的模型中采样，我们可以使用 Langevin Dynamics 应用马尔可夫链蒙特卡罗。该算法的思想是从随机点开始，利用梯度慢慢向概率较高的方向移动

exp

⁡

(

−

(

)

exp(-E_{theta}(mathbf{x}))

$exp (- E_{θ} (x))$ 。然而，这还不足以完全捕获概率分布。我们需要在当前样本的每个梯度步骤添加噪声

omega

$ω$ 。在某些条件下，例如我们无限次执行梯度步骤，我们将能够从建模的分布中创建精确的样本。然而，由于这实际上是不可能的，我们通常将链限制为

$K$ 步（

$K$ 为需要微调的超参数）。总的来说，采样过程可以总结为以下算法：
在这里插入图片描述

基于能量的模型在生成之外的应用

对新数据采样的概率分布进行建模并不是基于能量的模型的唯一应用。任何需要我们比较两个元素的应用都更容易学习，因为我们只需要追求更高的能量。下面显示了几个示例（图片来源 – Stefano Ermon 和 Aditya Grover）。像对象识别或序列标记这样的分类设置可以被视为基于能量的任务，因为我们只需要找到

$Y$ 来最小化输出的输入

(

)

E(X, Y)

$E (X, Y)$ （因此最大化概率）。类似地，基于能量的模型的一个流行应用是图像去噪。给定一张有很多噪声的图像

$X$ ，我们尝试通过找到真实的输入图像来最小化能量

$Y$ 。
在这里插入图片描述
尽管如此，我们将在这里重点关注生成模型。

图像生成

作为基于能量的模型的示例，我们将训练图像生成模型。具体来说，我们将了解如何使用非常简单的 CNN 模型生成 MNIST 数字。然而，应该指出的是，能量模型不容易训练，如果超参数调整不好，通常会出现发散。我们将依赖Yilun Du 和 Igor Mordatch 的论文《Implicit Generation and Generalization in Energy-Based Models》（博客）中提出的训练技巧。然而，本笔记本的重要部分是了解如何在模型中实际使用上述理论。

数据集

首先，我们可以加载下面的 MNIST 数据集。请注意，我们需要在 -1 和 1 之间标准化图像，而不是均值 0 和 std 1，因为在采样期间，我们必须限制输入空间。在 -1 和 1 之间缩放可以更容易实现。

# Transformations applied on each image => make them a tensor and normalize between -1 and 1
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))
                               ])

# Loading the training dataset. We need to split it into a training and validation part
train_set = MNIST(root=DATASET_PATH, train=True, transform=transform, download=True)

# Loading the test set
test_set = MNIST(root=DATASET_PATH, train=False, transform=transform, download=True)

# We define a set of data loaders that we can use for various purposes later.
# Note that for actually training a model, we will use different data loaders
# with a lower batch size.
train_loader = data.DataLoader(train_set, batch_size=128, shuffle=True,  drop_last=True,  num_workers=4, pin_memory=True)
test_loader  = data.DataLoader(test_set,  batch_size=256, shuffle=False, drop_last=False, num_workers=4)

CNN模型

首先，我们实现 CNN 模型。MNIST 图像的大小为 28×28，因此我们只需要一个小模型。例如，我们将应用多个步长为 2 的卷积来缩小图像。如果您有兴趣，还可以使用更深层次的模型，例如小型 ResNet，但为了简单起见，我们将坚持使用小型网络。

在能量模型中使用像 Swish 这样的平滑激活函数而不是 ReLU 是一个很好的做法。这是因为我们将依赖于我们得到的关于输入图像的梯度，它不应该是稀疏的。

class Swish(nn.Module):

    def forward(self, x):
        return x * torch.sigmoid(x)


class CNNModel(nn.Module):

    def __init__(self, hidden_features=32, out_dim=1, **kwargs):
        super().__init__()
        # We increase the hidden dimension over layers. Here pre-calculated for simplicity.
        c_hid1 = hidden_features//2
        c_hid2 = hidden_features
        c_hid3 = hidden_features*2

        # Series of convolutions and Swish activation functions
        self.cnn_layers = nn.Sequential(
                nn.Conv2d(1, c_hid1, kernel_size=5, stride=2, padding=4), # [16x16] - Larger padding to get 32x32 image
                Swish(),
                nn.Conv2d(c_hid1, c_hid2, kernel_size=3, stride=2, padding=1), #  [8x8]
                Swish(),
                nn.Conv2d(c_hid2, c_hid3, kernel_size=3, stride=2, padding=1), # [4x4]
                Swish(),
                nn.Conv2d(c_hid3, c_hid3, kernel_size=3, stride=2, padding=1), # [2x2]
                Swish(),
                nn.Flatten(),
                nn.Linear(c_hid3*4, c_hid3),
                Swish(),
                nn.Linear(c_hid3, out_dim)
        )

    def forward(self, x):
        x = self.cnn_layers(x).squeeze(dim=-1)
        return x

在笔记本的其余部分中，模型的输出实际上并不代表

(

)

E_{theta}(mathbf{x})

$E_{θ} (x)$ , 而是

−

(

)

-E_{theta}(mathbf{x})

$- E_{θ} (x)$ 。这是基于能量的模型的标准实现实践，因为有些人也将能量概率密度写为

(

)

exp

⁡

(

)

q_{theta}(mathbf{x}) = frac{expleft(f_{theta}(mathbf{x})right)}{Z_{theta}}

$q_{θ} (x) = \frac{e x p ( f _{θ} ( x ) )}{Z _{θ}}$ 。在这种情况下，该模型实际上代表

(

)

f_{theta}(mathbf{x})

$f_{θ} (x)$ 。在训练损失等方面，我们需要注意不要调换符号。

采样buffer

在下一部分中，我们将研究使用采样元素进行的训练。为了使用对比散度目标，我们需要在训练期间生成样本。之前的工作表明，由于图像的高维性，我们需要在MCMC采样内部进行大量迭代才能获得合理的样本。然而，有一个训练技巧可以显着降低采样成本：使用采样缓冲区。我们的想法是，我们将最后几个批次的样本存储在缓冲区中，并将它们重新用作下一个批次的 MCMC 算法的起点。这降低了采样成本，因为模型需要明显更少的步骤来收敛到合理的样本。然而，为了不仅依赖以前的样本并允许新的样本，我们从头开始重新初始化 5% 的样本（-1 到 1 之间的随机噪声）。

下面，我们实现采样缓冲区。该函数sample_new_exmps 返回一批新的“假”图像。我们将这些图像称为假图像，因为它们已经生成，但实际上并不是数据集的一部分。如前所述，我们随机使用初始化 5%，并从缓冲区中随机选取 95%。在这个初始批次中，我们执行 MCMC 60 次迭代，以提高图像质量并更接近来自

(

)

q_{theta}(mathbf{x})

$q_{θ} (x)$ 。在函数generate_samples中，我们实现了图像的MCMC。请注意 , 超参数 step_size，steps噪声标准差

sigma

$σ$ 是专门为 MNIST 设置的，如果您想使用此类数据集，则需要针对不同的数据集进行微调。

class Sampler:

    def __init__(self, model, img_shape, sample_size, max_len=8192):
        """
        Inputs:
            model - Neural network to use for modeling E_theta
            img_shape - Shape of the images to model
            sample_size - Batch size of the samples
            max_len - Maximum number of data points to keep in the buffer
        """
        super().__init__()
        self.model = model
        self.img_shape = img_shape
        self.sample_size = sample_size
        self.max_len = max_len
        self.examples = [(torch.rand((1,)+img_shape)*2-1) for _ in range(self.sample_size)]

    def sample_new_exmps(self, steps=60, step_size=10):
        """
        Function for getting a new batch of "fake" images.
        Inputs:
            steps - Number of iterations in the MCMC algorithm
            step_size - Learning rate nu in the algorithm above
        """
        # Choose 95% of the batch from the buffer, 5% generate from scratch
        n_new = np.random.binomial(self.sample_size, 0.05)
        rand_imgs = torch.rand((n_new,) + self.img_shape) * 2 - 1
        old_imgs = torch.cat(random.choices(self.examples, k=self.sample_size-n_new), dim=0)
        inp_imgs = torch.cat([rand_imgs, old_imgs], dim=0).detach().to(device)

        # Perform MCMC sampling
        inp_imgs = Sampler.generate_samples(self.model, inp_imgs, steps=steps, step_size=step_size)

        # Add new images to the buffer and remove old ones if needed
        self.examples = list(inp_imgs.to(torch.device("cpu")).chunk(self.sample_size, dim=0)) + self.examples
        self.examples = self.examples[:self.max_len]
        return inp_imgs

    @staticmethod
    def generate_samples(model, inp_imgs, steps=60, step_size=10, return_img_per_step=False):
        """
        Function for sampling images for a given model.
        Inputs:
            model - Neural network to use for modeling E_theta
            inp_imgs - Images to start from for sampling. If you want to generate new images, enter noise between -1 and 1.
            steps - Number of iterations in the MCMC algorithm.
            step_size - Learning rate nu in the algorithm above
            return_img_per_step - If True, we return the sample at every iteration of the MCMC
        """
        # Before MCMC: set model parameters to "required_grad=False"
        # because we are only interested in the gradients of the input.
        is_training = model.training
        model.eval()
        for p in model.parameters():
            p.requires_grad = False
        inp_imgs.requires_grad = True

        # Enable gradient calculation if not already the case
        had_gradients_enabled = torch.is_grad_enabled()
        torch.set_grad_enabled(True)

        # We use a buffer tensor in which we generate noise each loop iteration.
        # More efficient than creating a new tensor every iteration.
        noise = torch.randn(inp_imgs.shape, device=inp_imgs.device)

        # List for storing generations at each step (for later analysis)
        imgs_per_step = []

        # Loop over K (steps)
        for _ in range(steps):
            # Part 1: Add noise to the input.
            noise.normal_(0, 0.005)
            inp_imgs.data.add_(noise.data)
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            # Part 2: calculate gradients for the current input.
            out_imgs = -model(inp_imgs)
            out_imgs.sum().backward()
            inp_imgs.grad.data.clamp_(-0.03, 0.03) # For stabilizing and preventing too high gradients

            # Apply gradients to our current samples
            inp_imgs.data.add_(-step_size * inp_imgs.grad.data)
            inp_imgs.grad.detach_()
            inp_imgs.grad.zero_()
            inp_imgs.data.clamp_(min=-1.0, max=1.0)

            if return_img_per_step:
                imgs_per_step.append(inp_imgs.clone().detach())

        # Reactivate gradients for parameters for training
        for p in model.parameters():
            p.requires_grad = True
        model.train(is_training)

        # Reset gradient calculation to setting before this function
        torch.set_grad_enabled(had_gradients_enabled)

        if return_img_per_step:
            return torch.stack(imgs_per_step, dim=0)
        else:
            return inp_imgs

在下面的算法中，缓冲区的想法变得更加清晰。

训练算法

采样缓冲区准备就绪后，我们就可以完成训练算法了。下面是图像建模能量模型完整训练算法的总结：
在这里插入图片描述
每次训练迭代中的前几个语句涉及真实数据和虚假数据的采样，正如我们在上面的样本缓冲区中看到的那样。接下来，我们使用能量模型

E_{theta}

$E_{θ}$ 计算对比散度目标。然而，我们需要的另一项训练技巧是在输出

E_{theta}

$E_{θ}$ 上添加正则化损失。由于网络的输出不受约束，并且在输出中添加或不添加大偏差不会改变对比散度损失，因此我们需要以其他方式确保输出值处于合理范围内。如果没有正则化损失，输出值将在很大范围内波动。这样，我们可以确保真实数据的值在 0 左右，而假数据可能略低（对于噪声或异常值，分数仍可能明显较低）。由于正则化损失不如对比散度重要，因此我们有一个权重因子

alpha

$α$ ,它通常比 1 小很多。最后，我们使用优化器对组合损失执行更新步骤，并将新样本添加到缓冲区。

下面，我们将这种训练动态放入 PyTorch Lightning 模块中。请记住，因为我们建模

(

)

−

(

)

f_{theta}(x)=-E_{theta}(x)

$f_{θ} (x) = - E_{θ} (x)$ ，我们需要小心切换所有重要的符号，例如在损失函数中。

class DeepEnergyModel(pl.LightningModule):

    def __init__(self, img_shape, batch_size, alpha=0.1, lr=1e-4, beta1=0.0, **CNN_args):
        super().__init__()
        self.save_hyperparameters()

        self.cnn = CNNModel(**CNN_args)
        self.sampler = Sampler(self.cnn, img_shape=img_shape, sample_size=batch_size)
        self.example_input_array = torch.zeros(1, *img_shape)

    def forward(self, x):
        z = self.cnn(x)
        return z

    def configure_optimizers(self):
        # Energy models can have issues with momentum as the loss surfaces changes with its parameters.
        # Hence, we set it to 0 by default.
        optimizer = optim.Adam(self.parameters(), lr=self.hparams.lr, betas=(self.hparams.beta1, 0.999))
        scheduler = optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.97) # Exponential decay over epochs
        return [optimizer], [scheduler]

    def training_step(self, batch, batch_idx):
        # We add minimal noise to the original images to prevent the model from focusing on purely "clean" inputs
        real_imgs, _ = batch
        small_noise = torch.randn_like(real_imgs) * 0.005
        real_imgs.add_(small_noise).clamp_(min=-1.0, max=1.0)

        # Obtain samples
        fake_imgs = self.sampler.sample_new_exmps(steps=60, step_size=10)

        # Predict energy score for all images
        inp_imgs = torch.cat([real_imgs, fake_imgs], dim=0)
        real_out, fake_out = self.cnn(inp_imgs).chunk(2, dim=0)

        # Calculate losses
        reg_loss = self.hparams.alpha * (real_out ** 2 + fake_out ** 2).mean()
        cdiv_loss = fake_out.mean() - real_out.mean()
        loss = reg_loss + cdiv_loss

        # Logging
        self.log('loss', loss)
        self.log('loss_regularization', reg_loss)
        self.log('loss_contrastive_divergence', cdiv_loss)
        self.log('metrics_avg_real', real_out.mean())
        self.log('metrics_avg_fake', fake_out.mean())
        return loss

    def validation_step(self, batch, batch_idx):
        # For validating, we calculate the contrastive divergence between purely random images and unseen examples
        # Note that the validation/test step of energy-based models depends on what we are interested in the model
        real_imgs, _ = batch
        fake_imgs = torch.rand_like(real_imgs) * 2 - 1

        inp_imgs = torch.cat([real_imgs, fake_imgs], dim=0)
        real_out, fake_out = self.cnn(inp_imgs).chunk(2, dim=0)

        cdiv = fake_out.mean() - real_out.mean()
        self.log('val_contrastive_divergence', cdiv)
        self.log('val_fake_out', fake_out.mean())
        self.log('val_real_out', real_out.mean())

我们不实施测试步骤，因为基于能量的生成模型通常不会在测试集上进行评估。然而，验证步骤用于了解随机图像的能量/可能性与数据集的未见过的示例之间的差异。替代测试步骤是生成新图像并根据 FID 或 Inception 分数评估它们的真实程度，或者尝试对图像进行去噪。

回调 (Callbacks)

为了跟踪模型在训练期间的性能，我们将广泛使用 PyTorch Lightning 的回调框架。请记住，回调可用于在训练的任何时刻运行小函数，例如在完成一个epochs之后。在这里，我们将使用我们自己定义的三个不同的回调。

第一个回调称为GenerateCallback，用于在训练期间将图像生成添加到模型中。每

$N$ 个epochs之后（通常

N=5

$N = 5$ 以减少 TensorBoard 的输出），我们采用一小批随机图像并执行多次 MCMC 迭代，直到模型的生成收敛。与使用 60 次迭代的训练相比，我们在这里使用 256 次，因为 (1) 与每次迭代都必须执行的训练相比，我们只需要执行一次，(2) 我们在这里不是从缓冲区开始，而是从头开始。其实现方式如下：

class GenerateCallback(pl.Callback):

    def __init__(self, batch_size=8, vis_steps=8, num_steps=256, every_n_epochs=5):
        super().__init__()
        self.batch_size = batch_size         # Number of images to generate
        self.vis_steps = vis_steps           # Number of steps within generation to visualize
        self.num_steps = num_steps           # Number of steps to take during generation
        self.every_n_epochs = every_n_epochs # Only save those images every N epochs (otherwise tensorboard gets quite large)

    def on_epoch_end(self, trainer, pl_module):
        # Skip for all other epochs
        if trainer.current_epoch % self.every_n_epochs == 0:
            # Generate images
            imgs_per_step = self.generate_imgs(pl_module)
            # Plot and add to tensorboard
            for i in range(imgs_per_step.shape[1]):
                step_size = self.num_steps // self.vis_steps
                imgs_to_plot = imgs_per_step[step_size-1::step_size,i]
                grid = torchvision.utils.make_grid(imgs_to_plot, nrow=imgs_to_plot.shape[0], normalize=True, range=(-1,1))
                trainer.logger.experiment.add_image(f"generation_{i}", grid, global_step=trainer.current_epoch)

    def generate_imgs(self, pl_module):
        pl_module.eval()
        start_imgs = torch.rand((self.batch_size,) + pl_module.hparams["img_shape"]).to(pl_module.device)
        start_imgs = start_imgs * 2 - 1
        torch.set_grad_enabled(True)  # Tracking gradients for sampling necessary
        imgs_per_step = Sampler.generate_samples(pl_module.cnn, start_imgs, steps=self.num_steps, step_size=10, return_img_per_step=True)
        torch.set_grad_enabled(False)
        pl_module.train()
        return imgs_per_step

第二个回调称为SamplerCallback，它只是将采样缓冲区中随机选取的图像子集添加到 TensorBoard。这有助于了解当前向模型显示的哪些图像是“假的”。

class SamplerCallback(pl.Callback):

    def __init__(self, num_imgs=32, every_n_epochs=5):
        super().__init__()
        self.num_imgs = num_imgs             # Number of images to plot
        self.every_n_epochs = every_n_epochs # Only save those images every N epochs (otherwise tensorboard gets quite large)

    def on_epoch_end(self, trainer, pl_module):
        if trainer.current_epoch % self.every_n_epochs == 0:
            exmp_imgs = torch.cat(random.choices(pl_module.sampler.examples, k=self.num_imgs), dim=0)
            grid = torchvision.utils.make_grid(exmp_imgs, nrow=4, normalize=True, range=(-1,1))
            trainer.logger.experiment.add_image("sampler", grid, global_step=trainer.current_epoch)

最后，我们的最后一个回调是OutlierCallback. 此回调通过记录分配给随机噪声的（负）能量来评估模型。虽然我们的训练损失在迭代中几乎是恒定的，但这个分数可能显示了模型检测“异常值”的进度。

class OutlierCallback(pl.Callback):

    def __init__(self, batch_size=1024):
        super().__init__()
        self.batch_size = batch_size

    def on_epoch_end(self, trainer, pl_module):
        with torch.no_grad():
            pl_module.eval()
            rand_imgs = torch.rand((self.batch_size,) + pl_module.hparams["img_shape"]).to(pl_module.device)
            rand_imgs = rand_imgs * 2 - 1.0
            rand_out = pl_module.cnn(rand_imgs).mean()
            pl_module.train()

        trainer.logger.experiment.add_scalar("rand_out", rand_out, global_step=trainer.current_epoch)

运行模型

最后，我们可以将所有内容添加到一起来创建最终的训练函数。该函数与我们迄今为止看到的任何其他 PyTorch Lightning 训练函数非常相似。然而，有一个小小的区别，我们不会在测试集上测试模型，因为我们随后将通过检查其预测和执行异常值检测的能力来分析模型。

def train_model(**kwargs):
    # Create a PyTorch Lightning trainer with the generation callback
    trainer = pl.Trainer(default_root_dir=os.path.join(CHECKPOINT_PATH, "MNIST"),
                         accelerator="gpu" if str(device).startswith("cuda") else "cpu",
                         devices=1,
                         max_epochs=60,
                         gradient_clip_val=0.1,
                         callbacks=[ModelCheckpoint(save_weights_only=True, mode="min", monitor='val_contrastive_divergence'),
                                    GenerateCallback(every_n_epochs=5),
                                    SamplerCallback(every_n_epochs=5),
                                    OutlierCallback(),
                                    LearningRateMonitor("epoch")
                                   ])
    # Check whether pretrained model exists. If yes, load it and skip training
    pretrained_filename = os.path.join(CHECKPOINT_PATH, "MNIST.ckpt")
    if os.path.isfile(pretrained_filename):
        print("Found pretrained model, loading...")
        model = DeepEnergyModel.load_from_checkpoint(pretrained_filename)
    else:
        pl.seed_everything(42)
        model = DeepEnergyModel(**kwargs)
        trainer.fit(model, train_loader, test_loader)
        model = DeepEnergyModel.load_from_checkpoint(trainer.checkpoint_callback.best_model_path)
    # No testing as we are more interested in other properties
    return model

model = train_model(img_shape=(1,28,28),
                    batch_size=train_loader.batch_size,
                    lr=1e-4,
                    beta1=0.0)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Found pretrained model, loading...

分析

在笔记本的最后一部分，我们将尝试采用经过训练的基于能量的生成模型，并分析其属性。

张量板

我们首先可以看到的是训练期间生成的 TensorBoard。这可以帮助我们更好地了解训练动态，并显示潜在的问题。让我们加载下面的 TensorBoard：

# Import tensorboard
%load_ext tensorboard

# Opens tensorboard in notebook. Adjust the path to your CHECKPOINT_PATH!
%tensorboard --logdir ../saved_models/tutorial8/tensorboards/

在这里插入图片描述
我们看到对比散度以及正则化很快收敛到 0。然而，尽管损失始终接近于零，但训练仍在继续。这是因为我们的“训练”数据通过采样随着模型的变化而变化。训练的进度可以通过查看迭代中的样本以及随时间不断下降的随机图像的分数来最好地衡量。

图像生成

评估生成模型的另一种方法是对一些生成的图像进行采样。生成模型需要擅长生成逼真的图像，因为这真实地表明它们已经对真实的数据分布进行了建模。因此，让我们对下面模型的一些图像进行采样：

model.to(device)
pl.seed_everything(43)
callback = GenerateCallback(batch_size=4, vis_steps=8, num_steps=256)
imgs_per_step = callback.generate_imgs(model)
imgs_per_step = imgs_per_step.cpu()

基于能量的模型采样的特点是需要迭代MCMC算法。为了深入了解图像在迭代过程中如何变化，我们还在 MCMC 中绘制了一些中间样本：

for i in range(imgs_per_step.shape[1]):
    step_size = callback.num_steps // callback.vis_steps
    imgs_to_plot = imgs_per_step[step_size-1::step_size,i]
    imgs_to_plot = torch.cat([imgs_per_step[0:1,i],imgs_to_plot], dim=0)
    grid = torchvision.utils.make_grid(imgs_to_plot, nrow=imgs_to_plot.shape[0], normalize=True, range=(-1,1), pad_value=0.5, padding=2)
    grid = grid.permute(1, 2, 0)
    plt.figure(figsize=(8,8))
    plt.imshow(grid)
    plt.xlabel("Generation iteration")
    plt.xticks([(imgs_per_step.shape[-1]+2)*(0.5+j) for j in range(callback.vis_steps+1)],
               labels=[1] + list(range(step_size,imgs_per_step.shape[0]+1,step_size)))
    plt.yticks([])
    plt.show()

在这里插入图片描述
我们看到，虽然第一步是从噪声开始的，但采样算法仅经过 32 个步骤就获得了合理的形状。在接下来的 200 个步骤中，形状变得更加清晰并变为真实的数字。当您在 Colab 上运行代码时，具体示例可能会有所不同，因此以下描述特定于网站上显示的图。第一行显示 8，我们在迭代中删除不必要的白色部分。第二个样本最多可以看到迭代之间的变换，它创建了一个数字 2。虽然 32 次迭代后的第一个样本看起来有点像数字，但实际上并非如此，该样本越来越多地转换为典型图像数字2的.

分布外检测

基于能量的模型的一个非常常见且强大的应用是分布外检测（有时称为“异常”检测）。随着越来越多的深度学习模型应用于生产和应用中，这些模型的一个重要方面是了解模型不知道什么。深度学习模型通常过于自信，这意味着它们有时甚至以 100% 的概率对随机图像进行分类。显然，这不是我们希望在应用程序中看到的情况。基于能量的模型可以帮助解决这个问题，因为它们经过训练可以检测不适合训练数据集分布的图像。因此，在这些应用中，您可以与分类器一起训练基于能量的模型，并且仅当基于能量的模型分配的（非标准化）概率高于
到图像。正如本文所提出的，您实际上可以将分类器和基于能量的目标组合在一个模型中。

在这部分分析中，我们想要测试基于能量的模型的分布外能力。请记住，模型的输出较低表示概率较低。因此，如果我们向模型输入随机噪声，我们希望看到低分：

with torch.no_grad():
    rand_imgs = torch.rand((128,) + model.hparams.img_shape).to(model.device)
    rand_imgs = rand_imgs * 2 - 1.0
    rand_out = model.cnn(rand_imgs).mean()
    print(f"Average score for random images: {rand_out.item():4.2f}")

Average score for random images: -17.88

正如我们所希望的，该模型为这些噪声图像分配了非常低的概率。作为另一个参考，让我们看一下对训练集中的一批图像的预测：

with torch.no_grad():
    train_imgs,_ = next(iter(train_loader))
    train_imgs = train_imgs.to(model.device)
    train_out = model.cnn(train_imgs).mean()
    print(f"Average score for training images: {train_out.item():4.2f}")

Average score for training images: -0.00

由于训练中添加了正则化目标，分数接近 0。很明显，该模型可以区分噪声和真实数字。然而，如果我们稍微改变一下训练图像，看看哪些图像得分非常低，会发生什么？

@torch.no_grad()
def compare_images(img1, img2):
    imgs = torch.stack([img1, img2], dim=0).to(model.device)
    score1, score2 = model.cnn(imgs).cpu().chunk(2, dim=0)
    grid = torchvision.utils.make_grid([img1.cpu(), img2.cpu()], nrow=2, normalize=True, range=(-1,1), pad_value=0.5, padding=2)
    grid = grid.permute(1, 2, 0)
    plt.figure(figsize=(4,4))
    plt.imshow(grid)
    plt.xticks([(img1.shape[2]+2)*(0.5+j) for j in range(2)],
               labels=["Original image", "Transformed image"])
    plt.yticks([])
    plt.show()
    print(f"Score original image: {score1.item():4.2f}")
    print(f"Score transformed image: {score2.item():4.2f}")

我们为此使用随机测试图像。您可以随意更改它以亲自尝试该模型。

test_imgs, _ = next(iter(test_loader))
exmp_img = test_imgs[0].to(model.device)

第一个变换是向图像添加一些随机噪声：

img_noisy = exmp_img + torch.randn_like(exmp_img) * 0.3
img_noisy.clamp_(min=-1.0, max=1.0)
compare_images(exmp_img, img_noisy)

在这里插入图片描述
我们可以看到分数大幅下降。因此，该模型可以检测图像上的随机高斯噪声。这也是预料之中的，因为最初，“假”样本是纯噪声图像。

接下来，我们翻转图像并检查这对分数的影响：

img_flipped = exmp_img.flip(dims=(1,2))
compare_images(exmp_img, img_flipped)

在这里插入图片描述
如果数字只能用这种方式读出，例如7，那么我们可以看到分数下降了。然而，分数仅略有下降。这可能是因为我们的模型尺寸较小。请记住，生成建模是一项比分类困难得多的任务，因为我们不仅需要区分类别，还需要了解数字的所有细节/特征。通过更深入的模型，最终可以更好地捕获这一点（但代价是训练的稳定性更大）。

最后，我们检查一下如果显着减小数字的大小会发生什么：

img_tiny = torch.zeros_like(exmp_img)-1
img_tiny[:,exmp_img.shape[1]//2:,exmp_img.shape[2]//2:] = exmp_img[:,::2,::2]
compare_images(exmp_img, img_tiny)

在这里插入图片描述
尽管 MNIST 数据集中的数字通常要大得多，但分数再次下降，但幅度不大。

总的来说，我们可以得出结论，我们的模型适合检测高斯噪声和对现有数字的较小变换。尽管如此，为了获得非常好的分布外模型，我们需要训练更深的模型并进行更多的迭代。

不稳定

最后，我们应该讨论基于能量的模型可能存在的不稳定性，特别是我们在本笔记本中实现的图像生成示例。在这款笔记本的超参数搜索过程中，出现了多个模型出现分歧。基于能量的模型的发散意味着模型将高概率分配给训练集的示例，这是一件好事。然而，与此同时，采样算法失败并且仅生成获得最小概率分数的噪声图像。发生这种情况是因为模型创建了许多局部最大值，生成的噪声图像落入其中。我们计算梯度以高概率到达数据点的能量表面已经“发散”，对我们的 MCMC 采样没有用处。

除了寻找最佳超参数之外，基于能量的模型中的一个常见技巧是重新加载稳定的检查点。如果我们检测到模型正在发散，我们就会停止训练，从一个时期之前尚未发散的模型加载模型。之后，我们继续训练，并希望使用不同的种子模型不会再次发散。尽管如此，这应该被视为稳定模型的“最后希望”，而仔细的超参数调整是更好的方法。敏感的超参数包括step_size、steps采样器中的噪声标准差以及 CNN 模型中的学习率和特征维度。

结论

在本教程中，我们讨论了用于生成建模的基于能量的模型。该概念依赖于这样的想法：通过对整个数据集进行归一化，任何严格的正函数都可以转化为概率分布。由于这对于图像等高维数据的计算是不合理的，因此我们使用对比散度和通过 MCMC 采样来训练模型。虽然这个想法允许我们将任何神经网络转变为基于能量的模型，但我们已经看到需要多种训练技巧来稳定训练。此外，这些模型的训练时间相对较长，因为在每次训练迭代期间，即使使用采样缓冲区，我们也需要采样新的“假”图像。在接下来的讲座和作业中，我们将看到不同的生成模型（例如VAE、GAN、NF），它们使我们能够更稳定地进行生成建模，但代价是更多参数。