【机器学习】贝叶斯垃圾邮件识别

本文介绍: 电子邮件是互联网的一项重要服务，在大家的学习、工作和生活中会广泛使用。但是大家的邮箱常常被各种各样的垃圾邮件填充了。有统计显示，每天互联网上产生的垃圾邮件有几百亿近千亿的量级。因此，对电子邮件服务提供商来说，垃圾邮件过滤是一项重要功能。而朴素贝叶斯算法在垃圾邮件识别任务上一直表现非常好，至今仍然有很多系统在使用朴素贝叶斯算法作为基本的垃圾邮件识别算法。本次实验数据集来自Trec06的中文垃圾邮件数据集，目录解压后包含三个文件夹，其中data目录下是所有的邮件（未分词），已分词好的邮件在。

本次作业以垃圾邮件分类任务为基础，要求提取文本特征并使用朴素贝叶斯算法进行垃圾邮件识别（调用已有工具包或自行实现）。

电子邮件是互联网的一项重要服务，在大家的学习、工作和生活中会广泛使用。但是大家的邮箱常常被各种各样的垃圾邮件填充了。有统计显示，每天互联网上产生的垃圾邮件有几百亿近千亿的量级。因此，对电子邮件服务提供商来说，垃圾邮件过滤是一项重要功能。而朴素贝叶斯算法在垃圾邮件识别任务上一直表现非常好，至今仍然有很多系统在使用朴素贝叶斯算法作为基本的垃圾邮件识别算法。

本次实验数据集来自Trec06的中文垃圾邮件数据集，目录解压后包含三个文件夹，其中data目录下是所有的邮件（未分词），已分词好的邮件在data_cut目录下。邮件分为邮件头部分和正文部分，两部分之间一般有空行隔开。标签数据在label文件夹下，文件中每行是标签和对应的邮件路径。spam表示垃圾邮件，ham表示正常邮件。

import random # 随机相关包
import numpy as np # 常用数学运算工具包
import pandas as pd # pandas数据分析库
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm # 进度条工具包
from sklearn.model_selection import train_test_split # 数据集划分
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # 提取文本特征向量的类
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB # 三种朴素贝叶斯算法，差别在于估计p(x|y)的方式
from sklearn.metrics import accuracy_score, precision_score, recall_score,classification_report # 评价指标

RANDOM_SEED = 2023

data_path = './data/trec06c-utf8/data/' # 数据集目录
data_cut_path = './data/trec06c-utf8/data_cut/' # 切分数据集目录
index_path = './data/trec06c-utf8/label/index' # 标签文件路径

读取邮件，将其邮件头和正文根据空行分割开，读取成两个文件

def read_file(path): # 读取一个邮件文件，返回邮件头和正文信息
    with open(path, 'r', encoding='utf-8') as f: # 读入文件
        file = f.read()
        head = file.split('nn',maxsplit=1)[0]
        text = file.split('nn',maxsplit=1)[1]
    return head, text

test_head, test_text = read_file(data_path + '000/000')
print(f'HEAD:n{test_head}')
print(f'nTEXT:n{test_text}')

根据index文件，对data文件进行读取

label_list, head_list, text_list = [], [], [] 
with open(index_path, 'r') as index_file: # 读入标签文件
    lines = [line.strip() for line in index_file if line.strip() != ''] # 读入所有非空行，并去掉换行符
    for line in tqdm(lines):
        label, path = line.split() # 分割为标签和文件路径
        label = 1 if label == 'spam' else 0 # 标签转化成0 1，垃圾邮件为1
        path = data_cut_path + path.replace('../data/','') #转换路径
        head, text = read_file(path) # 读入头信息和正文文本
        
        label_list.append(label)
        head_list.append(head)
        text_list.append(text)

██████████████████████████████████████████████████████████████████████████| 64620/64620 [00:20<00:00, 3108.41it/s]

将数据存储为DataFrame格式，并展示数据

df = pd.DataFrame({'labels': label_list, 'heads': head_list, 'texts': text_list})
df

	labels	heads	texts
0	1	Received: from hp-5e1fe6310264 ([218.79.188.13…	[ 课程背景 ]nn　n每一位管理和技术人员都清楚地 …
1	0	Received: from jdl.ac.cn ([159.226.42.8])ntb…	讲的是孔子后人的故事。一个老领导回到家乡，跟儿子感情不和…
2	1	Received: from 163.con ([61.141.165.252])ntb…	尊敬的贵公司 ( 财务 / 经理 ) 负责人您好！n我是深圳金海实业有…
3	1	Received: from 12.com ([222.50.6.150])ntby s…	贵公司负责人 ( 经理 / 财务）您好：n深圳市华龙公司受多家公司委…
4	1	Received: from dghhkjk.com ([59.36.183.208])n…	这是一封 HTML 格式信件！nn- – – – – – – – – – – – …
…	…	…	…
64615	1	Received: from 163.com ([218.18.139.38])ntby…	贵公司负责人 ( 经理 / 财务 ) 您好：n我公司是深圳市华源实业有限…
64616	1	Received: from 12.com ([222.50.12.121])ntby …	尊敬的商家朋友您好：n我是深圳市裕华实业有限公司的。我司实力雄…
64617	1	Received: from 163.com ([219.133.253.212])nt…	贵公司负责人 ( 经理 / 财务）您好 !n我是深圳市康特实业有限公司 …
64618	1	Received: from tencent-0ba99d8 ([210.22.28.223…	n这是一个 HTML 格式的邮件nFRAME : easymainnnnn
64619	1	Received: from 163.com ([219.133.253.212])nt…	贵公司负责人 ( 经理 / 财务）您好 !n我是深圳市康特实业有限公司 …

feature_cols = ['heads','texts']
X = df[feature_cols]
Y = df['labels']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
print(X.shape, Y.shape, x_train.shape, x_test.shape, y_train.shape, y_test.shape) # 输出数据行列信息
# 将训练集进行二次划分，以便进行模型调优
xt_train, xt_test, yt_train, yt_test = train_test_split(x_train, y_train, test_size=0.2, random_state=RANDOM_SEED)
print(xt_train.shape, xt_test.shape, yt_train.shape, yt_test.shape) # 输出数据行列信息

(64620, 2) (64620,) (51696, 2) (12924, 2) (51696,) (12924,)
(41356, 2) (10340, 2) (41356,) (10340,)

x_train.head(10)

	heads	texts
13202	Received: from mail.com ([222.175.114.131])n…	这里所凝聚的是无数网络精英的心血，在往下读之前，请先让自…
57760	Received: from 126.com ([219.133.129.188])nt…	n贵公司负责人 :nn　　你好 !nn　　我公司为深圳市维拉…
11829	Received: from mail.cernet.com (staff.cernet.c…	你婆婆的表现是正常的n而你对你婆婆有这样的表现没有预计n…
4439	Received: from 163.com ([219.148.61.13])ntby…	n尊敬的阁下 :nn　　　我们现在正在开展一项《关于青年生活…
50671	Received: from lnfzb.com ([221.222.182.164])n…	n邮件群发 – – – 最直接、最有效的广告方式 !nnnn【网…
12351	Received: from 163.com ([219.133.131.33])ntb…	贵公司负责人 ( 经理 / 财务 ) 您好：n我公司是深圳市华源实业有限…
25373	Received: from 12565.com ([222.175.41.249])n…	红十月商务王是一款自动为企业发布产品信息的软件（能够在十分钟…
51314	Received: from sea.net.edu.cn ([202.112.5.66])…	如题， GG 会爱上可看透你们的 MM 吗？你们会不会觉得这样的 …
24542	Received: from 163.com ([219.134.22.61])ntby…	尊敬的公司您好！打扰之处请见谅！n我深圳公司愿在互惠互利、 …
15069	Received: from silversand.net ([219.136.103.68…	– – – – – – – 中国式执行与海尔兵法大 / 型 / 公 / 开 / 课…

# 通过Create_Vec创建文本向量化器（vectorizer）对象
def Create_Vec(V_type,max_df,min_df):
    if (V_type == 'CV'):
        vectorizer = CountVectorizer(max_df = max_df, min_df = min_df)
    elif (V_type == 'TV'):
        vectorizer = TfidfVectorizer(max_df = max_df, min_df = min_df)
    else:
        print('erro: vectorizer type wrong!')
        return 0
    return vectorizer
vectorizer = Create_Vec('TV',0.6,5)

根据创建的vectorizer来对数据集拟合转换

# 使用 fit_transform 进行拟合和转换
xheads_train = vectorizer.fit_transform(x_train['heads']) 
# 使用 transform 只进行数据的转换
xheads_test = vectorizer.transform(x_test['heads'])
print(xheads_train.shape, xheads_test.shape) # 输出矩阵大小

plt.figure(figsize=(10, 8))
# plt.spy 函数用于绘制稀疏矩阵的非零元素分布，其中横坐标表示矩阵的列索引，纵坐标表示矩阵的行索引。
plt.spy(xheads_train, markersize=0.1, aspect='auto')
plt.xlabel('Features (Words)')
plt.ylabel('Documents')
plt.title('xheads_train')
plt.show()

(51696, 7802) (12924, 7802)

# 使用 fit_transform 进行拟合和转换
xtexts_train = vectorizer.fit_transform(x_train['texts']) 
# 使用 transform 只进行数据的转换
xtexts_test = vectorizer.transform(x_test['texts'])
print(xtexts_train.shape, xtexts_test.shape) # 输出矩阵大小

plt.figure(figsize=(10, 8))
# plt.spy 函数用于绘制稀疏矩阵的非零元素分布，其中横坐标表示矩阵的列索引，纵坐标表示矩阵的行索引。
plt.spy(xtexts_train, markersize=0.1, aspect='auto')
plt.xlabel('Features (Words)')
plt.ylabel('Documents')
plt.title('xtexts_test')
plt.show()

(51696, 66591) (12924, 66591)

# 使用Multinomial Naive Bayes对邮件头进行训练
model_heads = MultinomialNB()
model_heads.fit(xheads_train, y_train)

# 获取类别为垃圾邮件和非垃圾邮件的特征对数概率
spam_class_prob = model_heads.feature_log_prob_[1]
non_spam_class_prob = model_heads.feature_log_prob_[0]

# 将对数概率转换为概率
prob_spam = np.exp(spam_class_prob)
prob_non_spam = np.exp(non_spam_class_prob)

# 获取词汇表
vocab = np.array(vectorizer.get_feature_names_out())

# 获取最大权重的索引（即对数概率最大的特征）
top_spam_words = np.argsort(spam_class_prob)[::-1][:10]
top_non_spam_words = np.argsort(non_spam_class_prob)[::-1][:10]

# 输出垃圾邮件和非垃圾邮件的主要影响词汇
print("Top words for spam:")
print(vocab[top_spam_words])
print(prob_spam[top_spam_words])

print("nTop words for non-spam:")
print(vocab[top_non_spam_words])
print(prob_non_spam[top_non_spam_words])


# 画图
plt.figure(figsize=(12, 6))
plt.rcParams['font.family'] = 'Microsoft YaHei'

plt.subplot(1, 2, 1)
plt.barh(range(10), prob_spam[top_spam_words], color='blue')
plt.yticks(range(10), vocab[top_spam_words])
plt.gca().invert_yaxis()
plt.title('Top Words for Spam')

plt.subplot(1, 2, 2)
plt.barh(range(10), prob_non_spam[top_non_spam_words], color='green')
plt.yticks(range(10), vocab[top_non_spam_words])
plt.gca().invert_yaxis()
plt.title('Top Words for Non-Spam')

plt.tight_layout()
plt.show()

Top words for spam:
['0760' '21rgypq' '723' '86619861' '052' '00' '5628517' '64755262' '330'
 '3126050']
[0.01191201 0.01068648 0.00904696 0.00887319 0.00887307 0.00853443
 0.00790591 0.00784224 0.00745457 0.00742007]

Top words for non-spam:
['87583640' '5468' '5kg' '34006833' '86545574' '039' '050810' '21rgypq'
 '040969' '259']
[0.01826745 0.01589548 0.01152334 0.01011742 0.00788216 0.00783755
 0.00766951 0.00763002 0.00701033 0.00699732]

# 使用Multinomial Naive Bayes对邮件正文进行训练
model_texts = MultinomialNB()
model_texts.fit(xtexts_train, y_train)

# 获取类别为垃圾邮件和非垃圾邮件的特征对数概率
spam_class_prob = model_texts.feature_log_prob_[1]
non_spam_class_prob = model_texts.feature_log_prob_[0]

# 将对数概率转换为概率
prob_spam = np.exp(spam_class_prob)
prob_non_spam = np.exp(non_spam_class_prob)

# 获取词汇表
vocab = np.array(vectorizer.get_feature_names_out())

# 获取最大权重的索引（即对数概率最大的特征）
top_spam_words = np.argsort(spam_class_prob)[::-1][:10]
top_non_spam_words = np.argsort(non_spam_class_prob)[::-1][:10]

# 输出垃圾邮件和非垃圾邮件的主要影响词汇
print("Top words for spam:")
print(vocab[top_spam_words])
print(prob_spam[top_spam_words])

print("nTop words for non-spam:")
print(vocab[top_non_spam_words])
print(prob_non_spam[top_non_spam_words])


# 画图
plt.figure(figsize=(12, 6))
plt.rcParams['font.family'] = 'Microsoft YaHei'

plt.subplot(1, 2, 1)
plt.barh(range(10), prob_spam[top_spam_words], color='blue')
plt.yticks(range(10), vocab[top_spam_words])
plt.gca().invert_yaxis()
plt.title('Top Words for Spam')

plt.subplot(1, 2, 2)
plt.barh(range(10), prob_non_spam[top_non_spam_words], color='green')
plt.yticks(range(10), vocab[top_non_spam_words])
plt.gca().invert_yaxis()
plt.title('Top Words for Non-Spam')

plt.tight_layout()
plt.show()

Top words for spam:
['公司' '发票' 'com' '合作' '优惠' 'http' '有限公司' '我司' '代开' 'www']
[0.00820575 0.0065605  0.00384043 0.00351144 0.00315074 0.00289291
 0.00286094 0.0027401  0.00266387 0.00265106]

Top words for non-spam:
['一个' '自己' '没有' '我们' '觉得' '时候' 'mm' '什么' '知道' '这个']
[0.00305769 0.00296125 0.00271538 0.00211419 0.00210244 0.00206828
 0.00206705 0.00205005 0.00203816 0.00193213]

# 根据邮件头预测测试集
yheads_pred = model_heads.predict(xheads_test)

# 根据邮件头训练模型评估模型性能
accuracy = accuracy_score(y_test, yheads_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

print("Classification Report:")
print(classification_report(y_test, yheads_pred, zero_division=1))

Accuracy: 99.85%
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4263
           1       1.00      1.00      1.00      8661

    accuracy                           1.00     12924
   macro avg       1.00      1.00      1.00     12924
weighted avg       1.00      1.00      1.00     12924

# 根据邮件正文预测测试集
ytexts_pred = model_texts.predict(xtexts_test)

# 根据邮件正文训练模型评估模型性能
accuracy = accuracy_score(y_test, ytexts_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

print("Classification Report:")
print(classification_report(y_test, ytexts_pred, zero_division=1))

Accuracy: 97.76%
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      4263
           1       0.98      0.98      0.98      8661

    accuracy                           0.98     12924
   macro avg       0.97      0.97      0.97     12924
weighted avg       0.98      0.98      0.98     12924

# 定义权重weight，根据权重分配预测参数
weight = 0.6
y_pred_avr = weight * yheads_pred + (1 - weight) * ytexts_pred

# 定义阈值threshold ，根据阈值确定是否预测为
thresholds = np.arange(0, 1, 0.01)
accuracies = []

for t in thresholds:
    y_pred = (y_pred_avr > t).astype(int)
    
    # 计算accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

# 绘制阈值与accuracy关系曲线
plt.plot(thresholds, accuracies)
plt.xlabel('Threshold')
plt.ylabel('Accuracy') 
plt.title('Threshold vs Accuracy')
plt.show()

# 找到最高accuracy的阈值    
best_threshold = thresholds[np.argmax(accuracies)]   
print("Best Threshold: ", best_threshold) 
print("Best Accuracy: ", max(accuracies))

Best Threshold:  0.4
Best Accuracy:  0.998529866914268

y_pred = (y_pred_avr > best_threshold).astype(int)

# 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=1))

Accuracy: 99.85%
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4263
           1       1.00      1.00      1.00      8661

    accuracy                           1.00     12924
   macro avg       1.00      1.00      1.00     12924
weighted avg       1.00      1.00      1.00     12924

# 定义权重和阈值的范围
weights = np.arange(0, 1, 0.01)
thresholds = np.arange(0, 1, 0.01)

best_accuracy = 0
best_params = {'weight': None, 'threshold': None}

# 遍历权重和阈值
for weight in weights:
    for threshold in thresholds:
        # 计算加权平均预测值
        y_pred_avr = weight * yheads_pred + (1 - weight) * ytexts_pred
        
        # 根据阈值确定二分类预测
        y_pred = (y_pred_avr > threshold).astype(int)
        
        # 计算准确率
        accuracy = accuracy_score(y_test, y_pred)
        
        # 更新最佳准确率和对应的参数
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params['weight'] = weight
            best_params['threshold'] = threshold

# 输出最佳参数和准确率
print("Best Weight:", best_params['weight'])
print("Best Threshold:", best_params['threshold'])
print("Best Accuracy:", best_accuracy)

Best Weight: 0.51
Best Threshold: 0.49
Best Accuracy: 0.998529866914268

y_pred = (y_pred_avr > best_threshold).astype(int)

# 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=1))

Accuracy: 99.85%
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4263
           1       1.00      1.00      1.00      8661

    accuracy                           1.00     12924
   macro avg       1.00      1.00      1.00     12924
weighted avg       1.00      1.00      1.00     12924

# 定义不同的词表大小
word_sizes = [100, 200, 500, 1000, 2000, 5000, 10000]

# 存储结果的列表
accuracies = []

# 循环实验
for word_size in tqdm(word_sizes):
    # 配置文本向量化器
    vectorizer = CountVectorizer(max_features=word_size)
    
    # 数据处理
    xheads_train = vectorizer.fit_transform(x_train['heads']) 
    xheads_test = vectorizer.transform(x_test['heads'])
    xtexts_train = vectorizer.fit_transform(x_train['texts']) 
    xtexts_test = vectorizer.transform(x_test['texts'])

    # 模型训练
    model_heads = MultinomialNB()
    model_heads.fit(xheads_train, y_train)
    model_texts = MultinomialNB()
    model_texts.fit(xtexts_train, y_train)

    # 模型预测
    yheads_pred = model_heads.predict(xheads_test)
    ytexts_pred = model_texts.predict(xtexts_test)
    
    # 模型融合
    weights = 0.6
    thresholds = 0.4
    y_pred_avr = weight * yheads_pred + (1 - weight) * ytexts_pred
    y_pred = (y_pred_avr > threshold).astype(int)
    
    # 评估性能
    accuracy = accuracy_score(y_test, y_pred)
    
    # 存储结果
    accuracies.append(accuracy)

# 绘制折线图
plt.plot(word_sizes, accuracies, marker='o')
plt.title('Model Performance vs. Word Size')
plt.xlabel('Word Size')
plt.ylabel('Accuracy')
plt.show()

████████████████████████████████████████████████████████████████████████████████████| 7/7 [02:46<00:00, 23.80s/it]

# 设置参数范围
Vector_Types = ['CV','TV']          # vector类型
NBs = [BernoulliNB(), MultinomialNB(), ComplementNB()]  # 
max_dfs = np.arange(0.1, 1, 0.1)    # max_df范围
min_dfs = np.arange(1, 10, 1)     # min_df范围
weights = np.arange(0, 1, 0.1)     # 权重范围
thresholds = np.arange(0, 1, 0.1)  # 阈值范围

best_accuracy_head = 0
best_params_head = {'Vector_Type': None, 'NB_Type': None, 'max_df': None, 'min_df': None}

best_accuracy_text = 0
best_params_text = {'Vector_Type': None, 'NB_Type': None, 'max_df': None, 'min_df': None}

total = len(Vector_Types) * len(NBs) * len(max_dfs) * len(min_dfs)
pbar = tqdm(total=total) 

for Vtype in Vector_Types:
    for NB in NBs:
        for max_df in max_dfs:
            for min_df in min_dfs:
                # print(f'Vtype:{Vtype},NB:{NB},max_df:{max_df},min_df:{min_df}')
                vectorizer = Create_Vec(Vtype,max_df,min_df)
                # 邮件头格式转换
                xheads_train = vectorizer.fit_transform(xt_train['heads']) 
                xheads_test = vectorizer.transform(xt_test['heads'])
                NB.fit(xheads_train, yt_train) # 在训练集上训练
                yheads_pred = NB.predict(xheads_test) # 在测试集上预测，获得预测值
                accuracy_heads = accuracy_score(yt_test, yheads_pred) # 将测试预测值与测试集标签对比获得准确率
                # 迭代求最佳参数
                if accuracy_heads > best_accuracy_head:
                    best_accuracy_head = accuracy_heads
                    best_params_head['Vector_Type'] = Vtype
                    best_params_head['NB_Type'] = NB
                    best_params_head['max_df'] = max_df
                    best_params_head['min_df'] = min_df
                pbar.update(1)

pbar.close()

print("Best Vector Type :", best_params_head['Vector_Type'])
print("Best NB Type:", best_params_head['NB_Type'])
print("Best max_df:", best_params_head['max_df'])
print("Best min_df:", best_params_head['min_df'])
print("Best Accuracy:", best_accuracy_head)

████████████████████████████████████████████████████████████████████████████████| 486/486 [50:57<00:00,  6.29s/it]

Best Vector Type : TV
Best NB Type: MultinomialNB()
Best max_df: 0.1
Best min_df: 1
Best Accuracy: 0.9979690522243714

最终，确定了最佳参数为TfidfVectorizer(max_df = 0.1,min_df = 1)格式下训练出来的模型，用MultinomialNB()进行贝叶斯分析得到的准确率最高。

vectorizer = Create_Vec('TV',0.1,1)

# 数据处理
xheads_train = vectorizer.fit_transform(x_train['heads']) 
xheads_test = vectorizer.transform(x_test['heads'])
xtexts_train = vectorizer.fit_transform(x_train['texts']) 
xtexts_test = vectorizer.transform(x_test['texts'])

# 模型训练
model_heads = MultinomialNB()
model_heads.fit(xheads_train, y_train)
model_texts = MultinomialNB()
model_texts.fit(xtexts_train, y_train)

# 模型预测
yheads_pred = model_heads.predict(xheads_test)
ytexts_pred = model_texts.predict(xtexts_test)

# 模型调优
# 定义权重和阈值的范围
weights = np.arange(0, 1, 0.01)
thresholds = np.arange(0, 1, 0.01)

best_accuracy = 0
best_params = {'weight': None, 'threshold': None}

# 遍历权重和阈值
for weight in weights:
    for threshold in thresholds:
        # 计算加权平均预测值
        y_pred_avr = weight * yheads_pred + (1 - weight) * ytexts_pred
        
        # 根据阈值确定二分类预测
        y_pred = (y_pred_avr > threshold).astype(int)
        
        # 计算准确率
        accuracy = accuracy_score(y_test, y_pred)
        
        # 更新最佳准确率和对应的参数
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params['weight'] = weight
            best_params['threshold'] = threshold

# 输出最佳参数和准确率
print("Best Weight:", best_params['weight'])
print("Best Threshold:", best_params['threshold'])
print("Best Accuracy:", best_accuracy)

Best Weight: 0.51
Best Threshold: 0.49
Best Accuracy: 0.9986072423398329

best_weight = 0.6
best_threshold = 0.4

y_pred_avr = best_weight * yheads_pred + (1 - best_weight) * ytexts_pred
y_pred = (y_pred_avr > best_threshold).astype(int)

# 评估模型性能
best_acc = accuracy_score(y_test, y_pred) # 将测试预测值与测试集标签对比获得准确率
best_precision = precision_score(y_test, y_pred) # 精准率，判断为1的邮件中有多少真的为垃圾邮件，垃圾邮件分类任务中的重要指标，因为不希望将非垃圾邮件判为垃圾邮件
best_recall = recall_score(y_test, y_pred) # 召回率，真的垃圾邮件中有多少被识别出来
print(f'accuracy: {best_acc * 100:.4f}%, precision: {best_precision * 100:.4f}%, recall: {best_recall * 100:.4f}%') # 输出评价指标

accuracy: 99.8607%, precision: 100.0000%, recall: 99.7922%

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

垃圾邮件识别贝叶斯

实验三：贝叶斯垃圾邮件识别

1 任务介绍

1.1 基本要求：

1.2 扩展要求：

2 导入工具包

3 读取数据

4 划分数据集

5 数据处理

6 模型训练

6.1 邮件头训练

6.2 邮件正文训练

6.3 模型预测

6.4 根据不同权重融合两个模型

7 模型调优

7.1 确定最好的权重与阈值

7.2 词表大小对准确率影响分析

7.3 通过遍历参数，选择最佳模型

8 总结

发表回复取消回复