无监督关键词提取算法:TF-IDF、TextRank、RAKE、YAKE、 keyBERT

本文介绍: 无监督关键词提取算法:TF-IDF、TextRank、RAKE、YAKE、 keyBERT

TF-IDF是一种经典的基于统计的方法，TF(Term frequency)是指一个单词在一个文档中出现的次数，通常一个单词在一个文档中出现的次数越多说明该词越重要。IDF(Inverse document frequency)是所有文档数比上出现某单词的个数，通常一个单词在整个文本集合中出现的文本数越少，这个单词就越能表示其所在文本的特点，重要性就越高；IDF计算一般会再取对数，设总文档数为N，出现单词t的文档数为

df_t

$d f_{t}$ （为了防止分母为0，一般会对分母加一）：

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    return json.dumps(results)


#docs 集合，如果是中文可以先分词或者在CountVectorizer里定义tokenizer
docs=[]

#创建单词词汇表, 忽略在85%的文档中出现的词
cv=CountVectorizer(max_df=0.85)
word_count_vector=cv.fit_transform(docs)
print(word_count_vector.shape)

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
tfidf_transformer.idf_

# 单词名
feature_names=cv.get_feature_names_out()
# tf-idf计算
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

results=[]
for i in range(tf_idf_vector.shape[0]):
    # 获取单个文档的向量
    curr_vector=tf_idf_vector[i]
    #根据tfidf分数对向量进行排序
    sorted_items=sort_coo(curr_vector.tocoo())
    # 取top 10 的关键词
    keywords=extract_topn_from_vector(feature_names, sorted_items, 10) 
    results.append(keywords)

Compatibility of systems of linear constraints over the set of natural numbers Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.

Compatibility – systems – linear constraints – set – natural numbers – Criteria – compatibility – system – linear Diophantine equations – strict inequations – nonstrict inequations – Upper bounds – components – minimal set – solutions – algorithms – minimal generating sets – solutions – systems – criteria – corresponding algorithms – constructing – minimal supporting set – solving – systems – systems

from keybert import KeyBERT

# 英文文档关键词提取示例，不指定embedding模型，默认使用sentence-transformers的all-MiniLM-L6-v2模型
doc = """
When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

# 中文文档关键词提取示例
# 中文需要自定义CountVectorizer，并为它指定分词器，比如下面示例中使用了jieba来分词
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')
doc = """
    强化学习是机器通过与环境交互来实现目标的一种计算方法。机器和环境的一轮交互是指，机器在环境的一个状态下做一个动作决策，把这个动作作用到环境当中，这个环境发生相应的改变并且将相应的奖励反馈和下一轮状态传回机器。这种交互是迭代进行的，机器的目标是最大化在多轮交互过程中获得的累积奖励的期望。"""
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)

# 代码来自keyBERT 源码 https://github.com/MaartenGr/KeyBERT/blob/master/keybert/_mmr.py
def mmr(
    doc_embedding: np.ndarray,
    word_embeddings: np.ndarray,
    words: List[str],
    top_n: int = 5,
    diversity: float = 0.8,
) -> List[Tuple[str, float]]:
    """Calculate Maximal Marginal Relevance (MMR)
    between candidate keywords and the document.


    MMR considers the similarity of keywords/keyphrases with the
    document, along with the similarity of already selected
    keywords and keyphrases. This results in a selection of keywords
    that maximize their within diversity with respect to the document.

    Arguments:
        doc_embedding: The document embeddings
        word_embeddings: The embeddings of the selected candidate keywords/phrases
        words: The selected candidate keywords/keyphrases
        top_n: The number of keywords/keyhprases to return
        diversity: How diverse the select keywords/keyphrases are.
                   Values between 0 and 1 with 0 being not diverse at all
                   and 1 being most diverse.

    Returns:
         List[Tuple[str, float]]: The selected keywords/keyphrases with their distances

    """

    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    # Initialize candidates and already choose best keyword/keyphras
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(min(top_n - 1, len(words) - 1)):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(
            word_similarity[candidates_idx][:, keywords_idx], axis=1
        )

        # Calculate MMR
        mmr = (
            1 - diversity
        ) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    # Extract and sort keywords in descending similarity
    keywords = [
        (words[idx], round(float(word_doc_similarity.reshape(1, -1)[0][idx]), 4))
        for idx in keywords_idx
    ]
    keywords = sorted(keywords, key=itemgetter(1), reverse=True)
    return keywords