tensorflow2.10怎么使用BERT實現Semantic?Similarity

發布時間：2023-04-12 15:00:17 來源：億速云閱讀：118 作者：iii 欄目：開發技術

本篇內容主要講解“tensorflow2.10怎么使用BERT實現Semantic Similarity”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓小編來帶大家學習“tensorflow2.10怎么使用BERT實現Semantic Similarity”吧!

主要的配置如下：

tensorflow-gpu == 2.10.0
python == 3.10
transformers == 4.26.1

數據處理

這里導入了后續步驟需要用到的庫，包括 NumPy、Pandas、TensorFlow 和 Transformers。同時設置了幾個重要的參數。其中，max_length 表示輸入文本的最大長度，batch_size 表示每個批次訓練的樣本數量，epochs 表示訓練集訓練次數，labels 列表包含了三個分類標簽，分別為“矛盾”、“蘊含” 和 “中性”。

import numpy as np
import pandas as pd
import tensorflow as tf
import transformers
max_length = 128   
batch_size = 32
epochs = 2
labels = ["contradiction", "entailment", "neutral"]

這里使用 Pandas 庫讀取了 SNLI 數據集中的訓練集、驗證集和測試集。其中，訓練集只讀取了前 30 萬條數據。接著打印了各數據集的樣本數。然后，打印了訓練集中的三組樣本，每組樣本包括兩個句子和分類標簽。

train_df = pd.read_csv("SNLI_Corpus/snli_1.0_train.csv", nrows=300000)
valid_df = pd.read_csv("SNLI_Corpus/snli_1.0_dev.csv")
test_df = pd.read_csv("SNLI_Corpus/snli_1.0_test.csv")
print(f"訓練集樣本數 : {train_df.shape[0]}")
print(f"驗證集樣本數: {valid_df.shape[0]}")
print(f"測試集樣本數: {test_df.shape[0]}")
print()
print(f"句子一: {train_df.loc[5, 'sentence1']}")
print(f"句子二: {train_df.loc[5, 'sentence2']}")
print(f"相似度: {train_df.loc[5, 'similarity']}")
print()
print(f"句子一: {train_df.loc[3, 'sentence1']}")
print(f"句子二: {train_df.loc[3, 'sentence2']}")
print(f"相似度: {train_df.loc[3, 'similarity']}")
print()
print(f"句子一: {train_df.loc[4, 'sentence1']}")
print(f"句子二: {train_df.loc[4, 'sentence2']}")
print(f"相似度: {train_df.loc[4, 'similarity']}")

打印：

訓練集樣本數 : 300000
驗證集樣本數: 10000
測試集樣本數: 10000
句子一: Children smiling and waving at camera
句子二: The kids are frowning
相似度: contradiction
句子一: Children smiling and waving at camera
句子二: They are smiling at their parents
相似度: neutral
句子一: Children smiling and waving at camera
句子二: There are children present
相似度: entailment

首先使用 dropna 函數刪除訓練集中的缺失數據。然后對訓練集、驗證集

測試集中的分類標簽為“-”的數據進行了刪除操作。接著使用 sample 函數進行了打亂處理，并使用 reset_index 函數重置了索引。最后，打印了處理后的各個數據集樣本數。

train_df.dropna(axis=0, inplace=True)
train_df = ( train_df[train_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
valid_df = ( valid_df[valid_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
test_df  = ( test_df[test_df.similarity != "-"].sample(frac=1.0, random_state=30).reset_index(drop=True) )
print(f"處理后訓練集樣本數 : {train_df.shape[0]}")
print(f"處理后驗證集樣本數: {valid_df.shape[0]}")
print(f"處理后測試集樣本數: {test_df.shape[0]}")

打印：

處理后訓練集樣本數 : 299616
處理后驗證集樣本數: 9842
處理后測試集樣本數: 9824

這里將訓練集、驗證集和測試集中的分類標簽轉換為數字，并將標簽轉換為 one-hot 編碼格式。具體來說就是使用 apply 函數將 "contradiction" 標簽轉換為數字 0，將 "entailment" 標簽轉換為數字 1，將 "neutral" 標簽轉換為數字 2。然后，使用 to_categorical 函數將數字標簽轉換為 one-hot 編碼格式。最終使用 y_train、y_val 和 y_test 存儲了訓練集、驗證集和測試集的 one-hot 編碼標簽結果。

train_df["label"] = train_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_train = tf.keras.utils.to_categorical(train_df.label, num_classes=3)
valid_df["label"] = valid_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_val = tf.keras.utils.to_categorical(valid_df.label, num_classes=3)
test_df["label"] = test_df["similarity"].apply(lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2)
y_test = tf.keras.utils.to_categorical(test_df.label, num_classes=3)

模型搭建

這里定義了一個繼承自 tf.keras.utils.Sequence 的類 BertSemanticDataGenerator ，用于生成 BERT 模型訓練所需的數據。

在初始化時，需要傳入句子對的數組 sentence_pairs 和對應的標簽 labels，同時可以指定批次大小 batch_size ，shuffle 表示是否要打亂數據， include_targets 表示是否包含標簽信息。類中還定義了一個 BERT 分詞器 tokenizer，使用了 bert-base-uncased 預訓練模型。

同時實現了 __len__ 、 __getitem__ 、on_epoch_end 三個方法， __len__ 用于獲取數據集可以按照 batch_size 均分的批次數量，__getitem__ 首先使用索引從 self.sentence_pairs 中獲取批數據，然后使用指定的編碼器對這些句子對進行編碼，使其適用于 BERT 模型的輸入，最后返回輸入和標簽。on_epoch_end 方法在每輪訓練之后判斷是否需要打亂數據集。

class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    def __init__( self, sentence_pairs, labels, batch_size=batch_size, shuffle=True, include_targets=True ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        self.tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()
    def __len__(self):
        return len(self.sentence_pairs) // self.batch_size
    def __getitem__(self, idx):
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]
        encoded = self.tokenizer.batch_encode_plus( sentence_pairs.tolist(), add_special_tokens=True,
            max_length=max_length, return_attention_mask=True, return_token_type_ids=True,
            pad_to_max_length=True, return_tensors="tf")
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]
    def on_epoch_end(self):
        if self.shuffle:
            np.random.RandomState(30).shuffle(self.indexes)

這里使用 TensorFlow2 和 Transformers 庫實現了一個基于 BERT 的文本分類模型。以下是代碼的主要步驟。

首先，定義了三個輸入張量：input_ids、attention_masks 和 token_type_ids ，這些張量的形狀都是 (max_length,) ，其中 max_length 是預處理后的文本序列的最大長度。

接下來，定義了一個 BERT 模型 bert_model 。通過調用 TFBertModel.from_pretrained 方法，該模型從預先訓練好的 BERT 模型中加載參數。同時，將 bert_model.trainable 設置為 False ，以避免在訓練過程中更新 BERT 模型的參數。

然后，將 input_ids、attention_masks 和 token_type_ids 作為輸入傳入 bert_model ，得到 bert_output 。獲取 BERT 模型的最后一個隱藏狀態（last_hidden_state），作為 LSTM 層的輸入。

接著，使用 Bi-LSTM 層對 sequence_output 進行處理，生成一個具有 64 個輸出單元的 LSTM 層，返回整個序列。然后，將 Bi-LSTM 層的輸出分別進行全局平均池化和全局最大池化，得到 avg_pool 和 max_pool 。將這兩個輸出連接起來，形成一個維度為 128 的向量，通過 Dropout 層后，經過一個 Dense 層輸出最終的分類結果。

最后，使用 tf.keras.models.Model 方法，將 input_ids、attention_masks 和 token_type_ids 作為輸入，output 作為輸出，定義一個完整的神經網絡模型。并使用 model.compile 方法編譯模型，指定了優化器 Adam 、損失函數為 categorical_crossentropy 、評估指標為 acc 。

input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
attention_masks = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="attention_masks")
token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name="token_type_ids")
bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
bert_model.trainable = False
bert_output = bert_model.bert(input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids)
sequence_output = bert_output.last_hidden_state
bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(sequence_output)
avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
concat = tf.keras.layers.concatenate([avg_pool, max_pool])
dropout = tf.keras.layers.Dropout(0.5)(concat)
output = tf.keras.layers.Dense(3, activation="softmax")(dropout)
model = tf.keras.models.Model(inputs=[input_ids, attention_masks, token_type_ids], outputs=output)
model.compile( optimizer=tf.keras.optimizers.Adam(), loss="categorical_crossentropy", metrics=["acc"],)
model.summary()

打印模型結構可以看到， BERT 的參數都被凍結了：

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 128)]        0           []                               
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',        
                                tentions(last_hidde               'token_type_ids[0][0]']         
                                n_state=(None, 128,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
 bidirectional (Bidirectional)  (None, 128, 128)     426496      ['bert[0][0]']                   
 global_average_pooling1d (Glob  (None, 128)         0           ['bidirectional[0][0]']          
 alAveragePooling1D)                                                                              
 global_max_pooling1d (GlobalMa  (None, 128)         0           ['bidirectional[0][0]']          
 xPooling1D)                                                                                      
 concatenate (Concatenate)      (None, 256)          0           ['global_average_pooling1d[0][0]'
                                                                 , 'global_max_pooling1d[0][0]']  
 dropout_37 (Dropout)           (None, 256)          0           ['concatenate[0][0]']            
 dense (Dense)                  (None, 3)            771         ['dropout_37[0][0]']             
==================================================================================================
Total params: 109,909,507
Trainable params: 427,267
Non-trainable params: 109,482,240

模型訓練

首先，將訓練集和驗證集傳入 BertSemanticDataGenerator 對象中，創建一個訓練數據生成器 train_data 和一個驗證數據生成器 valid_data。然后，通過調用 model.fit() 方法，對模型進行訓練。其中，訓練數據為 train_data，驗證數據為 valid_data。 use_multiprocessing 和 workers 參數用于指定在訓練期間使用的進程數，以加快訓練速度。

最后，訓練歷史記錄存儲在 history 變量中，可以使用這些歷史數據來分析模型的訓練效果。

train_data = BertSemanticDataGenerator( train_df[["sentence1", "sentence2"]].values.astype("str"), y_train, batch_size=batch_size, shuffle=True)
valid_data = BertSemanticDataGenerator( valid_df[["sentence1", "sentence2"]].values.astype("str"), y_val, batch_size=batch_size, shuffle=False)
history = model.fit( train_data, validation_data=valid_data, epochs=epochs, use_multiprocessing=True,  workers=-1 )
Epoch 1/2
	11/9363 [..............................] - ETA: 16:31 - loss: 1.1949 - acc: 0.3580
	31/9363 [..............................] - ETA: 13:51 - loss: 1.1223 - acc: 0.3831
	...
Epoch 2/2
	...
	9363/9363 [==============================] - ETA: 0s - loss: 0.5691 - acc: 0.7724
	9363/9363 [==============================] - 791s 84ms/step - loss: 0.5691 - acc: 0.7724 - val_loss: 0.4635 - val_acc: 0.8226

微調模型

這里是對訓練好的 BERT 模型進行 fine-tuning，即對其進行微調以適應新任務。具體來說就是通過將 bert_model.trainable 設置為 True ，可以使得 BERT 模型中的參數可以在 fine-tuning 過程中進行更新。然后使用 tf.keras.optimizers.Adam(1e-5) 作為優化器，以較小的學習率進行微調。同時使用 categorical_crossentropy 作為損失函數，用來評估模型輸出的預測分布與實際標簽分布之間的差異。最后，通過 model.summary() 函數查看模型的結構和參數信息，可以發現所有的參數現在都可以訓練了。

bert_model.trainable = True
model.compile( optimizer=tf.keras.optimizers.Adam(1e-5), loss="categorical_crossentropy",  metrics=["accuracy"] )
model.summary()

打印：

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 128)]        0           []                               
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',        
                                tentions(last_hidde               'token_type_ids[0][0]']         
                                n_state=(None, 128,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
 bidirectional (Bidirectional)  (None, 128, 128)     426496      ['bert[0][0]']                   
 global_average_pooling1d (Glob  (None, 128)         0           ['bidirectional[0][0]']          
 alAveragePooling1D)                                                                              
 global_max_pooling1d (GlobalMa  (None, 128)         0           ['bidirectional[0][0]']          
 xPooling1D)                                                                                      
 concatenate (Concatenate)      (None, 256)          0           ['global_average_pooling1d[0][0]'
                                                                 , 'global_max_pooling1d[0][0]']  
 dropout_37 (Dropout)           (None, 256)          0           ['concatenate[0][0]']            
 dense (Dense)                  (None, 3)            771         ['dropout_37[0][0]']             
==================================================================================================
Total params: 109,909,507
Trainable params: 109,909,507
Non-trainable params: 0

接著上面的模型，繼續進行微調訓練，我們可以看到這次的準確率比之前有所提升。

history = model.fit( train_data, validation_data=valid_data, epochs=epochs, use_multiprocessing=True, workers=-1,)

打印：

Epoch 1/2
7/9363 [..............................] - ETA: 24:41 - loss: 0.5716 - accuracy: 0.7946
...
Epoch 2/2
...
9363/9363 [==============================] - 1500s 160ms/step - loss: 0.3201 - accuracy: 0.8845 - val_loss: 0.2933 - val_accuracy: 0.8974

模型評估

使用測試數據對模型的性能進行評估。

test_data = BertSemanticDataGenerator(  test_df[["sentence1", "sentence2"]].values.astype("str"), y_test, batch_size=batch_size, shuffle=False)
model.evaluate(test_data, verbose=1)
307/307 [==============================] - 18s 57ms/step - loss: 0.2916 - accuracy: 0.8951

推理測試

這里定義了一個名為 check_similarity 的函數，該函數可以用來檢查兩個句子的語義相似度。傳入的參數是兩個句子 sentence1 和 sentence2 。首先將這兩個句子組成一個 np.array 格式方便處理，然后通過 BertSemanticDataGenerator 函數創建一個數據生成器生成模型需要的測試數據格式，使用訓練好的函數返回句子對的預測概率，最后取預測概率最高的類別作為預測結果。

def check_similarity(sentence1, sentence2):
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataGenerator( sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False )
    proba = model.predict(test_data[0])[0]
    idx = np.argmax(proba)
    proba = f"{proba[idx]: .2f}%"
    pred = labels[idx]
    return pred, proba
sentence1 = "Male in a blue jacket decides to lay in the grass"
sentence2 = "The guy wearing a blue jacket is laying on the green grass"
check_similarity(sentence1, sentence2)

打印：

('entailment', ' 0.51%')

到此，相信大家對“tensorflow2.10怎么使用BERT實現Semantic Similarity”有了更深的了解，不妨來實際操作一番吧！這里是億速云網站，更多相關內容可以進入相關頻道進行查詢，關注我們，繼續學習！

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

tensorflow2.10怎么使用BERT實現Semantic?Similarity

數據處理

模型搭建

模型訓練

微調模型

模型評估

推理測試

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

tensorflow2.10怎么使用BERT實現Semantic?Similarity

數據處理

模型搭建

模型訓練

微調模型

模型評估

推理測試

猜你喜歡

最新資訊

相關推薦

相關標簽