使用NLTK庫進行文本規范化的步驟如下:
import nltk
nltk.download('all')
word_tokenize
和sent_tokenize
。例如:from nltk.tokenize import word_tokenize
text = "Hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokens if word.lower() not in stop_words]
print(filtered_words)
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
print(stemmed_words)
normalized_text = ' '.join([word.lower() for word in stemmed_words if word.isalnum()])
print(normalized_text)
通過以上步驟,可以使用NLTK庫對文本進行規范化處理,使文本更易于分析和處理。