您好,登錄后才能下訂單哦!
這篇文章主要介紹“python中文分詞和詞頻統計如何實現”,在日常操作中,相信很多人在python中文分詞和詞頻統計如何實現問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”python中文分詞和詞頻統計如何實現”的疑惑有所幫助!接下來,請跟著小編一起來學習吧!
我準備了一個名為abstract.txt的文本文件
接著是在網上下載了stopword.txt(用于結巴分詞時的停用詞)
有一些是自己覺得沒有用加上去的
另外建立了自己的詞典extraDict.txt
準備工作做好了,就來看看怎么使用吧!
代碼如下:
import jieba from jieba.analyse import extract_tags from sklearn.feature_extraction.text import TfidfVectorizer
代碼如下:
jieba.load_userdict('extraDict.txt') # 導入自己建立詞典
def stopwordlist(): stopwords = [line.strip() for line in open('chinesestopwords.txt', encoding='UTF-8').readlines()] # ---停用詞補充,視具體情況而定--- i = 0 for i in range(19): stopwords.append(str(10 + i)) # ---------------------- return stopwords
def seg_word(line): # seg=jieba.cut_for_search(line.strip()) seg = jieba.cut(line.strip()) temp = "" counts = {} wordstop = stopwordlist() for word in seg: if word not in wordstop: if word != ' ': temp += word temp += '\n' counts[word] = counts.get(word, 0) + 1#統計每個詞出現的次數 return temp #顯示分詞結果 #return str(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20]) # 統計出現前二十最多的詞及次數
def output(inputfilename, outputfilename): inputfile = open(inputfilename, encoding='UTF-8', mode='r') outputfile = open(outputfilename, encoding='UTF-8', mode='w') for line in inputfile.readlines(): line_seg = seg_word(line) outputfile.write(line_seg) inputfile.close() outputfile.close() return outputfile
if __name__ == '__main__': print("__name__", __name__) inputfilename = 'abstract.txt' outputfilename = 'a1.txt' output(inputfilename, outputfilename)
先來講一下思路:
例如給出下面這樣一句話
Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven’t got a clue.
那么想要統計里面每一個單詞出現的次數,思路很簡單,遍歷一遍這個字符串,再定義一個空字典count_dict,看每一個單詞在這個用于統計的空字典count_dict中的key中存在否,不存在則將這個單詞當做count_dict的鍵加入字典內,然后值就為1,若這個單詞在count_dict里面已經存在,那就將它對應的鍵的值+1就行
下面來看代碼:
#定義字符串 sentences = """ # 字符串很長時用三個引號 Love is more than a word it says so much. When I see these four letters, I almost feel your touch. This is only happened since I fell in love with you. Why this word does this, I haven't got a clue. """ #具體實現 # 將句子里面的逗號去掉,去掉多種符號時請用循環,這里我就這樣吧 sentences=sentences.replace(',','') sentences=sentences.replace('.','') # 將句子里面的.去掉 sentences = sentences.split() # 將句子分開為單個的單詞,分開后產生的是一個列表sentences # print(sentences) count_dict = {} for sentence in sentences: if sentence not in count_dict: # 判斷是否不在統計的字典中 count_dict[sentence] = 1 else: # 判斷是否不在統計的字典中 count_dict[sentence] += 1 for key,value in count_dict.items(): print(f"{key}出現了{value}次")
輸出結果是這樣:
到此,關于“python中文分詞和詞頻統計如何實現”的學習就結束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習,快去試試吧!若想繼續學習更多相關知識,請繼續關注億速云網站,小編會繼續努力為大家帶來更多實用的文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。