怎么在pandas中利用apply函數實現多進程

發布時間：2021-04-08 17:27:26 來源：億速云閱讀：417 作者：Leah 欄目：開發技術

這篇文章將為大家詳細講解有關怎么在pandas中利用apply函數實現多進程，文章內容質量較高，因此小編分享給大家做個參考，希望大家閱讀完這篇文章后對相關知識有一定的了解。

1. DataFrame.groupby 分組聚合操作

# groupby 操作
df1 = pd.DataFrame({'a':[1,2,1,2,1,2], 'b':[3,3,3,4,4,4], 'data':[12,13,11,8,10,3]})
df1

按照某列分組

grouped = df1.groupby('b')
# 按照 'b' 這列分組了，name 為 'b' 的 key 值，group 為對應的df_group
for name, group in grouped:
 print name, '->'
 print group

按照多列分組

grouped = df1.groupby(['a','b'])
# 按照 'b' 這列分組了，name 為 'b' 的 key 值，group 為對應的df_group
for name, group in grouped:
 print name, '->'
 print group

(1, 3) ->
 a b data
0 1 3 12
2 1 3 11
(1, 4) ->
 a b data
4 1 4 10
(2, 3) ->
 a b data
1 2 3 13
(2, 4) ->
 a b data
3 2 4  8
5 2 4  3

若 df.index 為[1,2,3…]這樣一個 list，那么按照 df.index分組，其實就是每組就是一行，在后面去停用詞實驗中，我們就用這個方法把 df_all 處理成每行為一個元素的 list，再用多進程處理這個 list。

grouped = df1.groupby(df1.index)
# 按照 index 分組，其實每行就是一個組了
print len(grouped), type(grouped)
for name, group in grouped:
 print name, '->'
 print group

6 <class 'pandas.core.groupby.DataFrameGroupBy'>
0 ->
 a b data
0 1 3 12
1 ->
 a b data
1 2 3 13
2 ->
 a b data
2 1 3 11
3 ->
 a b data
3 2 4  8
4 ->
 a b data
4 1 4 10
5 ->
 a b data
5 2 4  3

2. joblib 用法

refer: https://pypi.python.org/pypi/joblib

# 1. Embarrassingly parallel helper: to make it easy to write readable parallel code and debug it quickly:
from joblib import Parallel, delayed
from math import sqrt

處理小任務的時候，多進程并沒有體現出優勢。

%time result1 = Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10000))
%time result2 = Parallel(n_jobs=8)(delayed(sqrt)(i**2) for i in range(10000))

CPU times: user 316 ms, sys: 0 ns, total: 316 ms
Wall time: 309 ms
CPU times: user 692 ms, sys: 384 ms, total: 1.08 s
Wall time: 1.03 s

當需要處理大量數據的時候，并行處理就體現出了它的優勢

%time result = Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(1000000))

CPU times: user 3min 43s, sys: 5.66 s, total: 3min 49s
Wall time: 3min 33s

%time result = Parallel(n_jobs=8)(delayed(sqrt)(i**2) for i in range(1000000))

CPU times: user 50.9 s, sys: 12.6 s, total: 1min 3s
Wall time: 52 s

3. apply 函數的多進程執行（去停用詞）

多進程的實現主要參考了 stack overflow 的解答： Parallelize apply after pandas groupby

怎么在pandas中利用apply函數實現多進程

上圖中，我們要把 AbstractText 去停用詞，處理成 AbstractText1 那樣。首先，導入停用詞表。

# 讀入所有停用詞
with open('stopwords.txt', 'rb') as inp:
 lines = inp.read()
stopwords = re.findall('"(.*?)"', lines)
print len(stopwords)
print stopwords[:10]

692
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after']

# 對 AbstractText 去停用詞
# 方法一：暴力法，對每個詞進行判斷
def remove_stopwords1(text):
 words = text.split(' ')
 new_words = list()
 for word in words:
  if word not in stopwords:
   new_words.append(word)
 return new_words
# 方法二：先構建停用詞的映射
for word in stopwords:
 if word in words_count.index:
  words_count[word] = -1
def remove_stopwords2(text):
 words = text.split(' ')
 new_words = list()
 for word in words:
  if words_count[word] != -1:
   new_words.append(word)
 return new_words
%time df_all['AbstractText1'] = df_all['AbstractText'].apply(remove_stopwords1)
%time df_all['AbstractText2'] = df_all['AbstractText'].apply(remove_stopwords2)

CPU times: user 8min 56s, sys: 2.72 s, total: 8min 59s
Wall time: 8min 48s
CPU times: user 1min 2s, sys: 4.12 s, total: 1min 6s
Wall time: 1min 2s

上面我嘗試了兩種不同的方法來去停用詞：

方法一中使用了比較粗暴的方法：首先用一個 list 存儲所有的 stopwords，然后對于每一個 text 中的每一個 word，我們判斷它是否出現在 stopwords 的list中(復雜度 O(n)O(n) ), 若為 stopword 則去掉。

方法二中我用一個Series(words_count) 對所有的詞進行映射，如果該詞為 stopword，則把它的值修改為 -1。這樣，對于 text 中的每個詞 ww, 我們只需要判斷它的值是否為 -1 即可判定是否為 stopword (復雜度 O(1)O(1))。

所以，在這兩個方法中，我們都是采用單進程來執行，方法二的速度(1min 2s)明顯高于方法一(8min 48s)。

from joblib import Parallel, delayed
import multiprocessing
# 方法三：對方法一使用多進程
def tmp_func(df):
 df['AbstractText3'] = df['AbstractText'].apply(remove_stopwords1)
 return df
def apply_parallel(df_grouped, func):
 """利用 Parallel 和 delayed 函數實現并行運算"""
 results = Parallel(n_jobs=-1)(delayed(func)(group) for name, group in df_grouped)
 return pd.concat(results)
if __name__ == '__main__':
 time0 = time.time()
 df_grouped = df_all.groupby(df_all.index)
 df_all =applyParallel(df_grouped, tmp_func)
 print 'time costed {0:.2f}'.format(time.time() - time0)

time costed 150.81

# 方法四：對方法二使用多進程
def tmp_func(df):
 df['AbstractText3'] = df['AbstractText'].apply(remove_stopwords2)
 return df
def apply_parallel(df_grouped, func):
 """利用 Parallel 和 delayed 函數實現并行運算"""
 results = Parallel(n_jobs=-1)(delayed(func)(group) for name, group in df_grouped)
 return pd.concat(results)
if __name__ == '__main__':
 time0 = time.time()
 df_grouped = df_all.groupby(df_all.index)
 df_all =applyParallel(df_grouped, tmp_func)
 print 'time costed {0:.2f}'.format(time.time() - time0)

time costed 123.80

上面方法三和方法四分別對應于前面方法一和方法二，但是都是用了多進程操作。結果是方法一使用多進程以后，速度一下子提高了好幾倍，但是方法二的多進程速度不升反降。這是不是有問題？的確，但是首先可以肯定，我們的代碼沒有問題。下圖顯示了我用 top 命令看到各個方法的進程執行情況。可以看出，在方法三和方法四中，的的確確是 12 個CPU核都跑起來了。只是在方法四中，每個核占用的比例都是比較低的。

怎么在pandas中利用apply函數實現多進程

fig1. 單進程 cpu 使用情況

怎么在pandas中利用apply函數實現多進程

fig2. 方法三 cpu 使用情況

怎么在pandas中利用apply函數實現多進程

關于怎么在pandas中利用apply函數實現多進程就分享到這里了，希望以上內容可以對大家有一定的幫助，可以學到更多知識。如果覺得文章不錯，可以把它分享出去讓更多的人看到。

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么在pandas中利用apply函數實現多進程

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

怎么在pandas中利用apply函數實現多進程

猜你喜歡

最新資訊

相關推薦

相關標簽