Python怎么利用Pandas與NumPy進行數據清洗

發布時間：2022-04-13 13:39:48 來源：億速云閱讀：190 作者：iii 欄目：開發技術

本文小編為大家詳細介紹“Python怎么利用Pandas與NumPy進行數據清洗”，內容詳細，步驟清晰，細節處理妥當，希望這篇“Python怎么利用Pandas與NumPy進行數據清洗”文章能幫助大家解決疑惑，下面跟著小編的思路慢慢深入，一起來學習新知識吧。

許多數據科學家認為獲取和清理數據的初始步驟占工作的 80%，花費大量時間來清理數據集并將它們歸結為可以使用的形式。

因此如果你是剛剛踏入這個領域或計劃踏入這個領域，重要的是能夠處理雜亂的數據，無論數據是否包含缺失值、不一致的格式、格式錯誤的記錄還是無意義的異常值。

將利用 Python 的 Pandas和 NumPy 庫來清理數據。

準備工作

導入模塊后就開始正式的數據預處理吧。

import pandas as pd
import numpy as np

DataFrame 列的刪除

通常會發現并非數據集中的所有數據類別都有用。例如可能有一個包含學生信息（姓名、年級、標準、父母姓名和地址）的數據集，但希望專注于分析學生成績。在這種情況下地址或父母的姓名并不重要。保留這些不需要的數據將占用不必要的空間。

BL-Flickr-Images-Book.csv 數據操作。

df = pd.read_csv('數據科學必備Pandas、NumPy進行數據清洗/BL-Flickr-Images-Book.csv')
df.head()

Python怎么利用Pandas與NumPy進行數據清洗

可以看到這些列是對 Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks 沒有任何信息幫助的，因此可以進行批量刪除處理。

to_drop_column = [ 'Edition Statement',
                   'Corporate Author',
                   'Corporate Contributors',
                   'Former owner',
                   'Engraver',
                   'Contributors',
                   'Issuance type',
                   'Shelfmarks']

df.drop(to_drop_column , inplace=True, axis=1)
df.head()

Python怎么利用Pandas與NumPy進行數據清洗

DataFrame 索引更改

Pandas 索引擴展了 NumPy 數組的功能，以允許更通用的切片和標記。在許多情況下，使用數據的唯一值標識字段作為其索引是有幫助的。

獲取唯一標識符。

df['Identifier'].is_unique
True

Identifier列替換索引列。

df = df.set_index('Identifier')
df.head()

Python怎么利用Pandas與NumPy進行數據清洗

206 是索引的第一個標簽，可以使用 df.iloc[0] 基于位置的索引訪問。

DataFrame 數據字段整理

清理特定列并將它們轉換為統一格式，以更好地理解數據集并強制保持一致性。

處理 Date of Publication 出版日期列，發現該數據列格式并不統一。

df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

我們可以使用正則表達式的方式直接提取連續的4個數字即可。

extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head()

Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object

最后獲取數字字段列。

df['Date of Publication'] = pd.to_numeric(extr)

str 方法與 NumPy 結合清理列

df[‘Date of Publication’].str 。此屬性是一種在 Pandas 中訪問快速字符串操作的方法，這些操作在很大程度上模仿了對原生 Python 字符串或編譯的正則表達式的操作，例如 .split()、.replace() 和 .capitalize()。

要清理 Place of Publication 字段，我們可以將 Pandas 的 str 方法與 NumPy 的 np.where 函數結合起來，該函數基本上是 Excel 的 IF() 宏的矢量化形式。

np.where(condition, then, else)

在這里 condition 要么是一個類似數組的對象，要么是一個布爾掩碼。 then 是如果條件評估為 True 時使用的值，否則是要使用的值。

本質上 .where() 獲取用于條件的對象中的每個元素，檢查該特定元素在條件上下文中的計算結果是否為 True，并返回一個包含 then 或 else 的 ndarray，具體取決于哪個適用。可以嵌套在復合 if-then 語句中，允許根據多個條件計算值.

處理 Place of Publication 出版地數據。

df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

使用包含的方式提取需要的數據信息。

pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool

也可以使用 np.where 處理。

df['Place of Publication'] = np.where(london, 'London',
                                      pub.str.replace('-', ' ')))

Identifier
206                     London
216                     London
218                     London
472                     London
480                     London
                  ...         
4158088                 London
4158128                  Derby
4159563                 London
4159587    Newcastle upon Tyne
4160339                 London
Name: Place of Publication, Length: 8287, dtype: object

apply 函數清理整個數據集

在某些情況下，將自定義函數應用于 DataFrame 的每個單元格或元素。 Pandas.apply() 方法類似于內置的 map() 函數，只是將函數應用于 DataFrame 中的所有元素。

例如將數據的發布日期進行處理成 xxxx 年的格式，就可以使用apply。

def clean_date(text):
    try:
        return str(int(text)) + "年"
    except:
        return text

df["new_date"] = df["Date of Publication"].apply(clean_date)
df["new_date"] 

Identifier
206        1879年
216        1868年
218        1869年
472        1851年
480        1857年
           ...  
4158088    1838年
4158128    1831年
4159563      NaN
4159587    1834年
4160339    1834年
Name: new_date, Length: 8287, dtype: object

DataFrame 跳過行

olympics_df = pd.read_csv('數據科學必備Pandas、NumPy進行數據清洗/olympics.csv')
olympics_df.head()

Python怎么利用Pandas與NumPy進行數據清洗

可以在讀取數據時候添加參數跳過某些不要的行，比如索引 0 行。

olympics_df = pd.read_csv('數據科學必備Pandas、NumPy進行數據清洗/olympics.csv',header=1)
olympics_df.head()

Python怎么利用Pandas與NumPy進行數據清洗

DataFrame 重命名列

new_names =  {'Unnamed: 0': 'Country',
              '? Summer': 'Summer Olympics',
               '01 !': 'Gold',
              '02 !': 'Silver',
              '03 !': 'Bronze',
              '? Winter': 'Winter Olympics',
              '01 !.1': 'Gold.1',
              '02 !.1': 'Silver.1',
              '03 !.1': 'Bronze.1',
              '? Games': '# Games',
              '01 !.2': 'Gold.2',
              '02 !.2': 'Silver.2',
              '03 !.2': 'Bronze.2'}

olympics_df.rename(columns=new_names, inplace=True)

olympics_df.head()

Python怎么利用Pandas與NumPy進行數據清洗

讀到這里，這篇“Python怎么利用Pandas與NumPy進行數據清洗”文章已經介紹完畢，想要掌握這篇文章的知識點還需要大家自己動手實踐使用過才能領會，如果想了解更多相關內容的文章，歡迎關注億速云行業資訊頻道。

向AI問一下細節

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Python怎么利用Pandas與NumPy進行數據清洗

準備工作

DataFrame 列的刪除

DataFrame 索引更改

DataFrame 數據字段整理

str 方法與 NumPy 結合清理列

apply 函數清理整個數據集

DataFrame 跳過行

DataFrame 重命名列

猜你喜歡

91超碰碰碰碰久久久久久综合_超碰av人澡人澡人澡人澡人掠_国产黄大片在线观看画质优化_txt小说免费全本

Python怎么利用Pandas與NumPy進行數據清洗

準備工作

DataFrame 列的刪除

DataFrame 索引更改

DataFrame 數據字段整理

str 方法與 NumPy 結合清理列

apply 函數清理整個數據集

DataFrame 跳過行

DataFrame 重命名列

猜你喜歡

最新資訊

相關推薦

相關標簽