Python数据分析 知识量:13 - 56 - 232
缺失值是指数据表某个位置的数据为空,在处理时可以删除对应行的数据,也可以对缺失值进行填充(替换)。但不论如何处理,首先需要找到缺失值。Python中缺失值表示为NaN,isnull()函数可用于判断缺失值,如果是缺失值就返回True;否则返回False。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam_nan.xlsx") print(df,'\n') print(df.isnull())
运行结果为:
Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 2 NaN NaN NaN NaN NaN 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0 Name Sex Chinese English Math 0 False False False False False 1 False True False False False 2 True True True True True 3 False False False False True 4 False False False True False 5 False False False False False 6 False False False True False 7 False False True False False
在数据表中,有些行或列存在几个缺失值,还有些行或列全部为缺失值,即是空白行或空白列。在处理缺失值时,可以选择删除缺失值所在的行或列,也可以只删除空白行或空白列。
dropna()函数用于删除含有缺失值(NaN值)的行,并返回删除后的数据。可以通过参数how、axis、subset等来调整具体删除内容。
how 设为all时,只删除空白行或空白列;省略时,对含有缺失值的行或列进行删除。
axis 设为1时,删除缺失值所在的列;设为0时(省略时默认),删除缺失值所在的行。
subset 以列表形式给出要检查的行或列的索引。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam_nan.xlsx") print('DataFrame:') print(df,'\n') print("how='all':") print(df.dropna(how='all'),'\n') print('全部省略(默认)时:') print(df.dropna(),'\n') print('axis=1,subset=[3,4]:') print(df.dropna(axis=1,subset=[3,4]),'\n') print("axis=0,subset=['Chinese','Math']:") print(df.dropna(axis=0,subset=['Chinese','Math']))
运行结果为:
DataFrame: Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 2 NaN NaN NaN NaN NaN 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0 how='all': Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0 全部省略(默认)时: Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 5 Sophia female 90.0 66.0 96.0 axis=1,subset=[3,4]: Name Sex Chinese 0 Noah male 90.0 1 Emma NaN 56.0 2 NaN NaN NaN 3 Olivia female 86.0 4 Liam male 55.0 5 Sophia female 90.0 6 Liam male 55.0 7 Isabella female NaN axis=0,subset=['Chinese','Math']: Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0
对于参数axis的含义,结合前面介绍的NumPy中concatenate()函数在数组合并时的操作,可以这样理解:
axis=0 表示数据在列方向(纵向)上进行“行”的增减变化。
axis=1 表示数据在行方向(横向)上进行“列”的增减变化。
除了对缺失值进行删除,还可以进行填充,也就是将缺失值替换为指定的值。fillna()函数用于对缺失值的填充,参数为要填充的新值。
1、全部填充。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam_nan.xlsx") print(df,'\n') print(df.fillna(0))
运行结果为:
Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 2 NaN NaN NaN NaN NaN 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0 Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma 0 56.0 56.0 55.0 2 0 0 0.0 0.0 0.0 3 Olivia female 86.0 87.0 0.0 4 Liam male 55.0 0.0 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 0.0 69.0 7 Isabella female 0.0 85.0 55.0
以上缺失值全部填充为0。根据每列的数据类型,填充值会有不同的格式。
2、选择列填充。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam_nan.xlsx") print(df,'\n') print(df.fillna({'Name':'XXX','Sex':'unknown'}))
运行结果为:
Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma NaN 56.0 56.0 55.0 2 NaN NaN NaN NaN NaN 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0 Name Sex Chinese English Math 0 Noah male 90.0 50.0 66.0 1 Emma unknown 56.0 56.0 55.0 2 XXX unknown NaN NaN NaN 3 Olivia female 86.0 87.0 NaN 4 Liam male 55.0 NaN 69.0 5 Sophia female 90.0 66.0 96.0 6 Liam male 55.0 NaN 69.0 7 Isabella female NaN 85.0 55.0
以上只对列Name和Sex进行了填充,且它们填充了不同的值,这提供了个性填充的基础方式。
Copyright © 2017-Now pnotes.cn. All Rights Reserved.
编程学习笔记 保留所有权利
MARK:3.0.0.20240214.P35
From 2017.2.6