Python数据分析 知识量:13 - 56 - 232
所谓表的横向拼接就是指:在横向上将两个表依据某些公共列拼接到一起,这些公共列(称为连接键)在拼接过程中起到数据准确对接的关键作用。
在Python中实现横向拼接通常利用merge()函数。
横向拼接的连接方式可以分为:一对一、多对一和多对多。
当拼接的两个表的公共列数据都是唯一的,没有重复值时,拼接将是一对一的。使用merge()函数时,Python会自动寻找公共列作为连接键。下面示例中,Number列为公共列。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Chinese English Math 0 101 90 50 66 1 102 56 56 55 2 103 86 87 44 3 104 55 88 69 4 105 90 66 96 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 102 Emma female 56 56 55 2 103 Olivia female 86 87 44 3 104 Liam male 55 88 69 4 105 Sophia female 90 66 96
当拼接的一个表的公共列有重复值,而另一个表没有重复值时,拼接将是多对一的。拼接的结果将保留所有重复值。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Chinese English Math 0 101 90 50 66 1 102 56 56 55 2 103 86 87 44 3 104 55 88 69 4 105 90 66 96 5 103 77 97 68 6 101 88 68 72 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 101 Noah male 88 68 72 2 102 Emma female 56 56 55 3 103 Olivia female 86 87 44 4 103 Olivia female 77 97 68 5 104 Liam male 55 88 69 6 105 Sophia female 90 66 96
当拼接的两个表的公共列都有重复值时,拼接将是多对多的。多对多相当于多个多对一,所有重复值都将保留。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2))
运行结果为:
Number Name Sex Chinese 0 101 Noah male 90 1 102 Emma female 56 2 103 Olivia female 86 3 104 Liam male 55 4 105 Sophia female 90 5 101 Noah male 77 6 104 Liam male 88 Number English Math 0 101 50 66 1 102 56 55 2 103 87 44 3 104 88 69 4 105 66 96 5 103 97 68 6 101 68 72 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 101 Noah male 90 68 72 2 101 Noah male 77 50 66 3 101 Noah male 77 68 72 4 102 Emma female 56 56 55 5 103 Olivia female 86 87 44 6 103 Olivia female 86 97 68 7 104 Liam male 55 88 69 8 104 Liam male 88 88 69 9 105 Sophia female 90 66 96
连接键是表拼接的关键,在没有指定连接键的情况下,merge()函数会默认寻找两个表的公共列。存在多个公共列时,可以使用参数on来指定连接键。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,on='Name'))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Name Chinese English Math 0 101 Noah 90 50 66 1 102 Emma 56 56 55 2 103 Olivia 86 87 44 3 104 Liam 55 88 69 4 105 Sophia 90 66 96 Number_x Name Sex Number_y Chinese English Math 0 101 Noah male 101 90 50 66 1 102 Emma female 102 56 56 55 2 103 Olivia female 103 86 87 44 3 104 Liam male 104 55 88 69 4 105 Sophia female 105 90 66 96
因为两个表实际存在两个公共列,以上只将Name列指定为连接键,因此,Number列作为普通列在拼接结果中出现了两次。
也可以同时指定两个连接键,以列表的形式赋给参数on:
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,on=['Name','Number']))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Name Chinese English Math 0 101 Noah 90 50 66 1 102 Emma 56 56 55 2 103 Olivia 86 87 44 3 104 Liam 55 88 69 4 105 Sophia 90 66 96 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 102 Emma female 56 56 55 2 103 Olivia female 86 87 44 3 104 Liam male 55 88 69 4 105 Sophia female 90 66 96
当要拼接的两个表没有公共列,但又存在实质相关列时,仍然可以拼接,只不过因为没有名称相同的公共列,merge()函数不能自动查找确定连接键,需要人为指定连接键。
左连接键使用参数left_on指定左表的连接键的列名;右连接键使用参数right_on指定右表的连接键的列名。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,left_on='Id',right_on='Number'))
运行结果为:
Id Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Chinese English Math 0 101 90 50 66 1 102 56 56 55 2 103 86 87 44 3 104 55 88 69 4 105 90 66 96 Id Name Sex Number Chinese English Math 0 101 Noah male 101 90 50 66 1 102 Emma female 102 56 56 55 2 103 Olivia female 103 86 87 44 3 104 Liam male 104 55 88 69 4 105 Sophia female 105 90 66 96
索引列不算真正的列,但拼接时也可以将索引列作为连接键使用,参数分别是left_index和right_index。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,left_index=True,right_index=True))
运行结果为:
Id Name Sex 0 101 Noah male 1 102 Emma female 2 103 Olivia female 3 104 Liam male 4 105 Sophia female Number Chinese English Math 0 101 90 50 66 1 102 56 56 55 2 103 86 87 44 3 104 55 88 69 4 105 90 66 96 Id Name Sex Number Chinese English Math 0 101 Noah male 101 90 50 66 1 102 Emma female 102 56 56 55 2 103 Olivia female 103 86 87 44 3 104 Liam male 104 55 88 69 4 105 Sophia female 105 90 66 96
以上拼接的结果恰好是对的,但实际拼接时需要确保利用索引的拼接是可靠且有意义的。
当要拼接的两个表的公共列存在不能匹配的值时,就会出现在左表或右表中找不到值。依据对这种情况的处理方式,可以分为:内连接、左连接、右连接和外连接。merge()函数使用参数how来指定具体方式。
内连接就是只保留两个表中的共同部分,匹配不到的行会被舍弃。默认情况下,即不设置参数how时,拼接以内连接方式运行。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,how='inner'))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 104 Liam male 3 105 Sophia female 4 106 Bob male Number Chinese English Math 0 101 90 50 66 1 103 86 87 44 2 104 55 88 69 3 105 90 66 96 4 107 100 98 97 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 104 Liam male 55 88 69 2 105 Sophia female 90 66 96
左连接以左表为基础,保留左表的全部内容,在右表中找不到的值用NaN来填充。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,how='left'))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 104 Liam male 3 105 Sophia female 4 106 Bob male Number Chinese English Math 0 101 90 50 66 1 103 86 87 44 2 104 55 88 69 3 105 90 66 96 4 107 100 98 97 Number Name Sex Chinese English Math 0 101 Noah male 90.0 50.0 66.0 1 102 Emma female NaN NaN NaN 2 104 Liam male 55.0 88.0 69.0 3 105 Sophia female 90.0 66.0 96.0 4 106 Bob male NaN NaN NaN
右连接以右表为基础,保留右表的全部内容,在左表中找不到的值用NaN来填充。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,how='right'))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 104 Liam male 3 105 Sophia female 4 106 Bob male Number Chinese English Math 0 101 90 50 66 1 103 86 87 44 2 104 55 88 69 3 105 90 66 96 4 107 100 98 97 Number Name Sex Chinese English Math 0 101 Noah male 90 50 66 1 103 NaN NaN 86 87 44 2 104 Liam male 55 88 69 3 105 Sophia female 90 66 96 4 107 NaN NaN 100 98 97
外连接就是取两个表的并集,所有内容保留,不能匹配的值用NaN来填充。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,how='outer'))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 104 Liam male 3 105 Sophia female 4 106 Bob male Number Chinese English Math 0 101 90 50 66 1 103 86 87 44 2 104 55 88 69 3 105 90 66 96 4 107 100 98 97 Number Name Sex Chinese English Math 0 101 Noah male 90.0 50.0 66.0 1 102 Emma female NaN NaN NaN 2 104 Liam male 55.0 88.0 69.0 3 105 Sophia female 90.0 66.0 96.0 4 106 Bob male NaN NaN NaN 5 103 NaN NaN 86.0 87.0 44.0 6 107 NaN NaN 100.0 98.0 97.0
当拼接的表中存在重复列名(非公共列)时,merge()函数会自动为这些列添加后缀。默认情况下会将列名修改为XX_x、XX_y、XX_z等。也可以通过修改参数suffixes的值来自定义重复列的名称。
import pandas as pd df1=pd.read_excel(r"D:\PythonTestFile\pinjie1.xlsx") df2=pd.read_excel(r"D:\PythonTestFile\pinjie2.xlsx") print(df1,'\n') print(df2,'\n') print(pd.merge(df1,df2,on='Number',suffixes=['_1','_2']))
运行结果为:
Number Name Sex 0 101 Noah male 1 102 Emma female 2 104 Liam male 3 105 Sophia female 4 106 Bob male Number Name Chinese English Math 0 101 Noah 90 50 66 1 103 Emma 86 87 44 2 104 Liam 55 88 69 3 105 Sophia 90 66 96 4 107 Bob 100 98 97 Number Name_1 Sex Name_2 Chinese English Math 0 101 Noah male Noah 90 50 66 1 104 Liam male Liam 55 88 69 2 105 Sophia female Sophia 90 66 96
Copyright © 2017-Now pnotes.cn. All Rights Reserved.
编程学习笔记 保留所有权利
MARK:3.0.0.20240214.P35
From 2017.2.6