Pandas中联合与合并数据集的方法主要有以下几种:
pd.merge()
方法是根据一个或多个键进行连接,类似关系型数据库的连接操作。pd.concat()
方法是对数据对象在轴向上进行粘合或堆叠操作。pd.combine_first()
方法允许将重叠的数据拼接到一起,以使用一个对象中的值填充另一个对象中的缺失值。
1.数据库风格的数据连接:
使用pd.merge()
方法时,如果没哟指定连接的键的信息,系统会自动选择重叠的列名作为连接的键,不过显式的指定连接键是一个好习惯。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})
df1
Out[5]:
device user_id
0 Iphone7 1
1 Huawei 2
2 HTC 3
3 oppo 4
4 vivo 5
5 xiaomi 6
6 google 7
df2
Out[6]:
month_G user_id
0 5 1
1 9 2
2 8 3
3 10 4
4 15 5
5 30 6
6 2 8
pd.merge(df1,df2)
Out[7]:
device user_id month_G
0 Iphone7 1 5
1 Huawei 2 9
2 HTC 3 8
3 oppo 4 10
4 vivo 5 15
5 xiaomi 6 30
pd.merge(df1,df2,on='user_id')
Out[8]:
device user_id month_G
0 Iphone7 1 5
1 Huawei 2 9
2 HTC 3 8
3 oppo 4 10
4 vivo 5 15
5 xiaomi 6 30
因为日常数据中很多会存在连接键名不一致的情况,此时就需要我们分别指定连接键。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df2 = pd.DataFrame({'id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})
pd.merge(df1,df2,left_on='user_id',right_on='id')
Out[11]:
device user_id id month_G
0 Iphone7 1 1 5
1 Huawei 2 2 9
2 HTC 3 3 8
3 oppo 4 4 10
4 vivo 5 5 15
5 xiaomi 6 6 30
默认情况下,系统做的是内连接,也就是取两个数据的交集,可以通过指定连接模式达到所需效果。inner为内连接,outer为外连接,left为使用左表的键进行连接,right为使用右表的键进行连接。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})
pd.merge(df1,df2,how='outer')
Out[15]:
device user_id month_G
0 Iphone7 1 5.0
1 Huawei 2 9.0
2 HTC 3 8.0
3 oppo 4 10.0
4 vivo 5 15.0
5 xiaomi 6 30.0
6 google 7 NaN
7 NaN 8 2.0
pd.merge(df1,df2,how='left')
Out[16]:
device user_id month_G
0 Iphone7 1 5.0
1 Huawei 2 9.0
2 HTC 3 8.0
3 oppo 4 10.0
4 vivo 5 15.0
5 xiaomi 6 30.0
6 google 7 NaN
当使用多个键进行连接时,需要传入一个键名的列表。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google'],'key':['shandong','shanghai','beijing','hebei','jiangsu','hunan','henan']})
df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2],'key':['shandong','shanghai','beijing','hebei','jiangsu','hunan','jiangxi']})
pd.merge(df1,df2,on=['user_id','key'],how='outer')
Out[20]:
device key user_id month_G
0 Iphone7 shandong 1 5.0
1 Huawei shanghai 2 9.0
2 HTC beijing 3 8.0
3 oppo hebei 4 10.0
4 vivo jiangsu 5 15.0
5 xiaomi hunan 6 30.0
6 google henan 7 NaN
7 NaN jiangxi 8 2.0
merge()
方法的参数:
参数 | 说明 |
---|---|
left | 合并时操作左边的DataFrame |
right | 合并时操作右边的DataFrame |
how | 数据连接的方式 |
on | 需要连接的键 |
left_on | 左边的DataFrame中用作连接键的列 |
right_on | 右边的DataFrame中用作连接键的列 |
left_index | 使用左边数据的行索引作为连接键 |
right_index | 使用右边数据的行索引作为连接键 |
sort | 通过连接键按字母顺序对合并的数据进行排序,默认为True |
suffixes | 在列重叠情况下,用于指定添加到列名后的字符串元祖,默认为_x和_y |
indicator | 添加一个特殊的列_merge,用来指示每一行的来源 |
2.根据索引合并数据:
有时候需要使用数据的索引作为连接键,此时需要设定left_index=True
或right_index=True
来指定使用索引连接。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30,2]},index=[1,2,3,4,5,6,8])
df1
Out[23]:
device user_id
0 Iphone7 1
1 Huawei 2
2 HTC 3
3 oppo 4
4 vivo 5
5 xiaomi 6
6 google 7
df2
Out[24]:
month_G
1 5
2 9
3 8
4 10
5 15
6 30
8 2
pd.merge(df1,df2,left_on='user_id',right_index=True,how='outer')
Out[25]:
device user_id month_G
0 Iphone7 1 5.0
1 Huawei 2 9.0
2 HTC 3 8.0
3 oppo 4 10.0
4 vivo 5 15.0
5 xiaomi 6 30.0
6 google 7 NaN
6 NaN 8 2.0
同样,当一方数据为多层索引时,传入一个键名的列表即可。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google'],'province':['shandong','shanghai','beijing','hebei','jiangsu','hunan','henan']})
df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30,2]},index=[[1,2,3,4,5,6,8],['shandong','shanghai','beijing','hebei','jiangsu','hunan','jiangxi']])
df1
Out[28]:
device province user_id
0 Iphone7 shandong 1
1 Huawei shanghai 2
2 HTC beijing 3
3 oppo hebei 4
4 vivo jiangsu 5
5 xiaomi hunan 6
6 google henan 7
df2
Out[29]:
month_G
1 shandong 5
2 shanghai 9
3 beijing 8
4 hebei 10
5 jiangsu 15
6 hunan 30
8 jiangxi 2
pd.merge(df1,df2,left_on=['user_id','province'],right_index=True,how='outer')
Out[30]:
device province user_id month_G
0 Iphone7 shandong 1 5.0
1 Huawei shanghai 2 9.0
2 HTC beijing 3 8.0
3 oppo hebei 4 10.0
4 vivo jiangsu 5 15.0
5 xiaomi hunan 6 30.0
6 google henan 7 NaN
6 NaN jiangxi 8 2.0
此外,DataFrame中还有一个join实例方法完基于索引的数据连接,但是经常使用的还是merge()
方法,这里就不做过多介绍。
3.沿轴向连接数据:
沿轴向连接数据也称为数据拼接、堆叠操作,此时需使用concat()
方法。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df1
Out[40]:
device user_id
0 Iphone7 1
1 Huawei 2
2 HTC 3
3 oppo 4
4 vivo 5
5 xiaomi 6
6 google 7
df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30],'province':['shandong','shanghai','beijing','hebei','jiangsu','hunan']})
df2
Out[42]:
month_G province
0 5 shandong
1 9 shanghai
2 8 beijing
3 10 hebei
4 15 jiangsu
5 30 hunan
默认情况下,是沿着axis=0的轴生效。
pd.concat([df1,df2])
Out[43]:
device month_G province user_id
0 Iphone7 NaN NaN 1.0
1 Huawei NaN NaN 2.0
2 HTC NaN NaN 3.0
3 oppo NaN NaN 4.0
4 vivo NaN NaN 5.0
5 xiaomi NaN NaN 6.0
6 google NaN NaN 7.0
0 NaN 5.0 shandong NaN
1 NaN 9.0 shanghai NaN
2 NaN 8.0 beijing NaN
3 NaN 10.0 hebei NaN
4 NaN 15.0 jiangsu NaN
5 NaN 30.0 hunan NaN
pd.concat([df1,df2],axis=1)
Out[44]:
device user_id month_G province
0 Iphone7 1 5.0 shandong
1 Huawei 2 9.0 shanghai
2 HTC 3 8.0 beijing
3 oppo 4 10.0 hebei
4 vivo 5 15.0 jiangsu
5 xiaomi 6 30.0 hunan
6 google 7 NaN NaN
通过设置join参数,指定连接方式。
pd.concat([df1,df2],axis=1,join='inner')
Out[45]:
device user_id month_G province
0 Iphone7 1 5 shandong
1 Huawei 2 9 shanghai
2 HTC 3 8 beijing
3 oppo 4 10 hebei
4 vivo 5 15 jiangsu
5 xiaomi 6 30 hunan
pd.concat([df1,df2],axis=1,join='outer')
Out[46]:
device user_id month_G province
0 Iphone7 1 5.0 shandong
1 Huawei 2 9.0 shanghai
2 HTC 3 8.0 beijing
3 oppo 4 10.0 hebei
4 vivo 5 15.0 jiangsu
5 xiaomi 6 30.0 hunan
6 google 7 NaN NaN
如果想在连接轴上创建一个多层索引,可以使用keys参数实现。
pd.concat([df1,df2],keys=['key1','key2'])
Out[52]:
device month_G province user_id
key1 0 Iphone7 NaN NaN 1.0
1 Huawei NaN NaN 2.0
2 HTC NaN NaN 3.0
3 oppo NaN NaN 4.0
4 vivo NaN NaN 5.0
5 xiaomi NaN NaN 6.0
6 google NaN NaN 7.0
key2 0 NaN 5.0 shandong NaN
1 NaN 9.0 shanghai NaN
2 NaN 8.0 beijing NaN
3 NaN 10.0 hebei NaN
4 NaN 15.0 jiangsu NaN
5 NaN 30.0 hunan NaN
还可以通过names参数来命名生成的轴层级。
pd.concat([df1,df2],axis=1,keys=['key1','key2'],names=['level1','level2'])
Out[54]:
level1 key1 key2
level2 device user_id month_G province
0 Iphone7 1 5.0 shandong
1 Huawei 2 9.0 shanghai
2 HTC 3 8.0 beijing
3 oppo 4 10.0 hebei
4 vivo 5 15.0 jiangsu
5 xiaomi 6 30.0 hunan
6 google 7 NaN NaN
有时候我们想连接生成的数据的索引不保留为原有索引值,此时需要设置ignore_index=True
。
df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})
df1
Out[65]:
device user_id
0 Iphone7 1
1 Huawei 2
2 HTC 3
3 oppo 4
4 vivo 5
5 xiaomi 6
6 google 7
df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30]},index=['shandong','shanghai','beijing','hebei','jiangsu','hunan'])
df2
Out[67]:
month_G
shandong 5
shanghai 9
beijing 8
hebei 10
jiangsu 15
hunan 30
pd.concat([df1,df2])
Out[68]:
device month_G user_id
0 Iphone7 NaN 1.0
1 Huawei NaN 2.0
2 HTC NaN 3.0
3 oppo NaN 4.0
4 vivo NaN 5.0
5 xiaomi NaN 6.0
6 google NaN 7.0
shandong NaN 5.0 NaN
shanghai NaN 9.0 NaN
beijing NaN 8.0 NaN
hebei NaN 10.0 NaN
jiangsu NaN 15.0 NaN
hunan NaN 30.0 NaN
pd.concat([df1,df2],ignore_index=True)
Out[69]:
device month_G user_id
0 Iphone7 NaN 1.0
1 Huawei NaN 2.0
2 HTC NaN 3.0
3 oppo NaN 4.0
4 vivo NaN 5.0
5 xiaomi NaN 6.0
6 google NaN 7.0
7 NaN 5.0 NaN
8 NaN 9.0 NaN
9 NaN 8.0 NaN
10 NaN 10.0 NaN
11 NaN 15.0 NaN
12 NaN 30.0 NaN
concat()
方法的参数:
参数 | 说明 |
---|---|
axis | 用于指定连接的轴 |
join | 连接方式(outer/inner),默认为外连接outer |
join_axes | 用于指定其他n-1轴的特定索引,可以替代内外连接的逻辑 |
keys | 与要连接的对象关联的值,沿着连接轴形成分层索引 |
leels | 在键值传递时,该参数用于指定多层索引的层级 |
names | 如果传入keys/levels参数,该参数用于指定多层索引的层级名称 |
verity_integrity | 检查连接对象中的新轴是否重复,如果是,则发生异常,默认为False允许重复 |
ignore_index | 不沿着连接轴保留索引,而是产生一个新索引 |
4.联合重叠数据:
有时,两个数据集的索引全部或部分重叠,需要进行联合重叠值操作,进行数据修补,此时就需要使用combine_first()
方法。
s1 = pd.Series([1,2.2,5,6,np.nan],index=['a','b','c','d','e'])
s1
Out[92]:
a 1.0
b 2.2
c 5.0
d 6.0
e NaN
dtype: float64
s2 = pd.Series([9,np.nan,np.nan,7,8],index=['b','c','d','e','f'])
s2
Out[94]:
b 9.0
c NaN
d NaN
e 7.0
f 8.0
dtype: float64
s1.combine_first(s2)
Out[95]:
a 1.0
b 2.2
c 5.0
d 6.0
e 7.0
f 8.0
dtype: float64
s2.combine_first(s1)
Out[96]:
a 1.0
b 9.0
c 5.0
d 6.0
e 7.0
f 8.0
dtype: float64
在DataFrame中,combine_first()
方法逐列做跟在Series中相同的操作,可以看做是根据传入的对象来修补调用对象的数据。
df1 = pd.DataFrame({'a':[2,np.nan,6,np.nan],'b':[np.nan,5,np.nan,8],'c':[11,22,33,44]})
df1
Out[101]:
a b c
0 2.0 NaN 11
1 NaN 5.0 22
2 6.0 NaN 33
3 NaN 8.0 44
df2 = pd.DataFrame({'a':[5.5,4.4,np.nan,3.3,2.2],'b':[np.nan,3.3,4.4,5.5,6.6]})
df2
Out[103]:
a b
0 5.5 NaN
1 4.4 3.3
2 NaN 4.4
3 3.3 5.5
4 2.2 6.6
df1.combine_first(df2)
Out[104]:
a b c
0 2.0 NaN 11.0
1 4.4 5.0 22.0
2 6.0 4.4 33.0
3 3.3 8.0 44.0
4 2.2 6.6 NaN
df2.combine_first(df1)
Out[105]:
a b c
0 5.5 NaN 11.0
1 4.4 3.3 22.0
2 6.0 4.4 33.0
3 3.3 5.5 44.0
4 2.2 6.6 NaN
Reference:
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》