首页Python15.数据规整——联合与合...

15.数据规整——联合与合并数据集

Pandas中联合与合并数据集的方法主要有以下几种:

  • pd.merge()方法是根据一个或多个键进行连接,类似关系型数据库的连接操作。
  • pd.concat()方法是对数据对象在轴向上进行粘合或堆叠操作。
  • pd.combine_first()方法允许将重叠的数据拼接到一起,以使用一个对象中的值填充另一个对象中的缺失值。

1.数据库风格的数据连接:

使用pd.merge()方法时,如果没哟指定连接的键的信息,系统会自动选择重叠的列名作为连接的键,不过显式的指定连接键是一个好习惯。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})

df1
Out[5]: 
    device  user_id
0  Iphone7        1
1   Huawei        2
2      HTC        3
3     oppo        4
4     vivo        5
5   xiaomi        6
6   google        7

df2
Out[6]: 
   month_G  user_id
0        5        1
1        9        2
2        8        3
3       10        4
4       15        5
5       30        6
6        2        8

pd.merge(df1,df2)
Out[7]: 
    device  user_id  month_G
0  Iphone7        1        5
1   Huawei        2        9
2      HTC        3        8
3     oppo        4       10
4     vivo        5       15
5   xiaomi        6       30

pd.merge(df1,df2,on='user_id')
Out[8]: 
    device  user_id  month_G
0  Iphone7        1        5
1   Huawei        2        9
2      HTC        3        8
3     oppo        4       10
4     vivo        5       15
5   xiaomi        6       30

因为日常数据中很多会存在连接键名不一致的情况,此时就需要我们分别指定连接键。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df2 = pd.DataFrame({'id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})

pd.merge(df1,df2,left_on='user_id',right_on='id')
Out[11]: 
    device  user_id  id  month_G
0  Iphone7        1   1        5
1   Huawei        2   2        9
2      HTC        3   3        8
3     oppo        4   4       10
4     vivo        5   5       15
5   xiaomi        6   6       30

默认情况下,系统做的是内连接,也就是取两个数据的交集,可以通过指定连接模式达到所需效果。inner为内连接,outer为外连接,left为使用左表的键进行连接,right为使用右表的键进行连接。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2]})

pd.merge(df1,df2,how='outer')
Out[15]: 
    device  user_id  month_G
0  Iphone7        1      5.0
1   Huawei        2      9.0
2      HTC        3      8.0
3     oppo        4     10.0
4     vivo        5     15.0
5   xiaomi        6     30.0
6   google        7      NaN
7      NaN        8      2.0

pd.merge(df1,df2,how='left')
Out[16]: 
    device  user_id  month_G
0  Iphone7        1      5.0
1   Huawei        2      9.0
2      HTC        3      8.0
3     oppo        4     10.0
4     vivo        5     15.0
5   xiaomi        6     30.0
6   google        7      NaN

当使用多个键进行连接时,需要传入一个键名的列表。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google'],'key':['shandong','shanghai','beijing','hebei','jiangsu','hunan','henan']})

df2 = pd.DataFrame({'user_id':[1,2,3,4,5,6,8],'month_G':[5,9,8,10,15,30,2],'key':['shandong','shanghai','beijing','hebei','jiangsu','hunan','jiangxi']})

pd.merge(df1,df2,on=['user_id','key'],how='outer')
Out[20]: 
    device       key  user_id  month_G
0  Iphone7  shandong        1      5.0
1   Huawei  shanghai        2      9.0
2      HTC   beijing        3      8.0
3     oppo     hebei        4     10.0
4     vivo   jiangsu        5     15.0
5   xiaomi     hunan        6     30.0
6   google     henan        7      NaN
7      NaN   jiangxi        8      2.0

merge()方法的参数:

参数说明
left合并时操作左边的DataFrame
right合并时操作右边的DataFrame
how数据连接的方式
on需要连接的键
left_on左边的DataFrame中用作连接键的列
right_on右边的DataFrame中用作连接键的列
left_index使用左边数据的行索引作为连接键
right_index使用右边数据的行索引作为连接键
sort通过连接键按字母顺序对合并的数据进行排序,默认为True
suffixes在列重叠情况下,用于指定添加到列名后的字符串元祖,默认为_x和_y
indicator添加一个特殊的列_merge,用来指示每一行的来源

2.根据索引合并数据:

有时候需要使用数据的索引作为连接键,此时需要设定left_index=Trueright_index=True来指定使用索引连接。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30,2]},index=[1,2,3,4,5,6,8])

df1
Out[23]: 
    device  user_id
0  Iphone7        1
1   Huawei        2
2      HTC        3
3     oppo        4
4     vivo        5
5   xiaomi        6
6   google        7

df2
Out[24]: 
   month_G
1        5
2        9
3        8
4       10
5       15
6       30
8        2

pd.merge(df1,df2,left_on='user_id',right_index=True,how='outer')
Out[25]: 
    device  user_id  month_G
0  Iphone7        1      5.0
1   Huawei        2      9.0
2      HTC        3      8.0
3     oppo        4     10.0
4     vivo        5     15.0
5   xiaomi        6     30.0
6   google        7      NaN
6      NaN        8      2.0

同样,当一方数据为多层索引时,传入一个键名的列表即可。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google'],'province':['shandong','shanghai','beijing','hebei','jiangsu','hunan','henan']})

df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30,2]},index=[[1,2,3,4,5,6,8],['shandong','shanghai','beijing','hebei','jiangsu','hunan','jiangxi']])

df1
Out[28]: 
    device  province  user_id
0  Iphone7  shandong        1
1   Huawei  shanghai        2
2      HTC   beijing        3
3     oppo     hebei        4
4     vivo   jiangsu        5
5   xiaomi     hunan        6
6   google     henan        7

df2
Out[29]: 
            month_G
1 shandong        5
2 shanghai        9
3 beijing         8
4 hebei          10
5 jiangsu        15
6 hunan          30
8 jiangxi         2

pd.merge(df1,df2,left_on=['user_id','province'],right_index=True,how='outer')
Out[30]: 
    device  province  user_id  month_G
0  Iphone7  shandong        1      5.0
1   Huawei  shanghai        2      9.0
2      HTC   beijing        3      8.0
3     oppo     hebei        4     10.0
4     vivo   jiangsu        5     15.0
5   xiaomi     hunan        6     30.0
6   google     henan        7      NaN
6      NaN   jiangxi        8      2.0

此外,DataFrame中还有一个join实例方法完基于索引的数据连接,但是经常使用的还是merge()方法,这里就不做过多介绍。

3.沿轴向连接数据:

沿轴向连接数据也称为数据拼接、堆叠操作,此时需使用concat()方法。

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df1
Out[40]: 
    device  user_id
0  Iphone7        1
1   Huawei        2
2      HTC        3
3     oppo        4
4     vivo        5
5   xiaomi        6
6   google        7

df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30],'province':['shandong','shanghai','beijing','hebei','jiangsu','hunan']})

df2
Out[42]: 
   month_G  province
0        5  shandong
1        9  shanghai
2        8   beijing
3       10     hebei
4       15   jiangsu
5       30     hunan

默认情况下,是沿着axis=0的轴生效。

pd.concat([df1,df2])
Out[43]: 
    device  month_G  province  user_id
0  Iphone7      NaN       NaN      1.0
1   Huawei      NaN       NaN      2.0
2      HTC      NaN       NaN      3.0
3     oppo      NaN       NaN      4.0
4     vivo      NaN       NaN      5.0
5   xiaomi      NaN       NaN      6.0
6   google      NaN       NaN      7.0
0      NaN      5.0  shandong      NaN
1      NaN      9.0  shanghai      NaN
2      NaN      8.0   beijing      NaN
3      NaN     10.0     hebei      NaN
4      NaN     15.0   jiangsu      NaN
5      NaN     30.0     hunan      NaN

pd.concat([df1,df2],axis=1)
Out[44]: 
    device  user_id  month_G  province
0  Iphone7        1      5.0  shandong
1   Huawei        2      9.0  shanghai
2      HTC        3      8.0   beijing
3     oppo        4     10.0     hebei
4     vivo        5     15.0   jiangsu
5   xiaomi        6     30.0     hunan
6   google        7      NaN       NaN

通过设置join参数,指定连接方式。

pd.concat([df1,df2],axis=1,join='inner')
Out[45]: 
    device  user_id  month_G  province
0  Iphone7        1        5  shandong
1   Huawei        2        9  shanghai
2      HTC        3        8   beijing
3     oppo        4       10     hebei
4     vivo        5       15   jiangsu
5   xiaomi        6       30     hunan

pd.concat([df1,df2],axis=1,join='outer')
Out[46]: 
    device  user_id  month_G  province
0  Iphone7        1      5.0  shandong
1   Huawei        2      9.0  shanghai
2      HTC        3      8.0   beijing
3     oppo        4     10.0     hebei
4     vivo        5     15.0   jiangsu
5   xiaomi        6     30.0     hunan
6   google        7      NaN       NaN

如果想在连接轴上创建一个多层索引,可以使用keys参数实现。

pd.concat([df1,df2],keys=['key1','key2'])
Out[52]: 
         device  month_G  province  user_id
key1 0  Iphone7      NaN       NaN      1.0
     1   Huawei      NaN       NaN      2.0
     2      HTC      NaN       NaN      3.0
     3     oppo      NaN       NaN      4.0
     4     vivo      NaN       NaN      5.0
     5   xiaomi      NaN       NaN      6.0
     6   google      NaN       NaN      7.0
key2 0      NaN      5.0  shandong      NaN
     1      NaN      9.0  shanghai      NaN
     2      NaN      8.0   beijing      NaN
     3      NaN     10.0     hebei      NaN
     4      NaN     15.0   jiangsu      NaN
     5      NaN     30.0     hunan      NaN

还可以通过names参数来命名生成的轴层级。

pd.concat([df1,df2],axis=1,keys=['key1','key2'],names=['level1','level2'])
Out[54]: 
level1     key1            key2          
level2   device user_id month_G  province
0       Iphone7       1     5.0  shandong
1        Huawei       2     9.0  shanghai
2           HTC       3     8.0   beijing
3          oppo       4    10.0     hebei
4          vivo       5    15.0   jiangsu
5        xiaomi       6    30.0     hunan
6        google       7     NaN       NaN

有时候我们想连接生成的数据的索引不保留为原有索引值,此时需要设置ignore_index=True

df1 = pd.DataFrame({'user_id':[1,2,3,4,5,6,7],'device':['Iphone7','Huawei','HTC','oppo','vivo','xiaomi','google']})

df1
Out[65]: 
    device  user_id
0  Iphone7        1
1   Huawei        2
2      HTC        3
3     oppo        4
4     vivo        5
5   xiaomi        6
6   google        7

df2 = pd.DataFrame({'month_G':[5,9,8,10,15,30]},index=['shandong','shanghai','beijing','hebei','jiangsu','hunan'])

df2
Out[67]: 
          month_G
shandong        5
shanghai        9
beijing         8
hebei          10
jiangsu        15
hunan          30

pd.concat([df1,df2])
Out[68]: 
           device  month_G  user_id
0         Iphone7      NaN      1.0
1          Huawei      NaN      2.0
2             HTC      NaN      3.0
3            oppo      NaN      4.0
4            vivo      NaN      5.0
5          xiaomi      NaN      6.0
6          google      NaN      7.0
shandong      NaN      5.0      NaN
shanghai      NaN      9.0      NaN
beijing       NaN      8.0      NaN
hebei         NaN     10.0      NaN
jiangsu       NaN     15.0      NaN
hunan         NaN     30.0      NaN

pd.concat([df1,df2],ignore_index=True)
Out[69]: 
     device  month_G  user_id
0   Iphone7      NaN      1.0
1    Huawei      NaN      2.0
2       HTC      NaN      3.0
3      oppo      NaN      4.0
4      vivo      NaN      5.0
5    xiaomi      NaN      6.0
6    google      NaN      7.0
7       NaN      5.0      NaN
8       NaN      9.0      NaN
9       NaN      8.0      NaN
10      NaN     10.0      NaN
11      NaN     15.0      NaN
12      NaN     30.0      NaN

concat()方法的参数:

参数说明
axis用于指定连接的轴
join连接方式(outer/inner),默认为外连接outer
join_axes用于指定其他n-1轴的特定索引,可以替代内外连接的逻辑
keys与要连接的对象关联的值,沿着连接轴形成分层索引
leels在键值传递时,该参数用于指定多层索引的层级
names如果传入keys/levels参数,该参数用于指定多层索引的层级名称
verity_integrity检查连接对象中的新轴是否重复,如果是,则发生异常,默认为False允许重复
ignore_index不沿着连接轴保留索引,而是产生一个新索引

4.联合重叠数据:

有时,两个数据集的索引全部或部分重叠,需要进行联合重叠值操作,进行数据修补,此时就需要使用combine_first()方法。

s1 = pd.Series([1,2.2,5,6,np.nan],index=['a','b','c','d','e'])

s1
Out[92]: 
a    1.0
b    2.2
c    5.0
d    6.0
e    NaN
dtype: float64

s2 = pd.Series([9,np.nan,np.nan,7,8],index=['b','c','d','e','f'])

s2
Out[94]: 
b    9.0
c    NaN
d    NaN
e    7.0
f    8.0
dtype: float64

s1.combine_first(s2)
Out[95]: 
a    1.0
b    2.2
c    5.0
d    6.0
e    7.0
f    8.0
dtype: float64

s2.combine_first(s1)
Out[96]: 
a    1.0
b    9.0
c    5.0
d    6.0
e    7.0
f    8.0
dtype: float64

在DataFrame中,combine_first()方法逐列做跟在Series中相同的操作,可以看做是根据传入的对象来修补调用对象的数据。

df1 = pd.DataFrame({'a':[2,np.nan,6,np.nan],'b':[np.nan,5,np.nan,8],'c':[11,22,33,44]})

df1
Out[101]: 
     a    b   c
0  2.0  NaN  11
1  NaN  5.0  22
2  6.0  NaN  33
3  NaN  8.0  44

df2 = pd.DataFrame({'a':[5.5,4.4,np.nan,3.3,2.2],'b':[np.nan,3.3,4.4,5.5,6.6]})

df2
Out[103]: 
     a    b
0  5.5  NaN
1  4.4  3.3
2  NaN  4.4
3  3.3  5.5
4  2.2  6.6

df1.combine_first(df2)
Out[104]: 
     a    b     c
0  2.0  NaN  11.0
1  4.4  5.0  22.0
2  6.0  4.4  33.0
3  3.3  8.0  44.0
4  2.2  6.6   NaN

df2.combine_first(df1)
Out[105]: 
     a    b     c
0  5.5  NaN  11.0
1  4.4  3.3  22.0
2  6.0  4.4  33.0
3  3.3  5.5  44.0
4  2.2  6.6   NaN

Reference:
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》

RELATED ARTICLES

欢迎留下您的宝贵建议

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments