14.数据规整——分层索引

2019年6月13日

1130

在日常的数据分析工作中，经常会遇到所需数据分布在多个文件或数据库中，或者以某种不易于分析的格式排列。接下来的几篇文章主要介绍一下关于数据联合、连接以及重新排列的一系列方法。

此篇主要介绍的是分层索引，分层索引时Pandas中的重要特性，允许在一个轴向上用于多个索引层级，也就是说分层索引提供了一种在更低维度的形式中处理更高维度数据的方式。

1.分层索引：

data = pd.Series(np.random.randn(9),index=[['android','android','android','ios','ios','wp','wp','symbian','symbian'],['htc','google','huawei','iphone6','iphone7','htc','Nokia','Nokia','LG']])

data
Out[11]: 
android  htc       -0.251641
         google    -0.602151
         huawei    -1.169282
ios      iphone6    0.355161
         iphone7   -0.559844
wp       htc        0.321531
         Nokia     -0.252508
symbian  Nokia     -1.104941
         LG        -0.674391
dtype: float64

通过index获取Series的多层索引数据。

data.index
Out[12]: 
MultiIndex(levels=[['android', 'ios', 'symbian', 'wp'], ['LG', 'Nokia', 'google', 'htc', 'huawei', 'iphone6', 'iphone7']],
           labels=[[0, 0, 0, 1, 1, 3, 3, 2, 2], [3, 2, 4, 5, 6, 3, 1, 1, 0]])

分层索引也称为部分索引，允许我们简洁的选择出数据的子集。

data['ios']
Out[13]: 
iphone6    0.355161
iphone7   -0.559844
dtype: float64

data.loc[['ios','wp']]
Out[14]: 
ios  iphone6    0.355161
     iphone7   -0.559844
wp   htc        0.321531
     Nokia     -0.252508
dtype: float64

在内部层级中进行选择也是被允许的。

data.loc[:,'iphone6']
Out[15]: 
ios    0.355161
dtype: float64

使用unstack()方法可以将数据在DataFrame中重新排列，使得分层索引在重塑数据和数组透视表操作中经常使用。与unstack()相对应的是stack()方法。

data.unstack()
Out[16]: 
               LG     Nokia    google       htc    huawei   iphone6   iphone7
android       NaN       NaN -0.602151 -0.251641 -1.169282       NaN       NaN
ios           NaN       NaN       NaN       NaN       NaN  0.355161 -0.559844
symbian -0.674391 -1.104941       NaN       NaN       NaN       NaN       NaN
wp            NaN -0.252508       NaN  0.321531       NaN       NaN       NaN

data.unstack().stack()
Out[17]: 
android  google    -0.602151
         htc       -0.251641
         huawei    -1.169282
ios      iphone6    0.355161
         iphone7   -0.559844
symbian  LG        -0.674391
         Nokia     -1.104941
wp       Nokia     -0.252508
         htc        0.321531
dtype: float64

在DataFrame中，每个轴都可以拥有分层索引，并且分层的层级可以由名称。

data
Out[22]: 
    android           ios
        HTC Google iphone
a 1       0      1      2
  2       3      4      5
b 1       6      7      8
  2       9     10     11

data.index
Out[23]: 
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

data.index.names=['key1','key2']

data
Out[24]: 
          android           ios
              HTC Google iphone
key1 key2                      
a    1          0      1      2
     2          3      4      5
b    1          6      7      8
     2          9     10     11

data.columns
Out[25]: 
MultiIndex(levels=[['android', 'ios'], ['Google', 'HTC', 'iphone']],
           labels=[[0, 0, 1], [1, 0, 2]])

data.columns.names=['platform','device']

data
Out[26]: 
platform  android           ios
device        HTC Google iphone
key1 key2                      
a    1          0      1      2
     2          3      4      5
b    1          6      7      8
     2          9     10     11

同样，在DataFrame中也可以通过列索引选择列中的数据组。

data['android']
Out[30]: 
device     HTC  Google
key1 key2             
a    1       0       1
     2       3       4
b    1       6       7
     2       9      10

2.重排序和层级排序：

通过swaplevel()方法可以对层级进行重新排序，该方法接收的是两个层级的序号或者名称。

data
Out[35]: 
platform  android           ios
device        HTC Google iphone
key1 key2                      
a    1          0      1      2
     2          3      4      5
b    1          6      7      8
     2          9     10     11

data.swaplevel('key1','key2')
Out[36]: 
platform  android           ios
device        HTC Google iphone
key2 key1                      
1    a          0      1      2
2    a          3      4      5
1    b          6      7      8
2    b          9     10     11

data.swaplevel(0,1)
Out[37]: 
platform  android           ios
device        HTC Google iphone
key2 key1                      
1    a          0      1      2
2    a          3      4      5
1    b          6      7      8
2    b          9     10     11

sort_index()方法可以在指定层级（通过设置level选项）对数据进行排序操作。

data.sort_index(level=1)
Out[38]: 
platform  android           ios
device        HTC Google iphone
key1 key2                      
a    1          0      1      2
b    1          6      7      8
a    2          3      4      5
b    2          9     10     11

data.sort_index(level=0)
Out[39]: 
platform  android           ios
device        HTC Google iphone
key1 key2                      
a    1          0      1      2
     2          3      4      5
b    1          6      7      8
     2          9     10     11

3.按层级进行汇总统计：

通过设置level选项可以按层级进行描述性和汇总性统计操作，另外可以通过设置axis指定操作的轴。

data
Out[40]: 
platform  android           ios
device        HTC Google iphone
key1 key2                      
a    1          0      1      2
     2          3      4      5
b    1          6      7      8
     2          9     10     11

data.sum(level='key1')
Out[41]: 
platform android           ios
device       HTC Google iphone
key1                          
a              3      5      7
b             15     17     19

data.sum(level='platform',axis=1)
Out[42]: 
platform   android  ios
key1 key2              
a    1           1    2
     2           7    5
b    1          13    8
     2          19   11

Reference：
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》

Previous article13.数据清洗与处理——字符串操作及正则表达式

Next article15.数据规整——联合与合并数据集

欢迎留下您的宝贵建议 Cancel reply

Please enter your comment!

Please enter your name here

You have entered an incorrect email address!

Please enter your email address here

14.数据规整——分层索引

1.分层索引：

2.重排序和层级排序：

3.按层级进行汇总统计：

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

欢迎留下您的宝贵建议 Cancel reply

Most Popular

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

【Python内置函数】help()函数

Recent Comments

EDITOR PICKS

RSS

3D Map Generator Terrain

1.ENVI软件操作基础——窗口介绍及打开、浏览数...

POPULAR POSTS

【ArcGIS工具箱】209.子类型——添加子类型...

【Python数据分析】27.分块读取数据

13.通过nn.Module方法快捷定义MLP

POPULAR CATEGORY