16.数据规整——重塑与透视

By 进击的码农设计师

2019年6月13日

0

1104

数据的重塑与透视其实就是重新排列原有的表格型数据。

1.使用多层索引进行排序：

主要使用以下两种基本方法：

stack()堆叠，该方法会将列中的数据透视到行。
unstack()拆堆，该方法会将行中的数据透视到列。

df = pd.DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['ios','andiord'],name='style'),columns=pd.Index(['one','two','three'],name='number'))

df
Out[121]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

df_stack = df.stack()

df_stack
Out[123]: 
style    number
ios      one       0
         two       1
         three     2
andiord  one       3
         two       4
         three     5
dtype: int32

df_stack.unstack()
Out[124]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

默认情况下，unstack()拆堆最内层，可以传入一个层级序号或名称来拆分不同的层级。

df_stack
Out[125]: 
style    number
ios      one       0
         two       1
         three     2
andiord  one       3
         two       4
         three     5
dtype: int32

df_stack.unstack(1)
Out[126]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

df_stack.unstack(0)
Out[127]: 
style   ios  andiord
number              
one       0        3
two       1        4
three     2        5

df_stack.unstack('style')
Out[128]: 
style   ios  andiord
number              
one       0        3
two       1        4
three     2        5

如果层级中所有值并为包含与每个子分组中时，拆分会引入缺失值。

2.数据透视1——将“长”透视为“宽”：

pivot()方法中传入的前两个值分别用于生成结果中的行和列索引，然后以可选的数值列填充DataFrame。

df
Out[150]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

df_pivoted = df.pivot('Style','variable','value')

df_pivoted
Out[152]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

df_pivoted.index
Out[153]: Index(['one', 'three', 'two'], dtype='object', name='Style')

df_pivoted.columns
Out[154]: Index(['A', 'B', 'C'], dtype='object', name='variable')

如果想重塑时保留两个数据列的话，去除最后一个参数即可。

df['value2'] = [10,11,12,13,14,15,16,17,18]

df
Out[156]: 
   Style variable  value  value2
0    one        A      1      10
1    two        A      2      11
2  three        A      3      12
3    one        B      4      13
4    two        B      5      14
5  three        B      6      15
6    one        C      7      16
7    two        C      8      17
8  three        C      9      18

df_pivoted = df.pivot('Style','variable')

df_pivoted
Out[158]: 
         value       value2        
variable     A  B  C      A   B   C
Style                              
one          1  4  7     10  13  16
three        3  6  9     12  15  18
two          2  5  8     11  14  17

pivot()方法等价于使用set_index()方法创建分层索引，然后再调用unstack()方法。

df
Out[162]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

df.set_index(['Style','variable']).unstack('variable')
Out[163]: 
         value      
variable     A  B  C
Style               
one          1  4  7
three        3  6  9
two          2  5  8

df_pivoted = df.pivot('Style','variable','value')

df_pivoted
Out[165]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

3.数据透视2——将“宽”透视为“长”：

pivot()方法的反操作是melt()方法。

df = pd.DataFrame({'Style':['one','two','three'],'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})

df
Out[133]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

df_melted = pd.melt(df,['Style'])

df_melted
Out[135]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

使用pivot()方法，可以将数据重塑会原来的布局。

df_re = df_melted.pivot('Style','variable','value')

df_re
Out[138]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

由于pivot()之后的结果根据作为行标签的列生成索引，此时需要使用reset_index()方法将数据回移一列。

df_re.reset_index()
Out[139]: 
variable  Style  A  B  C
0           one  1  4  7
1         three  3  6  9
2           two  2  5  8

df_re.index
Out[140]: Index(['one', 'three', 'two'], dtype='object', name='Style')

df_re.reset_index().index
Out[141]: RangeIndex(start=0, stop=3, step=1)

也可以指定列的子集作为值列。

df
Out[143]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

pd.melt(df,id_vars=['Style'],value_vars=['A','B'])
Out[144]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6

melt()方法也可以无须任何分组指标。

df
Out[145]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

pd.melt(df,value_vars=['Style','A'])
Out[146]: 
  variable  value
0    Style    one
1    Style    two
2    Style  three
3        A      1
4        A      2
5        A      3

Reference：
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》

Previous article15.数据规整——联合与合并数据集

Next article17.Matplotlib库入门

欢迎留下您的宝贵建议 Cancel reply

Please enter your comment!

Please enter your name here

You have entered an incorrect email address!

Please enter your email address here

16.数据规整——重塑与透视

1.使用多层索引进行排序：

2.数据透视1——将“长”透视为“宽”：

3.数据透视2——将“宽”透视为“长”：

【Python小程序】自动解压并删除压缩包

【Python小程序】查找并删除重复文件

【Python小程序】实现文件自动归类

欢迎留下您的宝贵建议 Cancel reply

Most Popular

【ArcGIS工具箱】221.几何网络——重新构建...

【深度学习(PyTorch篇)】47.可视化工具—...

【深度学习(PyTorch篇)】46.可视化工具—...

【深度学习(PyTorch篇)】45.autogr...

Recent Comments

EDITOR PICKS

RSS

3D Map Generator Terrain

1.ENVI软件操作基础——窗口介绍及打开、浏览数...

POPULAR POSTS

16.使用Visdom可视化

3.数据变换——地理配准

【娱乐】万神殿第一季

POPULAR CATEGORY