首页Python16.数据规整——重塑与透...

16.数据规整——重塑与透视

数据的重塑与透视其实就是重新排列原有的表格型数据。

1.使用多层索引进行排序:

主要使用以下两种基本方法:

  • stack()堆叠,该方法会将列中的数据透视到行。
  • unstack()拆堆,该方法会将行中的数据透视到列。
df = pd.DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['ios','andiord'],name='style'),columns=pd.Index(['one','two','three'],name='number'))

df
Out[121]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

df_stack = df.stack()

df_stack
Out[123]: 
style    number
ios      one       0
         two       1
         three     2
andiord  one       3
         two       4
         three     5
dtype: int32

df_stack.unstack()
Out[124]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

默认情况下,unstack()拆堆最内层,可以传入一个层级序号或名称来拆分不同的层级。

df_stack
Out[125]: 
style    number
ios      one       0
         two       1
         three     2
andiord  one       3
         two       4
         three     5
dtype: int32

df_stack.unstack(1)
Out[126]: 
number   one  two  three
style                   
ios        0    1      2
andiord    3    4      5

df_stack.unstack(0)
Out[127]: 
style   ios  andiord
number              
one       0        3
two       1        4
three     2        5

df_stack.unstack('style')
Out[128]: 
style   ios  andiord
number              
one       0        3
two       1        4
three     2        5

如果层级中所有值并为包含与每个子分组中时,拆分会引入缺失值。

2.数据透视1——将“长”透视为“宽”:

pivot()方法中传入的前两个值分别用于生成结果中的行和列索引,然后以可选的数值列填充DataFrame。

df
Out[150]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

df_pivoted = df.pivot('Style','variable','value')

df_pivoted
Out[152]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

df_pivoted.index
Out[153]: Index(['one', 'three', 'two'], dtype='object', name='Style')

df_pivoted.columns
Out[154]: Index(['A', 'B', 'C'], dtype='object', name='variable')

如果想重塑时保留两个数据列的话,去除最后一个参数即可。

df['value2'] = [10,11,12,13,14,15,16,17,18]

df
Out[156]: 
   Style variable  value  value2
0    one        A      1      10
1    two        A      2      11
2  three        A      3      12
3    one        B      4      13
4    two        B      5      14
5  three        B      6      15
6    one        C      7      16
7    two        C      8      17
8  three        C      9      18

df_pivoted = df.pivot('Style','variable')

df_pivoted
Out[158]: 
         value       value2        
variable     A  B  C      A   B   C
Style                              
one          1  4  7     10  13  16
three        3  6  9     12  15  18
two          2  5  8     11  14  17

pivot()方法等价于使用set_index()方法创建分层索引,然后再调用unstack()方法。

df
Out[162]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

df.set_index(['Style','variable']).unstack('variable')
Out[163]: 
         value      
variable     A  B  C
Style               
one          1  4  7
three        3  6  9
two          2  5  8

df_pivoted = df.pivot('Style','variable','value')

df_pivoted
Out[165]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

3.数据透视2——将“宽”透视为“长”:

pivot()方法的反操作是melt()方法。

df = pd.DataFrame({'Style':['one','two','three'],'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})

df
Out[133]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

df_melted = pd.melt(df,['Style'])

df_melted
Out[135]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6
6    one        C      7
7    two        C      8
8  three        C      9

使用pivot()方法,可以将数据重塑会原来的布局。

df_re = df_melted.pivot('Style','variable','value')

df_re
Out[138]: 
variable  A  B  C
Style            
one       1  4  7
three     3  6  9
two       2  5  8

由于pivot()之后的结果根据作为行标签的列生成索引,此时需要使用reset_index()方法将数据回移一列。

df_re.reset_index()
Out[139]: 
variable  Style  A  B  C
0           one  1  4  7
1         three  3  6  9
2           two  2  5  8

df_re.index
Out[140]: Index(['one', 'three', 'two'], dtype='object', name='Style')

df_re.reset_index().index
Out[141]: RangeIndex(start=0, stop=3, step=1)

也可以指定列的子集作为值列。

df
Out[143]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

pd.melt(df,id_vars=['Style'],value_vars=['A','B'])
Out[144]: 
   Style variable  value
0    one        A      1
1    two        A      2
2  three        A      3
3    one        B      4
4    two        B      5
5  three        B      6

melt()方法也可以无须任何分组指标。

df
Out[145]: 
   A  B  C  Style
0  1  4  7    one
1  2  5  8    two
2  3  6  9  three

pd.melt(df,value_vars=['Style','A'])
Out[146]: 
  variable  value
0    Style    one
1    Style    two
2    Style  three
3        A      1
4        A      2
5        A      3

Reference:
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》

RELATED ARTICLES

欢迎留下您的宝贵建议

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments