数据的重塑与透视其实就是重新排列原有的表格型数据。
1.使用多层索引进行排序:
主要使用以下两种基本方法:
stack()
堆叠,该方法会将列中的数据透视到行。unstack()
拆堆,该方法会将行中的数据透视到列。
df = pd.DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['ios','andiord'],name='style'),columns=pd.Index(['one','two','three'],name='number'))
df
Out[121]:
number one two three
style
ios 0 1 2
andiord 3 4 5
df_stack = df.stack()
df_stack
Out[123]:
style number
ios one 0
two 1
three 2
andiord one 3
two 4
three 5
dtype: int32
df_stack.unstack()
Out[124]:
number one two three
style
ios 0 1 2
andiord 3 4 5
默认情况下,unstack()
拆堆最内层,可以传入一个层级序号或名称来拆分不同的层级。
df_stack
Out[125]:
style number
ios one 0
two 1
three 2
andiord one 3
two 4
three 5
dtype: int32
df_stack.unstack(1)
Out[126]:
number one two three
style
ios 0 1 2
andiord 3 4 5
df_stack.unstack(0)
Out[127]:
style ios andiord
number
one 0 3
two 1 4
three 2 5
df_stack.unstack('style')
Out[128]:
style ios andiord
number
one 0 3
two 1 4
three 2 5
如果层级中所有值并为包含与每个子分组中时,拆分会引入缺失值。
2.数据透视1——将“长”透视为“宽”:
pivot()
方法中传入的前两个值分别用于生成结果中的行和列索引,然后以可选的数值列填充DataFrame。
df
Out[150]:
Style variable value
0 one A 1
1 two A 2
2 three A 3
3 one B 4
4 two B 5
5 three B 6
6 one C 7
7 two C 8
8 three C 9
df_pivoted = df.pivot('Style','variable','value')
df_pivoted
Out[152]:
variable A B C
Style
one 1 4 7
three 3 6 9
two 2 5 8
df_pivoted.index
Out[153]: Index(['one', 'three', 'two'], dtype='object', name='Style')
df_pivoted.columns
Out[154]: Index(['A', 'B', 'C'], dtype='object', name='variable')
如果想重塑时保留两个数据列的话,去除最后一个参数即可。
df['value2'] = [10,11,12,13,14,15,16,17,18]
df
Out[156]:
Style variable value value2
0 one A 1 10
1 two A 2 11
2 three A 3 12
3 one B 4 13
4 two B 5 14
5 three B 6 15
6 one C 7 16
7 two C 8 17
8 three C 9 18
df_pivoted = df.pivot('Style','variable')
df_pivoted
Out[158]:
value value2
variable A B C A B C
Style
one 1 4 7 10 13 16
three 3 6 9 12 15 18
two 2 5 8 11 14 17
pivot()
方法等价于使用set_index()
方法创建分层索引,然后再调用unstack()
方法。
df
Out[162]:
Style variable value
0 one A 1
1 two A 2
2 three A 3
3 one B 4
4 two B 5
5 three B 6
6 one C 7
7 two C 8
8 three C 9
df.set_index(['Style','variable']).unstack('variable')
Out[163]:
value
variable A B C
Style
one 1 4 7
three 3 6 9
two 2 5 8
df_pivoted = df.pivot('Style','variable','value')
df_pivoted
Out[165]:
variable A B C
Style
one 1 4 7
three 3 6 9
two 2 5 8
3.数据透视2——将“宽”透视为“长”:
pivot()
方法的反操作是melt()
方法。
df = pd.DataFrame({'Style':['one','two','three'],'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
Out[133]:
A B C Style
0 1 4 7 one
1 2 5 8 two
2 3 6 9 three
df_melted = pd.melt(df,['Style'])
df_melted
Out[135]:
Style variable value
0 one A 1
1 two A 2
2 three A 3
3 one B 4
4 two B 5
5 three B 6
6 one C 7
7 two C 8
8 three C 9
使用pivot()
方法,可以将数据重塑会原来的布局。
df_re = df_melted.pivot('Style','variable','value')
df_re
Out[138]:
variable A B C
Style
one 1 4 7
three 3 6 9
two 2 5 8
由于pivot()
之后的结果根据作为行标签的列生成索引,此时需要使用reset_index()
方法将数据回移一列。
df_re.reset_index()
Out[139]:
variable Style A B C
0 one 1 4 7
1 three 3 6 9
2 two 2 5 8
df_re.index
Out[140]: Index(['one', 'three', 'two'], dtype='object', name='Style')
df_re.reset_index().index
Out[141]: RangeIndex(start=0, stop=3, step=1)
也可以指定列的子集作为值列。
df
Out[143]:
A B C Style
0 1 4 7 one
1 2 5 8 two
2 3 6 9 three
pd.melt(df,id_vars=['Style'],value_vars=['A','B'])
Out[144]:
Style variable value
0 one A 1
1 two A 2
2 three A 3
3 one B 4
4 two B 5
5 three B 6
melt()
方法也可以无须任何分组指标。
df
Out[145]:
A B C Style
0 1 4 7 one
1 2 5 8 two
2 3 6 9 three
pd.melt(df,value_vars=['Style','A'])
Out[146]:
variable value
0 Style one
1 Style two
2 Style three
3 A 1
4 A 2
5 A 3
Reference:
《Python for Data Analysis:Data Wrangling with Pandas,Numpy,and IPython》