本系列文章配套代码获取有以下三种途径:
-
可以在以下网站查看,该网站是使用JupyterLite搭建的web端Jupyter环境,因此无需在本地安装运行环境即可使用,首次运行浏览器需要下载一些配置文件(大约20M):
https://returu.github.io/Python_Data_Analysis/lab/index.html
-
也可以通过百度网盘获取,需要在本地配置代码运行环境,环境配置可以查看【Python基础】2.搭建Python开发环境:
链接:https://pan.baidu.com/s/1MYkeYeVAIRqbxezQECHwcA?pwd=mnsj
提取码:mnsj
-
前往GitHub详情页面,单击 code 按钮,选择Download ZIP选项:
https://github.com/returu/Python_Data_Analysis
根据《Python for Data Analysis 3rd Edition》翻译整理
—————————————————–
1.根据索引合并:
1.1 merge()
有时候需要使用数据的索引作为连接键,此时需要设定left_index=True或right_index=True(或者两者)来指定使用索引连接。
1>>> df1 = pd.DataFrame({'key':["a", "b", "a", "a", "b", "c", "b"],'value':[11,22,33,44,55,66,77]})
2>>> df1
3 key value
40 a 11
51 b 22
62 a 33
73 a 44
84 b 55
95 c 66
106 b 77
11
12>>> df2 = pd.DataFrame({'value2':[99,100]},index=["a", "b"])
13>>> df2
14 value2
15a 99
16b 100
17
18>>> pd.merge(df1,df2,left_on='key',right_index=True)
19 key value value2
200 a 11 99
212 a 33 99
223 a 44 99
231 b 22 100
244 b 55 100
256 b 77 100
26
27>>> pd.merge(df1,df2,left_on='key',right_index=True,how='outer')
28 key value value2
290 a 11 99.0
302 a 33 99.0
313 a 44 99.0
321 b 22 100.0
334 b 55 100.0
346 b 77 100.0
355 c 66 NaN
同样,当一方数据为多层索引时,传入一个键名的列表即可。
1>>> df1 = pd.DataFrame({"key1": ["a", "a", "a","b", "b"],
2... "key2": [2000, 2001, 2002, 2001, 2002],
3... "value":[11,22,33,44,55]})
4>>> df1
5 key1 key2 value
60 a 2000 11
71 a 2001 22
82 a 2002 33
93 b 2001 44
104 b 2002 55
11
12>>> df2_index = pd.MultiIndex.from_arrays([
13... ["b", "b", "a", "a", "a", "a"],
14... [2001, 2000, 2000, 2000, 2001, 2002]
15... ])
16>>> df2_index
17MultiIndex([('b', 2001),
18 ('b', 2000),
19 ('a', 2000),
20 ('a', 2000),
21 ('a', 2001),
22 ('a', 2002)],
23 )
24
25>>> df2 = pd.DataFrame({"val1": pd.Series([0, 2, 4, 6, 8, 10],index=df2_index),
26... "val2": pd.Series([1, 3, 5, 7, 9, 11],index=df2_index)})
27>>> df2
28 val1 val2
29b 2001 0 1
30 2000 2 3
31a 2000 4 5
32 2000 6 7
33 2001 8 9
34 2002 10 11
35
36>>> pd.merge(df1,df2,left_on=["key1","key2"],right_index=True)
37 key1 key2 value val1 val2
380 a 2000 11 4 5
390 a 2000 11 6 7
401 a 2001 22 8 9
412 a 2002 33 10 11
423 b 2001 44 0 1
43
44>>> pd.merge(df1,df2,left_on=["key1","key2"],right_index=True,how="outer")
45 key1 key2 value val1 val2
460 a 2000 11.0 4.0 5.0
470 a 2000 11.0 6.0 7.0
481 a 2001 22.0 8.0 9.0
492 a 2002 33.0 10.0 11.0
503 b 2001 44.0 0.0 1.0
514 b 2002 55.0 NaN NaN
524 b 2000 NaN 2.0 3.0
使用两边的索引进行合并。
1>>> df1 = pd.DataFrame({'key1':[11,22,33],'key2':[99,88,77]},index={"a","b","c"})
2>>> df1
3 key1 key2
4a 11 99
5b 22 88
6c 33 77
7
8>>> df2 = pd.DataFrame({'value1':[9,10,11],'value2':[1,2,3]},index=["b", "d", "c"])
9>>> df2
10 value1 value2
11b 9 1
12d 10 2
13c 11 3
14
15>>> pd.merge(df1,df2,left_index=True,right_index=True)
16 key1 key2 value1 value2
17b 22 88 9 1
18c 33 77 11 3
19
20>>> pd.merge(df1,df2,left_index=True,right_index=True,how="outer")
21 key1 key2 value1 value2
22a 11.0 99.0 NaN NaN
23b 22.0 88.0 9.0 1.0
24c 33.0 77.0 11.0 3.0
25d NaN NaN 10.0 2.0
1.2 join()
DataFrame 有一个方便的join实例方法,用于按照索引合并。该方法也可以用于合并多个索引相同或相似但是没有重叠列的DataFrame 对象。
1>>> df1 = pd.DataFrame({'key1':[11,22,33],'key2':[99,88,77]},index={"a","b","c"})
2>>> df1
3 key1 key2
4a 11 99
5b 22 88
6c 33 77
7
8>>> df2 = pd.DataFrame({'value1':[9,10,11],'value2':[1,2,3]},index=["b", "d", "c"])
9>>> df2
10 value1 value2
11b 9 1
12d 10 2
13c 11 3
14
15>>> df1.join(df2,how="outer")
16 key1 key2 value1 value2
17a 11.0 99.0 NaN NaN
18b 22.0 88.0 9.0 1.0
19c 33.0 77.0 11.0 3.0
20d NaN NaN 10.0 2.0
与 pandas.merge 相比,DataFrame 的 join方法默认对连接键进行左连接。它还支持在调用 DataFrame 的某列上连接传递的 DataFrame 的索引。
1>>> df1 = pd.DataFrame({'key':["a", "b", "a", "a", "b", "c", "b"],'value':[11,22,33,44,55,66,77]})
2>>> df1
3 key value
40 a 11
51 b 22
62 a 33
73 a 44
84 b 55
95 c 66
106 b 77
11
12>>> df2 = pd.DataFrame({'value2':[99,100]},index=["a", "b"])
13>>> df2
14 value2
15a 99
16b 100
17
18>>> df1.join(df2,on="key",how="outer")
19 key value value2
200 a 11 99.0
212 a 33 99.0
223 a 44 99.0
231 b 22 100.0
244 b 55 100.0
256 b 77 100.0
265 c 66 NaN
最后,对于简单的 index-on-index 合并,可以传递要连接的 DataFrame 列表,可以作为下一节中介绍的更通用的 pandas.concat 函数的替代方法。
1>>> df1 = pd.DataFrame({'key1':[11,22,33],'key2':[99,88,77]},index={"a","b","c"})
2>>> df1
3 key1 key2
4a 11 99
5b 22 88
6c 33 77
7
8>>> df2 = pd.DataFrame({'value1':[9,10,11],'value2':[1,2,3]},index=["b", "d", "c"])
9>>> df2
10 value1 value2
11b 9 1
12d 10 2
13c 11 3
14
15>>> df3 = pd.DataFrame([[7, 8], [9, 10], [11, 12], [16, 17]],index=["a", "c", "e", "f"],columns=["score1", "score2"])
16>>> df3
17 score1 score2
18a 7 8
19c 9 10
20e 11 12
21f 16 17
22
23>>> df1.join([df2,df3])
24 key1 key2 value1 value2 score1 score2
25a 11.0 99.0 NaN NaN 7.0 8.0
26b 22.0 88.0 9.0 1.0 NaN NaN
27c 33.0 77.0 11.0 3.0 9.0 10.0
28
29>>> df1.join([df2,df3],how="outer")
30 key1 key2 value1 value2 score1 score2
31a 11.0 99.0 NaN NaN 7.0 8.0
32b 22.0 88.0 9.0 1.0 NaN NaN
33c 33.0 77.0 11.0 3.0 9.0 10.0
34d NaN NaN 10.0 2.0 NaN NaN
35e NaN NaN NaN NaN 11.0 12.0
36f NaN NaN NaN NaN 16.0 17.0
本篇文章来源于微信公众号: 码农设计师