【Python数据分析】37.数据清洗与准备——分类数据1

2023年1月15日

696

本系列文章配套代码获取有以下三种途径：

可以在以下网站查看，该网站是使用JupyterLite搭建的web端Jupyter环境，因此无需在本地安装运行环境即可使用，首次运行浏览器需要下载一些配置文件（大约20M）：

https://returu.github.io/Python_Data_Analysis/lab/index.html

也可以通过百度网盘获取，需要在本地配置代码运行环境，环境配置可以查看【Python基础】2.搭建Python开发环境：

链接：https://pan.baidu.com/s/1MYkeYeVAIRqbxezQECHwcA?pwd=mnsj 提取码：mnsj

前往GitHub详情页面，单击 code 按钮，选择Download ZIP选项：

https://github.com/returu/Python_Data_Analysis

根据《Python for Data Analysis 3rd Edition》翻译整理

—————————————————–

本节介绍 Pandas中的 Categorical 类型。使用它在某些 Pandas 操作中可以实现更好的性能和内存使用。

1.简介：

通常，表中的列可能包含一组不同值的重复实例。可以使用 unique 和 value_counts这样的函数，从数组中提取不同的值并分别计算它们的频率。

 1>>> values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
 2>>> values
 30     apple
 41    orange
 52     apple
 63     apple
 74     apple
 85    orange
 96     apple
107     apple
11dtype: object
12
13>>> pd.unique(values)
14array(['apple', 'orange'], dtype=object)
15
16>>> pd.value_counts(values)
17apple     6
18orange    2
19dtype: int64

许多数据系统（用于数据仓库、统计计算或其他用途）已经开发出专门的方法来表示具有重复值的数据，以实现更高效的存储和计算。在数据仓库中，最佳实践是使用所谓的维度表（dimension tables），其中包含不同的值并将值存储为引用维度表的整数键。

 1>>> values = pd.Series([0, 1, 0, 0] * 2)
 2>>> values
 30    0
 41    1
 52    0
 63    0
 74    0
 85    1
 96    0
107    0
11dtype: int64
12
13>>> dim = pd.Series(['apple', 'orange'])
14>>> dim
150     apple
161    orange
17dtype: object

我们可以使用 take方法来恢复原始的字符串 Series 。

 1>>> dim.take(values)
 20     apple
 31    orange
 40     apple
 50     apple
 60     apple
 71    orange
 80     apple
 90     apple
10dtype: object

这种整数表示的方式称为分类（categorical ）或字典编码（dictionary-encoded）表示。不同值的数组可以称为数据的类别（categories）、字典（dictionary）或层级（levels ）。本文将使用术语分类（categorical ）和类别（categories）。引用类别的整数值称为类别代码（category codes）或简称为代码（codes）。

当你进行数据分析时，分类表示可以显著提高性能。你也可以在不修改代码的情况下对类别进行转换，以下是一些以相对较低的成本进行的转换示例：

重命名分类；
在不改变已有类别顺序的情况下添加一个新的类别。

2.Pandas中的分类（Categorical）类型：

Pandas 中有一个特殊的Categorical类型，用于保存使用基于整数的分类表示或编码的数据。这是一种流行的数据压缩技术，适用于多次出现相同值的数据，可以显着提高性能并减少内存使用，尤其是对于字符串数据。

下面考虑之前的示例：

 1>>> fruits = ['apple', 'orange', 'apple', 'apple'] * 2
 2
 3>>> N = len(fruits)
 4
 5>>> rng = np.random.default_rng(seed=12345)
 6
 7>>> df = pd.DataFrame({'fruit': fruits,
 8...                    'basket_id': np.arange(N),
 9...                    'count': rng.integers(3, 15, size=N),
10...                    'weight': rng.uniform(0, 4, size=N)},
11...                   columns=['basket_id', 'fruit', 'count', 'weight'])
12
13>>> df
14   basket_id   fruit  count    weight
150          0   apple     11  1.564438
161          1  orange      5  1.331256
172          2   apple     12  2.393235
183          3   apple      6  0.746937
194          4   apple      5  2.691024
205          5  orange     12  3.767211
216          6   apple     10  0.992983
227          7   apple     11  3.795525

df[‘fruit’]是一个字符串数组，可以将其转换为categorical 类型。

 1>>> df['fruit'] = df['fruit'].astype('category')
 2
 3>>> fruit_cat = df['fruit']
 4>>> fruit_cat
 50     apple
 61    orange
 72     apple
 83     apple
 94     apple
105    orange
116     apple
127     apple
13Name: fruit, dtype: category
14Categories (2, object): ['apple', 'orange']

fruit_cat的值现在是 pandas.Categorical 的一个实例，可以通过 .array 属性访问它。

1>>> fruit_cat.array
2['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
3Categories (2, object): ['apple', 'orange']
4
5>>> c = fruit_cat.array
6>>> type(c)
7pandas.core.arrays.categorical.Categorical

Categorical对象具有 categories 和 codes 属性。

1>>> c.categories
2Index(['apple', 'orange'], dtype='object')
3
4>>> c.codes
5array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

获取代码和类别之间映射的一个有用技巧是：

1>>> dict(enumerate(c.categories))
2{0: 'apple', 1: 'orange'}

还可以直接从其他类型的 Python 序列创建 pandas.Categorical 。

1>>> my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
2>>> my_categories
3['foo', 'bar', 'baz', 'foo', 'bar']
4Categories (3, object): ['bar', 'baz', 'foo']

如果从其他来源获得了分类编码数据，则可以使用替代的 from_codes 构造函数。

1>>> categories = ['foo', 'bar', 'baz']
2
3>>> codes = [0, 1, 2, 0, 0, 1]
4
5>>> my_cats_2 = pd.Categorical.from_codes(codes, categories)
6>>> my_cats_2
7['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
8Categories (3, object): ['foo', 'bar', 'baz']

除非显式指定，否则分类转换假定类别没有特定的顺序。因此，输入数据的顺序与类别数组（categories array）的顺序可能会有所不同。使用 from_codes 或任何其他构造函数时，可以为类别指定一个有意义的顺序：

1>>> ordered_cat = pd.Categorical.from_codes(codes, categories,ordered=True)
2>>> ordered_cat
3['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
4Categories (3, object): ['foo' < 'bar' < 'baz']

上面的输出 [foo < bar < baz] 表示 'foo' 在排序中位于'bar' 之前，依此类推。可以使用 as_ordered对无序分类实例进行排序。

1>>> my_cats_2
2['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
3Categories (3, object): ['foo', 'bar', 'baz']
4
5>>> my_cats_2.as_ordered()
6['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
7Categories (3, object): ['foo' < 'bar' < 'baz']

最后一点，分类数据不必是字符串，分类数组可以包含任何不可变值类型。

本篇文章来源于微信公众号: 码农设计师

Previous article【Python数据分析】36.数据清洗与准备——字符串操作2

Next article【Python数据分析】38.数据清洗与准备——分类数据2

欢迎留下您的宝贵建议 Cancel reply

Please enter your comment!

Please enter your name here

You have entered an incorrect email address!

Please enter your email address here

【Python数据分析】37.数据清洗与准备——分类数据1

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

欢迎留下您的宝贵建议 Cancel reply

Most Popular

【Python计算生态】Dooit——待办事项管理...

【Python内置函数】hex()函数

【Python计算生态】Black——代码格式化工...

【Python内置函数】help()函数

Recent Comments

EDITOR PICKS

RSS

3D Map Generator Terrain

1.ENVI软件操作基础——窗口介绍及打开、浏览数...

POPULAR POSTS

【ArcGIS工具箱】71.叠加分析——加权叠加

【ArcGIS小操作】25.清除数据坐标信息

【ArcGIS工具箱】8.距离分析——廊道分析

POPULAR CATEGORY