多重索引/高级索引# 本节介绍使用 MultiIndex 进行索引 和其他高级索引功能。 有关一般索引文档,请参阅索引和选择数据。 警告 对于设置操作是否返回副本或引用可以取决于上下文。有时会出现这种情况,应该避免这种情况。请参阅返回视图与副本。chained assignment 请参阅食谱了解一些高级策略。 分层索引(MultiIndex)# 分层/多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是对于处理更高维的数据。本质上,它使您能够以较低维度的数据结构(如Series(1d) 和DataFrame(2d))存储和操作任意数量的维度的数据。 在本节中,我们将展示“分层”索引的确切含义,以及它如何与上面和前面章节中描述的所有 pandas 索引功能集成。稍后,在讨论分组、数据透视和重塑数据时,我们将展示重要的应用程序来说明它如何帮助构建数据进行分析。 请参阅食谱了解一些高级策略。 创建 MultiIndex(分层索引)对象# 该MultiIndex对象是标准对象的分层类似物 Index,通常将轴标签存储在 pandas 对象中。您可以将其视为MultiIndex一个元组数组,其中每个元组都是唯一的。 A MultiIndex可以从数组列表(使用 MultiIndex.from_arrays())、元组数组(使用 MultiIndex.from_tuples())、可迭代对象的交叉集(使用 MultiIndex.from_product())或 a DataFrame(使用 MultiIndex.from_frame())创建。当传递元组列表时,构造Index函数将尝试返回 a 。MultiIndex以下示例演示了初始化 MultiIndex 的不同方法。 In [1]: arrays = [ ...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ...: ["one", "two", "one", "two", "one", "two", "one", "two"], ...: ] ...: In [2]: tuples = list(zip(*arrays)) In [3]: tuples Out[3]: [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')] In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) In [5]: index Out[5]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second']) In [6]: s = pd.Series(np.random.randn(8), index=index) In [7]: s Out[7]: first second bar one 0.469112 two -0.282863 baz one -1.509059 two -1.135632 foo one 1.212112 two -0.173215 qux one 0.119209 two -1.044236 dtype: float64 当您想要两个迭代中的每个元素配对时,使用该MultiIndex.from_product()方法会更容易: In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]] In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"]) Out[9]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second']) 您还可以使用 方法直接MultiIndex从 a构造 a 。这是一种补充方法 。DataFrameMultiIndex.from_frame()MultiIndex.to_frame() In [10]: df = pd.DataFrame( ....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]], ....: columns=["first", "second"], ....: ) ....: In [11]: pd.MultiIndex.from_frame(df) Out[11]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('foo', 'one'), ('foo', 'two')], names=['first', 'second']) 为了方便起见,您可以将数组列表直接传递到Series或 自动DataFrame构造MultiIndex: In [12]: arrays = [ ....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]), ....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]), ....: ] ....: In [13]: s = pd.Series(np.random.randn(8), index=arrays) In [14]: s Out[14]: bar one -0.861849 two -2.104569 baz one -0.494929 two 1.071804 foo one 0.721555 two -0.706771 qux one -1.039575 two 0.271860 dtype: float64 In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays) In [16]: df Out[16]: 0 1 2 3 bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988 baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885 foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914 qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061 所有MultiIndex构造函数都接受一个names参数,该参数存储级别本身的字符串名称。如果未提供名称,None将分配: In [17]: df.index.names Out[17]: FrozenList([None, None]) 该索引可以支持 pandas 对象的任何轴,索引的级别数 由您决定: In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index) In [19]: df Out[19]: first bar baz ... foo qux second one two one ... two one two A 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169 B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737 C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747 [3 rows x 8 columns] In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6]) Out[20]: first bar baz foo second one two one two one two first second bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441 我们“稀疏”了较高级别的索引,以使控制台输出更容易被眼睛看到。请注意,可以使用multi_sparse以下选项控制索引的显示方式 pandas.set_options(): In [21]: with pd.option_context("display.multi_sparse", False): ....: df ....: 值得记住的是,没有什么可以阻止您使用元组作为轴上的原子标签: In [22]: pd.Series(np.random.randn(8), index=tuples) Out[22]: (bar, one) -1.236269 (bar, two) 0.896171 (baz, one) -0.487602 (baz, two) -0.082240 (foo, one) -2.182937 (foo, two) 0.380396 (qux, one) 0.084844 (qux, two) 0.432390 dtype: float64 之所以如此MultiIndex,是因为它允许您进行分组、选择和重塑操作,正如我们将在下面和文档的后续部分中描述的那样。正如您将在后面的部分中看到的,您会发现自己正在使用分层索引数据,而无需 MultiIndex自己显式创建。但是,当从文件加载数据时,您可能希望MultiIndex在准备数据集时生成自己的数据。 重建级别标签# 该方法get_level_values()将返回特定级别的每个位置的标签向量: In [23]: index.get_level_values(0) Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first') In [24]: index.get_level_values("second") Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second') 使用 MultiIndex 在轴上进行基本索引# 分层索引的重要功能之一是您可以通过标识数据中子组的“部分”标签来选择数据。部分 选择会“删除”结果中的分层索引级别,其方式与在常规 DataFrame 中选择列完全类似: In [25]: df["bar"] Out[25]: second one two A 0.895717 0.805244 B 0.410835 0.813850 C -1.413681 1.607920 In [26]: df["bar", "one"] Out[26]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64 In [27]: df["bar"]["one"] Out[27]: A 0.895717 B 0.410835 C -1.413681 Name: one, dtype: float64 In [28]: s["qux"] Out[28]: one -1.039575 two 0.271860 dtype: float64 有关如何在更深层次上进行选择的信息,请参阅具有分层索引的横截面。 定义级别# 保留MultiIndex索引的所有定义级别,即使它们实际上并未使用。当对索引进行切片时,您可能会注意到这一点。例如: In [29]: df.columns.levels # original MultiIndex Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']]) In [30]: df[["foo","qux"]].columns.levels # sliced Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']]) 这样做是为了避免重新计算级别,从而使切片具有高性能。如果您只想查看已使用的级别,可以使用该 get_level_values()方法。 In [31]: df[["foo", "qux"]].columns.to_numpy() Out[31]: array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object) # for a specific level In [32]: df[["foo", "qux"]].columns.get_level_values(0) Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first') MultiIndex为了仅使用所使用的级别 来重建,remove_unused_levels()可以使用该方法。 In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels() In [34]: new_mi.levels Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']]) 数据对齐和使用reindex# MultiIndex轴上具有不同索引的对象之间的操作将按您的预期进行;数据对齐与元组索引的工作方式相同: In [35]: s + s[:-2] Out[35]: bar one -1.723698 two -4.209138 baz one -0.989859 two 2.143608 foo one 1.443110 two -1.413542 qux one NaN two NaN dtype: float64 In [36]: s + s[::2] Out[36]: bar one -1.723698 two NaN baz one -0.989859 two NaN foo one 1.443110 two NaN qux one -2.079150 two NaN dtype: float64 /reindex()的方法可以用另一个,甚至元组的列表或数组来调用:SeriesDataFramesMultiIndex In [37]: s.reindex(index[:3]) Out[37]: first second bar one -0.861849 two -2.104569 baz one -0.494929 dtype: float64 In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")]) Out[38]: foo two -0.706771 bar one -0.861849 qux one -1.039575 baz one -0.494929 dtype: float64 具有分层索引的高级索引# 在语法上集成MultiIndex高级索引.loc有点具有挑战性,但我们已尽一切努力来做到这一点。一般来说,MultiIndex 键采用元组的形式。例如,以下内容将按您的预期工作: In [39]: df = df.T In [40]: df Out[40]: A B C first second bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747 In [41]: df.loc[("bar", "two")] Out[41]: A 0.805244 B 0.813850 C 1.607920 Name: (bar, two), dtype: float64 请注意,这也适用于本示例,但这种简写符号通常会导致歧义。df.loc['bar', 'two'] 如果您还想使用 索引特定列.loc,则必须使用如下所示的元组: In [42]: df.loc[("bar", "two"), "A"] Out[42]: 0.8052440253863785 您不必MultiIndex通过仅传递元组的第一个元素来指定所有级别。例如,您可以使用“部分”索引来获取bar第一级中的所有元素,如下所示: In [43]: df.loc["bar"] Out[43]: A B C second one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 这是稍微详细的表示法的快捷方式df.loc[('bar',),](相当于df.loc['bar',]本例中的)。 “部分”切片也效果很好。 In [44]: df.loc["baz":"foo"] Out[44]: A B C first second baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 您可以通过提供元组切片来对值的“范围”进行切片。 In [45]: df.loc[("baz", "two"):("qux", "one")] Out[45]: A B C first second baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 In [46]: df.loc[("baz", "two"):"foo"] Out[46]: A B C first second baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 传递标签或元组列表的工作方式与重新索引类似: In [47]: df.loc[[("bar", "two"), ("qux", "one")]] Out[47]: A B C first second bar two 0.805244 0.813850 1.607920 qux one -1.170299 1.130127 0.974466 笔记 需要注意的是,在建立索引时,pandas 中的元组和列表的处理方式并不相同。元组被解释为一个多级键,而列表则用于指定多个键。或者换句话说,元组水平移动(遍历级别),列表垂直移动(扫描级别)。 重要的是,元组列表索引多个完整的MultiIndex键,而列表元组引用级别内的多个值: In [48]: s = pd.Series( ....: [1, 2, 3, 4, 5, 6], ....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]), ....: ) ....: In [49]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples Out[49]: A c 1 B d 5 dtype: int64 In [50]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists Out[50]: A c 1 d 2 B c 4 d 5 dtype: int64 使用切片器# 您可以MultiIndex通过提供多个索引器来对 a 进行切片。 您可以提供任何选择器,就像按标签索引一样,请参阅按标签选择,包括切片、标签列表、标签和布尔索引器。 您可以使用选择该slice(None)级别的所有内容。您不需要指定所有 更深的级别,它们将隐含为。slice(None) 像往常一样,切片器的两侧都包含在内,因为这是标签索引。 警告 您应该在说明符中指定所有轴,这意味着索引和列.loc的索引器。在某些不明确的情况下,传递的索引器可能会被误解为对两个轴进行索引,而不是对行进行索引。MultiIndex 你应该做这个: df.loc[(slice("A1", "A3"), ...), :] # noqa: E999 你不应该这样做: df.loc[(slice("A1", "A3"), ...)] # noqa: E999 In [51]: def mklbl(prefix, n): ....: return ["%s%s" % (prefix, i) for i in range(n)] ....: In [52]: miindex = pd.MultiIndex.from_product( ....: [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)] ....: ) ....: In [53]: micolumns = pd.MultiIndex.from_tuples( ....: [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"] ....: ) ....: In [54]: dfmi = ( ....: pd.DataFrame( ....: np.arange(len(miindex) * len(micolumns)).reshape( ....: (len(miindex), len(micolumns)) ....: ), ....: index=miindex, ....: columns=micolumns, ....: ) ....: .sort_index() ....: .sort_index(axis=1) ....: ) ....: In [55]: dfmi Out[55]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254 [64 rows x 4 columns] 使用切片、列表和标签的基本多索引切片。 In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :] Out[56]: lvl0 a b lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254 [24 rows x 4 columns] 您可以使用 usingpandas.IndexSlice来促进更自然的语法:,而不是使用slice(None)。 In [57]: idx = pd.IndexSlice In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]] Out[58]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254 [32 rows x 2 columns] 使用此方法可以同时在多个轴上执行相当复杂的选择。 In [59]: dfmi.loc["A1", (slice(None), "foo")] Out[59]: lvl0 a b lvl1 foo foo B0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 ... ... ... B1 C1 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126 [16 rows x 2 columns] In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]] Out[60]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254 [32 rows x 2 columns] 使用布尔索引器,您可以提供与值相关的选择。 In [61]: mask = dfmi[("a", "foo")] > 200 In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]] Out[62]: lvl0 a b lvl1 foo foo A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254 您还可以指定axis参数来.loc解释单个轴上传递的切片器。 In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]] Out[63]: lvl0 a b lvl1 bar foo bah foo A0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254 [32 rows x 4 columns] 此外,您可以使用以下方法设置值。 In [64]: df2 = dfmi.copy() In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10 In [66]: df2 Out[66]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 [64 rows x 4 columns] 您也可以使用可对齐对象的右侧。 In [67]: df2 = dfmi.copy() In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000 In [69]: df2 Out[69]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000 [64 rows x 4 columns] 横截面# 该xs()方法DataFrame另外采用级别参数以使选择特定级别的数据变得MultiIndex更容易。 In [70]: df Out[70]: A B C first second bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747 In [71]: df.xs("one", level="second") Out[71]: A B C first bar 0.895717 0.410835 -1.413681 baz -1.206412 0.132003 1.024180 foo 1.431256 -0.076467 0.875906 qux -1.170299 1.130127 0.974466 # using the slicers In [72]: df.loc[(slice(None), "one"), :] Out[72]: A B C first second bar one 0.895717 0.410835 -1.413681 baz one -1.206412 0.132003 1.024180 foo one 1.431256 -0.076467 0.875906 qux one -1.170299 1.130127 0.974466 xs您还可以通过提供 axis 参数来选择列。 In [73]: df = df.T In [74]: df.xs("one", level="second", axis=1) Out[74]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466 # using the slicers In [75]: df.loc[:, (slice(None), "one")] Out[75]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466 xs还允许使用多个键进行选择。 In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1) Out[76]: first bar second one A 0.895717 B 0.410835 C -1.413681 # using the slicers In [77]: df.loc[:, ("bar", "one")] Out[77]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64 您可以传递drop_level=False到xs以保留所选级别。 In [78]: df.xs("one", level="second", axis=1, drop_level=False) Out[78]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466 drop_level=True将上面的结果与使用(默认值)的结果进行比较。 In [79]: df.xs("one", level="second", axis=1, drop_level=True) Out[79]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466 高级重新索引和对齐# level在pandas 对象的reindex()和 方法中使用参数align()对于跨级别广播值很有用。例如: In [80]: midx = pd.MultiIndex( ....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]] ....: ) ....: In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx) In [82]: df Out[82]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [83]: df2 = df.groupby(level=0).mean() In [84]: df2 Out[84]: 0 1 one 1.060074 -0.109716 zero 1.271532 0.713416 In [85]: df2.reindex(df.index, level=0) Out[85]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416 # aligning In [86]: df_aligned, df2_aligned = df.align(df2, level=0) In [87]: df_aligned Out[87]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [88]: df2_aligned Out[88]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416 用#交换级别swaplevel 该swaplevel()方法可以切换两个级别的顺序: In [89]: df[:5] Out[89]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 In [90]: df[:5].swaplevel(0, 1, axis=0) Out[90]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520 用#重新排序级别reorder_levels 该reorder_levels()方法概括了该swaplevel 方法,允许您一步排列分层索引级别: In [91]: df[:5].reorder_levels([1, 0], axis=0) Out[91]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520 Index重命名或#的名称MultiIndex 该rename()方法用于重命名 a 的标签 MultiIndex,通常用于重命名 a 的列DataFrame。columns的参数允许rename指定一个字典,其中仅包含您希望重命名的列。 In [92]: df.rename(columns={0: "col0", 1: "col1"}) Out[92]: col0 col1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 此方法还可以用于重命名DataFrame. In [93]: df.rename(index={"one": "two", "y": "z"}) Out[93]: 0 1 two z 1.519970 -0.493662 x 0.600178 0.274230 zero z 0.132885 -0.023688 x 2.410179 1.450520 该rename_axis()方法用于重命名 a Index或的名称MultiIndex。特别是,MultiIndex可以指定a 的级别名称 ,如果reset_index()稍后用于将值从 a 移动MultiIndex到列,这将很有用。 In [94]: df.rename_axis(index=["abc", "def"]) Out[94]: 0 1 abc def one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520 请注意,a 的列DataFrame是索引,因此 rename_axis与columns参数一起使用将更改该索引的名称。 In [95]: df.rename_axis(columns="Cols").columns Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols') rename和都rename_axis支持指定字典或 Series映射函数以将标签/名称映射到新值。 当直接使用Index对象而不是通过 a 时DataFrame, Index.set_names()可用于更改名称。 In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"]) In [97]: mi.names Out[97]: FrozenList(['x', 'y']) In [98]: mi2 = mi.rename("new name", level=0) In [99]: mi2 Out[99]: MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['new name', 'y']) 您无法通过级别设置 MultiIndex 的名称。 In [100]: mi.levels[0].name = "name via level" --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[100], line 1 ----> 1 mi.levels[0].name = "name via level" File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value) 1686 @name.setter 1687 def name(self, value: Hashable) -> None: 1688 if self._no_setting_name: 1689 # Used in MultiIndex.levels to avoid silently ignoring name updates. -> 1690 raise RuntimeError( 1691 "Cannot set name on a level of a MultiIndex. Use " 1692 "'MultiIndex.set_names' instead." 1693 ) 1694 maybe_extract_name(value, None, type(self)) 1695 self._name = value RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead. 代替使用Index.set_names()。 对#进行排序MultiIndex 为了MultiIndex有效地对 -ed 对象进行索引和切片,需要对它们进行排序。与任何索引一样,您可以使用sort_index(). In [101]: import random In [102]: random.shuffle(tuples) In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples)) In [104]: s Out[104]: qux two 0.206053 bar one -0.251905 foo one -2.213588 qux one 1.063327 foo two 1.266143 baz two 0.299368 bar two -0.863838 baz one 0.408204 dtype: float64 In [105]: s.sort_index() Out[105]: bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [106]: s.sort_index(level=0) Out[106]: bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [107]: s.sort_index(level=1) Out[107]: bar one -0.251905 baz one 0.408204 foo one -2.213588 qux one 1.063327 bar two -0.863838 baz two 0.299368 foo two 1.266143 qux two 0.206053 dtype: float64 sort_index如果MultiIndex级别已命名,您还可以传递级别名称。 In [108]: s.index = s.index.set_names(["L1", "L2"]) In [109]: s.sort_index(level="L1") Out[109]: L1 L2 bar one -0.251905 two -0.863838 baz one 0.408204 two 0.299368 foo one -2.213588 two 1.266143 qux one 1.063327 two 0.206053 dtype: float64 In [110]: s.sort_index(level="L2") Out[110]: L1 L2 bar one -0.251905 baz one 0.408204 foo one -2.213588 qux one 1.063327 bar two -0.863838 baz two 0.299368 foo two 1.266143 qux two 0.206053 dtype: float64 在更高维度的对象上,您可以按级别对任何其他轴进行排序,如果它们具有MultiIndex: In [111]: df.T.sort_index(level=1, axis=1) Out[111]: one zero one zero x x y y 0 0.600178 2.410179 1.519970 0.132885 1 0.274230 1.450520 -0.493662 -0.023688 即使数据未排序,索引也可以工作,但效率相当低(并显示 a PerformanceWarning)。它还将返回数据的副本而不是视图: In [112]: dfm = pd.DataFrame( .....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)} .....: ) .....: In [113]: dfm = dfm.set_index(["jim", "joe"]) In [114]: dfm Out[114]: jolie jim joe 0 x 0.490671 x 0.120248 1 z 0.537020 y 0.110968 In [115]: dfm.loc[(1, 'z')] Out[115]: jolie jim joe 1 z 0.53702 此外,如果您尝试对未完全进行词法排序的内容建立索引,这可能会引发: In [116]: dfm.loc[(0, 'y'):(1, 'z')] --------------------------------------------------------------------------- UnsortedIndexError Traceback (most recent call last) Cell In[116], line 1 ----> 1 dfm.loc[(0, 'y'):(1, 'z')] File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice): File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step) 2852 """ 2853 For an ordered MultiIndex, compute the slice locations for input 2854 labels. (...) 2900 sequence of such. 2901 """ 2902 # This function adds nothing to its parent implementation (the magic 2903 # happens in get_slice_bound method), but it adds meaningful doc. -> 2904 return super().slice_locs(start, end, step) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step) 6877 start_slice = None 6878 if start is not None: -> 6879 start_slice = self.get_slice_bound(start, "left") 6880 if start_slice is None: 6881 start_slice = 0 File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side) 2846 if not isinstance(label, tuple): 2847 label = (label,) -> 2848 return self._partial_tup_index(label, side=side) File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side) 2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"): 2907 if len(tup) > self._lexsort_depth: -> 2908 raise UnsortedIndexError( 2909 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth " 2910 f"({self._lexsort_depth})" 2911 ) 2913 n = len(tup) 2914 start, end = 0, len(self) UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)' is_monotonic_increasing()a 上的方法显示MultiIndex索引是否已排序: In [117]: dfm.index.is_monotonic_increasing Out[117]: False In [118]: dfm = dfm.sort_index() In [119]: dfm Out[119]: jolie jim joe 0 x 0.490671 x 0.120248 1 y 0.110968 z 0.537020 In [120]: dfm.index.is_monotonic_increasing Out[120]: True 现在选择按预期进行。 In [121]: dfm.loc[(0, "y"):(1, "z")] Out[121]: jolie jim joe 1 y 0.110968 z 0.537020 采取方法# 与 NumPy ndarrays 类似,pandas Index、Series、DataFrame也提供了take()沿着给定轴在给定索引处检索元素的方法。给定的索引必须是整数索引位置的列表或 ndarray。take还将接受负整数作为到对象末尾的相对位置。 In [122]: index = pd.Index(np.random.randint(0, 1000, 10)) In [123]: index Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64') In [124]: positions = [0, 9, 3] In [125]: index[positions] Out[125]: Index([214, 329, 567], dtype='int64') In [126]: index.take(positions) Out[126]: Index([214, 329, 567], dtype='int64') In [127]: ser = pd.Series(np.random.randn(10)) In [128]: ser.iloc[positions] Out[128]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64 In [129]: ser.take(positions) Out[129]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64 对于 DataFrame,给定的索引应该是指定行或列位置的一维列表或 ndarray。 In [130]: frm = pd.DataFrame(np.random.randn(5, 3)) In [131]: frm.take([1, 4, 3]) Out[131]: 0 1 2 1 -1.237881 0.106854 -1.276829 4 0.629675 -1.425966 1.857704 3 0.979542 -1.633678 0.615855 In [132]: frm.take([0, 2], axis=1) Out[132]: 0 2 0 0.595974 0.601544 1 -1.237881 -1.276829 2 -0.767101 1.499591 3 0.979542 0.615855 4 0.629675 1.857704 值得注意的是,takepandas 对象上的方法不适用于布尔索引,并且可能会返回意外结果。 In [133]: arr = np.random.randn(10) In [134]: arr.take([False, False, True, True]) Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775]) In [135]: arr[[0, 1]] Out[135]: array([-1.1935, 0.6775]) In [136]: ser = pd.Series(np.random.randn(10)) In [137]: ser.take([False, False, True, True]) Out[137]: 0 0.233141 0 0.233141 1 -0.223540 1 -0.223540 dtype: float64 In [138]: ser.iloc[[0, 1]] Out[138]: 0 0.233141 1 -0.223540 dtype: float64 最后,作为关于性能的一个小注释,因为该take方法处理较小范围的输入,所以它可以提供比花式索引快得多的性能。 In [139]: arr = np.random.randn(10000, 5) In [140]: indexer = np.arange(10000) In [141]: random.shuffle(indexer) In [142]: %timeit arr[indexer] .....: %timeit arr.take(indexer, axis=0) .....: 257 us +- 4.44 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each) 79.7 us +- 1.15 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) In [143]: ser = pd.Series(arr[:, 0]) In [144]: %timeit ser.iloc[indexer] .....: %timeit ser.take(indexer) .....: 144 us +- 3.69 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 129 us +- 2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 索引类型# 我们MultiIndex在前面的章节中已经进行了相当广泛的讨论。有关DatetimeIndex和 的文档PeriodIndex显示在此处,有关的文档TimedeltaIndex可在此处找到。 在下面的小节中,我们将重点介绍一些其他索引类型。 分类索引# CategoricalIndex是一种索引,对于支持重复项索引很有用。这是一个围绕 a 的容器Categorical ,允许高效索引和存储具有大量重复元素的索引。 In [145]: from pandas.api.types import CategoricalDtype In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")}) In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab"))) In [148]: df Out[148]: A B 0 0 a 1 1 a 2 2 b 3 3 b 4 4 c 5 5 a In [149]: df.dtypes Out[149]: A int64 B category dtype: object In [150]: df["B"].cat.categories Out[150]: Index(['c', 'a', 'b'], dtype='object') 设置索引将创建一个CategoricalIndex. In [151]: df2 = df.set_index("B") In [152]: df2.index Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 索引 with 的__getitem__/.iloc/.loc工作方式与 with 重复项类似Index。索引器必须位于类别中,否则操作将引发KeyError. In [153]: df2.loc["a"] Out[153]: A B a 0 a 1 a 5 索引后会CategoricalIndex保留: In [154]: df2.loc["a"].index Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 对索引进行排序将按类别的顺序进行排序(回想一下,我们使用 来创建索引CategoricalDtype(list('cab')),因此排序顺序为cab)。 In [155]: df2.sort_index() Out[155]: A B c 4 a 0 a 1 a 5 b 2 b 3 对索引的 Groupby 操作也将保留索引的性质。 In [156]: df2.groupby(level=0, observed=True).sum() Out[156]: A B c 4 a 6 b 5 In [157]: df2.groupby(level=0, observed=True).sum().index Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B') 重新索引操作将根据传递的索引器的类型返回结果索引。传递一个列表将返回一个普通的旧值Index;使用 a 进行索引Categorical将返回 a ,根据传递的CategoricalIndexdtype的类别进行索引。这允许人们任意索引这些,即使值不在类别中,类似于如何重新索引任何pandas 索引。 Categorical In [158]: df3 = pd.DataFrame( .....: {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")} .....: ) .....: In [159]: df3 = df3.set_index("B") In [160]: df3 Out[160]: A B a 0 b 1 c 2 In [161]: df3.reindex(["a", "e"]) Out[161]: A B a 0.0 e NaN In [162]: df3.reindex(["a", "e"]).index Out[162]: Index(['a', 'e'], dtype='object', name='B') In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))) Out[163]: A B a 0.0 e NaN In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B') 警告 a 上的重塑和比较操作CategoricalIndex必须具有相同的类别,否则 aTypeError将被引发。 In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")}) In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab"))) In [167]: df4 = df4.set_index("B") In [168]: df4.index Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B') In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")}) In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc"))) In [171]: df5 = df5.set_index("B") In [172]: df5.index Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B') In [173]: pd.concat([df4, df5]) Out[173]: A B b 0 a 1 b 0 c 1 范围索引# RangeIndex是它的子类,为所有和对象 Index提供默认索引。是它的优化版本,可以表示单调有序集。这些类似于 Python范围类型。 A总是有一个dtype。DataFrameSeriesRangeIndexIndexRangeIndexint64 In [174]: idx = pd.RangeIndex(5) In [175]: idx Out[175]: RangeIndex(start=0, stop=5, step=1) RangeIndexDataFrame是所有和对象的默认索引Series: In [176]: ser = pd.Series([1, 2, 3]) In [177]: ser.index Out[177]: RangeIndex(start=0, stop=3, step=1) In [178]: df = pd.DataFrame([[1, 2], [3, 4]]) In [179]: df.index Out[179]: RangeIndex(start=0, stop=2, step=1) In [180]: df.columns Out[180]: RangeIndex(start=0, stop=2, step=1) A 的RangeIndex行为与Index具有int64dtype 的 a 类似,并且对 a 的操作RangeIndex(其结果不能由 a 表示RangeIndex,但应具有整数 dtype)将被转换为Indexwith int64。例如: In [181]: idx[[0, 2]] Out[181]: Index([0, 2], dtype='int64') 间隔索引# IntervalIndex与它自己的数据类型IntervalDtype 以及Interval标量类型一起,允许 pandas 对区间表示法提供一流的支持。 允许一些唯一的索引,也用作和IntervalIndex中类别的返回类型。cut()qcut() 用#建立索引IntervalIndex AnIntervalIndex可以用作 inSeries和 inDataFrame作为索引。 In [182]: df = pd.DataFrame( .....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]) .....: ) .....: In [183]: df Out[183]: A (0, 1] 1 (1, 2] 2 (2, 3] 3 (3, 4] 4 .loc通过沿间隔边缘的基于标签的索引可以按照您的预期工作,选择该特定间隔。 In [184]: df.loc[2] Out[184]: A 2 Name: (1, 2], dtype: int64 In [185]: df.loc[[2, 3]] Out[185]: A (1, 2] 2 (2, 3] 3 如果您选择某个间隔内包含的标签,这也将选择该间隔。 In [186]: df.loc[2.5] Out[186]: A 3 Name: (2, 3], dtype: int64 In [187]: df.loc[[2.5, 3.5]] Out[187]: A (2, 3] 3 (3, 4] 4 使用 an 选择Interval只会返回完全匹配的结果。 In [188]: df.loc[pd.Interval(1, 2)] Out[188]: A 2 Name: (1, 2], dtype: int64 尝试选择一个Interval未完全包含在中的IntervalIndex将引发KeyError. In [189]: df.loc[pd.Interval(0.5, 2.5)] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[189], line 1 ----> 1 df.loc[pd.Interval(0.5, 2.5)] File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis) 1429 # fall thru to straight lookup 1430 self._validate_key(key, axis) -> 1431 return self._get_label(key, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis) 1379 def _get_label(self, label, axis: AxisInt): 1380 # GH#5567 this will fail if the label is not present in the axis. -> 1381 return self.obj.xs(label, axis=axis) File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level) 4299 new_index = index[loc] 4300 else: -> 4301 loc = index.get_loc(key) 4303 if isinstance(loc, np.ndarray): 4304 if loc.dtype == np.bool_: File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key) 676 matches = mask.sum() 677 if matches == 0: --> 678 raise KeyError(key) 679 if matches == 1: 680 return mask.argmax() KeyError: Interval(0.5, 2.5, closed='right') 可以使用 创建布尔索引器的方法来选择Intervals与给定重叠的所有内容。Intervaloverlaps() In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5)) In [191]: idxr Out[191]: array([ True, True, True, False]) In [192]: df[idxr] Out[192]: A (0, 1] 1 (1, 2] 2 (2, 3] 3 cut使用和#对数据进行分箱qcut cut()qcut()两者都返回一个对象,并且Categorical它们创建的 bin 存储IntervalIndex在其.categories属性中。 In [193]: c = pd.cut(range(4), bins=2) In [194]: c Out[194]: [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]] Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] In [195]: c.categories Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]') cut()还接受 anIntervalIndex作为其bins参数,这启用了有用的 pandas 惯用语。首先,我们调用cut()一些数据并bins设置为固定数字,以生成垃圾箱。然后,我们将 的值.categories作为 bins后续调用中的参数传递给cut(),提供将被分箱到相同箱中的新数据。 In [196]: pd.cut([0, 3, 5, 1], bins=c.categories) Out[196]: [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]] Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]] 任何落在所有 bin 之外的值都将被分配一个NaN值。 生成间隔范围# 如果我们需要定期频率的间隔,我们可以使用该函数使用、和的各种组合interval_range()来创建。对于数字间隔,默认频率为 1;对于类似日期时间的间隔,默认频率为日历日:IntervalIndexstartendperiodsinterval_range In [197]: pd.interval_range(start=0, end=5) Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]') In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4) Out[198]: IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00], (2017-01-02 00:00:00, 2017-01-03 00:00:00], (2017-01-03 00:00:00, 2017-01-04 00:00:00], (2017-01-04 00:00:00, 2017-01-05 00:00:00]], dtype='interval[datetime64[ns], right]') In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3) Out[199]: IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]], dtype='interval[timedelta64[ns], right]') 该参数可用于指定非默认频率,并且可以利用具有类似日期时间间隔的freq各种频率别名: In [200]: pd.interval_range(start=0, periods=5, freq=1.5) Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]') In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W") Out[201]: IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00], (2017-01-08 00:00:00, 2017-01-15 00:00:00], (2017-01-15 00:00:00, 2017-01-22 00:00:00], (2017-01-22 00:00:00, 2017-01-29 00:00:00]], dtype='interval[datetime64[ns], right]') In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h") Out[202]: IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]], dtype='interval[timedelta64[ns], right]') 此外,该closed参数可用于指定间隔在哪一侧闭合。默认情况下,间隔在右侧关闭。 In [203]: pd.interval_range(start=0, end=4, closed="both") Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]') In [204]: pd.interval_range(start=0, end=4, closed="neither") Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]') 指定start、end、 和将生成从到 的periods一系列均匀间隔的间隔,其中包含结果 中的元素数量:startendperiodsIntervalIndex In [205]: pd.interval_range(start=0, end=6, periods=4) Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]') In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3) Out[206]: IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28 00:00:00]], dtype='interval[datetime64[ns], right]') 其他索引常见问题解答# 整数索引# 使用整数轴标签的基于标签的索引是一个棘手的话题。它在邮件列表和科学 Python 社区的各个成员之间进行了广泛的讨论。在 pandas 中,我们的普遍观点是标签比整数位置更重要。因此,对于整数轴索引,只能 使用标准工具(如.loc.以下代码将产生异常: In [207]: s = pd.Series(range(5)) In [208]: s[-1] --------------------------------------------------------------------------- ValueError Traceback (most recent call last) File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key) 412 try: --> 413 return self._range.index(new_key) 414 except ValueError as err: ValueError: -1 is not in range The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[208], line 1 ----> 1 s[-1] File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.__getitem__(self, key) 1118 return self._values[key] 1120 elif key_is_scalar: -> 1121 return self._get_value(key) 1123 # Convert generator to list before going through hashable part 1124 # (We will iterate through the generator there to check for slices) 1125 if is_iterator(key): File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable) 1234 return self._values[label] 1236 # Similar to Index.get_value, but we do not fall back to positional -> 1237 loc = self.index.get_loc(label) 1239 if is_integer(loc): 1240 return self._values[loc] File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key) 413 return self._range.index(new_key) 414 except ValueError as err: --> 415 raise KeyError(key) from err 416 if isinstance(key, Hashable): 417 raise KeyError(key) KeyError: -1 In [209]: df = pd.DataFrame(np.random.randn(5, 4)) In [210]: df Out[210]: 0 1 2 3 0 -0.435772 -1.188928 -0.808286 -0.284634 1 -1.815703 1.347213 -0.243487 0.514704 2 1.162969 -0.287725 -0.179734 0.993962 3 -0.212673 0.909872 -0.733333 -0.349893 4 0.456434 -0.306735 0.553396 0.166221 In [211]: df.loc[-2:] Out[211]: 0 1 2 3 0 -0.435772 -1.188928 -0.808286 -0.284634 1 -1.815703 1.347213 -0.243487 0.514704 2 1.162969 -0.287725 -0.179734 0.993962 3 -0.212673 0.909872 -0.733333 -0.349893 4 0.456434 -0.306735 0.553396 0.166221 做出这个深思熟虑的决定是为了防止歧义和微妙的错误(许多用户报告在进行 API 更改以停止“回退”基于位置的索引时发现了错误)。 非单调索引需要精确匹配# Series如果 a或的索引DataFrame单调递增或递减,则基于标签的切片的边界可能超出索引的范围,就像索引普通 Python 的切片一样list。索引的单调性可以用is_monotonic_increasing()和 is_monotonic_decreasing()属性来测试。 In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5))) In [213]: df.index.is_monotonic_increasing Out[213]: True # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: In [214]: df.loc[0:4, :] Out[214]: data 2 0 3 1 3 2 4 3 # slice is are outside the index, so empty DataFrame is returned In [215]: df.loc[13:15, :] Out[215]: Empty DataFrame Columns: [data] Index: [] 另一方面,如果索引不是单调的,则两个切片边界必须是 索引的唯一成员。 In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6))) In [217]: df.index.is_monotonic_increasing Out[217]: False # OK because 2 and 4 are in the index In [218]: df.loc[2:4, :] Out[218]: data 2 0 3 1 1 2 4 3 # 0 is not in the index In [219]: df.loc[0:4, :] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates() File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() File index.pyx:134, in pandas._libs.index._unpack_bool_indexer() KeyError: 0 The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[219], line 1 ----> 1 df.loc[0:4, :] File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0 File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup) File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice): File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step) 6877 start_slice = None 6878 if start is not None: -> 6879 start_slice = self.get_slice_bound(start, "left") 6880 if start_slice is None: 6881 start_slice = 0 File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side) 6801 return self._searchsorted_monotonic(label, side) 6802 except ValueError: 6803 # raise the original KeyError -> 6804 raise err 6806 if isinstance(slc, np.ndarray): 6807 # get_loc may return a boolean array, which 6808 # is OK as long as they are representable by a slice. 6809 assert is_bool_dtype(slc.dtype) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side) 6796 # we need to look up the label 6797 try: -> 6798 slc = self.get_loc(label) 6799 except KeyError as err: 6800 try: File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 0 # 3 is not a unique label In [220]: df.loc[2:3, :] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[220], line 1 ----> 1 df.loc[2:3, :] File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0 File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup) File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis) File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice): File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step) 6883 end_slice = None 6884 if end is not None: -> 6885 end_slice = self.get_slice_bound(end, "right") 6886 if end_slice is None: 6887 end_slice = len(self) File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side) 6810 slc = lib.maybe_booleans_to_slice(slc.view("u1")) 6811 if isinstance(slc, np.ndarray): -> 6812 raise KeyError( 6813 f"Cannot get {side} slice bound for non-unique " 6814 f"label: {repr(original_label)}" 6815 ) 6817 if isinstance(slc, slice): 6818 if side == "left": KeyError: 'Cannot get right slice bound for non-unique label: 3' Index.is_monotonic_increasing并且Index.is_monotonic_decreasing仅检查索引是否弱单调。要检查严格的单调性,您可以将其中之一与is_unique()属性结合起来。 In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"]) In [222]: weakly_monotonic Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object') In [223]: weakly_monotonic.is_monotonic_increasing Out[223]: True In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique Out[224]: False 端点包含# 与切片端点不包含的标准 Python 序列切片相比,pandas 中基于标签的切片是包含的。主要原因是通常不可能轻松确定索引中特定标签之后的“后继”或下一个元素。例如,考虑以下情况Series: In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef")) In [226]: s Out[226]: a -0.101684 b -0.734907 c -0.130121 d -0.476046 e 0.759104 f 0.213379 dtype: float64 假设我们希望使用整数对 fromc到进行切片e,可以这样完成: In [227]: s[2:5] Out[227]: c -0.130121 d -0.476046 e 0.759104 dtype: float64 但是,如果只有c和e,确定索引中的下一个元素可能会有些复杂。例如,以下内容不起作用: In [228]: s.loc['c':'e' + 1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[228], line 1 ----> 1 s.loc['c':'e' + 1] TypeError: can only concatenate str (not "int") to str 一个非常常见的用例是将时间序列限制在两个特定日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包括两个端点: In [229]: s.loc["c":"e"] Out[229]: c -0.130121 d -0.476046 e 0.759104 dtype: float64 这绝对是“实用性胜过纯粹性”之类的事情,但如果您希望基于标签的切片的行为与标准 Python 整数切片的工作方式完全相同,则需要注意这一点。 索引可能会改变底层系列数据类型# 不同的索引操作可能会改变Series. In [230]: series1 = pd.Series([1, 2, 3]) In [231]: series1.dtype Out[231]: dtype('int64') In [232]: res = series1.reindex([0, 4]) In [233]: res.dtype Out[233]: dtype('float64') In [234]: res Out[234]: 0 1.0 4 NaN dtype: float64 In [235]: series2 = pd.Series([True]) In [236]: series2.dtype Out[236]: dtype('bool') In [237]: res = series2.reindex_like(series1) In [238]: res.dtype Out[238]: dtype('O') In [239]: res Out[239]: 0 True 1 NaN 2 NaN dtype: object 这是因为上面的(重新)索引操作会默默地插入NaNs并dtype 相应地进行更改。这可能会在使用时导致一些问题,numpy ufuncs 例如numpy.logical_and。 有关更详细的讨论,请参阅GH 2388 。