合并、连接、连接和比较#

pandas 提供了各种组合和比较Seriesor 的 方法DataFrame

concat()#

concat()函数沿一个轴连接任意数量的 SeriesDataFrame对象,同时对其他轴上的索引执行可选的集合逻辑(并集或交集)。像 一样numpy.concatenateconcat() 接受同构类型对象的列表或字典并将它们连接起来。

In [1]: df1 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A0", "A1", "A2", "A3"],
   ...:         "B": ["B0", "B1", "B2", "B3"],
   ...:         "C": ["C0", "C1", "C2", "C3"],
   ...:         "D": ["D0", "D1", "D2", "D3"],
   ...:     },
   ...:     index=[0, 1, 2, 3],
   ...: )
   ...: 

In [2]: df2 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A4", "A5", "A6", "A7"],
   ...:         "B": ["B4", "B5", "B6", "B7"],
   ...:         "C": ["C4", "C5", "C6", "C7"],
   ...:         "D": ["D4", "D5", "D6", "D7"],
   ...:     },
   ...:     index=[4, 5, 6, 7],
   ...: )
   ...: 

In [3]: df3 = pd.DataFrame(
   ...:     {
   ...:         "A": ["A8", "A9", "A10", "A11"],
   ...:         "B": ["B8", "B9", "B10", "B11"],
   ...:         "C": ["C8", "C9", "C10", "C11"],
   ...:         "D": ["D8", "D9", "D10", "D11"],
   ...:     },
   ...:     index=[8, 9, 10, 11],
   ...: )
   ...: 

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)

In [6]: result
Out[6]: 
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

../_images/merging_concat_basic.png

笔记

concat()制作数据的完整副本,迭代重用concat()可能会创建不必要的副本。在使用之前收集列表中的 所有 DataFrame或对象。Seriesconcat()

frames = [process_your_file(f) for f in files]
result = pd.concat(frames)

笔记

当与命名轴连接时DataFrame,pandas 将尽可能尝试保留这些索引/列名称。如果所有输入共享一个通用名称,则该名称将分配给结果。当输入的名称不全部一致时,结果将是未命名的。也是如此MultiIndex,但逻辑是逐级单独应用的。

结果轴的连接逻辑#

join关键字指定如何处理第一个 中不存在的轴值 DataFrame

join='outer'取所有轴值的并集

In [7]: df4 = pd.DataFrame(
   ...:     {
   ...:         "B": ["B2", "B3", "B6", "B7"],
   ...:         "D": ["D2", "D3", "D6", "D7"],
   ...:         "F": ["F2", "F3", "F6", "F7"],
   ...:     },
   ...:     index=[2, 3, 6, 7],
   ...: )
   ...: 

In [8]: result = pd.concat([df1, df4], axis=1)

In [9]: result
Out[9]: 
     A    B    C    D    B    D    F
0   A0   B0   C0   D0  NaN  NaN  NaN
1   A1   B1   C1   D1  NaN  NaN  NaN
2   A2   B2   C2   D2   B2   D2   F2
3   A3   B3   C3   D3   B3   D3   F3
6  NaN  NaN  NaN  NaN   B6   D6   F6
7  NaN  NaN  NaN  NaN   B7   D7   F7

../_images/merging_concat_axis1.png

join='inner'取轴值的交集

In [10]: result = pd.concat([df1, df4], axis=1, join="inner")

In [11]: result
Out[11]: 
    A   B   C   D   B   D   F
2  A2  B2  C2  D2  B2  D2  F2
3  A3  B3  C3  D3  B3  D3  F3

../_images/merging_concat_axis1_inner.png

要使用原始 索引执行有效的“左”连接DataFrame,可以对结果重新索引。

In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)

In [13]: result
Out[13]: 
    A   B   C   D    B    D    F
0  A0  B0  C0  D0  NaN  NaN  NaN
1  A1  B1  C1  D1  NaN  NaN  NaN
2  A2  B2  C2  D2   B2   D2   F2
3  A3  B3  C3  D3   B3   D3   F3

../_images/merging_concat_axis1_join_axes.png

忽略串联轴上的索引#

对于DataFrame没有有意义索引的对象,会ignore_index 忽略重叠索引。

In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False)

In [15]: result
Out[15]: 
     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2  NaN
3   A3  B3   C3  D3  NaN
4  NaN  B2  NaN  D2   F2
5  NaN  B3  NaN  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

../_images/merging_concat_ignore_index.png

连接SeriesDataFrame在一起#

Series您可以连接和对象的混合DataFrame。将 Series被转换为DataFrame以列名作为 的名称Series

In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

In [17]: result = pd.concat([df1, s1], axis=1)

In [18]: result
Out[18]: 
    A   B   C   D   X
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3

../_images/merging_concat_mixed_ndim.png

未命名的Series将连续编号。

In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"])

In [20]: result = pd.concat([df1, s2, s2, s2], axis=1)

In [21]: result
Out[21]: 
    A   B   C   D   0   1   2
0  A0  B0  C0  D0  _0  _0  _0
1  A1  B1  C1  D1  _1  _1  _1
2  A2  B2  C2  D2  _2  _2  _2
3  A3  B3  C3  D3  _3  _3  _3

../_images/merging_concat_unnamed_series.png

ignore_index=True将删除所有名称引用。

In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)

In [23]: result
Out[23]: 
    0   1   2   3   4
0  A0  B0  C0  D0  X0
1  A1  B1  C1  D1  X1
2  A2  B2  C2  D2  X2
3  A3  B3  C3  D3  X3

../_images/merging_concat_series_ignore_index.png

结果keys#

keys参数将另一个轴级别添加到结果索引或列(创建MultiIndex)将特定键与每个原始关联DataFrame

In [24]: result = pd.concat(frames, keys=["x", "y", "z"])

In [25]: result
Out[25]: 
        A    B    C    D
x 0    A0   B0   C0   D0
  1    A1   B1   C1   D1
  2    A2   B2   C2   D2
  3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11

In [26]: result.loc["y"]
Out[26]: 
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

../_images/merging_concat_keys.png

当基于现有的创建新的时,参数keys可以覆盖列名。DataFrameSeries

In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo")

In [28]: s4 = pd.Series([0, 1, 2, 3])

In [29]: s5 = pd.Series([0, 1, 4, 5])

In [30]: pd.concat([s3, s4, s5], axis=1)
Out[30]: 
   foo  0  1
0    0  0  0
1    1  1  1
2    2  2  4
3    3  3  5

In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
Out[31]: 
   red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5

您还可以传递一个字典,在这种情况下,除非指定其他参数,否则concat()字典键将用于参数:keyskeys

In [32]: pieces = {"x": df1, "y": df2, "z": df3}

In [33]: result = pd.concat(pieces)

In [34]: result
Out[34]: 
        A    B    C    D
x 0    A0   B0   C0   D0
  1    A1   B1   C1   D1
  2    A2   B2   C2   D2
  3    A3   B3   C3   D3
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11

../_images/merging_concat_dict.png
In [35]: result = pd.concat(pieces, keys=["z", "y"])

In [36]: result
Out[36]: 
        A    B    C    D
z 8    A8   B8   C8   D8
  9    A9   B9   C9   D9
  10  A10  B10  C10  D10
  11  A11  B11  C11  D11
y 4    A4   B4   C4   D4
  5    A5   B5   C5   D5
  6    A6   B6   C6   D6
  7    A7   B7   C7   D7

../_images/merging_concat_dict_keys.png

创建MultiIndex的级别是根据传递的键和DataFrame片段的索引构造的:

In [37]: result.index.levels
Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])

levels参数允许指定与相关的结果级别keys

In [38]: result = pd.concat(
   ....:     pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
   ....: )
   ....: 

In [39]: result
Out[39]: 
                A    B    C    D
group_key                       
x         0    A0   B0   C0   D0
          1    A1   B1   C1   D1
          2    A2   B2   C2   D2
          3    A3   B3   C3   D3
y         4    A4   B4   C4   D4
          5    A5   B5   C5   D5
          6    A6   B6   C6   D6
          7    A7   B7   C7   D7
z         8    A8   B8   C8   D8
          9    A9   B9   C9   D9
          10  A10  B10  C10  D10
          11  A11  B11  C11  D11

../_images/merging_concat_dict_keys_names.png
In [40]: result.index.levels
Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

将行附加到#DataFrame

如果您Series想将 a 作为单行附加到 a DataFrame,则可以将该行转换为 a DataFrame并使用concat()

In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True)

In [43]: result
Out[43]: 
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  X0  X1  X2  X3

../_images/merging_append_series_as_row.png

merge()#

merge()执行类似于 SQL 等关系数据库的连接操作。熟悉 SQL 但刚接触 pandas 的用户可以参考 与 SQL 的比较

合并类型#

merge()实现常见的 SQL 样式连接操作。

  • 一对一:在其索引上连接两个DataFrame对象,该索引必须包含唯一值。

  • 多对一:将唯一索引连接到不同DataFrame.

  • 多对多:在列上连接列。

笔记

当在列上连接列时(可能是多对多连接),所传递DataFrame对象上的任何索引都将被丢弃

对于多对多联接,如果一个键组合在两个表中出现多次,则将DataFrame具有关联数据的笛卡尔积。

In [44]: left = pd.DataFrame(
   ....:     {
   ....:         "key": ["K0", "K1", "K2", "K3"],
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:     }
   ....: )
   ....: 

In [45]: right = pd.DataFrame(
   ....:     {
   ....:         "key": ["K0", "K1", "K2", "K3"],
   ....:         "C": ["C0", "C1", "C2", "C3"],
   ....:         "D": ["D0", "D1", "D2", "D3"],
   ....:     }
   ....: )
   ....: 

In [46]: result = pd.merge(left, right, on="key")

In [47]: result
Out[47]: 
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

../_images/merging_merge_on_key.png

参数how指定merge()结果表中包含哪些键。如果组合键未出现在左表或右表中,则连接表中的值将为 NA。以下是how选项及其 SQL 等效名称的摘要:

合并方法

SQL 连接名称

描述

left

LEFT OUTER JOIN

仅使用左框架中的键

right

RIGHT OUTER JOIN

仅使用右框架中的关键点

outer

FULL OUTER JOIN

使用两个帧中的键的并集

inner

INNER JOIN

使用两个帧的关键点的交集

cross

CROSS JOIN

创建两个框架的行的笛卡尔积

In [48]: left = pd.DataFrame(
   ....:    {
   ....:       "key1": ["K0", "K0", "K1", "K2"],
   ....:       "key2": ["K0", "K1", "K0", "K1"],
   ....:       "A": ["A0", "A1", "A2", "A3"],
   ....:       "B": ["B0", "B1", "B2", "B3"],
   ....:    }
   ....: )
   ....: 

In [49]: right = pd.DataFrame(
   ....:    {
   ....:       "key1": ["K0", "K1", "K1", "K2"],
   ....:       "key2": ["K0", "K0", "K0", "K0"],
   ....:       "C": ["C0", "C1", "C2", "C3"],
   ....:       "D": ["D0", "D1", "D2", "D3"],
   ....:    }
   ....: )
   ....: 

In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"])

In [51]: result
Out[51]: 
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN

../_images/merging_merge_on_key_left.png
In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"])

In [53]: result
Out[53]: 
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

../_images/merging_merge_on_key_right.png
In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])

In [55]: result
Out[55]: 
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K0  NaN  NaN   C3   D3
5   K2   K1   A3   B3  NaN  NaN

../_images/merging_merge_on_key_outer.png
In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])

In [57]: result
Out[57]: 
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

../_images/merging_merge_on_key_inner.png
In [58]: result = pd.merge(left, right, how="cross")

In [59]: result
Out[59]: 
   key1_x key2_x   A   B key1_y key2_y   C   D
0      K0     K0  A0  B0     K0     K0  C0  D0
1      K0     K0  A0  B0     K1     K0  C1  D1
2      K0     K0  A0  B0     K1     K0  C2  D2
3      K0     K0  A0  B0     K2     K0  C3  D3
4      K0     K1  A1  B1     K0     K0  C0  D0
..    ...    ...  ..  ..    ...    ...  ..  ..
11     K1     K0  A2  B2     K2     K0  C3  D3
12     K2     K1  A3  B3     K0     K0  C0  D0
13     K2     K1  A3  B3     K1     K0  C1  D1
14     K2     K1  A3  B3     K1     K0  C2  D2
15     K2     K1  A3  B3     K2     K0  C3  D3

[16 rows x 8 columns]

../_images/merging_merge_cross.png

如果 的名称对应于 中的列,则可以Series将 aDataFrame与 a结合使用。合并前将 转换为usingMultiIndexMultiIndexDataFrameSeriesDataFrameSeries.reset_index()

In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

In [61]: df
Out[61]: 
  Let  Num
0   A    1
1   B    2
2   C    3

In [62]: ser = pd.Series(
   ....:     ["a", "b", "c", "d", "e", "f"],
   ....:     index=pd.MultiIndex.from_arrays(
   ....:         [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
   ....:     ),
   ....: )
   ....: 

In [63]: ser
Out[63]: 
Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[64]: 
  Let  Num  0
0   A    1  a
1   B    2  b
2   C    3  c

使用重复的连接键执行外连接DataFrame

In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [67]: result = pd.merge(left, right, on="B", how="outer")

In [68]: result
Out[68]: 
   A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6

../_images/merging_merge_on_key_dup.png

警告

合并重复键会显着增加结果的维度,并可能导致内存溢出。

合并键唯一性#

validate参数检查合并键是否唯一。在合并操作之前检查密钥的唯一性,可以防止内存溢出和意外的密钥重复。

In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Cell In[71], line 1
----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    155     return _cross_merge(
    156         left_df,
    157         right_df,
   (...)
    167         copy=copy,
    168     )
    169 else:
--> 170     op = _MergeOperation(
    171         left_df,
    172         right_df,
    173         how=how,
    174         on=on,
    175         left_on=left_on,
    176         right_on=right_on,
    177         left_index=left_index,
    178         right_index=right_index,
    179         sort=sort,
    180         suffixes=suffixes,
    181         indicator=indicator,
    182         validate=validate,
    183     )
    184     return op.get_result(copy=copy)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)
    809 # If argument passed to validate,
    810 # check if columns specified as unique
    811 # are in fact unique.
    812 if validate is not None:
--> 813     self._validate_validate_kwd(validate)

File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate)
   1653         raise MergeError(
   1654             "Merge keys are not unique in left dataset; not a one-to-one merge"
   1655         )
   1656     if not right_unique:
-> 1657         raise MergeError(
   1658             "Merge keys are not unique in right dataset; not a one-to-one merge"
   1659         )
   1661 elif validate in ["one_to_many", "1:m"]:
   1662     if not left_unique:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

如果用户知道右侧有重复项,DataFrame但想确保左侧没有重复项DataFrame,则可以使用该 validate='one_to_many'参数,这不会引发异常。

In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[72]: 
   A_x  B  A_y
0    1  1  NaN
1    2  2  4.0
2    2  2  5.0
3    2  2  6.0

合并结果指示器#

merge()接受论点indicator。如果True,则名为 的分类类型列_merge将被添加到采用值的输出对象中:

观察原点

_merge价值

'left'仅在框架中合并关键点

left_only

'right'仅在框架中合并关键点

right_only

合并两个帧中的关键点

both

In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[75]: 
   col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only

的字符串参数indicator将使用该值作为指示符列的名称。

In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[76]: 
   col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only

重叠值列#

mergesuffixes参数采用一个字符串列表元组来附加到输入中重叠的列名,DataFrame以消除结果列的歧义:

In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})

In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})

In [79]: result = pd.merge(left, right, on="k")

In [80]: result
Out[80]: 
    k  v_x  v_y
0  K0    1    4
1  K0    1    5

../_images/merging_merge_overlapped.png
In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

In [82]: result
Out[82]: 
    k  v_l  v_r
0  K0    1    4
1  K0    1    5

../_images/merging_merge_overlapped_suffix.png

DataFrame.join()#

DataFrame.join()将多个可能具有不同索引的列组合DataFrame成一个结果 DataFrame

In [83]: left = pd.DataFrame(
   ....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
   ....: )
   ....: 

In [84]: right = pd.DataFrame(
   ....:     {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
   ....: )
   ....: 

In [85]: result = left.join(right)

In [86]: result
Out[86]: 
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2

../_images/merging_join.png
In [87]: result = left.join(right, how="outer")

In [88]: result
Out[88]: 
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3

../_images/merging_join_outer.png
In [89]: result = left.join(right, how="inner")

In [90]: result
Out[90]: 
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2

../_images/merging_join_inner.png

DataFrame.join()接受一个可选on参数,该参数可以是DataFrame要对齐的列名或多个列名。

In [91]: left = pd.DataFrame(
   ....:     {
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:         "key": ["K0", "K1", "K0", "K1"],
   ....:     }
   ....: )
   ....: 

In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

In [93]: result = left.join(right, on="key")

In [94]: result
Out[94]: 
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

../_images/merging_join_key_columns.png
In [95]: result = pd.merge(
   ....:     left, right, left_on="key", right_index=True, how="left", sort=False
   ....: )
   ....: 

In [96]: result
Out[96]: 
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

../_images/merging_merge_key_columns.png

要连接多个键,传递的键DataFrame必须具有MultiIndex

In [97]: left = pd.DataFrame(
   ....:     {
   ....:         "A": ["A0", "A1", "A2", "A3"],
   ....:         "B": ["B0", "B1", "B2", "B3"],
   ....:         "key1": ["K0", "K0", "K1", "K2"],
   ....:         "key2": ["K0", "K1", "K0", "K1"],
   ....:     }
   ....: )
   ....: 

In [98]: index = pd.MultiIndex.from_tuples(
   ....:     [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
   ....: )
   ....: 

In [99]: right = pd.DataFrame(
   ....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
   ....: )
   ....: 

In [100]: result = left.join(right, on=["key1", "key2"])

In [101]: result
Out[101]: 
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3

../_images/merging_join_multikeys.png

默认情况下DataFrame.join,执行左连接,仅使用调用中找到的键