合并、连接、连接和比较# pandas 提供了各种组合和比较Seriesor 的 方法DataFrame。 concat():沿着共享索引或列合并多个Series或对象DataFrame DataFrame.join():DataFrame沿列合并多个对象 DataFrame.combine_first():用同一位置的非缺失值更新缺失值 merge():使用 SQL 样式连接组合两个Series或多个对象DataFrame merge_ordered():沿有序轴组合两个Series或多个对象DataFrame merge_asof():通过接近而不是完全匹配的键组合两个Series或多个对象DataFrame Series.compare()and :显示两个或对象DataFrame.compare()之间的值差异SeriesDataFrame concat()# 该concat()函数沿一个轴连接任意数量的 Series或DataFrame对象,同时对其他轴上的索引执行可选的集合逻辑(并集或交集)。像 一样numpy.concatenate,concat() 接受同构类型对象的列表或字典并将它们连接起来。 In [1]: df1 = pd.DataFrame( ...: { ...: "A": ["A0", "A1", "A2", "A3"], ...: "B": ["B0", "B1", "B2", "B3"], ...: "C": ["C0", "C1", "C2", "C3"], ...: "D": ["D0", "D1", "D2", "D3"], ...: }, ...: index=[0, 1, 2, 3], ...: ) ...: In [2]: df2 = pd.DataFrame( ...: { ...: "A": ["A4", "A5", "A6", "A7"], ...: "B": ["B4", "B5", "B6", "B7"], ...: "C": ["C4", "C5", "C6", "C7"], ...: "D": ["D4", "D5", "D6", "D7"], ...: }, ...: index=[4, 5, 6, 7], ...: ) ...: In [3]: df3 = pd.DataFrame( ...: { ...: "A": ["A8", "A9", "A10", "A11"], ...: "B": ["B8", "B9", "B10", "B11"], ...: "C": ["C8", "C9", "C10", "C11"], ...: "D": ["D8", "D9", "D10", "D11"], ...: }, ...: index=[8, 9, 10, 11], ...: ) ...: In [4]: frames = [df1, df2, df3] In [5]: result = pd.concat(frames) In [6]: result Out[6]: A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11 笔记 concat()制作数据的完整副本,迭代重用concat()可能会创建不必要的副本。在使用之前收集列表中的 所有 DataFrame或对象。Seriesconcat() frames = [process_your_file(f) for f in files] result = pd.concat(frames) 笔记 当与命名轴连接时DataFrame,pandas 将尽可能尝试保留这些索引/列名称。如果所有输入共享一个通用名称,则该名称将分配给结果。当输入的名称不全部一致时,结果将是未命名的。也是如此MultiIndex,但逻辑是逐级单独应用的。 结果轴的连接逻辑# 该join关键字指定如何处理第一个 中不存在的轴值 DataFrame。 join='outer'取所有轴值的并集 In [7]: df4 = pd.DataFrame( ...: { ...: "B": ["B2", "B3", "B6", "B7"], ...: "D": ["D2", "D3", "D6", "D7"], ...: "F": ["F2", "F3", "F6", "F7"], ...: }, ...: index=[2, 3, 6, 7], ...: ) ...: In [8]: result = pd.concat([df1, df4], axis=1) In [9]: result Out[9]: A B C D B D F 0 A0 B0 C0 D0 NaN NaN NaN 1 A1 B1 C1 D1 NaN NaN NaN 2 A2 B2 C2 D2 B2 D2 F2 3 A3 B3 C3 D3 B3 D3 F3 6 NaN NaN NaN NaN B6 D6 F6 7 NaN NaN NaN NaN B7 D7 F7 join='inner'取轴值的交集 In [10]: result = pd.concat([df1, df4], axis=1, join="inner") In [11]: result Out[11]: A B C D B D F 2 A2 B2 C2 D2 B2 D2 F2 3 A3 B3 C3 D3 B3 D3 F3 要使用原始 索引执行有效的“左”连接DataFrame,可以对结果重新索引。 In [12]: result = pd.concat([df1, df4], axis=1).reindex(df1.index) In [13]: result Out[13]: A B C D B D F 0 A0 B0 C0 D0 NaN NaN NaN 1 A1 B1 C1 D1 NaN NaN NaN 2 A2 B2 C2 D2 B2 D2 F2 3 A3 B3 C3 D3 B3 D3 F3 忽略串联轴上的索引# 对于DataFrame没有有意义索引的对象,会ignore_index 忽略重叠索引。 In [14]: result = pd.concat([df1, df4], ignore_index=True, sort=False) In [15]: result Out[15]: A B C D F 0 A0 B0 C0 D0 NaN 1 A1 B1 C1 D1 NaN 2 A2 B2 C2 D2 NaN 3 A3 B3 C3 D3 NaN 4 NaN B2 NaN D2 F2 5 NaN B3 NaN D3 F3 6 NaN B6 NaN D6 F6 7 NaN B7 NaN D7 F7 连接Series并DataFrame在一起# Series您可以连接和对象的混合DataFrame。将 Series被转换为DataFrame以列名作为 的名称Series。 In [16]: s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X") In [17]: result = pd.concat([df1, s1], axis=1) In [18]: result Out[18]: A B C D X 0 A0 B0 C0 D0 X0 1 A1 B1 C1 D1 X1 2 A2 B2 C2 D2 X2 3 A3 B3 C3 D3 X3 未命名的Series将连续编号。 In [19]: s2 = pd.Series(["_0", "_1", "_2", "_3"]) In [20]: result = pd.concat([df1, s2, s2, s2], axis=1) In [21]: result Out[21]: A B C D 0 1 2 0 A0 B0 C0 D0 _0 _0 _0 1 A1 B1 C1 D1 _1 _1 _1 2 A2 B2 C2 D2 _2 _2 _2 3 A3 B3 C3 D3 _3 _3 _3 ignore_index=True将删除所有名称引用。 In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True) In [23]: result Out[23]: 0 1 2 3 4 0 A0 B0 C0 D0 X0 1 A1 B1 C1 D1 X1 2 A2 B2 C2 D2 X2 3 A3 B3 C3 D3 X3 结果keys# 该keys参数将另一个轴级别添加到结果索引或列(创建MultiIndex)将特定键与每个原始关联DataFrame。 In [24]: result = pd.concat(frames, keys=["x", "y", "z"]) In [25]: result Out[25]: A B C D x 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11 In [26]: result.loc["y"] Out[26]: A B C D 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 当基于现有的创建新的时,参数keys可以覆盖列名。DataFrameSeries In [27]: s3 = pd.Series([0, 1, 2, 3], name="foo") In [28]: s4 = pd.Series([0, 1, 2, 3]) In [29]: s5 = pd.Series([0, 1, 4, 5]) In [30]: pd.concat([s3, s4, s5], axis=1) Out[30]: foo 0 1 0 0 0 0 1 1 1 1 2 2 2 4 3 3 3 5 In [31]: pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"]) Out[31]: red blue yellow 0 0 0 0 1 1 1 1 2 2 2 4 3 3 3 5 您还可以传递一个字典,在这种情况下,除非指定其他参数,否则concat()字典键将用于参数:keyskeys In [32]: pieces = {"x": df1, "y": df2, "z": df3} In [33]: result = pd.concat(pieces) In [34]: result Out[34]: A B C D x 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11 In [35]: result = pd.concat(pieces, keys=["z", "y"]) In [36]: result Out[36]: A B C D z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11 y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 创建MultiIndex的级别是根据传递的键和DataFrame片段的索引构造的: In [37]: result.index.levels Out[37]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]]) levels参数允许指定与相关的结果级别keys In [38]: result = pd.concat( ....: pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"] ....: ) ....: In [39]: result Out[39]: A B C D group_key x 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11 In [40]: result.index.levels Out[40]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 将行附加到#DataFrame 如果您Series想将 a 作为单行附加到 a DataFrame,则可以将该行转换为 a DataFrame并使用concat() In [41]: s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"]) In [42]: result = pd.concat([df1, s2.to_frame().T], ignore_index=True) In [43]: result Out[43]: A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 X0 X1 X2 X3 merge()# merge()执行类似于 SQL 等关系数据库的连接操作。熟悉 SQL 但刚接触 pandas 的用户可以参考 与 SQL 的比较。 合并类型# merge()实现常见的 SQL 样式连接操作。 一对一:在其索引上连接两个DataFrame对象,该索引必须包含唯一值。 多对一:将唯一索引连接到不同DataFrame. 多对多:在列上连接列。 笔记 当在列上连接列时(可能是多对多连接),所传递DataFrame对象上的任何索引都将被丢弃。 对于多对多联接,如果一个键组合在两个表中出现多次,则将DataFrame具有关联数据的笛卡尔积。 In [44]: left = pd.DataFrame( ....: { ....: "key": ["K0", "K1", "K2", "K3"], ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: } ....: ) ....: In [45]: right = pd.DataFrame( ....: { ....: "key": ["K0", "K1", "K2", "K3"], ....: "C": ["C0", "C1", "C2", "C3"], ....: "D": ["D0", "D1", "D2", "D3"], ....: } ....: ) ....: In [46]: result = pd.merge(left, right, on="key") In [47]: result Out[47]: key A B C D 0 K0 A0 B0 C0 D0 1 K1 A1 B1 C1 D1 2 K2 A2 B2 C2 D2 3 K3 A3 B3 C3 D3 参数how指定merge()结果表中包含哪些键。如果组合键未出现在左表或右表中,则连接表中的值将为 NA。以下是how选项及其 SQL 等效名称的摘要: 合并方法 SQL 连接名称 描述 left LEFT OUTER JOIN 仅使用左框架中的键 right RIGHT OUTER JOIN 仅使用右框架中的关键点 outer FULL OUTER JOIN 使用两个帧中的键的并集 inner INNER JOIN 使用两个帧的关键点的交集 cross CROSS JOIN 创建两个框架的行的笛卡尔积 In [48]: left = pd.DataFrame( ....: { ....: "key1": ["K0", "K0", "K1", "K2"], ....: "key2": ["K0", "K1", "K0", "K1"], ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: } ....: ) ....: In [49]: right = pd.DataFrame( ....: { ....: "key1": ["K0", "K1", "K1", "K2"], ....: "key2": ["K0", "K0", "K0", "K0"], ....: "C": ["C0", "C1", "C2", "C3"], ....: "D": ["D0", "D1", "D2", "D3"], ....: } ....: ) ....: In [50]: result = pd.merge(left, right, how="left", on=["key1", "key2"]) In [51]: result Out[51]: key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN In [52]: result = pd.merge(left, right, how="right", on=["key1", "key2"]) In [53]: result Out[53]: key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 3 K2 K0 NaN NaN C3 D3 In [54]: result = pd.merge(left, right, how="outer", on=["key1", "key2"]) In [55]: result Out[55]: key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K0 NaN NaN C3 D3 5 K2 K1 A3 B3 NaN NaN In [56]: result = pd.merge(left, right, how="inner", on=["key1", "key2"]) In [57]: result Out[57]: key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 In [58]: result = pd.merge(left, right, how="cross") In [59]: result Out[59]: key1_x key2_x A B key1_y key2_y C D 0 K0 K0 A0 B0 K0 K0 C0 D0 1 K0 K0 A0 B0 K1 K0 C1 D1 2 K0 K0 A0 B0 K1 K0 C2 D2 3 K0 K0 A0 B0 K2 K0 C3 D3 4 K0 K1 A1 B1 K0 K0 C0 D0 .. ... ... .. .. ... ... .. .. 11 K1 K0 A2 B2 K2 K0 C3 D3 12 K2 K1 A3 B3 K0 K0 C0 D0 13 K2 K1 A3 B3 K1 K0 C1 D1 14 K2 K1 A3 B3 K1 K0 C2 D2 15 K2 K1 A3 B3 K2 K0 C3 D3 [16 rows x 8 columns] 如果 的名称对应于 中的列,则可以Series将 aDataFrame与 a结合使用。合并前将 转换为usingMultiIndexMultiIndexDataFrameSeriesDataFrameSeries.reset_index() In [60]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]}) In [61]: df Out[61]: Let Num 0 A 1 1 B 2 2 C 3 In [62]: ser = pd.Series( ....: ["a", "b", "c", "d", "e", "f"], ....: index=pd.MultiIndex.from_arrays( ....: [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"] ....: ), ....: ) ....: In [63]: ser Out[63]: Let Num A 1 a B 2 b C 3 c A 4 d B 5 e C 6 f dtype: object In [64]: pd.merge(df, ser.reset_index(), on=["Let", "Num"]) Out[64]: Let Num 0 0 A 1 a 1 B 2 b 2 C 3 c 使用重复的连接键执行外连接DataFrame In [65]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]}) In [66]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) In [67]: result = pd.merge(left, right, on="B", how="outer") In [68]: result Out[68]: A_x B A_y 0 1 2 4 1 1 2 5 2 1 2 6 3 2 2 4 4 2 2 5 5 2 2 6 警告 合并重复键会显着增加结果的维度,并可能导致内存溢出。 合并键唯一性# 该validate参数检查合并键是否唯一。在合并操作之前检查密钥的唯一性,可以防止内存溢出和意外的密钥重复。 In [69]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]}) In [70]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) In [71]: result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") --------------------------------------------------------------------------- MergeError Traceback (most recent call last) Cell In[71], line 1 ----> 1 result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 155 return _cross_merge( 156 left_df, 157 right_df, (...) 167 copy=copy, 168 ) 169 else: --> 170 op = _MergeOperation( 171 left_df, 172 right_df, 173 how=how, 174 on=on, 175 left_on=left_on, 176 right_on=right_on, 177 left_index=left_index, 178 right_index=right_index, 179 sort=sort, 180 suffixes=suffixes, 181 indicator=indicator, 182 validate=validate, 183 ) 184 return op.get_result(copy=copy) File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate) 809 # If argument passed to validate, 810 # check if columns specified as unique 811 # are in fact unique. 812 if validate is not None: --> 813 self._validate_validate_kwd(validate) File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in _MergeOperation._validate_validate_kwd(self, validate) 1653 raise MergeError( 1654 "Merge keys are not unique in left dataset; not a one-to-one merge" 1655 ) 1656 if not right_unique: -> 1657 raise MergeError( 1658 "Merge keys are not unique in right dataset; not a one-to-one merge" 1659 ) 1661 elif validate in ["one_to_many", "1:m"]: 1662 if not left_unique: MergeError: Merge keys are not unique in right dataset; not a one-to-one merge 如果用户知道右侧有重复项,DataFrame但想确保左侧没有重复项DataFrame,则可以使用该 validate='one_to_many'参数,这不会引发异常。 In [72]: pd.merge(left, right, on="B", how="outer", validate="one_to_many") Out[72]: A_x B A_y 0 1 1 NaN 1 2 2 4.0 2 2 2 5.0 3 2 2 6.0 合并结果指示器# merge()接受论点indicator。如果True,则名为 的分类类型列_merge将被添加到采用值的输出对象中: 观察原点 _merge价值 'left'仅在框架中合并关键点 left_only 'right'仅在框架中合并关键点 right_only 合并两个帧中的关键点 both In [73]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]}) In [74]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]}) In [75]: pd.merge(df1, df2, on="col1", how="outer", indicator=True) Out[75]: col1 col_left col_right _merge 0 0 a NaN left_only 1 1 b 2.0 both 2 2 NaN 2.0 right_only 3 2 NaN 2.0 right_only 的字符串参数indicator将使用该值作为指示符列的名称。 In [76]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column") Out[76]: col1 col_left col_right indicator_column 0 0 a NaN left_only 1 1 b 2.0 both 2 2 NaN 2.0 right_only 3 2 NaN 2.0 right_only 重叠值列# mergesuffixes参数采用一个字符串列表元组来附加到输入中重叠的列名,DataFrame以消除结果列的歧义: In [77]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]}) In [78]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]}) In [79]: result = pd.merge(left, right, on="k") In [80]: result Out[80]: k v_x v_y 0 K0 1 4 1 K0 1 5 In [81]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r")) In [82]: result Out[82]: k v_l v_r 0 K0 1 4 1 K0 1 5 DataFrame.join()# DataFrame.join()将多个可能具有不同索引的列组合DataFrame成一个结果 DataFrame。 In [83]: left = pd.DataFrame( ....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"] ....: ) ....: In [84]: right = pd.DataFrame( ....: {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"] ....: ) ....: In [85]: result = left.join(right) In [86]: result Out[86]: A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2 In [87]: result = left.join(right, how="outer") In [88]: result Out[88]: A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2 K3 NaN NaN C3 D3 In [89]: result = left.join(right, how="inner") In [90]: result Out[90]: A B C D K0 A0 B0 C0 D0 K2 A2 B2 C2 D2 DataFrame.join()接受一个可选on参数,该参数可以是DataFrame要对齐的列名或多个列名。 In [91]: left = pd.DataFrame( ....: { ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: "key": ["K0", "K1", "K0", "K1"], ....: } ....: ) ....: In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"]) In [93]: result = left.join(right, on="key") In [94]: result Out[94]: A B key C D 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K0 C0 D0 3 A3 B3 K1 C1 D1 In [95]: result = pd.merge( ....: left, right, left_on="key", right_index=True, how="left", sort=False ....: ) ....: In [96]: result Out[96]: A B key C D 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K0 C0 D0 3 A3 B3 K1 C1 D1 要连接多个键,传递的键DataFrame必须具有MultiIndex: In [97]: left = pd.DataFrame( ....: { ....: "A": ["A0", "A1", "A2", "A3"], ....: "B": ["B0", "B1", "B2", "B3"], ....: "key1": ["K0", "K0", "K1", "K2"], ....: "key2": ["K0", "K1", "K0", "K1"], ....: } ....: ) ....: In [98]: index = pd.MultiIndex.from_tuples( ....: [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")] ....: ) ....: In [99]: right = pd.DataFrame( ....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index ....: ) ....: In [100]: result = left.join(right, on=["key1", "key2"]) In [101]: result Out[101]: A B key1 key2 C D 0 A0 B0 K0 K0 C0 D0 1 A1 B1 K0 K1 NaN NaN 2 A2 B2 K1 K0 C1 D1 3 A3 B3 K2 K1 C3 D3 默认情况下DataFrame.join,执行左连接,仅使用调用中找到的键DataFrame。其他连接类型可以用 指定how。 In [102]: result = left.join(right, on=["key1", "key2"], how="inner") In [103]: result Out[103]: A B key1 key2 C D 0 A0 B0 K0 K0 C0 D0 2 A2 B2 K1 K0 C1 D1 3 A3 B3 K2 K1 C3 D3 将单个索引加入多索引# 您可以在一个级别上将aDataFrame与 a连接到Indexa 与DataFramea 。的 与MultiIndex的级别名称相匹配。nameIndexMultiIndex In [104]: left = pd.DataFrame( .....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, .....: index=pd.Index(["K0", "K1", "K2"], name="key"), .....: ) .....: In [105]: index = pd.MultiIndex.from_tuples( .....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], .....: names=["key", "Y"], .....: ) .....: In [106]: right = pd.DataFrame( .....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, .....: index=index, .....: ) .....: In [107]: result = left.join(right, how="inner") In [108]: result Out[108]: A B C D key Y K0 Y0 A0 B0 C0 D0 K1 Y1 A1 B1 C1 D1 K2 Y2 A2 B2 C2 D2 Y3 A2 B2 C3 D3 用两个#连接MultiIndex 输入参数的MultiIndex必须在连接中完全使用,并且是左参数中索引的子集。 In [109]: leftindex = pd.MultiIndex.from_product( .....: [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"] .....: ) .....: In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex) In [111]: left Out[111]: v1 abc xy num a x 1 0 2 1 y 1 2 2 3 b x 1 4 2 5 y 1 6 2 7 c x 1 8 2 9 y 1 10 2 11 In [112]: rightindex = pd.MultiIndex.from_product( .....: [list("abc"), list("xy")], names=["abc", "xy"] .....: ) .....: In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex) In [114]: right Out[114]: v2 abc xy a x 100 y 200 b x 300 y 400 c x 500 y 600 In [115]: left.join(right, on=["abc", "xy"], how="inner") Out[115]: v1 v2 abc xy num a x 1 0 100 2 1 100 y 1 2 200 2 3 200 b x 1 4 300 2 5 300 y 1 6 400 2 7 400 c x 1 8 500 2 9 500 y 1 10 600 2 11 600 In [116]: leftindex = pd.MultiIndex.from_tuples( .....: [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"] .....: ) .....: In [117]: left = pd.DataFrame( .....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex .....: ) .....: In [118]: rightindex = pd.MultiIndex.from_tuples( .....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"] .....: ) .....: In [119]: right = pd.DataFrame( .....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex .....: ) .....: In [120]: result = pd.merge( .....: left.reset_index(), right.reset_index(), on=["key"], how="inner" .....: ).set_index(["key", "X", "Y"]) .....: In [121]: result Out[121]: A B C D key X Y K0 X0 Y0 A0 B0 C0 D0 X1 Y0 A1 B1 C0 D0 K1 X2 Y1 A2 B2 C1 D1 合并列和索引级别的组合# 作为on、left_on和right_on参数传递的字符串可以引用列名称或索引级别名称。这使得可以 DataFrame在索引级别和列的组合上合并实例,而无需重置索引。 In [122]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1") In [123]: left = pd.DataFrame( .....: { .....: "A": ["A0", "A1", "A2", "A3"], .....: "B": ["B0", "B1", "B2", "B3"], .....: "key2": ["K0", "K1", "K0", "K1"], .....: }, .....: index=left_index, .....: ) .....: In [124]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1") In [125]: right = pd.DataFrame( .....: { .....: "C": ["C0", "C1", "C2", "C3"], .....: "D": ["D0", "D1", "D2", "D3"], .....: "key2": ["K0", "K0", "K0", "K1"], .....: }, .....: index=right_index, .....: ) .....: In [126]: result = left.merge(right, on=["key1", "key2"]) In [127]: result Out[127]: A B key2 C D key1 K0 A0 B0 K0 C0 D0 K1 A2 B2 K0 C1 D1 K2 A3 B3 K1 C3 D3 笔记 当DataFrame连接到与两个参数中的索引级别匹配的字符串时,索引级别将保留为结果中的索引级别 DataFrame。 笔记 当DataFrame仅使用 a 的某些级别进行连接时MultiIndex,多余的级别将从结果连接中删除。要保留这些级别,请使用DataFrame.reset_index()这些级别名称将这些级别移动到连接之前的列。 加入多个#DataFrame :class:`DataFrame`也可以传递一个列表或元组来join() 将它们在索引上连接在一起。 In [128]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"]) In [129]: result = left.join([right, right2]) DataFrame.combine_first()# DataFrame.combine_first()使用相应位置DataFrame 中另一个的非缺失值更新缺失值。DataFrame In [130]: df1 = pd.DataFrame( .....: [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]] .....: ) .....: In [131]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2]) In [132]: result = df1.combine_first(df2) In [133]: result Out[133]: 0 1 2 0 NaN 3.0 5.0 1 -4.6 NaN -8.2 2 -5.0 7.0 4.0 merge_ordered()# merge_ordered()将数字或时间序列数据等订单数据与可选的缺失数据填充相结合fill_method。 In [134]: left = pd.DataFrame( .....: {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]} .....: ) .....: In [135]: right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]}) In [136]: pd.merge_ordered(left, right, fill_method="ffill", left_by="s") Out[136]: k lv s rv 0 K0 1.0 a NaN 1 K1 1.0 a 1.0 2 K2 1.0 a 2.0 3 K4 1.0 a 3.0 4 K1 2.0 b 1.0 5 K2 2.0 b 2.0 6 K4 2.0 b 3.0 7 K1 3.0 c 1.0 8 K2 3.0 c 2.0 9 K4 3.0 c 3.0 10 K1 NaN d 1.0 11 K2 4.0 d 2.0 12 K4 4.0 d 3.0 merge_asof()# merge_asof()与有序左连接类似,只是匹配项位于最近的键而不是相等的键上。对于 中的每一行,选择left DataFrame中的最后一行,其中键小于左侧的键。两者都必须按键排序。right DataFrameonDataFrame 可选地,除了键上最近的匹配之外,merge_asof()还可以通过匹配键来执行分组合并 。byon In [137]: trades = pd.DataFrame( .....: { .....: "time": pd.to_datetime( .....: [ .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.038", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.048", .....: ] .....: ), .....: "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"], .....: "price": [51.95, 51.95, 720.77, 720.92, 98.00], .....: "quantity": [75, 155, 100, 100, 100], .....: }, .....: columns=["time", "ticker", "price", "quantity"], .....: ) .....: In [138]: quotes = pd.DataFrame( .....: { .....: "time": pd.to_datetime( .....: [ .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.023", .....: "20160525 13:30:00.030", .....: "20160525 13:30:00.041", .....: "20160525 13:30:00.048", .....: "20160525 13:30:00.049", .....: "20160525 13:30:00.072", .....: "20160525 13:30:00.075", .....: ] .....: ), .....: "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"], .....: "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01], .....: "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03], .....: }, .....: columns=["time", "ticker", "bid", "ask"], .....: ) .....: In [139]: trades Out[139]: time ticker price quantity 0 2016-05-25 13:30:00.023 MSFT 51.95 75 1 2016-05-25 13:30:00.038 MSFT 51.95 155 2 2016-05-25 13:30:00.048 GOOG 720.77 100 3 2016-05-25 13:30:00.048 GOOG 720.92 100 4 2016-05-25 13:30:00.048 AAPL 98.00 100 In [140]: quotes Out[140]: time ticker bid ask 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03 In [141]: pd.merge_asof(trades, quotes, on="time", by="ticker") Out[141]: time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN merge_asof()2ms报价时间和交易时间之间。 In [142]: pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms")) Out[142]: time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN merge_asof()10ms报价时间和交易时间之间,排除时间上的精确匹配。请注意,虽然我们排除了(引号的)精确匹配,但先前的引号确实会传播到该时间点。 In [143]: pd.merge_asof( .....: trades, .....: quotes, .....: on="time", .....: by="ticker", .....: tolerance=pd.Timedelta("10ms"), .....: allow_exact_matches=False, .....: ) .....: Out[143]: time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN 3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN compare()# Series.compare()和方法DataFrame.compare()允许您分别比较两个DataFrame或Series,并总结它们的差异。 In [144]: df = pd.DataFrame( .....: { .....: "col1": ["a", "a", "b", "b", "a"], .....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0], .....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0], .....: }, .....: columns=["col1", "col2", "col3"], .....: ) .....: In [145]: df Out[145]: col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0 In [146]: df2 = df.copy() In [147]: df2.loc[0, "col1"] = "c" In [148]: df2.loc[2, "col3"] = 4.0 In [149]: df2 Out[149]: col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0 In [150]: df.compare(df2) Out[150]: col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0 默认情况下,如果两个对应值相等,它们将显示为NaN。此外,如果所有值都在整个行/列中,则该行/列将从结果中省略。其余差异将在列上对齐。 将差异堆叠在行上。 In [151]: df.compare(df2, align_axis=0) Out[151]: col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0 保留所有原始行和列keep_shape=True In [152]: df.compare(df2, keep_shape=True) Out[152]: col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN 保留所有原始值,即使它们相等。 In [153]: df.compare(df2, keep_shape=True, keep_equal=True) Out[153]: col1 col2 col3 self other self other self other 0 a c 1.0 1.0 1.0 1.0 1 a a 2.0 2.0 2.0 2.0 2 b b 3.0 3.0 3.0 4.0 3 b b NaN NaN 4.0 4.0 4 a a 5.0 5.0 5.0 5.0