与 SQL #的比较 由于许多潜在的 pandas 用户对 SQL有一定的熟悉,本页面旨在提供一些示例,说明如何使用 pandas 执行各种 SQL 操作。 如果您是 pandas 新手,您可能需要先阅读10 分钟了解 pandas 以熟悉该库。 按照惯例,我们导入 pandas 和 NumPy,如下所示: In [1]: import pandas as pd In [2]: import numpy as np 大多数示例将利用tipspandas 测试中找到的数据集。我们将数据读入名为 的 DataFrame 中tips,并假设我们有一个具有相同名称和结构的数据库表。 In [3]: url = ( ...: "https://raw.githubusercontent.com/pandas-dev" ...: "/pandas/main/pandas/tests/io/data/csv/tips.csv" ...: ) ...: In [4]: tips = pd.read_csv(url) In [5]: tips Out[5]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns] 副本与就地操作# Series大多数 pandas 操作都会返回/的副本DataFrame。要使更改“坚持”,您需要分配给一个新变量: sorted_df = df.sort_values("col1") 或者覆盖原来的: df = df.sort_values("col1") 笔记 您将看到某些方法可使用inplace=Trueor关键字参数:copy=False df.replace(5, inplace=True) inplace关于弃用和删除以及copy对于大多数方法(例如dropna),除了极小的方法子集(包括replace)之外,存在着积极的讨论。在 Copy-on-Write 上下文中,这两个关键字不再是必需的。该提案可以 在这里找到。 选择# 在 SQL 中,选择是使用您想要选择的列的逗号分隔列表(或* 选择所有列)来完成的: SELECT total_bill, tip, smoker, time FROM tips; 对于 pandas,列选择是通过将列名称列表传递到 DataFrame 来完成的: In [6]: tips[["total_bill", "tip", "smoker", "time"]] Out[6]: total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner .. ... ... ... ... 239 29.03 5.92 No Dinner 240 27.18 2.00 Yes Dinner 241 22.67 2.00 Yes Dinner 242 17.82 1.75 No Dinner 243 18.78 3.00 No Dinner [244 rows x 4 columns] 调用不带列名列表的 DataFrame 将显示所有列(类似于 SQL *)。 在 SQL 中,您可以添加计算列: SELECT *, tip/total_bill as tip_rate FROM tips; 使用 pandas,您可以使用DataFrame.assign()DataFrame 的方法来附加新列: In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"]) Out[7]: total_bill tip sex smoker day time size tip_rate 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 4 24.59 3.61 Female No Sun Dinner 4 0.146808 .. ... ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 0.203927 240 27.18 2.00 Female Yes Sat Dinner 2 0.073584 241 22.67 2.00 Male Yes Sat Dinner 2 0.088222 242 17.82 1.75 Male No Sat Dinner 2 0.098204 243 18.78 3.00 Female No Thur Dinner 2 0.159744 [244 rows x 8 columns] 在哪里# SQL 中的过滤是通过 WHERE 子句完成的。 SELECT * FROM tips WHERE time = 'Dinner'; DataFrames 可以通过多种方式进行过滤;其中最直观的是使用 布尔索引。 In [8]: tips[tips["total_bill"] > 10] Out[8]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [227 rows x 7 columns] 上面的语句只是将一个Seriesof True/False对象传递给 DataFrame,返回所有带有True. In [9]: is_dinner = tips["time"] == "Dinner" In [10]: is_dinner Out[10]: 0 True 1 True 2 True 3 True 4 True ... 239 True 240 True 241 True 242 True 243 True Name: time, Length: 244, dtype: bool In [11]: is_dinner.value_counts() Out[11]: time True 176 False 68 Name: count, dtype: int64 In [12]: tips[is_dinner] Out[12]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 .. ... ... ... ... ... ... ... 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [176 rows x 7 columns] 就像 SQL 的ORand一样,可以使用 ( ) 和( )AND将多个条件传递到 DataFrame 。|OR&AND 晚餐超过 5 美元的小费: SELECT * FROM tips WHERE time = 'Dinner' AND tip > 5.00; In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)] Out[13]: total_bill tip sex smoker day time size 23 39.42 7.58 Male No Sat Dinner 4 44 30.40 5.60 Male No Sun Dinner 4 47 32.40 6.00 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 59 48.27 6.73 Male No Sat Dinner 4 116 29.93 5.07 Male No Sun Dinner 4 155 29.85 5.14 Female No Sun Dinner 5 170 50.81 10.00 Male Yes Sat Dinner 3 172 7.25 5.15 Male Yes Sun Dinner 2 181 23.33 5.65 Male Yes Sun Dinner 2 183 23.17 6.50 Male Yes Sun Dinner 4 211 25.89 5.16 Male Yes Sat Dinner 4 212 48.33 9.00 Male No Sat Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 239 29.03 5.92 Male No Sat Dinner 3 至少 5 名用餐者的团体小费或账单总额超过 45 美元: SELECT * FROM tips WHERE size >= 5 OR total_bill > 45; In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)] Out[14]: total_bill tip sex smoker day time size 59 48.27 6.73 Male No Sat Dinner 4 125 29.80 4.20 Female No Thur Lunch 6 141 34.30 6.70 Male No Thur Lunch 6 142 41.19 5.00 Male No Thur Lunch 5 143 27.05 5.00 Female No Thur Lunch 6 155 29.85 5.14 Female No Sun Dinner 5 156 48.17 5.00 Male No Sun Dinner 6 170 50.81 10.00 Male Yes Sat Dinner 3 182 45.35 3.50 Male Yes Sun Dinner 3 185 20.69 5.00 Male No Sun Dinner 5 187 30.46 2.00 Male Yes Sun Dinner 5 212 48.33 9.00 Male No Sat Dinner 4 216 28.15 3.00 Male Yes Sat Dinner 5 NULL 检查是使用notna()和isna() 方法完成的。 In [15]: frame = pd.DataFrame( ....: {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]} ....: ) ....: In [16]: frame Out[16]: col1 col2 0 A F 1 B NaN 2 NaN G 3 C H 4 D I 假设我们有一个与上面的 DataFrame 结构相同的表。col2通过以下查询,我们只能看到 IS NULL 的记录: SELECT * FROM frame WHERE col2 IS NULL; In [17]: frame[frame["col2"].isna()] Out[17]: col1 col2 1 B NaN col1可以使用 来获取 IS NOT NULL 的项目notna()。 SELECT * FROM frame WHERE col1 IS NOT NULL; In [18]: frame[frame["col1"].notna()] Out[18]: col1 col2 0 A F 1 B NaN 3 C H 4 D I 通过...分组# 在 pandas 中,SQL 的操作是使用类似名称的 方法执行的。通常是指我们希望将数据集分成组,应用某些函数(通常是聚合),然后将组组合在一起的过程。GROUP BYgroupby()groupby() 常见的 SQL 操作是获取整个数据集中每个组中的记录数。例如,一个查询让我们获得按性别留下的小费数量: SELECT sex, count(*) FROM tips GROUP BY sex; /* Female 87 Male 157 */ pandas 的等价物是: In [19]: tips.groupby("sex").size() Out[19]: sex Female 87 Male 157 dtype: int64 请注意,在 pandas 代码中我们使用了DataFrameGroupBy.size()而不是 DataFrameGroupBy.count().这是因为 DataFrameGroupBy.count()将该函数应用于每列,返回每列中的记录数。NOT NULL In [20]: tips.groupby("sex").count() Out[20]: total_bill tip smoker day time size sex Female 87 87 87 87 87 87 Male 157 157 157 157 157 157 或者,我们可以将该DataFrameGroupBy.count()方法应用于单个列: In [21]: tips.groupby("sex")["total_bill"].count() Out[21]: sex Female 87 Male 157 Name: total_bill, dtype: int64 也可以同时应用多个功能。例如,假设我们想查看小费金额在一周中的不同日期有何不同 -DataFrameGroupBy.agg()允许您将字典传递到分组的 DataFrame,指示哪些函数应用于特定列。 SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day; /* Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thu 2.771452 62 */ In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"}) Out[22]: tip day day Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thur 2.771452 62 groupby()通过将列列表传递给该方法来完成按多个列进行分组 。 SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day; /* smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thu 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thu 17 3.030000 */ In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]}) Out[23]: tip size mean smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000 加入# JOINs 可以用join()或 来执行merge()。默认情况下,join()将在其索引上加入 DataFrame。每个方法都有参数,允许您指定要执行的联接类型 ( LEFT、RIGHT、INNER、 FULL) 或要联接的列(列名称或索引)。 警告 如果两个键列都包含键为空值的行,则这些行将相互匹配。这与通常的 SQL 连接行为不同,可能会导致意外结果。 In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)}) 假设我们有两个与 DataFrame 具有相同名称和结构的数据库表。 现在让我们回顾一下 s 的各种类型JOIN。 内部联接# SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key; # merge performs an INNER JOIN by default In [26]: pd.merge(df1, df2, on="key") Out[26]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209 merge()当您想要将一个 DataFrame 的列与另一个 DataFrame 的索引连接起来时,还提供参数。 In [27]: indexed_df2 = df2.set_index("key") In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True) Out[28]: key value_x value_y 1 B -0.282863 1.212112 3 D -1.135632 -0.173215 3 D -1.135632 0.119209 左外连接# 显示来自 的所有记录df1。 SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key; In [29]: pd.merge(df1, df2, on="key", how="left") Out[29]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 右加入# 显示来自 的所有记录df2。 SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key; In [30]: pd.merge(df1, df2, on="key", how="right") Out[30]: key value_x value_y 0 B -0.282863 1.212112 1 D -1.135632 -0.173215 2 D -1.135632 0.119209 3 E NaN -1.044236 完全加入# pandas 还允许s,它显示数据集的两侧,无论连接的列是否找到匹配项。截至撰写本文时,并非所有 RDBMS (MySQL) 都支持 s。FULL JOINFULL JOIN 显示两个表中的所有记录。 SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key; In [31]: pd.merge(df1, df2, on="key", how="outer") Out[31]: key value_x value_y 0 A 0.469112 NaN 1 B -0.282863 1.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.135632 0.119209 5 E NaN -1.044236 联盟# UNION ALL可以使用 执行concat()。 In [32]: df1 = pd.DataFrame( ....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)} ....: ) ....: In [33]: df2 = pd.DataFrame( ....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]} ....: ) ....: SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM df2; /* city rank Chicago 1 San Francisco 2 New York City 3 Chicago 1 Boston 4 Los Angeles 5 */ In [34]: pd.concat([df1, df2]) Out[34]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 0 Chicago 1 1 Boston 4 2 Los Angeles 5 SQLUNION类似于,但会删除重复的行。UNION ALLUNION SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2; -- notice that there is only one Chicago record this time /* city rank Chicago 1 San Francisco 2 New York City 3 Boston 4 Los Angeles 5 */ 在 pandas 中,您可以concat()与 结合 使用drop_duplicates()。 In [35]: pd.concat([df1, df2]).drop_duplicates() Out[35]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5 限制# SELECT * FROM tips LIMIT 10; In [36]: tips.head(10) Out[36]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 5 25.29 4.71 Male No Sun Dinner 4 6 8.77 2.00 Male No Sun Dinner 2 7 26.88 3.12 Male No Sun Dinner 4 8 15.04 1.96 Male No Sun Dinner 2 9 14.78 3.23 Male No Sun Dinner 2 pandas 等价于一些 SQL 分析和聚合函数# 带偏移量的前 n 行# -- MySQL SELECT * FROM tips ORDER BY tip DESC LIMIT 10 OFFSET 5; In [37]: tips.nlargest(10 + 5, columns="tip").tail(10) Out[37]: total_bill tip sex smoker day time size 183 23.17 6.50 Male Yes Sun Dinner 4 214 28.17 6.50 Female Yes Sat Dinner 3 47 32.40 6.00 Male No Sun Dinner 4 239 29.03 5.92 Male No Sat Dinner 3 88 24.71 5.85 Male No Thur Lunch 2 181 23.33 5.65 Male Yes Sun Dinner 2 44 30.40 5.60 Male No Sun Dinner 4 52 34.81 5.20 Female No Sun Dinner 4 85 34.83 5.17 Female No Thur Lunch 4 211 25.89 5.16 Male Yes Sat Dinner 4 每组前 n 行# -- Oracle's ROW_NUMBER() analytic function SELECT * FROM ( SELECT t.*, ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn FROM tips t ) WHERE rn < 3 ORDER BY day, rn; In [38]: ( ....: tips.assign( ....: rn=tips.sort_values(["total_bill"], ascending=False) ....: .groupby(["day"]) ....: .cumcount() ....: + 1 ....: ) ....: .query("rn < 3") ....: .sort_values(["day", "rn"]) ....: ) ....: Out[38]: total_bill tip sex smoker day time size rn 95 40.17 4.73 Male Yes Fri Dinner 4 1 90 28.97 3.00 Male Yes Fri Dinner 2 2 170 50.81 10.00 Male Yes Sat Dinner 3 1 212 48.33 9.00 Male No Sat Dinner 4 2 156 48.17 5.00 Male No Sun Dinner 6 1 182 45.35 3.50 Male Yes Sun Dinner 3 2 197 43.11 5.00 Female Yes Thur Lunch 4 1 142 41.19 5.00 Male No Thur Lunch 5 2 相同的使用rank(method='first')功能 In [39]: ( ....: tips.assign( ....: rnk=tips.groupby(["day"])["total_bill"].rank( ....: method="first", ascending=False ....: ) ....: ) ....: .query("rnk < 3") ....: .sort_values(["day", "rnk"]) ....: ) ....: Out[39]: total_bill tip sex smoker day time size rnk 95 40.17 4.73 Male Yes Fri Dinner 4 1.0 90 28.97 3.00 Male Yes Fri Dinner 2 2.0 170 50.81 10.00 Male Yes Sat Dinner 3 1.0 212 48.33 9.00 Male No Sat Dinner 4 2.0 156 48.17 5.00 Male No Sun Dinner 6 1.0 182 45.35 3.50 Male Yes Sun Dinner 3 2.0 197 43.11 5.00 Female Yes Thur Lunch 4 1.0 142 41.19 5.00 Male No Thur Lunch 5 2.0 -- Oracle's RANK() analytic function SELECT * FROM ( SELECT t.*, RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk FROM tips t WHERE tip < 2 ) WHERE rnk < 3 ORDER BY sex, rnk; 让我们查找每个性别组的 (tips < 2) 的 (rank < 3) 提示。请注意,使用rank(method='min')函数 时rnk_min保持相同tip (如 Oracle 的RANK()函数) In [40]: ( ....: tips[tips["tip"] < 2] ....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min")) ....: .query("rnk_min < 3") ....: .sort_values(["sex", "rnk_min"]) ....: ) ....: Out[40]: total_bill tip sex smoker day time size rnk_min 67 3.07 1.00 Female Yes Sat Dinner 1 1.0 92 5.75 1.00 Female Yes Fri Dinner 2 1.0 111 7.25 1.00 Female No Sat Dinner 1 1.0 236 12.60 1.00 Male Yes Sat Dinner 2 1.0 237 32.83 1.17 Male Yes Sat Dinner 2 2.0 更新# UPDATE tips SET tip = tip*2 WHERE tip < 2; In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2 删除# DELETE FROM tips WHERE tip > 9; 在 pandas 中,我们选择应保留的行而不是删除它们: In [42]: tips = tips.loc[tips["tip"] <= 9]