pandas中DataFrame表连接操作，及merge与join的区别

本文介绍: pandas中Da taFram e表连接操作，及merge与join的区别。ValueEr ro r: column s ov e rl ap bu t no su ffi x sp ec i fi e d: Inde x([‘di sh es _id‘], dtype=’object‘) 错误解释

前言：
为了方便维护，一般公司的数据在数据库内都是分表存储的，比如用一个表存储所有用户的基本信息，一个表存储用户的消费情况。
所以，在日常的数据处理中，经常需要将两张表拼接起来使用，这样的操作对应到SQL中是join，在Pandas中则是用merge来实现。
上面的引入部分说到merge是用来拼接两张表的，那么拼接时自然就需要将用户信息一一对应地进行拼接。
所以进行拼接的两张表需要有一个共同的识别用户的键（key），也就是on 参数所指定的列。
总结来说，整个merge的过程就是将信息一一对应匹配的过程，下面介绍merge的四种类型，分别为’inner‘、’left‘、’right‘和’out e r‘。

表的连接方式可参考以下文档： https://zh uanlan.zhi hu.com/p/102274476

merge参数讲解

merge(
    left,            # 左表
    right,           # 右表
    how="inner",     # 连接方式，inner、left、right、outer，默认为inner
    on=None,
    """
    on: 用于连接的列名称
    指定合并时用于连接(外连，内连，左连，右连)的列。
    默认为None，merge()方法自动识别两个DataFrame中名字相同的列，作为连接的列。
    on参数指定的列必须在两个被合并DataFrame中都有，否则会报错。
    on参数也可以指定多列，合并时按多个列进行连接。在合并时，只有多个列的值同时相等，两个DataFrame才会匹配上。
    """
    left_on,         # 左表用于连接的列名
    right_on,        # 右表用于连接的列名
    """
    使用on参数时，指定的列必须在两个DataFrame中都有。
    merge()方法也支持两个DataFrame分别指定连接的列，此时不要求指定列在两个DataFrame中都有。
    当left_on和right_on都指定一样的列时，与用on参数的结果一样。
    """
    left_index,      # 是否使用左表的行索引作为连接键，默认False
    right_index,     # 是否使用右表的行索引作为连接键，默认False
    sort,            # 默认为False，将合并的数据进行排序
    copy,            # 默认为True，总是将数据复制到数据结构中，设置为False可以提高性能
    suffixes,        # 存在相同列名时在列名后面添加的后缀，默认为(’_x’, ‘_y’)
    indicator,       # 显示合并数据中数据来自哪个表
    validate=None,
    """
    validate: 用于指定两个DataFrame连接列的对应关系。
    有one_to_one(一对一)，one_to_many(一对多)，many_to_one(多对一)，many_to_many(多对多)四种对应方式。
    默认为None，merge()方法自动根据两个DataFrame的连接列采用适合的对应方式。
    """
)

创建两个DataFrame

dishes_info = pd.read_csv("./dishes_info.csv")
order_sample = pd.read_csv("./order_sample.csv")
print(dishes_info)
print(order_sample)

di sh es_info：

data = pd.merge(dishes_info, order_sample, on="dishes_id", how="left")

data = pd.merge(dishes_info, order_sample, how="left", left_index=True, right_index=True)

join(
    other,            # DataFrame, Series, or list of DataFrame，另外一个dataframe, series，或者dataframe list。
    on=None,          # 参与join的列，与sql中的on参数类似。
    how=“left”,       # how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’， 与sql中的join方式类似。
    lsuffix="",       #  lsuffix: 左DataFrame中重复列的后缀
    rsuffix="",       # rsuffix: 右DataFrame中重复列的后缀
    sort=False        # 默认为False，将合并的数据进行排序
)

data = dishes_info.join(order_sample)         # 会报错，原因就是因为有重复的列名

# 以下为错误信息
"""
dishes_id
ValueError                                Traceback (most recent call last)
<ipython-input-18-8bc025c8fee6> in <module>()
      1 # DataFrame对象的join()函数就像是merge()函数的left_index &amp; right_index 为 True
      2 # DataFrame对象的join()函数更适合于根据索引进行合并，我们可以用它合并多个索引相同列不同的DataFrame对象。
----> 3 data = dishes_info.join(order_sample)         # 会报错，原因就是因为有重复的列名dishes_id
      4 # 由于join默认根据行索引进行连接，所以我们修改两表的行索引为dishes_id列在进行连接
      5 # dishes_info.set_index("dishes_id", inplace=True)     # 该函数默认不修改原数据，需要inplace配置项指定为True才保存修改

D:Destinationlibsite-packagespandascoreframe.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

D:Destinationlibsite-packagespandascoreframe.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

D:Destinationlibsite-packagespandascorereshapemerge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     60                          copy=copy, indicator=indicator,
     61                          validate=validate)
---> 62     return op.get_result()
     63 
     64 

D:Destinationlibsite-packagespandascorereshapemerge.py in get_result(self)
    572 
    573         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 574                                                      rdata.items, rsuf)
    575 
    576         lindexers = {1: left_indexer} if left_indexer is not None else {}

D:Destinationlibsite-packagespandascoreinternals.py in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   5242         if not lsuffix and not rsuffix:
   5243             raise ValueError('columns overlap but no suffix specified: '
-> 5244                              '{rename}'.format(rename=to_rename))
   5245 
   5246         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index(['dishes_id'], dtype='object')
"""

dishes_info.set_index("dishes_id", inplace=True)     # 该函数默认不修改原数据，需要inplace配置项指定为True才保存修改
order_sample.set_index("dishes_id", inplace=True)

data = dishes_info.join(order_sample)

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

相关文章

发表回复 取消回复

发表回复取消回复