Merge, join, and concatenate in Pandas

Merge, Join, Concat in Pandas


First thing, go skim the official Pandas docs. It is pretty straightforward instructions on how to perform merge, join, concat using pandas. It has everything you need to know, but it is written very dryly and covers a lot of cases I rarely come across in my day to day, so I'll try to summarize quickly.

I'll use their DataFrame in my notes below.

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])
     

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])
        

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])
        
        
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                    'D': ['D2', 'D3', 'D6', 'D7'],
                    'F': ['F2', 'F3', 'F6', 'F7']},
                    index=[2, 3, 6, 7])

frames = [df1, df2, df3]

result = pd.concat(frames)

Below I'll talk a little about how and when to use each.

One of the tricky things for me when starting with pandas merge, join, concat was "when do I use each of these?". It seemed like you could do the same thing multiple ways. And you can. So whichever way makes most sense for you roll with that.

CONCAT

The default behavior you can think of as a union.

If you want to keep track of which DataFrame (or series) was which before you can add a 'keys' argument. This will add a second index to your new DataFrame. I won't get into multi-indexing here.

You can create either an inner join or outer join based on the INDEX using concat with the parameter axis=1. However note two things:

  1. If you specify axis=1 to do an inner join it will always use the index. You cannot specify a column to join on.
  2. No left or right join with concat

If you want to specify what columns to join or make a left or right join, use merge.

One more wrinkle is you can perform the equivalent of
>> pd.concat([df1, df2, df3]) 

using the append() method. The equivalent is
>> df1.append([df2,df3])

Its quick and easy but you don't have as much functionality.

MERGE

"pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame objects":

Here's a quick example:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})


right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})


result = pd.merge(left, right, on='key')

Result is the inner join.

Here's something more typical you might see:

case_df = pd.DataFrame({'case': ['K0', 'K1', 'K4', 'K7'],
                        'A': ['foo', 'bar', 'boo', 'hoo'],
                        'B': ['B0', 'B1', 'B2', 'B3']})


name_df = pd.DataFrame({'name': ['K0', 'K1', 'K2', 'K3'],
                        'C': ['true', 'false', 'true', 'false'],
                        'D': [1, 5, 4, 7],
                        'A': ['NULL', 'NULL', 'NULL', 'NULL'],})


result = pd.merge(case_df, name_df, how='left',left_on='case',  
                  right_on='name',suffixes=('_case','_name'))

Here we are taking two data frames, telling it to perform a left join, specifying which column in the second DataFrame to use as a left join on the first DataFrame. There are overlapping column names so we specify the suffixes.

pd.merge parameters

  • left=left_dataframe, 
  • right=right_dataframe, 
  • how='inner' or 'outter' or 'left' or 'right.  Defaults to inner.
  • on=None if joining on index, otherwise specify column IF the column is in both dataframes, 
  • left_on=None by default. If you want to match columns of different names in the two dataframes you need to use left_on and right_on to specify column names. left_on is the column in the 'left=' dataframe
  • right_on=None by default. right_on is the column in the 'right=' dataframe,
  • left_index=False by default. can also join the right dataframe column (or index) to the left dataframe's index by specifying True, 
  • right_index=False by default. can also join the left dataframe column (or index) to the right dataframe's index by specifying True,
  • sort=True by default. If you are using really large dataframes I reccomend setting this to False to improve performance substantially in many cases
  • suffixes=('_x', '_y'),  A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x', '_y') 
  • copy=True, don't worry about this
  • indicator=False, don't worry about this
  • validate=None, don't worry about this

Another example from the docs below:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})


right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, how='right', on=['key1',          
         'key2'],indicator=True)

pd.merge(left, right, how='right', on=['key1', 'key2'])

JOIN

Well what about join you ask? If we can do a union using merge and any type of join with merge, what else is there? Well, there's join() :)
From the docs "The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join."

If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                     index=['K0', 'K1', 'K2'])


right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])


result = left.join(right)

Succinct! And thats a quick hit on merge, join, concat in Pandas!

Comments

Popular posts from this blog

Upcasting in python

Chart Types & Styles