Pandas Merging 101JOIN two dataframes on common column in pythonMapping columns from one dataframe to another to create a new columnMerge a list of dataframes to create one dataframepandas merge on columns with different names and avoid duplicatesPandas merge on index column?Merge multiple dataframes based on a common column“Merging” numpy arrays together with a common dimensionPandas join on columns with different namesMerge DataFrames with Matching Values From Two Different Columns - PandasPandas join/merge/concat two DataFrames and combine rows of identical key/indexHow to merge two dictionaries in a single expression?How to merge a specific commit in GitHow to join (merge) data frames (inner, outer, left, right)How do you merge two Git repositories?How to merge two arrays in JavaScript and de-duplicate itemsRenaming columns in pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasAggregation in pandas

How to disambiguate between various meditation practices?

If the UK government illegally doesn't ask for article 50 extension, can parliament do it instead?

Punishment in pacifist society

Is mathematics truth?

Why can't I summon my friend to help me?

What is "latex-dev"?

Given a specific computer system, is it possible to estimate the actual precise run time of a piece of Assembly code

Should one invest in a professional editor before querying?

Tiny image scraper for xkcd.com

Why is k-means used for non normally distributed data?

Why are Latin and Sanskrit called dead languages?

Why wasn't Linda Hamilton in T3?

Remove ads in Viber for PC

What happens if you just start drawing from the Deck of Many Things without declaring any number of cards?

Why do old games use flashing as means of showing damage?

Why KVM VPS is slower then OPENVZ

Can a Centaur utilize a mount?

What is the most likely cause of short, quick, and useless reviews?

What is the definition of Product

How do you manage to study and have a balance in your life at the same time?

Killing task by name - start menu shortcut

FHE: What is the difference between multiplicative depth and multiplicative level?

Playing boules... IN SPACE!

How to run a command 1 out of N times in Bash



Pandas Merging 101


JOIN two dataframes on common column in pythonMapping columns from one dataframe to another to create a new columnMerge a list of dataframes to create one dataframepandas merge on columns with different names and avoid duplicatesPandas merge on index column?Merge multiple dataframes based on a common column“Merging” numpy arrays together with a common dimensionPandas join on columns with different namesMerge DataFrames with Matching Values From Two Different Columns - PandasPandas join/merge/concat two DataFrames and combine rows of identical key/indexHow to merge two dictionaries in a single expression?How to merge a specific commit in GitHow to join (merge) data frames (inner, outer, left, right)How do you merge two Git repositories?How to merge two arrays in JavaScript and de-duplicate itemsRenaming columns in pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasAggregation in pandas






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








226















  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

  • How do I add NaNs for missing rows after merge?

  • How do I get rid of NaNs after merging?

  • Can I merge on the index?

  • Cross join with pandas?

  • How do I merge multiple DataFrames?


  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.



This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).



Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.










share|improve this question


























  • Whats mean 101 in the title? :)

    – Mikhail_Sam
    Aug 8 at 9:18






  • 1





    @Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

    – cs95
    Aug 8 at 14:43


















226















  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

  • How do I add NaNs for missing rows after merge?

  • How do I get rid of NaNs after merging?

  • Can I merge on the index?

  • Cross join with pandas?

  • How do I merge multiple DataFrames?


  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.



This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).



Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.










share|improve this question


























  • Whats mean 101 in the title? :)

    – Mikhail_Sam
    Aug 8 at 9:18






  • 1





    @Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

    – cs95
    Aug 8 at 14:43














226












226








226


185






  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

  • How do I add NaNs for missing rows after merge?

  • How do I get rid of NaNs after merging?

  • Can I merge on the index?

  • Cross join with pandas?

  • How do I merge multiple DataFrames?


  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.



This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).



Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.










share|improve this question
















  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

  • How do I add NaNs for missing rows after merge?

  • How do I get rid of NaNs after merging?

  • Can I merge on the index?

  • Cross join with pandas?

  • How do I merge multiple DataFrames?


  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.



This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).



Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.







python pandas join merge






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 9 at 12:10









JohnE

16.3k7 gold badges42 silver badges67 bronze badges




16.3k7 gold badges42 silver badges67 bronze badges










asked Dec 6 '18 at 6:41









cs95cs95

165k32 gold badges236 silver badges303 bronze badges




165k32 gold badges236 silver badges303 bronze badges















  • Whats mean 101 in the title? :)

    – Mikhail_Sam
    Aug 8 at 9:18






  • 1





    @Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

    – cs95
    Aug 8 at 14:43


















  • Whats mean 101 in the title? :)

    – Mikhail_Sam
    Aug 8 at 9:18






  • 1





    @Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

    – cs95
    Aug 8 at 14:43

















Whats mean 101 in the title? :)

– Mikhail_Sam
Aug 8 at 9:18





Whats mean 101 in the title? :)

– Mikhail_Sam
Aug 8 at 9:18




1




1





@Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

– cs95
Aug 8 at 14:43






@Mikhail_Sam check this out: en.m.wikipedia.org/wiki/101_(topic)... "the number 101 is often used for an introductory course at a beginner's level in a department's subject area."

– cs95
Aug 8 at 14:43













2 Answers
2






active

oldest

votes


















300







+500








This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.



In particular, here's what this post will go through:




  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)



    • merging with different column names

    • avoiding duplicate merge key column in output


  • Merging with index under different conditions

    • effectively using your named index

    • merge key as the index of one and column of another


  • Multiway merges on columns and indexes (unique and non-unique)

  • Notable alternatives to merge and join

What this post will not go through:



  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!


Note

Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.



Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.



Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.




Enough Talk, just show me how to use merge!



Setup



np.random.seed(0)
left = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4))
right = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4))

left

key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893

right

key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357


For the sake of simplicity, the key column has the same name (for now).



An INNER JOIN is represented by






Note

This, along with the forthcoming figures all follow this convention:




  • blue indicates rows that are present in the merge result


  • red indicates rows that are excluded from the result (i.e., removed)


  • green indicates missing values that are replaced with NaNs in the result



To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.



pd.merge(left, right, on='key')

key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278


This returns only rows from left and right which share a common key (in this example, "B" and "D).



In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.



left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278


A LEFT OUTER JOIN, or LEFT JOIN is represented by





This can be performed by specifying how='left'.



left.merge(right, on='key', how='left')

key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278


Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.



And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...





...specify how='right':



left.merge(right, on='key', how='right')

key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357


Here, keys from right are used, and missing data from left is replaced by NaN.



Finally, for the FULL OUTER JOIN, given by





specify how='outer'.



left.merge(right, on='key', how='outer')

key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357


This uses the keys from both frames, and NaNs are inserted for missing rows in both.



The documentation summarises these various merges nicely:



enter image description here



Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs



If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.



For LEFT-Excluding JOIN, represented as





Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,



(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))

key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN


Where,



left.merge(right, on='key', how='left', indicator=True)

key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both


And similarly, for a RIGHT-Excluding JOIN,





(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))

key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357


Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),





You can do this in similar fashion—



(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))

key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357


Different names for key columns



If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:



left2 = left.rename('key':'keyLeft', axis=1)
right2 = right.rename('key':'keyRight', axis=1)

left2

keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893

right2

keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357




left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278


Avoiding duplicate key column in output



When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.



left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278


Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



Merging only a single column from one of the DataFrames



For example, consider



right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3


If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:



left.merge(right3[['key', 'newcol']], on='key')

key value newcol
0 B 0.400157 0
1 D 2.240893 1


If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:



# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0


As mentioned, this is similar to, but faster than



left.merge(right3[['key', 'newcol']], on='key', how='left')

key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0


Merging on multiple columns



To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).



left.merge(right, on=['key1', 'key2'] ...)


Or, in the event the names are different,



left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


Other useful merge* operations and functions




  • Merging a DataFrame with Series on index: See this answer.

  • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.


  • pd.merge_ordered is a useful function for ordered JOINs.


  • pd.merge_asof (read: merge_asOf) is useful for approximate joins.


This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.




Index-based *-JOIN (+ index-column merges)



Setup



np.random.seed([3, 14])
left = pd.DataFrame('value': np.random.randn(4), index=['A', 'B', 'C', 'D'])
right = pd.DataFrame('value': np.random.randn(4), index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349

right

value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076


Typically, a merge on index would look like this:



left.merge(right, left_index=True, right_index=True)


value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135


Support for index names



If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).



left.merge(right, on='idxkey')

value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135


Merging on index of one, column(s) of another



It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,



left.merge(right, left_on='key1', right_index=True)


Or vice versa (right_on=... and left_index=True).



right2 = right.reset_index().rename('idxkey' : 'colkey', axis=1)
right2

colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076

left.merge(right2, left_index=True, right_on='colkey')

value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135


In this special case, the index for left is named, so you can also use the index name with left_on, like this:



left.merge(right2, left_on='idxkey', right_on='colkey')

value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135


DataFrame.join

Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.



left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135


Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:



left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')


Since the column names are the same. This would not be a problem if they were differently named.



left.rename(columns='value':'leftvalue').join(right, how='inner')

leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135


pd.concat

Lastly, as an alternative for index-based joins, you can use pd.concat:



pd.concat([left, right], axis=1, sort=False, join='inner')

value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135


Omit join='inner' if you need a FULL OUTER JOIN (the default):



pd.concat([left, right], axis=1, sort=False)

value value
A -0.602923 NaN
B -0.402655 0.543843
C 0.302329 NaN
D -0.524349 0.013135
E NaN -0.326498
F NaN 1.385076


For more information, see this canonical post on pd.concat by @piRSquared.




Generalizing: mergeing multiple DataFrames



Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:



df1.merge(df2, ...).merge(df3, ...)


However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.



Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.



# Setup.
np.random.seed(0)
A = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4))
B = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4))
C = pd.DataFrame('key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4))
dfs = [A, B, C]

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]


Multiway merge on unique keys (or index)



If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.



# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

key valueA valueB valueC
0 D 2.240893 -0.977278 1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

valueA valueB valueC
key
D 2.240893 -0.977278 1.0


Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).



Multiway merge on keys with duplicates



concat is fast, but has its shortcomings. It cannot handle duplicates.



A3 = pd.DataFrame('key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5))




pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)


In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).



# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()

key valueA valueB valueC
0 D 2.240893 -0.977278 1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0





share|improve this answer


































    17















    A supplemental visual view of pd.concat([df0, df1], kwargs).
    Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)



    on pd.concat([df0, df1])






    share|improve this answer






















    • 4





      This is a nice diagram. May I ask how you produced it?

      – cs95
      May 20 at 17:27






    • 4





      google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

      – eliu
      May 21 at 18:19











    • Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

      – Ufos
      Aug 1 at 11:16











    • @Ufos Isn't that exactly what axis=1 and axis=0 is?

      – cs95
      Aug 6 at 17:31












    • yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

      – Ufos
      Aug 19 at 9:32













    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53645882%2fpandas-merging-101%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    300







    +500








    This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.



    In particular, here's what this post will go through:




    • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)



      • merging with different column names

      • avoiding duplicate merge key column in output


    • Merging with index under different conditions

      • effectively using your named index

      • merge key as the index of one and column of another


    • Multiway merges on columns and indexes (unique and non-unique)

    • Notable alternatives to merge and join

    What this post will not go through:



    • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

    • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!


    Note

    Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.



    Furthermore, all the DataFrames here can be copied and replicated so
    you can play with them. Also, see this
    post
    on how to read DataFrames from your clipboard.



    Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.




    Enough Talk, just show me how to use merge!



    Setup



    np.random.seed(0)
    left = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4))
    right = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4))

    left

    key value
    0 A 1.764052
    1 B 0.400157
    2 C 0.978738
    3 D 2.240893

    right

    key value
    0 B 1.867558
    1 D -0.977278
    2 E 0.950088
    3 F -0.151357


    For the sake of simplicity, the key column has the same name (for now).



    An INNER JOIN is represented by






    Note

    This, along with the forthcoming figures all follow this convention:




    • blue indicates rows that are present in the merge result


    • red indicates rows that are excluded from the result (i.e., removed)


    • green indicates missing values that are replaced with NaNs in the result



    To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.



    pd.merge(left, right, on='key')

    key value_x value_y
    0 B 0.400157 1.867558
    1 D 2.240893 -0.977278


    This returns only rows from left and right which share a common key (in this example, "B" and "D).



    In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.



    left.merge(right, on='key')
    # Or, if you want to be explicit
    # left.merge(right, on='key', how='inner')

    key value_x value_y
    0 B 0.400157 1.867558
    1 D 2.240893 -0.977278


    A LEFT OUTER JOIN, or LEFT JOIN is represented by





    This can be performed by specifying how='left'.



    left.merge(right, on='key', how='left')

    key value_x value_y
    0 A 1.764052 NaN
    1 B 0.400157 1.867558
    2 C 0.978738 NaN
    3 D 2.240893 -0.977278


    Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.



    And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...





    ...specify how='right':



    left.merge(right, on='key', how='right')

    key value_x value_y
    0 B 0.400157 1.867558
    1 D 2.240893 -0.977278
    2 E NaN 0.950088
    3 F NaN -0.151357


    Here, keys from right are used, and missing data from left is replaced by NaN.



    Finally, for the FULL OUTER JOIN, given by





    specify how='outer'.



    left.merge(right, on='key', how='outer')

    key value_x value_y
    0 A 1.764052 NaN
    1 B 0.400157 1.867558
    2 C 0.978738 NaN
    3 D 2.240893 -0.977278
    4 E NaN 0.950088
    5 F NaN -0.151357


    This uses the keys from both frames, and NaNs are inserted for missing rows in both.



    The documentation summarises these various merges nicely:



    enter image description here



    Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs



    If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.



    For LEFT-Excluding JOIN, represented as





    Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,



    (left.merge(right, on='key', how='left', indicator=True)
    .query('_merge == "left_only"')
    .drop('_merge', 1))

    key value_x value_y
    0 A 1.764052 NaN
    2 C 0.978738 NaN


    Where,



    left.merge(right, on='key', how='left', indicator=True)

    key value_x value_y _merge
    0 A 1.764052 NaN left_only
    1 B 0.400157 1.867558 both
    2 C 0.978738 NaN left_only
    3 D 2.240893 -0.977278 both


    And similarly, for a RIGHT-Excluding JOIN,





    (left.merge(right, on='key', how='right', indicator=True)
    .query('_merge == "right_only"')
    .drop('_merge', 1))

    key value_x value_y
    2 E NaN 0.950088
    3 F NaN -0.151357


    Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),





    You can do this in similar fashion—



    (left.merge(right, on='key', how='outer', indicator=True)
    .query('_merge != "both"')
    .drop('_merge', 1))

    key value_x value_y
    0 A 1.764052 NaN
    2 C 0.978738 NaN
    4 E NaN 0.950088
    5 F NaN -0.151357


    Different names for key columns



    If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:



    left2 = left.rename('key':'keyLeft', axis=1)
    right2 = right.rename('key':'keyRight', axis=1)

    left2

    keyLeft value
    0 A 1.764052
    1 B 0.400157
    2 C 0.978738
    3 D 2.240893

    right2

    keyRight value
    0 B 1.867558
    1 D -0.977278
    2 E 0.950088
    3 F -0.151357




    left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

    keyLeft value_x keyRight value_y
    0 B 0.400157 B 1.867558
    1 D 2.240893 D -0.977278


    Avoiding duplicate key column in output



    When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.



    left3 = left2.set_index('keyLeft')
    left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight value_y
    0 0.400157 B 1.867558
    1 2.240893 D -0.977278


    Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



    Merging only a single column from one of the DataFrames



    For example, consider



    right3 = right.assign(newcol=np.arange(len(right)))
    right3
    key value newcol
    0 B 1.867558 0
    1 D -0.977278 1
    2 E 0.950088 2
    3 F -0.151357 3


    If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:



    left.merge(right3[['key', 'newcol']], on='key')

    key value newcol
    0 B 0.400157 0
    1 D 2.240893 1


    If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:



    # left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
    left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

    key value newcol
    0 A 1.764052 NaN
    1 B 0.400157 0.0
    2 C 0.978738 NaN
    3 D 2.240893 1.0


    As mentioned, this is similar to, but faster than



    left.merge(right3[['key', 'newcol']], on='key', how='left')

    key value newcol
    0 A 1.764052 NaN
    1 B 0.400157 0.0
    2 C 0.978738 NaN
    3 D 2.240893 1.0


    Merging on multiple columns



    To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).



    left.merge(right, on=['key1', 'key2'] ...)


    Or, in the event the names are different,



    left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


    Other useful merge* operations and functions




    • Merging a DataFrame with Series on index: See this answer.

    • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.


    • pd.merge_ordered is a useful function for ordered JOINs.


    • pd.merge_asof (read: merge_asOf) is useful for approximate joins.


    This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.




    Index-based *-JOIN (+ index-column merges)



    Setup



    np.random.seed([3, 14])
    left = pd.DataFrame('value': np.random.randn(4), index=['A', 'B', 'C', 'D'])
    right = pd.DataFrame('value': np.random.randn(4), index=['B', 'D', 'E', 'F'])
    left.index.name = right.index.name = 'idxkey'

    left
    value
    idxkey
    A -0.602923
    B -0.402655
    C 0.302329
    D -0.524349

    right

    value
    idxkey
    B 0.543843
    D 0.013135
    E -0.326498
    F 1.385076


    Typically, a merge on index would look like this:



    left.merge(right, left_index=True, right_index=True)


    value_x value_y
    idxkey
    B -0.402655 0.543843
    D -0.524349 0.013135


    Support for index names



    If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).



    left.merge(right, on='idxkey')

    value_x value_y
    idxkey
    B -0.402655 0.543843
    D -0.524349 0.013135


    Merging on index of one, column(s) of another



    It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,



    left.merge(right, left_on='key1', right_index=True)


    Or vice versa (right_on=... and left_index=True).



    right2 = right.reset_index().rename('idxkey' : 'colkey', axis=1)
    right2

    colkey value
    0 B 0.543843
    1 D 0.013135
    2 E -0.326498
    3 F 1.385076

    left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey value_y
    0 -0.402655 B 0.543843
    1 -0.524349 D 0.013135


    In this special case, the index for left is named, so you can also use the index name with left_on, like this:



    left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey value_y
    0 -0.402655 B 0.543843
    1 -0.524349 D 0.013135


    DataFrame.join

    Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.



    left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

    value_x value_y
    idxkey
    B -0.402655 0.543843
    D -0.524349 0.013135


    Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:



    left.join(right)
    ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')


    Since the column names are the same. This would not be a problem if they were differently named.



    left.rename(columns='value':'leftvalue').join(right, how='inner')

    leftvalue value
    idxkey
    B -0.402655 0.543843
    D -0.524349 0.013135


    pd.concat

    Lastly, as an alternative for index-based joins, you can use pd.concat:



    pd.concat([left, right], axis=1, sort=False, join='inner')

    value value
    idxkey
    B -0.402655 0.543843
    D -0.524349 0.013135


    Omit join='inner' if you need a FULL OUTER JOIN (the default):



    pd.concat([left, right], axis=1, sort=False)

    value value
    A -0.602923 NaN
    B -0.402655 0.543843
    C 0.302329 NaN
    D -0.524349 0.013135
    E NaN -0.326498
    F NaN 1.385076


    For more information, see this canonical post on pd.concat by @piRSquared.




    Generalizing: mergeing multiple DataFrames



    Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:



    df1.merge(df2, ...).merge(df3, ...)


    However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.



    Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.



    # Setup.
    np.random.seed(0)
    A = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4))
    B = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4))
    C = pd.DataFrame('key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4))
    dfs = [A, B, C]

    # Note, the "key" column values are unique, so the index is unique.
    A2 = A.set_index('key')
    B2 = B.set_index('key')
    C2 = C.set_index('key')

    dfs2 = [A2, B2, C2]


    Multiway merge on unique keys (or index)



    If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.



    # merge on `key` column, you'll need to set the index before concatenating
    pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
    ).reset_index()

    key valueA valueB valueC
    0 D 2.240893 -0.977278 1.0

    # merge on `key` index
    pd.concat(dfs2, axis=1, sort=False, join='inner')

    valueA valueB valueC
    key
    D 2.240893 -0.977278 1.0


    Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).



    Multiway merge on keys with duplicates



    concat is fast, but has its shortcomings. It cannot handle duplicates.



    A3 = pd.DataFrame('key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5))




    pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
    ValueError: Shape of passed values is (3, 4), indices imply (3, 2)


    In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).



    # join on `key` column, set as the index first
    # For inner join. For left join, omit the "how" argument.
    A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

    key valueA valueB valueC
    0 D 2.240893 -0.977278 1.0

    # join on `key` index
    A3.set_index('key').join([B2, C2], how='inner')

    valueA valueB valueC
    key
    D 1.454274 -0.977278 1.0
    D 0.761038 -0.977278 1.0





    share|improve this answer































      300







      +500








      This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.



      In particular, here's what this post will go through:




      • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)



        • merging with different column names

        • avoiding duplicate merge key column in output


      • Merging with index under different conditions

        • effectively using your named index

        • merge key as the index of one and column of another


      • Multiway merges on columns and indexes (unique and non-unique)

      • Notable alternatives to merge and join

      What this post will not go through:



      • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

      • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!


      Note

      Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.



      Furthermore, all the DataFrames here can be copied and replicated so
      you can play with them. Also, see this
      post
      on how to read DataFrames from your clipboard.



      Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.




      Enough Talk, just show me how to use merge!



      Setup



      np.random.seed(0)
      left = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4))
      right = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4))

      left

      key value
      0 A 1.764052
      1 B 0.400157
      2 C 0.978738
      3 D 2.240893

      right

      key value
      0 B 1.867558
      1 D -0.977278
      2 E 0.950088
      3 F -0.151357


      For the sake of simplicity, the key column has the same name (for now).



      An INNER JOIN is represented by






      Note

      This, along with the forthcoming figures all follow this convention:




      • blue indicates rows that are present in the merge result


      • red indicates rows that are excluded from the result (i.e., removed)


      • green indicates missing values that are replaced with NaNs in the result



      To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.



      pd.merge(left, right, on='key')

      key value_x value_y
      0 B 0.400157 1.867558
      1 D 2.240893 -0.977278


      This returns only rows from left and right which share a common key (in this example, "B" and "D).



      In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.



      left.merge(right, on='key')
      # Or, if you want to be explicit
      # left.merge(right, on='key', how='inner')

      key value_x value_y
      0 B 0.400157 1.867558
      1 D 2.240893 -0.977278


      A LEFT OUTER JOIN, or LEFT JOIN is represented by





      This can be performed by specifying how='left'.



      left.merge(right, on='key', how='left')

      key value_x value_y
      0 A 1.764052 NaN
      1 B 0.400157 1.867558
      2 C 0.978738 NaN
      3 D 2.240893 -0.977278


      Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.



      And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...





      ...specify how='right':



      left.merge(right, on='key', how='right')

      key value_x value_y
      0 B 0.400157 1.867558
      1 D 2.240893 -0.977278
      2 E NaN 0.950088
      3 F NaN -0.151357


      Here, keys from right are used, and missing data from left is replaced by NaN.



      Finally, for the FULL OUTER JOIN, given by





      specify how='outer'.



      left.merge(right, on='key', how='outer')

      key value_x value_y
      0 A 1.764052 NaN
      1 B 0.400157 1.867558
      2 C 0.978738 NaN
      3 D 2.240893 -0.977278
      4 E NaN 0.950088
      5 F NaN -0.151357


      This uses the keys from both frames, and NaNs are inserted for missing rows in both.



      The documentation summarises these various merges nicely:



      enter image description here



      Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs



      If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.



      For LEFT-Excluding JOIN, represented as





      Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,



      (left.merge(right, on='key', how='left', indicator=True)
      .query('_merge == "left_only"')
      .drop('_merge', 1))

      key value_x value_y
      0 A 1.764052 NaN
      2 C 0.978738 NaN


      Where,



      left.merge(right, on='key', how='left', indicator=True)

      key value_x value_y _merge
      0 A 1.764052 NaN left_only
      1 B 0.400157 1.867558 both
      2 C 0.978738 NaN left_only
      3 D 2.240893 -0.977278 both


      And similarly, for a RIGHT-Excluding JOIN,





      (left.merge(right, on='key', how='right', indicator=True)
      .query('_merge == "right_only"')
      .drop('_merge', 1))

      key value_x value_y
      2 E NaN 0.950088
      3 F NaN -0.151357


      Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),





      You can do this in similar fashion—



      (left.merge(right, on='key', how='outer', indicator=True)
      .query('_merge != "both"')
      .drop('_merge', 1))

      key value_x value_y
      0 A 1.764052 NaN
      2 C 0.978738 NaN
      4 E NaN 0.950088
      5 F NaN -0.151357


      Different names for key columns



      If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:



      left2 = left.rename('key':'keyLeft', axis=1)
      right2 = right.rename('key':'keyRight', axis=1)

      left2

      keyLeft value
      0 A 1.764052
      1 B 0.400157
      2 C 0.978738
      3 D 2.240893

      right2

      keyRight value
      0 B 1.867558
      1 D -0.977278
      2 E 0.950088
      3 F -0.151357




      left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

      keyLeft value_x keyRight value_y
      0 B 0.400157 B 1.867558
      1 D 2.240893 D -0.977278


      Avoiding duplicate key column in output



      When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.



      left3 = left2.set_index('keyLeft')
      left3.merge(right2, left_index=True, right_on='keyRight')

      value_x keyRight value_y
      0 0.400157 B 1.867558
      1 2.240893 D -0.977278


      Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



      Merging only a single column from one of the DataFrames



      For example, consider



      right3 = right.assign(newcol=np.arange(len(right)))
      right3
      key value newcol
      0 B 1.867558 0
      1 D -0.977278 1
      2 E 0.950088 2
      3 F -0.151357 3


      If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:



      left.merge(right3[['key', 'newcol']], on='key')

      key value newcol
      0 B 0.400157 0
      1 D 2.240893 1


      If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:



      # left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
      left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

      key value newcol
      0 A 1.764052 NaN
      1 B 0.400157 0.0
      2 C 0.978738 NaN
      3 D 2.240893 1.0


      As mentioned, this is similar to, but faster than



      left.merge(right3[['key', 'newcol']], on='key', how='left')

      key value newcol
      0 A 1.764052 NaN
      1 B 0.400157 0.0
      2 C 0.978738 NaN
      3 D 2.240893 1.0


      Merging on multiple columns



      To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).



      left.merge(right, on=['key1', 'key2'] ...)


      Or, in the event the names are different,



      left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


      Other useful merge* operations and functions




      • Merging a DataFrame with Series on index: See this answer.

      • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.


      • pd.merge_ordered is a useful function for ordered JOINs.


      • pd.merge_asof (read: merge_asOf) is useful for approximate joins.


      This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.




      Index-based *-JOIN (+ index-column merges)



      Setup



      np.random.seed([3, 14])
      left = pd.DataFrame('value': np.random.randn(4), index=['A', 'B', 'C', 'D'])
      right = pd.DataFrame('value': np.random.randn(4), index=['B', 'D', 'E', 'F'])
      left.index.name = right.index.name = 'idxkey'

      left
      value
      idxkey
      A -0.602923
      B -0.402655
      C 0.302329
      D -0.524349

      right

      value
      idxkey
      B 0.543843
      D 0.013135
      E -0.326498
      F 1.385076


      Typically, a merge on index would look like this:



      left.merge(right, left_index=True, right_index=True)


      value_x value_y
      idxkey
      B -0.402655 0.543843
      D -0.524349 0.013135


      Support for index names



      If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).



      left.merge(right, on='idxkey')

      value_x value_y
      idxkey
      B -0.402655 0.543843
      D -0.524349 0.013135


      Merging on index of one, column(s) of another



      It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,



      left.merge(right, left_on='key1', right_index=True)


      Or vice versa (right_on=... and left_index=True).



      right2 = right.reset_index().rename('idxkey' : 'colkey', axis=1)
      right2

      colkey value
      0 B 0.543843
      1 D 0.013135
      2 E -0.326498
      3 F 1.385076

      left.merge(right2, left_index=True, right_on='colkey')

      value_x colkey value_y
      0 -0.402655 B 0.543843
      1 -0.524349 D 0.013135


      In this special case, the index for left is named, so you can also use the index name with left_on, like this:



      left.merge(right2, left_on='idxkey', right_on='colkey')

      value_x colkey value_y
      0 -0.402655 B 0.543843
      1 -0.524349 D 0.013135


      DataFrame.join

      Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.



      left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

      value_x value_y
      idxkey
      B -0.402655 0.543843
      D -0.524349 0.013135


      Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:



      left.join(right)
      ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')


      Since the column names are the same. This would not be a problem if they were differently named.



      left.rename(columns='value':'leftvalue').join(right, how='inner')

      leftvalue value
      idxkey
      B -0.402655 0.543843
      D -0.524349 0.013135


      pd.concat

      Lastly, as an alternative for index-based joins, you can use pd.concat:



      pd.concat([left, right], axis=1, sort=False, join='inner')

      value value
      idxkey
      B -0.402655 0.543843
      D -0.524349 0.013135


      Omit join='inner' if you need a FULL OUTER JOIN (the default):



      pd.concat([left, right], axis=1, sort=False)

      value value
      A -0.602923 NaN
      B -0.402655 0.543843
      C 0.302329 NaN
      D -0.524349 0.013135
      E NaN -0.326498
      F NaN 1.385076


      For more information, see this canonical post on pd.concat by @piRSquared.




      Generalizing: mergeing multiple DataFrames



      Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:



      df1.merge(df2, ...).merge(df3, ...)


      However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.



      Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.



      # Setup.
      np.random.seed(0)
      A = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4))
      B = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4))
      C = pd.DataFrame('key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4))
      dfs = [A, B, C]

      # Note, the "key" column values are unique, so the index is unique.
      A2 = A.set_index('key')
      B2 = B.set_index('key')
      C2 = C.set_index('key')

      dfs2 = [A2, B2, C2]


      Multiway merge on unique keys (or index)



      If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.



      # merge on `key` column, you'll need to set the index before concatenating
      pd.concat([
      df.set_index('key') for df in dfs], axis=1, join='inner'
      ).reset_index()

      key valueA valueB valueC
      0 D 2.240893 -0.977278 1.0

      # merge on `key` index
      pd.concat(dfs2, axis=1, sort=False, join='inner')

      valueA valueB valueC
      key
      D 2.240893 -0.977278 1.0


      Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).



      Multiway merge on keys with duplicates



      concat is fast, but has its shortcomings. It cannot handle duplicates.



      A3 = pd.DataFrame('key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5))




      pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
      ValueError: Shape of passed values is (3, 4), indices imply (3, 2)


      In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).



      # join on `key` column, set as the index first
      # For inner join. For left join, omit the "how" argument.
      A.set_index('key').join(
      [df.set_index('key') for df in (B, C)], how='inner').reset_index()

      key valueA valueB valueC
      0 D 2.240893 -0.977278 1.0

      # join on `key` index
      A3.set_index('key').join([B2, C2], how='inner')

      valueA valueB valueC
      key
      D 1.454274 -0.977278 1.0
      D 0.761038 -0.977278 1.0





      share|improve this answer





























        300







        +500







        300







        +500



        300






        +500





        This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.



        In particular, here's what this post will go through:




        • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)



          • merging with different column names

          • avoiding duplicate merge key column in output


        • Merging with index under different conditions

          • effectively using your named index

          • merge key as the index of one and column of another


        • Multiway merges on columns and indexes (unique and non-unique)

        • Notable alternatives to merge and join

        What this post will not go through:



        • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

        • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!


        Note

        Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.



        Furthermore, all the DataFrames here can be copied and replicated so
        you can play with them. Also, see this
        post
        on how to read DataFrames from your clipboard.



        Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.




        Enough Talk, just show me how to use merge!



        Setup



        np.random.seed(0)
        left = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4))
        right = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4))

        left

        key value
        0 A 1.764052
        1 B 0.400157
        2 C 0.978738
        3 D 2.240893

        right

        key value
        0 B 1.867558
        1 D -0.977278
        2 E 0.950088
        3 F -0.151357


        For the sake of simplicity, the key column has the same name (for now).



        An INNER JOIN is represented by






        Note

        This, along with the forthcoming figures all follow this convention:




        • blue indicates rows that are present in the merge result


        • red indicates rows that are excluded from the result (i.e., removed)


        • green indicates missing values that are replaced with NaNs in the result



        To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.



        pd.merge(left, right, on='key')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278


        This returns only rows from left and right which share a common key (in this example, "B" and "D).



        In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.



        left.merge(right, on='key')
        # Or, if you want to be explicit
        # left.merge(right, on='key', how='inner')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278


        A LEFT OUTER JOIN, or LEFT JOIN is represented by





        This can be performed by specifying how='left'.



        left.merge(right, on='key', how='left')

        key value_x value_y
        0 A 1.764052 NaN
        1 B 0.400157 1.867558
        2 C 0.978738 NaN
        3 D 2.240893 -0.977278


        Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.



        And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...





        ...specify how='right':



        left.merge(right, on='key', how='right')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278
        2 E NaN 0.950088
        3 F NaN -0.151357


        Here, keys from right are used, and missing data from left is replaced by NaN.



        Finally, for the FULL OUTER JOIN, given by





        specify how='outer'.



        left.merge(right, on='key', how='outer')

        key value_x value_y
        0 A 1.764052 NaN
        1 B 0.400157 1.867558
        2 C 0.978738 NaN
        3 D 2.240893 -0.977278
        4 E NaN 0.950088
        5 F NaN -0.151357


        This uses the keys from both frames, and NaNs are inserted for missing rows in both.



        The documentation summarises these various merges nicely:



        enter image description here



        Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs



        If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.



        For LEFT-Excluding JOIN, represented as





        Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,



        (left.merge(right, on='key', how='left', indicator=True)
        .query('_merge == "left_only"')
        .drop('_merge', 1))

        key value_x value_y
        0 A 1.764052 NaN
        2 C 0.978738 NaN


        Where,



        left.merge(right, on='key', how='left', indicator=True)

        key value_x value_y _merge
        0 A 1.764052 NaN left_only
        1 B 0.400157 1.867558 both
        2 C 0.978738 NaN left_only
        3 D 2.240893 -0.977278 both


        And similarly, for a RIGHT-Excluding JOIN,





        (left.merge(right, on='key', how='right', indicator=True)
        .query('_merge == "right_only"')
        .drop('_merge', 1))

        key value_x value_y
        2 E NaN 0.950088
        3 F NaN -0.151357


        Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),





        You can do this in similar fashion—



        (left.merge(right, on='key', how='outer', indicator=True)
        .query('_merge != "both"')
        .drop('_merge', 1))

        key value_x value_y
        0 A 1.764052 NaN
        2 C 0.978738 NaN
        4 E NaN 0.950088
        5 F NaN -0.151357


        Different names for key columns



        If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:



        left2 = left.rename('key':'keyLeft', axis=1)
        right2 = right.rename('key':'keyRight', axis=1)

        left2

        keyLeft value
        0 A 1.764052
        1 B 0.400157
        2 C 0.978738
        3 D 2.240893

        right2

        keyRight value
        0 B 1.867558
        1 D -0.977278
        2 E 0.950088
        3 F -0.151357




        left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

        keyLeft value_x keyRight value_y
        0 B 0.400157 B 1.867558
        1 D 2.240893 D -0.977278


        Avoiding duplicate key column in output



        When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.



        left3 = left2.set_index('keyLeft')
        left3.merge(right2, left_index=True, right_on='keyRight')

        value_x keyRight value_y
        0 0.400157 B 1.867558
        1 2.240893 D -0.977278


        Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



        Merging only a single column from one of the DataFrames



        For example, consider



        right3 = right.assign(newcol=np.arange(len(right)))
        right3
        key value newcol
        0 B 1.867558 0
        1 D -0.977278 1
        2 E 0.950088 2
        3 F -0.151357 3


        If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:



        left.merge(right3[['key', 'newcol']], on='key')

        key value newcol
        0 B 0.400157 0
        1 D 2.240893 1


        If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:



        # left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
        left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

        key value newcol
        0 A 1.764052 NaN
        1 B 0.400157 0.0
        2 C 0.978738 NaN
        3 D 2.240893 1.0


        As mentioned, this is similar to, but faster than



        left.merge(right3[['key', 'newcol']], on='key', how='left')

        key value newcol
        0 A 1.764052 NaN
        1 B 0.400157 0.0
        2 C 0.978738 NaN
        3 D 2.240893 1.0


        Merging on multiple columns



        To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).



        left.merge(right, on=['key1', 'key2'] ...)


        Or, in the event the names are different,



        left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


        Other useful merge* operations and functions




        • Merging a DataFrame with Series on index: See this answer.

        • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.


        • pd.merge_ordered is a useful function for ordered JOINs.


        • pd.merge_asof (read: merge_asOf) is useful for approximate joins.


        This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.




        Index-based *-JOIN (+ index-column merges)



        Setup



        np.random.seed([3, 14])
        left = pd.DataFrame('value': np.random.randn(4), index=['A', 'B', 'C', 'D'])
        right = pd.DataFrame('value': np.random.randn(4), index=['B', 'D', 'E', 'F'])
        left.index.name = right.index.name = 'idxkey'

        left
        value
        idxkey
        A -0.602923
        B -0.402655
        C 0.302329
        D -0.524349

        right

        value
        idxkey
        B 0.543843
        D 0.013135
        E -0.326498
        F 1.385076


        Typically, a merge on index would look like this:



        left.merge(right, left_index=True, right_index=True)


        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Support for index names



        If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).



        left.merge(right, on='idxkey')

        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Merging on index of one, column(s) of another



        It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,



        left.merge(right, left_on='key1', right_index=True)


        Or vice versa (right_on=... and left_index=True).



        right2 = right.reset_index().rename('idxkey' : 'colkey', axis=1)
        right2

        colkey value
        0 B 0.543843
        1 D 0.013135
        2 E -0.326498
        3 F 1.385076

        left.merge(right2, left_index=True, right_on='colkey')

        value_x colkey value_y
        0 -0.402655 B 0.543843
        1 -0.524349 D 0.013135


        In this special case, the index for left is named, so you can also use the index name with left_on, like this:



        left.merge(right2, left_on='idxkey', right_on='colkey')

        value_x colkey value_y
        0 -0.402655 B 0.543843
        1 -0.524349 D 0.013135


        DataFrame.join

        Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.



        left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:



        left.join(right)
        ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')


        Since the column names are the same. This would not be a problem if they were differently named.



        left.rename(columns='value':'leftvalue').join(right, how='inner')

        leftvalue value
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        pd.concat

        Lastly, as an alternative for index-based joins, you can use pd.concat:



        pd.concat([left, right], axis=1, sort=False, join='inner')

        value value
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Omit join='inner' if you need a FULL OUTER JOIN (the default):



        pd.concat([left, right], axis=1, sort=False)

        value value
        A -0.602923 NaN
        B -0.402655 0.543843
        C 0.302329 NaN
        D -0.524349 0.013135
        E NaN -0.326498
        F NaN 1.385076


        For more information, see this canonical post on pd.concat by @piRSquared.




        Generalizing: mergeing multiple DataFrames



        Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:



        df1.merge(df2, ...).merge(df3, ...)


        However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.



        Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.



        # Setup.
        np.random.seed(0)
        A = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4))
        B = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4))
        C = pd.DataFrame('key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4))
        dfs = [A, B, C]

        # Note, the "key" column values are unique, so the index is unique.
        A2 = A.set_index('key')
        B2 = B.set_index('key')
        C2 = C.set_index('key')

        dfs2 = [A2, B2, C2]


        Multiway merge on unique keys (or index)



        If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.



        # merge on `key` column, you'll need to set the index before concatenating
        pd.concat([
        df.set_index('key') for df in dfs], axis=1, join='inner'
        ).reset_index()

        key valueA valueB valueC
        0 D 2.240893 -0.977278 1.0

        # merge on `key` index
        pd.concat(dfs2, axis=1, sort=False, join='inner')

        valueA valueB valueC
        key
        D 2.240893 -0.977278 1.0


        Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).



        Multiway merge on keys with duplicates



        concat is fast, but has its shortcomings. It cannot handle duplicates.



        A3 = pd.DataFrame('key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5))




        pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
        ValueError: Shape of passed values is (3, 4), indices imply (3, 2)


        In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).



        # join on `key` column, set as the index first
        # For inner join. For left join, omit the "how" argument.
        A.set_index('key').join(
        [df.set_index('key') for df in (B, C)], how='inner').reset_index()

        key valueA valueB valueC
        0 D 2.240893 -0.977278 1.0

        # join on `key` index
        A3.set_index('key').join([B2, C2], how='inner')

        valueA valueB valueC
        key
        D 1.454274 -0.977278 1.0
        D 0.761038 -0.977278 1.0





        share|improve this answer















        This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.



        In particular, here's what this post will go through:




        • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)



          • merging with different column names

          • avoiding duplicate merge key column in output


        • Merging with index under different conditions

          • effectively using your named index

          • merge key as the index of one and column of another


        • Multiway merges on columns and indexes (unique and non-unique)

        • Notable alternatives to merge and join

        What this post will not go through:



        • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

        • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!


        Note

        Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.



        Furthermore, all the DataFrames here can be copied and replicated so
        you can play with them. Also, see this
        post
        on how to read DataFrames from your clipboard.



        Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.




        Enough Talk, just show me how to use merge!



        Setup



        np.random.seed(0)
        left = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4))
        right = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4))

        left

        key value
        0 A 1.764052
        1 B 0.400157
        2 C 0.978738
        3 D 2.240893

        right

        key value
        0 B 1.867558
        1 D -0.977278
        2 E 0.950088
        3 F -0.151357


        For the sake of simplicity, the key column has the same name (for now).



        An INNER JOIN is represented by






        Note

        This, along with the forthcoming figures all follow this convention:




        • blue indicates rows that are present in the merge result


        • red indicates rows that are excluded from the result (i.e., removed)


        • green indicates missing values that are replaced with NaNs in the result



        To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.



        pd.merge(left, right, on='key')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278


        This returns only rows from left and right which share a common key (in this example, "B" and "D).



        In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.



        left.merge(right, on='key')
        # Or, if you want to be explicit
        # left.merge(right, on='key', how='inner')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278


        A LEFT OUTER JOIN, or LEFT JOIN is represented by





        This can be performed by specifying how='left'.



        left.merge(right, on='key', how='left')

        key value_x value_y
        0 A 1.764052 NaN
        1 B 0.400157 1.867558
        2 C 0.978738 NaN
        3 D 2.240893 -0.977278


        Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.



        And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...





        ...specify how='right':



        left.merge(right, on='key', how='right')

        key value_x value_y
        0 B 0.400157 1.867558
        1 D 2.240893 -0.977278
        2 E NaN 0.950088
        3 F NaN -0.151357


        Here, keys from right are used, and missing data from left is replaced by NaN.



        Finally, for the FULL OUTER JOIN, given by





        specify how='outer'.



        left.merge(right, on='key', how='outer')

        key value_x value_y
        0 A 1.764052 NaN
        1 B 0.400157 1.867558
        2 C 0.978738 NaN
        3 D 2.240893 -0.977278
        4 E NaN 0.950088
        5 F NaN -0.151357


        This uses the keys from both frames, and NaNs are inserted for missing rows in both.



        The documentation summarises these various merges nicely:



        enter image description here



        Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs



        If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.



        For LEFT-Excluding JOIN, represented as





        Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,



        (left.merge(right, on='key', how='left', indicator=True)
        .query('_merge == "left_only"')
        .drop('_merge', 1))

        key value_x value_y
        0 A 1.764052 NaN
        2 C 0.978738 NaN


        Where,



        left.merge(right, on='key', how='left', indicator=True)

        key value_x value_y _merge
        0 A 1.764052 NaN left_only
        1 B 0.400157 1.867558 both
        2 C 0.978738 NaN left_only
        3 D 2.240893 -0.977278 both


        And similarly, for a RIGHT-Excluding JOIN,





        (left.merge(right, on='key', how='right', indicator=True)
        .query('_merge == "right_only"')
        .drop('_merge', 1))

        key value_x value_y
        2 E NaN 0.950088
        3 F NaN -0.151357


        Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),





        You can do this in similar fashion—



        (left.merge(right, on='key', how='outer', indicator=True)
        .query('_merge != "both"')
        .drop('_merge', 1))

        key value_x value_y
        0 A 1.764052 NaN
        2 C 0.978738 NaN
        4 E NaN 0.950088
        5 F NaN -0.151357


        Different names for key columns



        If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:



        left2 = left.rename('key':'keyLeft', axis=1)
        right2 = right.rename('key':'keyRight', axis=1)

        left2

        keyLeft value
        0 A 1.764052
        1 B 0.400157
        2 C 0.978738
        3 D 2.240893

        right2

        keyRight value
        0 B 1.867558
        1 D -0.977278
        2 E 0.950088
        3 F -0.151357




        left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

        keyLeft value_x keyRight value_y
        0 B 0.400157 B 1.867558
        1 D 2.240893 D -0.977278


        Avoiding duplicate key column in output



        When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.



        left3 = left2.set_index('keyLeft')
        left3.merge(right2, left_index=True, right_on='keyRight')

        value_x keyRight value_y
        0 0.400157 B 1.867558
        1 2.240893 D -0.977278


        Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.



        Merging only a single column from one of the DataFrames



        For example, consider



        right3 = right.assign(newcol=np.arange(len(right)))
        right3
        key value newcol
        0 B 1.867558 0
        1 D -0.977278 1
        2 E 0.950088 2
        3 F -0.151357 3


        If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:



        left.merge(right3[['key', 'newcol']], on='key')

        key value newcol
        0 B 0.400157 0
        1 D 2.240893 1


        If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:



        # left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
        left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

        key value newcol
        0 A 1.764052 NaN
        1 B 0.400157 0.0
        2 C 0.978738 NaN
        3 D 2.240893 1.0


        As mentioned, this is similar to, but faster than



        left.merge(right3[['key', 'newcol']], on='key', how='left')

        key value newcol
        0 A 1.764052 NaN
        1 B 0.400157 0.0
        2 C 0.978738 NaN
        3 D 2.240893 1.0


        Merging on multiple columns



        To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).



        left.merge(right, on=['key1', 'key2'] ...)


        Or, in the event the names are different,



        left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


        Other useful merge* operations and functions




        • Merging a DataFrame with Series on index: See this answer.

        • Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.


        • pd.merge_ordered is a useful function for ordered JOINs.


        • pd.merge_asof (read: merge_asOf) is useful for approximate joins.


        This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.




        Index-based *-JOIN (+ index-column merges)



        Setup



        np.random.seed([3, 14])
        left = pd.DataFrame('value': np.random.randn(4), index=['A', 'B', 'C', 'D'])
        right = pd.DataFrame('value': np.random.randn(4), index=['B', 'D', 'E', 'F'])
        left.index.name = right.index.name = 'idxkey'

        left
        value
        idxkey
        A -0.602923
        B -0.402655
        C 0.302329
        D -0.524349

        right

        value
        idxkey
        B 0.543843
        D 0.013135
        E -0.326498
        F 1.385076


        Typically, a merge on index would look like this:



        left.merge(right, left_index=True, right_index=True)


        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Support for index names



        If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).



        left.merge(right, on='idxkey')

        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Merging on index of one, column(s) of another



        It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,



        left.merge(right, left_on='key1', right_index=True)


        Or vice versa (right_on=... and left_index=True).



        right2 = right.reset_index().rename('idxkey' : 'colkey', axis=1)
        right2

        colkey value
        0 B 0.543843
        1 D 0.013135
        2 E -0.326498
        3 F 1.385076

        left.merge(right2, left_index=True, right_on='colkey')

        value_x colkey value_y
        0 -0.402655 B 0.543843
        1 -0.524349 D 0.013135


        In this special case, the index for left is named, so you can also use the index name with left_on, like this:



        left.merge(right2, left_on='idxkey', right_on='colkey')

        value_x colkey value_y
        0 -0.402655 B 0.543843
        1 -0.524349 D 0.013135


        DataFrame.join

        Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.



        left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

        value_x value_y
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:



        left.join(right)
        ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')


        Since the column names are the same. This would not be a problem if they were differently named.



        left.rename(columns='value':'leftvalue').join(right, how='inner')

        leftvalue value
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        pd.concat

        Lastly, as an alternative for index-based joins, you can use pd.concat:



        pd.concat([left, right], axis=1, sort=False, join='inner')

        value value
        idxkey
        B -0.402655 0.543843
        D -0.524349 0.013135


        Omit join='inner' if you need a FULL OUTER JOIN (the default):



        pd.concat([left, right], axis=1, sort=False)

        value value
        A -0.602923 NaN
        B -0.402655 0.543843
        C 0.302329 NaN
        D -0.524349 0.013135
        E NaN -0.326498
        F NaN 1.385076


        For more information, see this canonical post on pd.concat by @piRSquared.




        Generalizing: mergeing multiple DataFrames



        Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:



        df1.merge(df2, ...).merge(df3, ...)


        However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.



        Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.



        # Setup.
        np.random.seed(0)
        A = pd.DataFrame('key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4))
        B = pd.DataFrame('key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4))
        C = pd.DataFrame('key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4))
        dfs = [A, B, C]

        # Note, the "key" column values are unique, so the index is unique.
        A2 = A.set_index('key')
        B2 = B.set_index('key')
        C2 = C.set_index('key')

        dfs2 = [A2, B2, C2]


        Multiway merge on unique keys (or index)



        If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.



        # merge on `key` column, you'll need to set the index before concatenating
        pd.concat([
        df.set_index('key') for df in dfs], axis=1, join='inner'
        ).reset_index()

        key valueA valueB valueC
        0 D 2.240893 -0.977278 1.0

        # merge on `key` index
        pd.concat(dfs2, axis=1, sort=False, join='inner')

        valueA valueB valueC
        key
        D 2.240893 -0.977278 1.0


        Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).



        Multiway merge on keys with duplicates



        concat is fast, but has its shortcomings. It cannot handle duplicates.



        A3 = pd.DataFrame('key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5))




        pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
        ValueError: Shape of passed values is (3, 4), indices imply (3, 2)


        In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).



        # join on `key` column, set as the index first
        # For inner join. For left join, omit the "how" argument.
        A.set_index('key').join(
        [df.set_index('key') for df in (B, C)], how='inner').reset_index()

        key valueA valueB valueC
        0 D 2.240893 -0.977278 1.0

        # join on `key` index
        A3.set_index('key').join([B2, C2], how='inner')

        valueA valueB valueC
        key
        D 1.454274 -0.977278 1.0
        D 0.761038 -0.977278 1.0






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jul 6 at 3:52

























        answered Dec 6 '18 at 6:41









        cs95cs95

        165k32 gold badges236 silver badges303 bronze badges




        165k32 gold badges236 silver badges303 bronze badges


























            17















            A supplemental visual view of pd.concat([df0, df1], kwargs).
            Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)



            on pd.concat([df0, df1])






            share|improve this answer






















            • 4





              This is a nice diagram. May I ask how you produced it?

              – cs95
              May 20 at 17:27






            • 4





              google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

              – eliu
              May 21 at 18:19











            • Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

              – Ufos
              Aug 1 at 11:16











            • @Ufos Isn't that exactly what axis=1 and axis=0 is?

              – cs95
              Aug 6 at 17:31












            • yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

              – Ufos
              Aug 19 at 9:32















            17















            A supplemental visual view of pd.concat([df0, df1], kwargs).
            Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)



            on pd.concat([df0, df1])






            share|improve this answer






















            • 4





              This is a nice diagram. May I ask how you produced it?

              – cs95
              May 20 at 17:27






            • 4





              google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

              – eliu
              May 21 at 18:19











            • Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

              – Ufos
              Aug 1 at 11:16











            • @Ufos Isn't that exactly what axis=1 and axis=0 is?

              – cs95
              Aug 6 at 17:31












            • yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

              – Ufos
              Aug 19 at 9:32













            17














            17










            17









            A supplemental visual view of pd.concat([df0, df1], kwargs).
            Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)



            on pd.concat([df0, df1])






            share|improve this answer















            A supplemental visual view of pd.concat([df0, df1], kwargs).
            Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)



            on pd.concat([df0, df1])







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jun 28 at 21:45

























            answered Apr 25 at 23:43









            eliueliu

            5924 silver badges16 bronze badges




            5924 silver badges16 bronze badges










            • 4





              This is a nice diagram. May I ask how you produced it?

              – cs95
              May 20 at 17:27






            • 4





              google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

              – eliu
              May 21 at 18:19











            • Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

              – Ufos
              Aug 1 at 11:16











            • @Ufos Isn't that exactly what axis=1 and axis=0 is?

              – cs95
              Aug 6 at 17:31












            • yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

              – Ufos
              Aug 19 at 9:32












            • 4





              This is a nice diagram. May I ask how you produced it?

              – cs95
              May 20 at 17:27






            • 4





              google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

              – eliu
              May 21 at 18:19











            • Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

              – Ufos
              Aug 1 at 11:16











            • @Ufos Isn't that exactly what axis=1 and axis=0 is?

              – cs95
              Aug 6 at 17:31












            • yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

              – Ufos
              Aug 19 at 9:32







            4




            4





            This is a nice diagram. May I ask how you produced it?

            – cs95
            May 20 at 17:27





            This is a nice diagram. May I ask how you produced it?

            – cs95
            May 20 at 17:27




            4




            4





            google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

            – eliu
            May 21 at 18:19





            google doc's built-in "insert ==> drawing... ==> new" (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc's drawing tool is pretty neat.

            – eliu
            May 21 at 18:19













            Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

            – Ufos
            Aug 1 at 11:16





            Wow, this is great. Coming from the SQL world, "vertical" join is not a join in my head, as the table's structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

            – Ufos
            Aug 1 at 11:16













            @Ufos Isn't that exactly what axis=1 and axis=0 is?

            – cs95
            Aug 6 at 17:31






            @Ufos Isn't that exactly what axis=1 and axis=0 is?

            – cs95
            Aug 6 at 17:31














            yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

            – Ufos
            Aug 19 at 9:32





            yes, there're now merge and concat and axis and whatever. However, as @eliu shows, it's all just the same concept of merge with "left" and "right" and "horizontal" or "vertical". I, personally, have to look into the documentation every time I have to remember which "axis" is 0 and which is 1.

            – Ufos
            Aug 19 at 9:32

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53645882%2fpandas-merging-101%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

            용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

            155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해