The hidden rules of pandas.merge_asof()

Angwalt
5 min readMar 5, 2021

--

After years of merging dataframes on a field common to both the left and right dataframe being merged , that match exactly then keeping the resulting dataframe either based on whether you want to ensure all rows from the right dataframe , left dataframe or both are kept in the final result(using how parameter with either of the values ‘right’, ‘left’, ‘outer’ or ‘inner’). You may finally have run into a use-case whereby you need all rows from the left dataframe with a valid value for the field above to match to the right dataframe whether its an exact or inexact match . This is where we leave behind merge() and in comes its more complicated cousin merge_asof() which is a very handy method to know how to manipulate in the right situation.

Disclaimer:This is an article on some of the gotchas of the method that does require some knowlege about merge() and basics about merge_asof(), to simply learn the basics and get started I do recommend checking out the amply useful documentation from pandas themselves on the matter here.

Now with that out of the way we can begin.

#Rule №1: There is no ‘how’ parameter

This is because it is implied that we are doing a left-join where the result is that all rows from the left dataframe are preserved in the final result. In other words, by default behind the scenes, the how parameter we are used to in pandas.merge() is by force set to ‘left’ when using .merge_asof(). So any use cases requiring anything other than preserving all the left dataframe rows in the final result is not possible by only using the merge_asof(), but I am sure you’ll think of something.

#Rule №2:What ‘allow_exact_matches’ parameter means

This is an additional parameter unique to merge_asof() and it has to do with exact matches. Now, this depends on what you want to do and since it takes in a boolean, there really only are 2 options.

One, you either want all rows from the left dataframe(with valid value for the on field) to be matched inexactly to a row in the right-handside dataframe or you want all rows from left dataframe (with valid value for the on field) to be matched exactly, if possible, but if that is not possible you want an inexact match. For merge_asof() to enable the second scenario you have to set allow_exact_matches to True and it is worthwhile to note that if unset it defaults to False which might not be the functionality you intend to have.

It is also worth noting that an exact match from a left dataframe row can have more than one match while this allow_exact_matches is set to True but an inexact match only ever has one match from the right dataframe depending on the direction parameter.

#Rule №3:The difference between the ‘on’ and ‘by’ parameter

To the untrained eye, these 2 can seem like a bad case of duplication or a slight overlap but I am here to assure you that this is not the case. I will use the second scenario as an example mentioned above pertaining to the allow_exact_matches parameter.

In the case where you want an exact match between values from fields of the left and right dataframe then the cumulative columns listed in the on and by parameter serve the function that would have been served by the on parameter in the vanilla merge() method i.e. those columns must exist in both dataframes and must match exactly for the row from the left dataframe to merge with 1 or more rows from the right dataframe .

It should also be noted that the on parameter can only take one value while the by can take 1 or more in merge_asof().

But in the case of an inexact match, the possible contending rows from the right dataframe for an inexact match with the row from the left dataframe, must have valid values for all the fields listed in the by parameter that match exactly to the row from the left dataframe. That was indeed a mouthful, let’s see an example to futher illustrate this:

differentiate ‘on’ and ‘by’

As you can see the final row from trades did not match any value from quotes because no value from quotes had the same ticker value as its own of ‘PLAA’ even while using the direction parameter that incorporates both backwards and forwards for an inexact match.

It is also worth noting that when the columns listed in by parameter for left and right dataframes have different names and it is tedious to rename them then you can use the left_by and right_by in the same way you use right_on and left_on when it comes to the vanilla merge() method.

#Rule №4:The column listed under ‘on’ should have a numerical datatype.

This is a simple one but can cost you hours to debug if you’re not looking for it. To be on the safe side explicitly cast the column listed in the on parameter for both dataframes to a datetime-like or numerical type such as float64 or int32. Columns with datatypes of string, object or category are notorious for causing this error so watch out.

#Rule №5:The column listed under ‘on’ should have no nonetype values.

This can easily be checked using the .hasnans attribute of the column in question or dropping all nonetypes using dropna() with a subset parameter.

Moreover, another more subtle issue may be having a datatype for your column that can contain nonetypes for example floats because np.nan is actually considered a float or pandas.Int64Dtype() datatype with a capital ‘I’ that is basically the equivalent of int64 that can contain nonetypes. To learn more about this datatype visit here.

These kinds of datatypes are considered ambiguous and pandas will throw an error if they exist for the column listed for on parameter, not risking putting through a merge with possible nonetypes in the values.

#Rule №5: Sort both dataframes according to the column listed for the ‘on’ parameter.

This is also a rather hidden rule given its necessity especially if you data was sorted initially. Regardless, just like watching out for nonetypes before running a merge ensure you sort both dataframes according to the on column and all should be well. This is actually a rule that merge_asof shares with merge_ordered() so ensure you sort before that kind of merge also. To find out more about merge_ordered() go here.

That’s all for now but if I have left any hidden rules out be sure to add then via a comment. Otherwise, merge on.

--

--

Angwalt
Angwalt

Written by Angwalt

Just another coder like you

Responses (1)