# Data Wrangling: Hierarchical Indexing, Join, & Combine¶

Ha Khanh Nguyen (hknguyen)

## 1. Hierarchical Indexing¶

• Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis.
• It provides a way for you to work with higher dimensional data in a lower dimensional form.

### 1.1 Series¶

• What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:
• With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:
• Selection is even possible from an “inner” level:
• Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table.
• For example, you could rearrange the data into a DataFrame using its unstack() method:
• The inverse operation of unstack() is stack():

### 1.2 DataFrame¶

• With a DataFrame, either axis can have a hierarchical index:
• The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:
• With partial column indexing you can similarly select groups of columns:

### 1.2 Summary Statistics by Level¶

• Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis.
• Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

## 2. Combining and Merging Datasets¶

• Data contained in pandas objects can be combined together in a number of ways:
• pandas.merge() connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
• pandas.concat() concatenates or “stacks” together objects along an axis.
• The combine_first() instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

### 2.1 Database-style DataFrame Joins¶

• Merge or join operations combine datasets by linking rows using one or more keys.
• Note that we didn’t specify which column to join on. If that information is not specified, merge() uses the overlapping column names as the keys.
• It’s a good practice to specify explicitly, though:
• If the column names are different in each object, you can specify them separately:
• You may notice that the 'c' and 'd' values and associated data are missing from the result.
• By default merge() does an'inner' join; the keys in the result are the intersection, or the common set found in both tables.
• Other possible options are 'left', 'right', and 'outer'.
• The 'outer' join takes the union of the keys, combining the effect of applying both left and right joins:

#### c. Many-to-many join¶

• Many-to-many merges have well-defined, though not necessarily intuitive, behavior. Here’s an example:
• Many-to-many joins form the Cartesian product of the rows. Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the result.

### 2.2 Concatenating Along an Axis¶

• The concat() function in pandas provides a consistent way to concatenate DataFrame or Series.
• By default, concat() works along axis=0, producing another Series.
• If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns):
• The same logic extends to DataFrame objects:

### 2.3 Combining Data with Overlap¶

• You may have two datasets whose indexes overlap in full or part.
• As a motivating example, consider NumPy’s where() function, which performs the array-oriented equivalent of an if-else expression:
• Series has a combine_first() method, which performs the equivalent of this operation along with pandas’s usual data alignment logic:
• With DataFrames, combine_first()` does the same thing column by column, so you can think of it as “patching” missing data in the calling object with data from the object you pass:

This lecture notes reference materials from Chapter 8 of Wes McKinney's Python for Data Analysis 2nd Ed.