# 8.1 - pandas: DataFrame¶

Ha Khanh Nguyen (hknguyen)

## 1. Importing Data¶

• We will learn about importing data in depth next week.
• For the purpose of this lecture, we will only work with csv (comma separated value) files.
• Therefore, we will use read_csv() to import data into Python.
• The output of the read_csv() function is a pandas DataFrame.

## 2. What is a DataFrame?¶

• A DataFrame is a data structure of table format where:
• Each row represents each observation.
• Each column represents each variable/attribute.
• In our ramen dataset, there are 2575 observations and 5 attributes, which are:
• Brand
• Variety
• Style
• Country
• Stars

### 2.1 Creating a DataFrame¶

• There are many ways to construct a DataFrame, the most common method is to load the data from a file (using function like read_csv().
• Another common method is to use a dictionary of equal-length lists or NumPy arrays:
• A DataFrame represents a rectangular table of data and contains an ordered collection of columns of the same length, each of which can be a different value type (numeric, string, boolean, etc.).
• If you have learned R, remind yourself of the definition of a DataFrame in R!
• Similarly, in Python, DataFrame can be thought of a dictionary of Series all sharing the same index.

### 2.2 Summary statistics of DataFrame columns¶

• Use the function describe() of the DataFrame object to get the summary statistics of the variables in the DataFrame.
• We also want to know the data type of each variables!
• You might wonder: How many different country that produce ramen? How many style of ramen are there? Or how many different brands?
• Use the unique() function to find out!

## 3. Selecting Rows/Columns of a DataFrame¶

### 3.1 Using []¶

• [] is a very popular operator in Python as we have seen its usage with various data structures.
• With DataFrame, we can use [] to select a column using the column's name:
• Another way to select a column is:
• To select multiple columns at the same time, supply a list of column names inside []:

Notes:

• The output of selecting a column using [] is a pandas Series.
• The output of selecting multiple columns using [] is a pandas DataFrame.
• This is similar to how R sometimes returns a vector (or tuple if you use dplyr) or a DataFrame.

BIG NOTES: As DataFrame is a dict of Series (where each key is the column name and each value is a Series), we cannot access the row using [].

### 3.2 Using iloc¶

• Now, so how do we access the row?
• One of the methods is to use iloc which is short for interger-location based indexing.
• iloc syntax: dataframe.iloc[<row index>, <column index>].
• To select the Stars rating of the first observation (index 0):
• To select multiple columns of the first observation:
• To select ALL columns of one row:
• Now, what is the type of the ouput of iloc?
• It's important to make a habit of know exactly the returned object of a function! Always ask yourself this question when you program!
• Notice the appearance of NumPy data structures and types in pandas? There is a strong relationship between the 2 famous libraries.
• Similarly, we can select multiple rows of the same column:
• The following code segment returns the rating of the first 5 observations in this dataset.
• Another Series is returned!
• Let's try selecting multiple rows and multiple columns at the same time:
• The output is now a DataFrame! It makes sense right?
• One-dimension output: Series
• Two-dimension output: DataFrame

### 3.3 Using loc¶

• The i in iloc stands for integer. That is why with iloc, we always use numbers for indexing.
• With loc, we use label (names) or a Boolean list/array for indexing instead.
• For our dataset, our rows are labeled by integers, so we still use integers as our row labels.

### Exercise¶

Try to use loc to select the following rows, columns:

• All columns except Brand for the 1st observation.
• All columns for the last observation.
• Stars for the first 5 observations.
• All columns for the first 5 observations.

## 4. Filtering DataFrame¶

• Let's say I only want observations of ramen coming from USA!
• Then we need to filter ramen and only print out the rows where Country is USA.
• This returns a Series of type Boolean. We can use this to filter the DataFrame!
• There are 2 ways to do this:
• The 2nd method is PREFERRED when you will be alternating the output!
• Now, what's if I want only the ones where Brand is Nissin?

## 5. Modifying DataFrame Values¶

• For example, there was a data entry error and all observations of Nissin brands in the US that received 1.5 stars should receive 2.5 stars instead.
• To do this, we take the following steps:
• Find all the observations that are of brand Nissin from USA and have 1.5 stars rating.
• Assign 2.5 as the new stars rating.
• Let's try to do this using [] instead of loc:
• What causes this warning?
• Chained assignment
• To discuss chaining, first, let's define some new terminologies:
• Assignment: operations that set the value of a variable.
• ramen = pd.read_csv(...)
• Access: operations that return the value of a variable/object.
• ramen['Stars']
• Indexing: any assignment or access method that references a subset of the data.
• ramen['Stars']
• ramen['Stars'] = 0
• Chaining: the use of more than one indexing operation back-to-back.
• ramen[ramen['Brand'] == 'Nissin']['Stars']

This lecture notes reference materials from Chapter 5 of Wes McKinney's Python for Data Analysis 2nd Ed.