Indexing Dataframes

Indexing Dataframes in Pandas

Vidya Menon
Dev Genius

--

Photo by davisco on Unsplash

What is an Index? Index is like an address, which lets us access data points across a dataframe or a series can be accessed. Both Rows and columns have indexes, rows indices are called as index and for columns its their general column names.

Thus, it is extremely important to understand Indexing rows and/or columns. In this tutorial, let’s look at some of the common ways of Indexing in Pandas.

Please install Pandas in case your system doesn’t have it installed.

! pip install pandas

Now let’s get started. Let’s read our data into the dataframe and take a look at it.

Using the data from https://www.espncricinfo.com/
The dataset has 14 rows and 12 columns

The dataset has 14 rows and 12 features, the leftmost series 0,1 2,3 … is the index. Let’s Look at some more information.

The above information will help us to understand the number of rows and columns of our dataframe. Basically, to see how our data looks.

Our Indexing Methods:

  1. loc method
  2. iloc method

The first thing that comes to our mind is ‘the difference between these 2 methods’. Let’s take a small example to understand this. Assume, 5 kids are standing in a line at positions 1 to 5. We have 2 ways of addressing them or calling out to them. Either call them by their name or call them by their position name. If we call the kids by their name, it refers to as the ‘loc method’ and if we call by the position it refers to as the ‘iloc method’.

loc Method:

It is one of the most versatile methods in pandas used to index a dataframe and/or a series method.The loc() function is used to access a group of rows and columns by label(s) or a boolean array. loc[] is primarily label based, but may also be used with a boolean array.

The syntax being:

df.loc[specified rows: specified columns], where df is the name of our dataframe.

The allowed inputs are:

  • Single label
  • A list or array of labels
  • A slice object
  • A boolean array
  • A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

Using Single label:

Helps to specify which rows and/or columns we want .For rows, the label is the index value of that row, and for columns, the column name is the label. Eg., in our Worldcup dataframe, if we want the 3rd row only along with all the columns, we would use the following:

Using Single Lable

Here we specified the label we want, which is 2 and the colon(:) indicates that we want all the columns in that row.

List or Array of Labels:

What if we want multiple rows and/or columns instead of just one. Well, with using array of labels, we can achieve that. Let’s take a look:

Using list of labels

What if we want specific columns with respect to the rows:

Slice Object:

The slice notation indicates the start and the stop label. where both the start AND stop labels are included in our output as shown below:

All the columns starting from Team to Total Matches are displayed

Boolean Array:

Last but not the least, we can use an array of boolean values. However, this array of boolean values must have the same length as the axis we are using it on. For eg., our world cup dataframe has a shape of (14,12) , meaning it has 14 rows and 12 columns. So if we want to use a boolean array to specify our rows, then it would need to have a length of 14 elements and if we want to use a boolean array to specify our columns, it would need to have a length of 12 elements.

For example, let’s say we wanted to select only the rows that included 1975 in the column First:

Note,this returns a pandas series (or array like object) that has a length of 14 rows and is made up of boolean values (True or False). This is the exact number of values we need to be able to use this boolean array to specify our rows using the loc method. Another way to use this method is:

iloc Method:

iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.The iloc method can also be used with both a dataframe and series method. The i in iloc stands for integer, instead of labels.

The syntax for iloc:

df.iloc[<row selection>, <column selection>]

“iloc” is used to select rows and columns by number, in the order that they appear in the data frame. Each row has a row number from 0 to the total number of rows (df.shape[0],0 is for the rows) and iloc[] allows selections based on these numbers. The same applies for columns (ranging from 0 to df.shape[1] , 1 being for columns). There are two ‘arguments’ to iloc — a row selector, and a column selector.

In our dataframe, we have not change the index, so the default index of our dataframe is just the integer location of our rows. So, let’s try using the slice object to specify our rows using the iloc method:

When using the slice object with the iloc method, the stop integer location is NOT included in our dataframe. So we are only seeing rows 6,7 and 8. This is one of the difference with the loc method where both the start and stop labels are included in our dataframe.

Now let’s use the iloc method to specify the columns we want:

Conclusion

In this tutorial, we learnt how to index a dataframe with both the loc and iloc methods. We got to know that the loc method works with labels of the rows and columns, whereas the iloc method works with the integer locations.

--

--