Key Concept of Selecting Data from DataFrame in Python

As a data scientist or analyst, you must use python to manipulate data quite often. And using the pandas and numpy package is the popular way to do it. However, as a beginner or intermediate level python user, you must be confused about how to get the data from a DataFrame and try to google it. Because there’re many different methods to get the same result. At least, I’m in such a situation. If you’re not. Congratulations! you’re an expert! I’ll introduce the key concept to remember it by using sample data.

Of course to import the two packages and import data first.

import pandas as pd
import numby as np
marriage = pd.read_csv('data/Marriage Data.csv',index_col=0)
marriage = marriage.rename(columns= {'月份區域別':'area'})
>>> marriage
    area month  count  year
0    桃園市    一月    336   102
1    中壢市    一月    281   102
2    平鎮市    一月    159   102
3    八德市    一月    145   102
4    楊梅市    一月    113   102
..   ...   ...    ...   ...
151  龍潭區   十二月     92   109
152  平鎮區   十二月    167   109
153  新屋區   十二月     41   109
154  觀音區   十二月     45   109
155  復興區   十二月     10   109

[1248 rows x 4 columns]
>>> type(marriage)
<class 'pandas.core.frame.DataFrame'>
>>> marriage['area']
0      桃園市
1      中壢市
2      平鎮市
3      八德市
4      楊梅市
      ... 
151    龍潭區
152    平鎮區
153    新屋區
154    觀音區
155    復興區
Name: area, Length: 1248, dtype: object
>>> type(marriage['area'])
<class 'pandas.core.series.Series'>
>>> marriage.area
0      桃園市
1      中壢市
2      平鎮市
3      八德市
4      楊梅市
      ... 
151    龍潭區
152    平鎮區
153    新屋區
154    觀音區
155    復興區
Name: area, Length: 1248, dtype: object
>>> type(marriage['area'])
<class 'pandas.core.series.Series'>

So, marriage['area'] and marriage.area are the same.

In order to make the python code be consistent and clear, we choose to use .loc and .iloc. .loc is to get the data by label and .iloc is to get the data by location. The followings are the demo.

By a label value:

>>> marriage.loc[:,'area']
0      桃園市
1      中壢市
2      平鎮市
3      八德市
4      楊梅市
      ... 
151    龍潭區
152    平鎮區
153    新屋區
154    觀音區
155    復興區
Name: area, Length: 1248, dtype: object
>>> type(marriage.loc[:,'area'])
<class 'pandas.core.series.Series'>

By a location value:

>>> marriage.iloc[:,0]
0      桃園市
1      中壢市
2      平鎮市
3      八德市
4      楊梅市
      ... 
151    龍潭區
152    平鎮區
153    新屋區
154    觀音區
155    復興區
Name: area, Length: 1248, dtype: object
>>> type(marriage.iloc[:,0])
<class 'pandas.core.series.Series'>

By a label list:

>>> marriage.loc[:,['area']]
    area
0    桃園市
1    中壢市
2    平鎮市
3    八德市
4    楊梅市
..   ...
151  龍潭區
152  平鎮區
153  新屋區
154  觀音區
155  復興區

[1248 rows x 1 columns]
>>> type(marriage.loc[:,['area']])
<class 'pandas.core.frame.DataFrame'>

By a value list:

    area
0    桃園市
1    中壢市
2    平鎮市
3    八德市
4    楊梅市
..   ...
151  龍潭區
152  平鎮區
153  新屋區
154  觀音區
155  復興區

[1248 rows x 1 columns]
>>> type(marriage.iloc[:,0:1])
<class 'pandas.core.frame.DataFrame'>

The key point to remember:

  • use label .loc to get Series or DataFrame
  • use location .iloc to get Series or DataFrame
  • use value to get the Series
  • use list to get the DataFrame

Now, your python code is a consistent format to get the data you need.