Text files and CSV¶
Pandas¶
- Data in tabular form? Use Pandas package 
- It can easily read and write text/ascii/csv files 
- Data is messy? Read the messy data and clean in Pandas 
- Pandas can deal with column renaming, missing values, column datatypes, date and time settings and timezones 
- Want to plot all numerical columns in the same axis? Pandas plotting method got you covered 
- Pandas documentation page has a 10 min guide to get you started 
Exercise¶
We will use Pandas to do the following tasks
- Read rainfall data (accessible from data/ directory) for 36 subdivisions in India 
- plot summer monsoon rainfall for all divisions for a given year 
- tidy the data and index it by date 
- plot time series for selected subdivisions 
Data used here is provided by India Meteorological Department(IMD), Govt. of India and downloaded from here
import pandas as pd
import matplotlib.pyplot as plt
# read the data
df = pd.read_csv('data/rainfall_in_india_1901-2015.csv',na_values='nan')
# display first 5 rows
df.head()
| SUBDIVISION | YEAR | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ANDAMAN & NICOBAR ISLANDS | 1901 | 49.2 | 87.1 | 29.2 | 2.3 | 528.8 | 517.5 | 365.1 | 481.1 | 332.6 | 388.5 | 558.2 | 33.6 | 3373.2 | 136.3 | 560.3 | 1696.3 | 980.3 | 
| 1 | ANDAMAN & NICOBAR ISLANDS | 1902 | 0.0 | 159.8 | 12.2 | 0.0 | 446.1 | 537.1 | 228.9 | 753.7 | 666.2 | 197.2 | 359.0 | 160.5 | 3520.7 | 159.8 | 458.3 | 2185.9 | 716.7 | 
| 2 | ANDAMAN & NICOBAR ISLANDS | 1903 | 12.7 | 144.0 | 0.0 | 1.0 | 235.1 | 479.9 | 728.4 | 326.7 | 339.0 | 181.2 | 284.4 | 225.0 | 2957.4 | 156.7 | 236.1 | 1874.0 | 690.6 | 
| 3 | ANDAMAN & NICOBAR ISLANDS | 1904 | 9.4 | 14.7 | 0.0 | 202.4 | 304.5 | 495.1 | 502.0 | 160.1 | 820.4 | 222.2 | 308.7 | 40.1 | 3079.6 | 24.1 | 506.9 | 1977.6 | 571.0 | 
| 4 | ANDAMAN & NICOBAR ISLANDS | 1905 | 1.3 | 0.0 | 3.3 | 26.9 | 279.5 | 628.7 | 368.7 | 330.5 | 297.0 | 260.7 | 25.4 | 344.7 | 2566.7 | 1.3 | 309.7 | 1624.9 | 630.8 | 
# display last 5 rows
df.tail()
| SUBDIVISION | YEAR | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4111 | LAKSHADWEEP | 2011 | 5.1 | 2.8 | 3.1 | 85.9 | 107.2 | 153.6 | 350.2 | 254.0 | 255.2 | 117.4 | 184.3 | 14.9 | 1533.7 | 7.9 | 196.2 | 1013.0 | 316.6 | 
| 4112 | LAKSHADWEEP | 2012 | 19.2 | 0.1 | 1.6 | 76.8 | 21.2 | 327.0 | 231.5 | 381.2 | 179.8 | 145.9 | 12.4 | 8.8 | 1405.5 | 19.3 | 99.6 | 1119.5 | 167.1 | 
| 4113 | LAKSHADWEEP | 2013 | 26.2 | 34.4 | 37.5 | 5.3 | 88.3 | 426.2 | 296.4 | 154.4 | 180.0 | 72.8 | 78.1 | 26.7 | 1426.3 | 60.6 | 131.1 | 1057.0 | 177.6 | 
| 4114 | LAKSHADWEEP | 2014 | 53.2 | 16.1 | 4.4 | 14.9 | 57.4 | 244.1 | 116.1 | 466.1 | 132.2 | 169.2 | 59.0 | 62.3 | 1395.0 | 69.3 | 76.7 | 958.5 | 290.5 | 
| 4115 | LAKSHADWEEP | 2015 | 2.2 | 0.5 | 3.7 | 87.1 | 133.1 | 296.6 | 257.5 | 146.4 | 160.4 | 165.4 | 231.0 | 159.0 | 1642.9 | 2.7 | 223.9 | 860.9 | 555.4 | 
# prints info of row and column types
# and the number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4116 entries, 0 to 4115
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SUBDIVISION  4116 non-null   object 
 1   YEAR         4116 non-null   int64  
 2   JAN          4112 non-null   float64
 3   FEB          4113 non-null   float64
 4   MAR          4110 non-null   float64
 5   APR          4112 non-null   float64
 6   MAY          4113 non-null   float64
 7   JUN          4111 non-null   float64
 8   JUL          4109 non-null   float64
 9   AUG          4112 non-null   float64
 10  SEP          4110 non-null   float64
 11  OCT          4109 non-null   float64
 12  NOV          4105 non-null   float64
 13  DEC          4106 non-null   float64
 14  ANNUAL       4090 non-null   float64
 15  Jan-Feb      4110 non-null   float64
 16  Mar-May      4107 non-null   float64
 17  Jun-Sep      4106 non-null   float64
 18  Oct-Dec      4103 non-null   float64
dtypes: float64(17), int64(1), object(1)
memory usage: 611.1+ KB
# get the number of null values
df.loc[:,'YEAR':'DEC'].isnull().sum()
YEAR     0
JAN      4
FEB      3
MAR      6
APR      4
MAY      3
JUN      5
JUL      7
AUG      4
SEP      6
OCT      7
NOV     11
DEC     10
dtype: int64
# unique values in a column
df['SUBDIVISION'].unique()
array(['ANDAMAN & NICOBAR ISLANDS', 'ARUNACHAL PRADESH',
       'ASSAM & MEGHALAYA', 'NAGA MANI MIZO TRIPURA',
       'SUB HIMALAYAN WEST BENGAL & SIKKIM', 'GANGETIC WEST BENGAL',
       'ORISSA', 'JHARKHAND', 'BIHAR', 'EAST UTTAR PRADESH',
       'WEST UTTAR PRADESH', 'UTTARAKHAND', 'HARYANA DELHI & CHANDIGARH',
       'PUNJAB', 'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'WEST RAJASTHAN',
       'EAST RAJASTHAN', 'WEST MADHYA PRADESH', 'EAST MADHYA PRADESH',
       'GUJARAT REGION', 'SAURASHTRA & KUTCH', 'KONKAN & GOA',
       'MADHYA MAHARASHTRA', 'MATATHWADA', 'VIDARBHA', 'CHHATTISGARH',
       'COASTAL ANDHRA PRADESH', 'TELANGANA', 'RAYALSEEMA', 'TAMIL NADU',
       'COASTAL KARNATAKA', 'NORTH INTERIOR KARNATAKA',
       'SOUTH INTERIOR KARNATAKA', 'KERALA', 'LAKSHADWEEP'], dtype=object)
# index by year
df_year = df.set_index('YEAR')
df_year.head(10)
| SUBDIVISION | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | ANNUAL | Jan-Feb | Mar-May | Jun-Sep | Oct-Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YEAR | ||||||||||||||||||
| 1901 | ANDAMAN & NICOBAR ISLANDS | 49.2 | 87.1 | 29.2 | 2.3 | 528.8 | 517.5 | 365.1 | 481.1 | 332.6 | 388.5 | 558.2 | 33.6 | 3373.2 | 136.3 | 560.3 | 1696.3 | 980.3 | 
| 1902 | ANDAMAN & NICOBAR ISLANDS | 0.0 | 159.8 | 12.2 | 0.0 | 446.1 | 537.1 | 228.9 | 753.7 | 666.2 | 197.2 | 359.0 | 160.5 | 3520.7 | 159.8 | 458.3 | 2185.9 | 716.7 | 
| 1903 | ANDAMAN & NICOBAR ISLANDS | 12.7 | 144.0 | 0.0 | 1.0 | 235.1 | 479.9 | 728.4 | 326.7 | 339.0 | 181.2 | 284.4 | 225.0 | 2957.4 | 156.7 | 236.1 | 1874.0 | 690.6 | 
| 1904 | ANDAMAN & NICOBAR ISLANDS | 9.4 | 14.7 | 0.0 | 202.4 | 304.5 | 495.1 | 502.0 | 160.1 | 820.4 | 222.2 | 308.7 | 40.1 | 3079.6 | 24.1 | 506.9 | 1977.6 | 571.0 | 
| 1905 | ANDAMAN & NICOBAR ISLANDS | 1.3 | 0.0 | 3.3 | 26.9 | 279.5 | 628.7 | 368.7 | 330.5 | 297.0 | 260.7 | 25.4 | 344.7 | 2566.7 | 1.3 | 309.7 | 1624.9 | 630.8 | 
| 1906 | ANDAMAN & NICOBAR ISLANDS | 36.6 | 0.0 | 0.0 | 0.0 | 556.1 | 733.3 | 247.7 | 320.5 | 164.3 | 267.8 | 128.9 | 79.2 | 2534.4 | 36.6 | 556.1 | 1465.8 | 475.9 | 
| 1907 | ANDAMAN & NICOBAR ISLANDS | 110.7 | 0.0 | 113.3 | 21.6 | 616.3 | 305.2 | 443.9 | 377.6 | 200.4 | 264.4 | 648.9 | 245.6 | 3347.9 | 110.7 | 751.2 | 1327.1 | 1158.9 | 
| 1908 | ANDAMAN & NICOBAR ISLANDS | 20.9 | 85.1 | 0.0 | 29.0 | 562.0 | 693.6 | 481.4 | 699.9 | 428.8 | 170.7 | 208.1 | 196.9 | 3576.4 | 106.0 | 591.0 | 2303.7 | 575.7 | 
| 1910 | ANDAMAN & NICOBAR ISLANDS | 26.6 | 22.7 | 206.3 | 89.3 | 224.5 | 472.7 | 264.3 | 337.4 | 626.6 | 208.2 | 267.3 | 153.5 | 2899.4 | 49.3 | 520.1 | 1701.0 | 629.0 | 
| 1911 | ANDAMAN & NICOBAR ISLANDS | 0.0 | 8.4 | 0.0 | 122.5 | 327.3 | 649.0 | 253.0 | 187.1 | 464.5 | 333.8 | 94.5 | 247.1 | 2687.2 | 8.4 | 449.8 | 1553.6 | 675.4 | 
# cell magic command to get plots inline
%matplotlib inline 
# select a year and plot rainfall through jun-sep for all subdivision
df_year.loc[2003,['SUBDIVISION','Jun-Sep']].plot(x='SUBDIVISION',kind='bar',figsize=(12,4))
<AxesSubplot:xlabel='SUBDIVISION'>
 
# function to convert datetime from strings to python recognized type for datetime
def tidy(df):
    df = df.copy()
    df['date'] = pd.to_datetime(df['year'].astype(str)+df['month'].astype(str),format='%Y%b')
    return df.set_index('date')
# reorient the data so that dates become the index and 
# drop the columns to the right of "ANNUAL" column
df_clean= (df.loc[:,'SUBDIVISION':'DEC']
           .dropna(how='any')
          .set_index(['SUBDIVISION','YEAR'])
          .stack()
          .reset_index()
          .rename(columns={'YEAR':'year','level_2':'month',0:'precip','SUBDIVISION':'subdivision'})
          .reindex(columns=['subdivision','year','month','precip'])
          .pipe(tidy) # calls function tidy with preceeding dataframe as argument
          .drop(columns=['year','month'])
         )
df_clean.tail()
| subdivision | precip | |
|---|---|---|
| date | ||
| 2015-08-01 | LAKSHADWEEP | 146.4 | 
| 2015-09-01 | LAKSHADWEEP | 160.4 | 
| 2015-10-01 | LAKSHADWEEP | 165.4 | 
| 2015-11-01 | LAKSHADWEEP | 231.0 | 
| 2015-12-01 | LAKSHADWEEP | 159.0 | 
# select data for Kerala and plot
df_krl = df_clean.loc[df_clean['subdivision']=='KERALA',['precip']]
df_krl.loc['1990':'2015'].plot(figsize=(21,3))
plt.suptitle("Monthly rainfall in Kerala",fontweight='bold')
Text(0.5, 0.98, 'Monthly rainfall in Kerala')
 
# select rainfall for Telangana state and join it to the dataframe for Kerala
df_tel = df_clean.loc[df_clean['subdivision']=='TELANGANA',['precip']]
df_join = df_tel.join(df_krl,lsuffix='_telangana',rsuffix='_kerala')
df_join.head()
| precip_telangana | precip_kerala | |
|---|---|---|
| date | ||
| 1901-01-01 | 6.9 | 28.7 | 
| 1901-02-01 | 41.8 | 44.7 | 
| 1901-03-01 | 7.8 | 51.6 | 
| 1901-04-01 | 45.2 | 160.0 | 
| 1901-05-01 | 22.0 | 174.7 | 
df_join.plot(kind='box')
plt.suptitle("Rainfall distribution",fontweight='bold')
Text(0.5, 0.98, 'Rainfall distribution')
 
# groupby subdivisions and plot rainall time series for three selected subdivisions
selected_subd = ['KERALA','ARUNACHAL PRADESH','ASSAM & MEGHALAYA']
(df_clean.loc[df_clean['subdivision'].isin(selected_subd)]
         .groupby('subdivision').plot(figsize=(12,4),xlim=('1990','2010')))
subdivision
ARUNACHAL PRADESH       AxesSubplot(0.125,0.2;0.775x0.68)
ASSAM & MEGHALAYA    AxesSubplot(0.125,0.125;0.775x0.755)
KERALA               AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object
 
 
 
Further references:¶
- Definitely checout Pandas documentation page. 
- A course of basic Pandas can be found in the realpython website. 
- Also checkout tomaugspurger blog on Modern Pandas 
