Text files and CSV

Pandas

  • Data in tabular form? Use Pandas package

  • It can easily read and write text/ascii/csv files

  • Data is messy? Read the messy data and clean in Pandas

  • Pandas can deal with column renaming, missing values, column datatypes, date and time settings and timezones

  • Want to plot all numerical columns in the same axis? Pandas plotting method got you covered

  • Pandas documentation page has a 10 min guide to get you started

Exercise

We will use Pandas to do the following tasks

  • Read rainfall data (accessible from data/ directory) for 36 subdivisions in India

  • plot summer monsoon rainfall for all divisions for a given year

  • tidy the data and index it by date

  • plot time series for selected subdivisions

Data used here is provided by India Meteorological Department(IMD), Govt. of India and downloaded from here

import pandas as pd
import matplotlib.pyplot as plt
# read the data
df = pd.read_csv('data/rainfall_in_india_1901-2015.csv',na_values='nan')
# display first 5 rows
df.head()
SUBDIVISION YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANNUAL Jan-Feb Mar-May Jun-Sep Oct-Dec
0 ANDAMAN & NICOBAR ISLANDS 1901 49.2 87.1 29.2 2.3 528.8 517.5 365.1 481.1 332.6 388.5 558.2 33.6 3373.2 136.3 560.3 1696.3 980.3
1 ANDAMAN & NICOBAR ISLANDS 1902 0.0 159.8 12.2 0.0 446.1 537.1 228.9 753.7 666.2 197.2 359.0 160.5 3520.7 159.8 458.3 2185.9 716.7
2 ANDAMAN & NICOBAR ISLANDS 1903 12.7 144.0 0.0 1.0 235.1 479.9 728.4 326.7 339.0 181.2 284.4 225.0 2957.4 156.7 236.1 1874.0 690.6
3 ANDAMAN & NICOBAR ISLANDS 1904 9.4 14.7 0.0 202.4 304.5 495.1 502.0 160.1 820.4 222.2 308.7 40.1 3079.6 24.1 506.9 1977.6 571.0
4 ANDAMAN & NICOBAR ISLANDS 1905 1.3 0.0 3.3 26.9 279.5 628.7 368.7 330.5 297.0 260.7 25.4 344.7 2566.7 1.3 309.7 1624.9 630.8
# display last 5 rows
df.tail()
SUBDIVISION YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANNUAL Jan-Feb Mar-May Jun-Sep Oct-Dec
4111 LAKSHADWEEP 2011 5.1 2.8 3.1 85.9 107.2 153.6 350.2 254.0 255.2 117.4 184.3 14.9 1533.7 7.9 196.2 1013.0 316.6
4112 LAKSHADWEEP 2012 19.2 0.1 1.6 76.8 21.2 327.0 231.5 381.2 179.8 145.9 12.4 8.8 1405.5 19.3 99.6 1119.5 167.1
4113 LAKSHADWEEP 2013 26.2 34.4 37.5 5.3 88.3 426.2 296.4 154.4 180.0 72.8 78.1 26.7 1426.3 60.6 131.1 1057.0 177.6
4114 LAKSHADWEEP 2014 53.2 16.1 4.4 14.9 57.4 244.1 116.1 466.1 132.2 169.2 59.0 62.3 1395.0 69.3 76.7 958.5 290.5
4115 LAKSHADWEEP 2015 2.2 0.5 3.7 87.1 133.1 296.6 257.5 146.4 160.4 165.4 231.0 159.0 1642.9 2.7 223.9 860.9 555.4
# prints info of row and column types
# and the number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4116 entries, 0 to 4115
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SUBDIVISION  4116 non-null   object 
 1   YEAR         4116 non-null   int64  
 2   JAN          4112 non-null   float64
 3   FEB          4113 non-null   float64
 4   MAR          4110 non-null   float64
 5   APR          4112 non-null   float64
 6   MAY          4113 non-null   float64
 7   JUN          4111 non-null   float64
 8   JUL          4109 non-null   float64
 9   AUG          4112 non-null   float64
 10  SEP          4110 non-null   float64
 11  OCT          4109 non-null   float64
 12  NOV          4105 non-null   float64
 13  DEC          4106 non-null   float64
 14  ANNUAL       4090 non-null   float64
 15  Jan-Feb      4110 non-null   float64
 16  Mar-May      4107 non-null   float64
 17  Jun-Sep      4106 non-null   float64
 18  Oct-Dec      4103 non-null   float64
dtypes: float64(17), int64(1), object(1)
memory usage: 611.1+ KB
# get the number of null values
df.loc[:,'YEAR':'DEC'].isnull().sum()
YEAR     0
JAN      4
FEB      3
MAR      6
APR      4
MAY      3
JUN      5
JUL      7
AUG      4
SEP      6
OCT      7
NOV     11
DEC     10
dtype: int64
# unique values in a column
df['SUBDIVISION'].unique()
array(['ANDAMAN & NICOBAR ISLANDS', 'ARUNACHAL PRADESH',
       'ASSAM & MEGHALAYA', 'NAGA MANI MIZO TRIPURA',
       'SUB HIMALAYAN WEST BENGAL & SIKKIM', 'GANGETIC WEST BENGAL',
       'ORISSA', 'JHARKHAND', 'BIHAR', 'EAST UTTAR PRADESH',
       'WEST UTTAR PRADESH', 'UTTARAKHAND', 'HARYANA DELHI & CHANDIGARH',
       'PUNJAB', 'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'WEST RAJASTHAN',
       'EAST RAJASTHAN', 'WEST MADHYA PRADESH', 'EAST MADHYA PRADESH',
       'GUJARAT REGION', 'SAURASHTRA & KUTCH', 'KONKAN & GOA',
       'MADHYA MAHARASHTRA', 'MATATHWADA', 'VIDARBHA', 'CHHATTISGARH',
       'COASTAL ANDHRA PRADESH', 'TELANGANA', 'RAYALSEEMA', 'TAMIL NADU',
       'COASTAL KARNATAKA', 'NORTH INTERIOR KARNATAKA',
       'SOUTH INTERIOR KARNATAKA', 'KERALA', 'LAKSHADWEEP'], dtype=object)
# index by year
df_year = df.set_index('YEAR')
df_year.head(10)
SUBDIVISION JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANNUAL Jan-Feb Mar-May Jun-Sep Oct-Dec
YEAR
1901 ANDAMAN & NICOBAR ISLANDS 49.2 87.1 29.2 2.3 528.8 517.5 365.1 481.1 332.6 388.5 558.2 33.6 3373.2 136.3 560.3 1696.3 980.3
1902 ANDAMAN & NICOBAR ISLANDS 0.0 159.8 12.2 0.0 446.1 537.1 228.9 753.7 666.2 197.2 359.0 160.5 3520.7 159.8 458.3 2185.9 716.7
1903 ANDAMAN & NICOBAR ISLANDS 12.7 144.0 0.0 1.0 235.1 479.9 728.4 326.7 339.0 181.2 284.4 225.0 2957.4 156.7 236.1 1874.0 690.6
1904 ANDAMAN & NICOBAR ISLANDS 9.4 14.7 0.0 202.4 304.5 495.1 502.0 160.1 820.4 222.2 308.7 40.1 3079.6 24.1 506.9 1977.6 571.0
1905 ANDAMAN & NICOBAR ISLANDS 1.3 0.0 3.3 26.9 279.5 628.7 368.7 330.5 297.0 260.7 25.4 344.7 2566.7 1.3 309.7 1624.9 630.8
1906 ANDAMAN & NICOBAR ISLANDS 36.6 0.0 0.0 0.0 556.1 733.3 247.7 320.5 164.3 267.8 128.9 79.2 2534.4 36.6 556.1 1465.8 475.9
1907 ANDAMAN & NICOBAR ISLANDS 110.7 0.0 113.3 21.6 616.3 305.2 443.9 377.6 200.4 264.4 648.9 245.6 3347.9 110.7 751.2 1327.1 1158.9
1908 ANDAMAN & NICOBAR ISLANDS 20.9 85.1 0.0 29.0 562.0 693.6 481.4 699.9 428.8 170.7 208.1 196.9 3576.4 106.0 591.0 2303.7 575.7
1910 ANDAMAN & NICOBAR ISLANDS 26.6 22.7 206.3 89.3 224.5 472.7 264.3 337.4 626.6 208.2 267.3 153.5 2899.4 49.3 520.1 1701.0 629.0
1911 ANDAMAN & NICOBAR ISLANDS 0.0 8.4 0.0 122.5 327.3 649.0 253.0 187.1 464.5 333.8 94.5 247.1 2687.2 8.4 449.8 1553.6 675.4
# cell magic command to get plots inline
%matplotlib inline 
# select a year and plot rainfall through jun-sep for all subdivision
df_year.loc[2003,['SUBDIVISION','Jun-Sep']].plot(x='SUBDIVISION',kind='bar',figsize=(12,4))
<AxesSubplot:xlabel='SUBDIVISION'>
_images/L3-working_with_ASCII_csv_txt_11_1.png
# function to convert datetime from strings to python recognized type for datetime
def tidy(df):
    df = df.copy()
    df['date'] = pd.to_datetime(df['year'].astype(str)+df['month'].astype(str),format='%Y%b')
    return df.set_index('date')
# reorient the data so that dates become the index and 
# drop the columns to the right of "ANNUAL" column
df_clean= (df.loc[:,'SUBDIVISION':'DEC']
           .dropna(how='any')
          .set_index(['SUBDIVISION','YEAR'])
          .stack()
          .reset_index()
          .rename(columns={'YEAR':'year','level_2':'month',0:'precip','SUBDIVISION':'subdivision'})
          .reindex(columns=['subdivision','year','month','precip'])
          .pipe(tidy) # calls function tidy with preceeding dataframe as argument
          .drop(columns=['year','month'])
         )
df_clean.tail()
subdivision precip
date
2015-08-01 LAKSHADWEEP 146.4
2015-09-01 LAKSHADWEEP 160.4
2015-10-01 LAKSHADWEEP 165.4
2015-11-01 LAKSHADWEEP 231.0
2015-12-01 LAKSHADWEEP 159.0
# select data for Kerala and plot
df_krl = df_clean.loc[df_clean['subdivision']=='KERALA',['precip']]
df_krl.loc['1990':'2015'].plot(figsize=(21,3))
plt.suptitle("Monthly rainfall in Kerala",fontweight='bold')
Text(0.5, 0.98, 'Monthly rainfall in Kerala')
_images/L3-working_with_ASCII_csv_txt_14_1.png
# select rainfall for Telangana state and join it to the dataframe for Kerala
df_tel = df_clean.loc[df_clean['subdivision']=='TELANGANA',['precip']]
df_join = df_tel.join(df_krl,lsuffix='_telangana',rsuffix='_kerala')
df_join.head()
precip_telangana precip_kerala
date
1901-01-01 6.9 28.7
1901-02-01 41.8 44.7
1901-03-01 7.8 51.6
1901-04-01 45.2 160.0
1901-05-01 22.0 174.7
df_join.plot(kind='box')
plt.suptitle("Rainfall distribution",fontweight='bold')
Text(0.5, 0.98, 'Rainfall distribution')
_images/L3-working_with_ASCII_csv_txt_16_1.png
# groupby subdivisions and plot rainall time series for three selected subdivisions
selected_subd = ['KERALA','ARUNACHAL PRADESH','ASSAM & MEGHALAYA']
(df_clean.loc[df_clean['subdivision'].isin(selected_subd)]
         .groupby('subdivision').plot(figsize=(12,4),xlim=('1990','2010')))
subdivision
ARUNACHAL PRADESH       AxesSubplot(0.125,0.2;0.775x0.68)
ASSAM & MEGHALAYA    AxesSubplot(0.125,0.125;0.775x0.755)
KERALA               AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object
_images/L3-working_with_ASCII_csv_txt_17_1.png _images/L3-working_with_ASCII_csv_txt_17_2.png _images/L3-working_with_ASCII_csv_txt_17_3.png

Further references:

  • Definitely checout Pandas documentation page.

  • A course of basic Pandas can be found in the realpython website.

  • Also checkout tomaugspurger blog on Modern Pandas