Text files and CSV¶

Pandas¶

Data in tabular form? Use Pandas package
It can easily read and write text/ascii/csv files
Data is messy? Read the messy data and clean in Pandas
Pandas can deal with column renaming, missing values, column datatypes, date and time settings and timezones
Want to plot all numerical columns in the same axis? Pandas plotting method got you covered
Pandas documentation page has a 10 min guide to get you started

Exercise¶

We will use Pandas to do the following tasks

Read rainfall data (accessible from data/ directory) for 36 subdivisions in India
plot summer monsoon rainfall for all divisions for a given year
tidy the data and index it by date
plot time series for selected subdivisions

Data used here is provided by India Meteorological Department(IMD), Govt. of India and downloaded from here

import pandas as pd
import matplotlib.pyplot as plt

# read the data
df = pd.read_csv('data/rainfall_in_india_1901-2015.csv',na_values='nan')

# display first 5 rows
df.head()

	SUBDIVISION	YEAR	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC	ANNUAL	Jan-Feb	Mar-May	Jun-Sep	Oct-Dec
0	ANDAMAN & NICOBAR ISLANDS	1901	49.2	87.1	29.2	2.3	528.8	517.5	365.1	481.1	332.6	388.5	558.2	33.6	3373.2	136.3	560.3	1696.3	980.3
1	ANDAMAN & NICOBAR ISLANDS	1902	0.0	159.8	12.2	0.0	446.1	537.1	228.9	753.7	666.2	197.2	359.0	160.5	3520.7	159.8	458.3	2185.9	716.7
2	ANDAMAN & NICOBAR ISLANDS	1903	12.7	144.0	0.0	1.0	235.1	479.9	728.4	326.7	339.0	181.2	284.4	225.0	2957.4	156.7	236.1	1874.0	690.6
3	ANDAMAN & NICOBAR ISLANDS	1904	9.4	14.7	0.0	202.4	304.5	495.1	502.0	160.1	820.4	222.2	308.7	40.1	3079.6	24.1	506.9	1977.6	571.0
4	ANDAMAN & NICOBAR ISLANDS	1905	1.3	0.0	3.3	26.9	279.5	628.7	368.7	330.5	297.0	260.7	25.4	344.7	2566.7	1.3	309.7	1624.9	630.8

# display last 5 rows
df.tail()

	SUBDIVISION	YEAR	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC	ANNUAL	Jan-Feb	Mar-May	Jun-Sep	Oct-Dec
4111	LAKSHADWEEP	2011	5.1	2.8	3.1	85.9	107.2	153.6	350.2	254.0	255.2	117.4	184.3	14.9	1533.7	7.9	196.2	1013.0	316.6
4112	LAKSHADWEEP	2012	19.2	0.1	1.6	76.8	21.2	327.0	231.5	381.2	179.8	145.9	12.4	8.8	1405.5	19.3	99.6	1119.5	167.1
4113	LAKSHADWEEP	2013	26.2	34.4	37.5	5.3	88.3	426.2	296.4	154.4	180.0	72.8	78.1	26.7	1426.3	60.6	131.1	1057.0	177.6
4114	LAKSHADWEEP	2014	53.2	16.1	4.4	14.9	57.4	244.1	116.1	466.1	132.2	169.2	59.0	62.3	1395.0	69.3	76.7	958.5	290.5
4115	LAKSHADWEEP	2015	2.2	0.5	3.7	87.1	133.1	296.6	257.5	146.4	160.4	165.4	231.0	159.0	1642.9	2.7	223.9	860.9	555.4

# prints info of row and column types
# and the number of non-null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4116 entries, 0 to 4115
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 SUBDIVISION  4116 non-null   object 
 YEAR         4116 non-null   int64  
 JAN          4112 non-null   float64
 FEB          4113 non-null   float64
 MAR          4110 non-null   float64
 APR          4112 non-null   float64
 MAY          4113 non-null   float64
 JUN          4111 non-null   float64
 JUL          4109 non-null   float64
 AUG          4112 non-null   float64
SEP          4110 non-null   float64
OCT          4109 non-null   float64
NOV          4105 non-null   float64
DEC          4106 non-null   float64
ANNUAL       4090 non-null   float64
Jan-Feb      4110 non-null   float64
Mar-May      4107 non-null   float64
Jun-Sep      4106 non-null   float64
Oct-Dec      4103 non-null   float64
dtypes: float64(17), int64(1), object(1)
memory usage: 611.1+ KB

# get the number of null values
df.loc[:,'YEAR':'DEC'].isnull().sum()

YEAR     0
JAN      4
FEB      3
MAR      6
APR      4
MAY      3
JUN      5
JUL      7
AUG      4
SEP      6
OCT      7
NOV     11
DEC     10
dtype: int64

# unique values in a column
df['SUBDIVISION'].unique()

array(['ANDAMAN & NICOBAR ISLANDS', 'ARUNACHAL PRADESH',
       'ASSAM & MEGHALAYA', 'NAGA MANI MIZO TRIPURA',
       'SUB HIMALAYAN WEST BENGAL & SIKKIM', 'GANGETIC WEST BENGAL',
       'ORISSA', 'JHARKHAND', 'BIHAR', 'EAST UTTAR PRADESH',
       'WEST UTTAR PRADESH', 'UTTARAKHAND', 'HARYANA DELHI & CHANDIGARH',
       'PUNJAB', 'HIMACHAL PRADESH', 'JAMMU & KASHMIR', 'WEST RAJASTHAN',
       'EAST RAJASTHAN', 'WEST MADHYA PRADESH', 'EAST MADHYA PRADESH',
       'GUJARAT REGION', 'SAURASHTRA & KUTCH', 'KONKAN & GOA',
       'MADHYA MAHARASHTRA', 'MATATHWADA', 'VIDARBHA', 'CHHATTISGARH',
       'COASTAL ANDHRA PRADESH', 'TELANGANA', 'RAYALSEEMA', 'TAMIL NADU',
       'COASTAL KARNATAKA', 'NORTH INTERIOR KARNATAKA',
       'SOUTH INTERIOR KARNATAKA', 'KERALA', 'LAKSHADWEEP'], dtype=object)

# index by year
df_year = df.set_index('YEAR')
df_year.head(10)

	SUBDIVISION	JAN	FEB	MAR	APR	MAY	JUN	JUL	AUG	SEP	OCT	NOV	DEC	ANNUAL	Jan-Feb	Mar-May	Jun-Sep	Oct-Dec
YEAR
1901	ANDAMAN & NICOBAR ISLANDS	49.2	87.1	29.2	2.3	528.8	517.5	365.1	481.1	332.6	388.5	558.2	33.6	3373.2	136.3	560.3	1696.3	980.3
1902	ANDAMAN & NICOBAR ISLANDS	0.0	159.8	12.2	0.0	446.1	537.1	228.9	753.7	666.2	197.2	359.0	160.5	3520.7	159.8	458.3	2185.9	716.7
1903	ANDAMAN & NICOBAR ISLANDS	12.7	144.0	0.0	1.0	235.1	479.9	728.4	326.7	339.0	181.2	284.4	225.0	2957.4	156.7	236.1	1874.0	690.6
1904	ANDAMAN & NICOBAR ISLANDS	9.4	14.7	0.0	202.4	304.5	495.1	502.0	160.1	820.4	222.2	308.7	40.1	3079.6	24.1	506.9	1977.6	571.0
1905	ANDAMAN & NICOBAR ISLANDS	1.3	0.0	3.3	26.9	279.5	628.7	368.7	330.5	297.0	260.7	25.4	344.7	2566.7	1.3	309.7	1624.9	630.8
1906	ANDAMAN & NICOBAR ISLANDS	36.6	0.0	0.0	0.0	556.1	733.3	247.7	320.5	164.3	267.8	128.9	79.2	2534.4	36.6	556.1	1465.8	475.9
1907	ANDAMAN & NICOBAR ISLANDS	110.7	0.0	113.3	21.6	616.3	305.2	443.9	377.6	200.4	264.4	648.9	245.6	3347.9	110.7	751.2	1327.1	1158.9
1908	ANDAMAN & NICOBAR ISLANDS	20.9	85.1	0.0	29.0	562.0	693.6	481.4	699.9	428.8	170.7	208.1	196.9	3576.4	106.0	591.0	2303.7	575.7
1910	ANDAMAN & NICOBAR ISLANDS	26.6	22.7	206.3	89.3	224.5	472.7	264.3	337.4	626.6	208.2	267.3	153.5	2899.4	49.3	520.1	1701.0	629.0
1911	ANDAMAN & NICOBAR ISLANDS	0.0	8.4	0.0	122.5	327.3	649.0	253.0	187.1	464.5	333.8	94.5	247.1	2687.2	8.4	449.8	1553.6	675.4

# cell magic command to get plots inline
%matplotlib inline 
# select a year and plot rainfall through jun-sep for all subdivision
df_year.loc[2003,['SUBDIVISION','Jun-Sep']].plot(x='SUBDIVISION',kind='bar',figsize=(12,4))

<AxesSubplot:xlabel='SUBDIVISION'>

_images/L3-working_with_ASCII_csv_txt_11_1.png

# function to convert datetime from strings to python recognized type for datetime
def tidy(df):
    df = df.copy()
    df['date'] = pd.to_datetime(df['year'].astype(str)+df['month'].astype(str),format='%Y%b')
    return df.set_index('date')

# reorient the data so that dates become the index and 
# drop the columns to the right of "ANNUAL" column
df_clean= (df.loc[:,'SUBDIVISION':'DEC']
           .dropna(how='any')
          .set_index(['SUBDIVISION','YEAR'])
          .stack()
          .reset_index()
          .rename(columns={'YEAR':'year','level_2':'month',0:'precip','SUBDIVISION':'subdivision'})
          .reindex(columns=['subdivision','year','month','precip'])
          .pipe(tidy) # calls function tidy with preceeding dataframe as argument
          .drop(columns=['year','month'])
         )
df_clean.tail()

	subdivision	precip
date
2015-08-01	LAKSHADWEEP	146.4
2015-09-01	LAKSHADWEEP	160.4
2015-10-01	LAKSHADWEEP	165.4
2015-11-01	LAKSHADWEEP	231.0
2015-12-01	LAKSHADWEEP	159.0

# select data for Kerala and plot
df_krl = df_clean.loc[df_clean['subdivision']=='KERALA',['precip']]
df_krl.loc['1990':'2015'].plot(figsize=(21,3))
plt.suptitle("Monthly rainfall in Kerala",fontweight='bold')

Text(0.5, 0.98, 'Monthly rainfall in Kerala')

_images/L3-working_with_ASCII_csv_txt_14_1.png

# select rainfall for Telangana state and join it to the dataframe for Kerala
df_tel = df_clean.loc[df_clean['subdivision']=='TELANGANA',['precip']]
df_join = df_tel.join(df_krl,lsuffix='_telangana',rsuffix='_kerala')
df_join.head()

	precip_telangana	precip_kerala
date
1901-01-01	6.9	28.7
1901-02-01	41.8	44.7
1901-03-01	7.8	51.6
1901-04-01	45.2	160.0
1901-05-01	22.0	174.7

df_join.plot(kind='box')
plt.suptitle("Rainfall distribution",fontweight='bold')

Text(0.5, 0.98, 'Rainfall distribution')

_images/L3-working_with_ASCII_csv_txt_16_1.png

# groupby subdivisions and plot rainall time series for three selected subdivisions
selected_subd = ['KERALA','ARUNACHAL PRADESH','ASSAM & MEGHALAYA']
(df_clean.loc[df_clean['subdivision'].isin(selected_subd)]
         .groupby('subdivision').plot(figsize=(12,4),xlim=('1990','2010')))

subdivision
ARUNACHAL PRADESH       AxesSubplot(0.125,0.2;0.775x0.68)
ASSAM & MEGHALAYA    AxesSubplot(0.125,0.125;0.775x0.755)
KERALA               AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object

_images/L3-working_with_ASCII_csv_txt_17_1.png

_images/L3-working_with_ASCII_csv_txt_17_2.png

_images/L3-working_with_ASCII_csv_txt_17_3.png

Further references:¶

Definitely checout Pandas documentation page.
A course of basic Pandas can be found in the realpython website.
Also checkout tomaugspurger blog on Modern Pandas

Oceanography with python

Text files and CSV¶

Pandas¶

Exercise¶

Further references:¶