Rice Paddy Methane Emissions Estimation: Part 2

May 23, 2022

Methane Emissions Estimation Data Part 2: A Comparison between FAOSTAT and University of Malaysia Estimates

This post documents the data exploration phase of a project that determines whether global methane emissions produced by rice paddies are undercounted.

It is fairly code python and pandas heavy.

The code and data exploration follows the summary below.

Hypothesis Testing the University of Malaysia Paper

Claims

That the distributions do not differ between 2020 and 2019
That the means do no differ between 2020 and 2019

What will be Tested.

Shapiro-Wilk Test
Mann-Whitney U Test
Kruskal Wallis
Friedman

Analysis

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

filepath = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/ch4_2015-2021.xlsx"

hypothesis_testing_df = pd.read_excel(filepath)

Drop total row from the data

hypothesis_testing_df = hypothesis_testing_df.loc[(hypothesis_testing_df['country_name'] != "Total")].copy() #copying to avoid modifying slices in memory.  Old df should also drop from memory in production environment.

hypothesis_testing_df

	iso3_country	country_name	tCH4_2015	tCH4_2016	tCH4_2017	tCH4_2018	tCH4_2019	tCH4_2020	tCH4_2021
0	BGD	Bangladesh	2.344420e+06	2.278158e+06	2.098958e+06	2.141231e+06	2.070985e+06	2.106781e+06	1.983974e+06
1	BRA	Brazil	3.410233e+05	3.104189e+05	3.725173e+05	3.717030e+05	3.294713e+05	4.902874e+05	4.544874e+05
2	CHN	China	6.133647e+06	5.859531e+06	6.355071e+06	5.413962e+06	5.603352e+06	6.402353e+06	6.068210e+06
3	ESP	Spain	1.141464e+04	1.334803e+04	1.217299e+04	1.405410e+04	1.148324e+04	1.305461e+04	8.531579e+03
4	IDN	Indonesia	1.283649e+06	1.023129e+06	9.615327e+05	1.176982e+06	1.266668e+06	1.188195e+06	1.009936e+06
5	IND	India	6.219887e+06	5.309413e+06	6.228451e+06	6.589798e+06	7.501556e+06	7.599764e+06	6.567960e+06
6	IRN	Iran (Islamic Republic of)	8.774407e+04	9.180121e+04	9.620217e+04	8.875744e+04	9.500199e+04	9.600254e+04	9.053525e+04
7	ITA	Italy	4.995968e+04	4.937785e+04	5.443679e+04	4.469902e+04	4.566914e+04	5.101547e+04	5.089759e+04
8	JPN	Japan	2.305465e+05	2.284133e+05	2.708935e+05	1.548252e+05	2.332056e+05	2.835167e+05	1.574007e+05
9	KHM	Cambodia	4.954698e+05	5.731698e+05	4.517045e+05	5.592610e+05	5.947277e+05	6.412802e+05	5.644891e+05
10	KOR	Korea (the Republic of)	1.451878e+05	1.274597e+05	1.463222e+05	1.293543e+05	1.327782e+05	1.165467e+05	1.013006e+05
11	LAO	Lao People's Democratic Republic (the)	1.661169e+04	1.696441e+04	1.168063e+04	1.009675e+04	1.461058e+04	2.136270e+04	1.475014e+04
12	LKA	Sri Lanka	8.305626e+04	1.011743e+05	5.911841e+04	9.018914e+04	8.476088e+04	9.248238e+04	8.466966e+04
13	MMR	Myanmar	1.132082e+06	1.290806e+06	1.205169e+06	1.372447e+06	1.256888e+06	1.221904e+06	1.289837e+06
14	MYS	Malaysia	1.057399e+05	1.110049e+05	1.111291e+05	1.066525e+05	1.056287e+05	1.127141e+05	1.069696e+05
15	NPL	Nepal	1.007479e+05	6.667161e+04	8.081300e+04	9.200752e+04	1.164235e+05	7.168401e+04	4.811408e+04
16	PAK	Pakistan	4.852431e+05	5.945922e+05	5.372641e+05	4.532297e+05	6.528548e+05	6.401201e+05	4.849205e+05
17	PHL	Philippines (the)	3.432021e+05	4.073554e+05	3.836830e+05	4.175210e+05	3.584550e+05	4.462836e+05	4.383270e+05
18	PRK	Korea (the Democratic People's Republic of)	1.143217e+05	9.177653e+04	1.085457e+05	8.662578e+04	9.655062e+04	8.581038e+04	7.735988e+04
19	THA	Thailand	1.393798e+06	1.780993e+06	1.164699e+06	9.166575e+05	1.305046e+06	1.520788e+06	8.528673e+05
20	TWN	Taiwan (Province of China)	7.866956e+04	8.089149e+04	8.705634e+04	8.138151e+04	8.990870e+04	8.333327e+04	6.619861e+04
21	USA	United States of America (the)	1.611324e+05	1.618576e+05	1.684799e+05	1.657254e+05	1.691351e+05	1.941455e+05	1.634842e+05
22	VNM	Viet Nam	1.346013e+06	1.483777e+06	1.406437e+06	1.317455e+06	1.269751e+06	1.374450e+06	1.502787e+06

Test for Normality: Shapiro-Wilk

2019

## Selecting Malaysia 2019 Data 
data_2019 = hypothesis_testing_df['tCH4_2019']
data_2019

0     2.070985e+06
1     3.294713e+05
2     5.603352e+06
3     1.148324e+04
4     1.266668e+06
5     7.501556e+06
6     9.500199e+04
7     4.566914e+04
8     2.332056e+05
9     5.947277e+05
10    1.327782e+05
11    1.461058e+04
12    8.476088e+04
13    1.256888e+06
14    1.056287e+05
15    1.164235e+05
16    6.528548e+05
17    3.584550e+05
18    9.655062e+04
19    1.305046e+06
20    8.990870e+04
21    1.691351e+05
22    1.269751e+06
Name: tCH4_2019, dtype: float64

results = stats.shapiro(data_2019)
print('stat=%.3f, p=%.3f' % (results.statistic, results.pvalue))
if results.pvalue > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.567, p=0.000
Probably not Gaussian

Results

The distribution is not gausian so a non-paremtric test must be completed. It is not necessary to perform this test on the 2020 data, but I will do so anyways for practice.

2020

## Selecting the Malaysia Data 2020
data_2020 = hypothesis_testing_df['tCH4_2020']

results = stats.shapiro(data_2020)
print('stat=%.3f, p=%.3f' % (results.statistic, results.pvalue))
if results.pvalue > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.565, p=0.000
Probably not Gaussian

Results

The 2020 data is not gausian which verifies that we will need to perform a non parmetric test

Independence of Samples.

We have to assume that the samples are independent of each other as we know they are dependent on hecatares.
Though the correlations are rather high this is due to the smiliarity of hectares per year. Thus the amount of ch4 is similiar

Distribution Similiarity

Mann-Whitney U Test

# Example of the Mann-Whitney U Test

stat, p = stats.mannwhitneyu(data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=266.000, p=0.982
Probably the same distribution

Kruskal Wallis test


stat, p = stats.kruskal(data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.001, p=0.974
Probably the same distribution

Friedman Test

Just for the sake of it I will compare data across all distributions

# Example of the Friedman Test
#data_2014 = hypothesis_testing_df['tCH4_2014']
data_2015 = hypothesis_testing_df['tCH4_2015']
data_2016 = hypothesis_testing_df['tCH4_2016']
data_2017 = hypothesis_testing_df['tCH4_2017']
data_2018 = hypothesis_testing_df['tCH4_2018']

stat, p = stats.friedmanchisquare(data_2015, data_2016, data_2017, data_2018, data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=11.472, p=0.043
Probably different distributions

Results.

Some distributions differ from one another. Which those are have yet to be discovered. For the sake of this analysis I will not attempt to identify them.

The statment that the distributions of the 2019 and 2020 data do not differ cannot differ. That said we also cannot claim that the means are statistically equivalent as the data is not parametric.