Parametric Tutorial

NOTE This notebook is not meant to teach statistics, but only demo how to run the py50 functions. There are plenty of resources available online. I particularly found introductory tutorials by DATAtab helpful.

The following page will show examples of parametric tests using the Stats() and Plots() modules of py50. There are many plot features available for py50, but they will not all be demoed here. Instead, to see all the plots available, take a look at the 006_statistics_quickstart. As py50 is built on top of Pingouin, additional examples with ANOVA tests can be found here.

[1]:

from py50 import Stats, Plots
import pingouin as pg

[2]:

## List Datasets available from Pingouin
# pg.list_dataset()

# Read Dataset
df = pg.read_dataset('penguins')

# Initialize Stats and Plots
stats = Stats(df)
plot = Plots(df)

print('List of Group names in Dataset:', df['species'].unique())
stats.show()

List of Group names in Dataset: ['Adelie' 'Chinstrap' 'Gentoo']

[2]:

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Biscoe	37.8	18.3	174.0	3400.0	female
1	Adelie	Biscoe	37.7	18.7	180.0	3600.0	male
2	Adelie	Biscoe	35.9	19.2	189.0	3800.0	female
3	Adelie	Biscoe	38.2	18.1	185.0	3950.0	male
4	Adelie	Biscoe	38.8	17.2	180.0	3800.0	male

Data Distribution

The data distribution is important to determine patterns that may show up in the dataset. For example, it can also tell you how often a value may occur, the shape of the data, variability of a value, etc.

py50 includes two tests that can help analyze the dataset.

Normality. This will test if the dataset follows normal distribution - i.e. a bell-shape curve.
Homoscedasticity. This will test for the “spread” or consistency of the dataset along a regression model.

[3]:

# Test for normality
stats.get_normality(value_col='body_mass_g', group_col='species').round(3)

[3]:

	W	pval	normal
species
Adelie	0.981	0.032	False
Chinstrap	0.984	0.561	True
Gentoo	0.986	0.234	True

[4]:

stats.get_homoscedasticity(value_col='body_mass_g', group_col='species').round(3)

[4]:

	W	pval	equal_var
levene	NaN	NaN	False

ANOVA

Several tests, parametric and non-parametric, are available in py50. The functions use the “get_x” format, where x is the name of said test. Here, the One-Way ANOVA script will be demoed. The final table will be a Pandas.DataFrame object.

[5]:

anova = stats.get_anova(value_col='body_mass_g', group_col='species')
anova

[5]:

	Source	ddof1	ddof2	F	p_unc	np2	significance
0	species	2	339	343.626275	2.892368e-82	0.669672	****

Tukey

py50 comes with several post-hoc tests. Again, this can be called using the same ‘get_x’ phrase. As the above ANOVA test show significance between the Hair color group, a Tukey test will be performed.

[6]:

tukey = stats.get_tukey(value_col='body_mass_g', group_col='species').round(4)
tukey

[6]:

	A	B	mean_A	mean_B	diff	se	T	p_tukey	hedges	significance
0	Adelie	Chinstrap	3700.6623	3733.0882	-32.426	67.5117	-0.4803	0.8807	-0.0739	n.s.
1	Adelie	Gentoo	3700.6623	5076.0163	-1375.354	56.1480	-24.4952	0.0000	-2.8602	****
2	Chinstrap	Gentoo	3733.0882	5076.0163	-1342.928	69.8569	-19.2240	0.0000	-2.8753	****

The tukey results can be plotted as follows:

The parameters for all plots in py50 are created to be as similar as possible. For each plot, there is also an option to “return_df”. This parameter will output the calculated dataframe that was used to create the plot. Users can use this to compare to the plot generated using the Stats() module to check for any inconsistencies.

[7]:

title = 'Tukey Results Bar Plot'

plot.barplot(test='tukey', value_col='body_mass_g', group_col='species', palette='RdYlGn', title=title, fontsize=30,
             return_df=True)

[7]:

(           A          B       mean_A       mean_B         diff         se  \
 0     Adelie  Chinstrap  3700.662252  3733.088235   -32.425984  67.511684
 1     Adelie     Gentoo  3700.662252  5076.016260 -1375.354009  56.147971
 2  Chinstrap     Gentoo  3733.088235  5076.016260 -1342.928025  69.856928

            T       p_tukey    hedges significance
 0  -0.480302  8.806666e-01 -0.073946         n.s.
 1 -24.495169  1.443290e-14 -2.860201         ****
 2 -19.223978  1.443290e-14 -2.875327         ****  ,
 <statannotations.Annotator.Annotator at 0x119fe7a60>)

../_images/tutorials_007_parametric_example_11_1.png

Welch ANOVA

Welch ANOVA is used for when the Homoscedasticity is false. In other words, if the datasets does not look consistent. The get_welch_anova() function can be tested usign the same Penguin dataset used above.

[8]:

# welch = stats.get_welch_anova(data, value_col='body_mass_g', group_col='species').round(4)
welch = stats.get_welch_anova(value_col='body_mass_g', group_col='species')
welch

[8]:

	Source	ddof1	ddof2	F	p_unc	np2	significance
0	species	2	189.478413	317.572267	3.093701e-61	0.669672	****

Games-Howell Post-Hoc Test

Again, after the ANOVA test, a post-hoc test should be used to tell significance between groups. Games-Howell can be used to handle situations with inconsistent datasets - i.e. Homoscedasticity is false.

[9]:

gameshowell = stats.get_gameshowell(value_col='body_mass_g', group_col='species')
gameshowell

[9]:

	A	B	mean_A	mean_B	diff	se	T	df	pval	hedges	significance
0	Adelie	Chinstrap	3700.662252	3733.088235	-32.425984	59.706437	-0.543090	152.454796	0.850154	-0.073946	n.s.
1	Adelie	Gentoo	3700.662252	5076.016260	-1375.354009	58.810929	-23.386028	249.642554	0.000000	-2.860201	****
2	Chinstrap	Gentoo	3733.088235	5076.016260	-1342.928025	65.102843	-20.627794	170.404362	0.000000	-2.875327	****

The post-hoc results can be plotted using py50. Her there are specific pairs denoted for plotting. This would remove the pairs that have no significance (n.s.). If the return_df=True, then the returned table will only contain the pairs annotated on the plot.

[10]:

title = 'Games-Howell Results Bar Plot'
pairs = [('Gentoo', 'Chinstrap'), ('Adelie', 'Gentoo')]

plot.barplot(test='gameshowell', value_col='body_mass_g', group_col='species', palette='RdYlGn', title=title,
             fontsize=30, return_df=True, pairs=pairs)

[10]:

(           A       B       mean_A      mean_B         diff         se  \
 1  Chinstrap  Gentoo  3733.088235  5076.01626 -1342.928025  65.102843
 0     Adelie  Gentoo  3700.662252  5076.01626 -1375.354009  58.810929

            T          df  pval    hedges significance
 1 -20.627794  170.404362   0.0 -2.875327         ****
 0 -23.386028  249.642554   0.0 -2.860201         ****  ,
 <statannotations.Annotator.Annotator at 0x11a29c5b0>)

../_images/tutorials_007_parametric_example_17_1.png

Repeated Measures

Repeated measures is used to analyze within-subjects or within-groups factors. Can be used when the subjects are measured under different conditions or time points. Examples for the get_rm_anova() function uses the ‘rm_anova’ dataset from pingouin as follows:

[11]:

repeated_m = pg.read_dataset('rm_anova')
repeated_m.dropna(subset=['Gender'], inplace=True)

rm_stats = Stats(repeated_m)
rm_plots = Plots(repeated_m)
rm_stats.show()

[11]:

	Subject	Gender	Region	Education	DesireToKill	Disgustingness	Frighteningness
0	1	Female	North	some	10.0	High	High
1	1	Female	North	some	9.0	High	Low
2	1	Female	North	some	6.0	Low	High
3	1	Female	North	some	6.0	Low	Low
4	2	Female	North	advance	10.0	High	High

[12]:

repeated_measures = rm_stats.get_rm_anova(value_col='DesireToKill', within_subject_col='Disgustingness',
                                          subject_col='Subject')
repeated_measures

[12]:

	Source	ddof1	ddof2	F	p_unc	ng2	eps	significance
0	Disgustingness	1	91	11.876684	0.000862	0.025827	1.0	***

Pairwise tests for repeated measures

The guidelines found on Pingouin suggest pairwise_tests(). py50 warps this function in the get_pairwise_rm() function. The results can also be used to plot hte results.

NOTE: Depending on the data, it may also be feasible to use the Tukey test.

[13]:

pairwise_test = rm_stats.get_pairwise_rm(value_col='DesireToKill', within_subject_col='Disgustingness',
                                         subject_col='Subject')
pairwise_test

[13]:

	Contrast	A	B	Paired	Parametric	T	dof	alternative	p_unc	BF10	hedges	significance
0	Disgustingness	High	Low	True	True	3.446257	91.0	two-sided	0.000862	26.262	0.322536	***

[14]:

title = 'Pairwise T-Test Results Bar Plot'

rm_plots.boxplot(test='pairwise-rm', value_col='DesireToKill', group_col='Disgustingness', subject_col='Subject',
                 palette='RdYlGn', title=title, fontsize=30, return_df=True)

[14]:

(         Contrast     A    B  Paired  Parametric         T        dof  \
 0  Disgustingness  High  Low   False        True  2.595165  360.18469

   alternative     p_unc   BF10    hedges significance
 0   two-sided  0.009841  2.899  0.271888           **  ,
 <statannotations.Annotator.Annotator at 0x119ce0c10>)

../_images/tutorials_007_parametric_example_23_1.png

Mixed ANOVA

Mixed ANOVA can be used to analyze data with mixed design - both within and between group factors. The function uses the py50 standard as seen in the above examples. The example here will use the ‘mixed_anova’ dataset from Pingouin.

[15]:

data = pg.read_dataset('mixed_anova')
mixed_stats = Stats(data)
mixed_plot = Plots(data)

mixed_stats.show()

[15]:

	Scores	Time	Group	Subject
0	5.971435	August	Control	0
1	4.309024	August	Control	1
2	6.932707	August	Control	2
3	5.187348	August	Control	3
4	4.779411	August	Control	4

[16]:

mixed = mixed_stats.get_mixed_anova(value_col='Scores', group_col='Group', within_subject_col='Time',
                                    subject_col='Subject')
mixed

[16]:

	Source	SS	DF1	DF2	MS	F	p_unc	np2	eps	significance
0	Group	5.459963	1	58	5.459963	5.051709	0.028420	0.080120	NaN	*
1	Time	7.628428	2	116	3.814214	4.027394	0.020369	0.064929	0.998751	*
2	Interaction	5.167192	2	116	2.583596	2.727996	0.069545	0.044922	NaN	n.s.

The guidelines found on Pingouin suggest pairwise_tests(). That can also be used. But depending on the dataset, it may be more prudent to use the tukey test. That is easy to use, as we have seen above. The Tukey test will be performed and the corresponding barplot() will be generated.

[17]:

pairwise_test = mixed_stats.get_tukey(value_col='Scores', group_col='Group')
pairwise_test

[17]:

	A	B	mean_A	mean_B	diff	se	T	p_tukey	hedges	significance
0	Control	Meditation	5.567851	5.91618	-0.348328	0.152115	-2.289903	0.0232	-0.339918	*

[18]:

title = 'Tukey on Mixed ANOVA Results Bar Plot'

mixed_plot.barplot(test='tukey', value_col='Scores', group_col='Group', palette='RdYlGn', title=title, titlesize=30,
                   return_df=True)

[18]:

(         A           B    mean_A   mean_B      diff        se         T  \
 0  Control  Meditation  5.567851  5.91618 -0.348328  0.152115 -2.289903

    p_tukey    hedges significance
 0   0.0232 -0.339918            *  ,
 <statannotations.Annotator.Annotator at 0x11a5aa020>)

../_images/tutorials_007_parametric_example_29_1.png