How to Plot Statistically Significant Letters on Bar Plots Using Tukey Test Results in Python
Published:
This is lets you create a sing letters in form post hoc test, great if you want to learn about the more python functionalities.
starting
Have you ever wondered how to add letters on top of a bar plot to represent statistically significant relationships between groups using Python? In this tutorial, I’ll guide you step by step through the process. We’ll keep things simple, using only a few essential libraries like statsmodels, along with Python’s built-in dictionaries and list comprehensions. No need for heavy external libraries—just straightforward Python code!
At first we need to generate some some random data, for this I am going to use random module from python
I am generating randomly car brand name and some weird hypothetical prices using this code, and making sure some are statically different!
import pandas as pd import random import numpy as np
Define the list of car names and their respective price ranges (mean and std deviation)
car_names = [“toyota”, “mercedes”, “mazda”, “chevy”, “ram”] car_price_stats = { “toyota”: {“mean”: 7000, “std”: 500}, “mercedes”: {“mean”: 9000, “std”: 800}, “mazda”: {“mean”: 6500, “std”: 400}, “chevy”: {“mean”: 7500, “std”: 600}, “ram”: {“mean”: 8000, “std”: 700} }
Set the number of rows for the DataFrame
num_rows = 1000
Generate random data
random_car_names = [random.choice(car_names) for _ in range(num_rows)] random_avg_prices = [ int(np.clip( np.random.normal( car_price_stats[car][“mean”], car_price_stats[car][“std”] ), 5000, 10000 )) for car in random_car_names ]
Create the DataFrame
data = { “car_name”: random_car_names, “avg_price”: random_avg_prices } df = pd.DataFrame(data)
Display the DataFrame
print(df) if we do df.head()
Now, we want to determine whether there is a statistically significant difference in price between the cars.
Before doing that we can return some nice descriptive statistics using this formula form researchpy, it is really clean to see what is happening in data before jumping to do any analysis.
Press enter or click to view image in full size
Now we can to anova, and see if these are statically different or not, anova tells us, at least one is different to another but it does not tells us which one
Press enter or click to view image in full size
From the model p value, we see that they are statistically different, now we can go ahead and do post hoc test, which one are different to each other,
Press enter or click to view image in full size
We observe that Chevy and Mazda are not statistically different, while the rest of the car brands show significant differences. Therefore, Chevy and Mazda should share the same letter, distinguishing them from the other car brands.
we can use tukey plot see the differences
Press enter or click to view image in full size
df_tukey_car = pd.DataFrame(data=tukey_car._results_table.data[1:], columns=tukey_car._results_table.data[0]) Press enter or click to view image in full size
Now we have changed the tukey results into dataframe, we can extract the relevent information from this, to generate letters.
import string import pandas as pd
def letters(df, alpha=0.05):
df["p-adj"] = df["p-adj"].astype(float)
# Creating a list of the different treatment groups from Tukey's
group1 = set(df.group1.tolist()) # Dropping duplicates by creating a set
group2 = set(df.group2.tolist()) # Dropping duplicates by creating a set
groupSet = group1 | group2 # Set operation that creates a union of 2 sets
groups = list(groupSet) #removed sorted from here
# Creating lists of letters that will be assigned to treatment groups
letters = list(string.ascii_lowercase)[:len(groups)]
cldgroups = letters
# the following algoritm is a simplification of the classical cld,
cld = pd.DataFrame(list(zip(groups, letters, cldgroups)))
cld[3]=""
for row in df.itertuples():
if df["p-adj"][row[0]] > (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 2] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 2] += cld.iat[groups.index(df["group1"][row[0]]), 1]
if df["p-adj"][row[0]] < (alpha):
cld.iat[groups.index(df["group1"][row[0]]), 3] += cld.iat[groups.index(df["group2"][row[0]]), 1]
cld.iat[groups.index(df["group2"][row[0]]), 3] += cld.iat[groups.index(df["group1"][row[0]]), 1]
cld[2] = cld[2].apply(lambda x: "".join(sorted(x)))
cld[3] = cld[3].apply(lambda x: "".join(sorted(x)))
cld.rename(columns={0: "groups"}, inplace=True)
# this part will reassign the final name to the group
# for sure there are more elegant ways of doing this
cld = cld.sort_values(cld.columns[2], key=lambda x: x.str.len())
cld["labels"] = ""
letters = list(string.ascii_lowercase)
unique = []
for item in cld[2]:
for fitem in cld["labels"].unique():
for c in range(0, len(fitem)):
if not set(unique).issuperset(set(fitem[c])):
unique.append(fitem[c])
g = len(unique)
for kitem in cld[1]:
if kitem in item:
if cld["labels"].loc[cld[1] == kitem].iloc[0] == "":
cld["labels"].loc[cld[1] == kitem] += letters[g]
#Checking if there are forbidden pairing (proposition of solution to the imperfect script)
if kitem in ' '.join(cld[3][cld["labels"]==letters[g]]):
g=len(unique)+1
# Checking if columns 1 & 2 of cld share at least 1 letter
if len(set(cld["labels"].loc[cld[1] == kitem].iloc[0]).intersection(cld.loc[cld[2] == item, "labels"].iloc[0])) <= 0:
if letters[g] not in list(cld["labels"].loc[cld[1] == kitem].iloc[0]):
cld["labels"].loc[cld[1] == kitem] += letters[g]
if letters[g] not in list(cld["labels"].loc[cld[2] == item].iloc[0]):
cld["labels"].loc[cld[2] == item] += letters[g]
cld = cld.sort_values("labels")
cld.drop(columns=[1, 2, 3], inplace=True)
cld= dict(zip(cld["groups"], cld["labels"]))
return(cld) This code block would take the input as a tukey dataframe that we generated and returns the dictionary with keys being the car name values being the letters
Press enter or click to view image in full size
Now, we have successfully generated the letters, and we can plot this letters in a bar chart, before that we need to generate the mean and standard error
Press enter or click to view image in full size
Now, we have data and related letters we can plot these together using this code block
import matplotlib.pyplot as plt import numpy as np
plt.figure(figsize=(12, 10), dpi=200) error = np.full(len(df_plot_car), df_plot_car[‘sem’]) custom_letters = group_labels
Create the bar plot
bars = plt.bar(df_plot_car[‘car_name’], df_plot_car[‘mean’], yerr=error, capsize=5)
Add annotations above bars
for bar, car_name in zip(bars, df_plot_car[‘car_name’]): height = bar.get_height() plt.annotate( custom_letters[car_name], xy=(bar.get_x() + bar.get_width() / 2, height + 0.8), xytext=(0, 5), # 3 points vertical offset textcoords=”offset points”, ha=’center’, va=’bottom’ )
Set x-ticks with rotation
plt.xticks( ticks=range(len(df_plot_car[‘car_name’])), labels=df_plot_car[‘car_name’], rotation=45, ha=’right’ )
Add labels and title with larger font sizes and spacing
plt.xlabel(‘Car Names’, fontsize=14) plt.ylabel(‘Average Price in USD’, fontsize=14) plt.title(‘Average Price By Different Car Names’, fontsize=16, pad=20) # Increased title font size and added padding
Adjust layout for better spacing
plt.tight_layout() plt.show() Press enter or click to view image in full size
Now we have successfully completed our task, to achieve this we have different other ways such as compact letter display, and other, However, I find this approach to be most reliable due to compatibility issues between libraries.
Thanks for the reading until the end, this is a google colab notebook with all code for this tutorial,
https://colab.research.google.com/drive/15agod0EqGeH3v9uVVIn8fq_CnhdH6huf?usp=sharing
