Visualizing Connections Using Chord Diagrams in Python
Published on
It is often helpful to visualize the connections between categorical data points. This could help identify a significant amount of overlap between two types or types that are typically associated with each other.
There are a few different ways we can visualize this, but the chord diagram is one I have started using for data sets with limited options for the categorical data point.
What is a Chord Diagram?
A chord diagram shows all the possible options for a categorical value and the number of connections between each option. The chord diagram is a great way to analyze and view the connections.
For example, if you have a dataset of posts with different tags or movies that are in multiple categories, using a chord diagram would be a helpful way to identify data points with either a high number of connections or very few connections compared to the average.
Creating a Chord Diagram in Python
We can use the Python library, Holoviews, to help us create our diagram. Holoviews is a library that extends an underlying visualization library, such as MatplotLib. This library offers a variety of graphs, and you can switch which library it is extending (such as MatplotLib or Bokeh) so you can use the library that best works for your needs.
Holoviews is designed to work great with Pandas, which we'll use below. However, there are also methods for using a few different data types.
First, we need to install Holoviews. I'd suggest installing the recommended setup using:
pip install holoviews[recommended]
Once installed, we can create our Python script or Notebook and get started.
To start, we import the modules we need and let Holoviews know which library we are extending. I normally use matplotlib as the backend for this.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
hv.extension('matplotlib')
hv.output(fig='svg', size=500)
From there, let's create a very basic example data set first.
# Create a Pandas DataFrame from an example list of dicts.
connections_df = pd.DataFrame.from_records([
{'source': 0, 'target': 1, 'value':5},
{'source': 0, 'target': 2, 'value':15},
{'source': 0, 'target': 3, 'value':8},
{'source': 0, 'target': 4, 'value':2},
{'source': 1, 'target': 4, 'value':45},
{'source': 1, 'target': 3, 'value':12},
{'source': 1, 'target': 2, 'value':1},
{'source': 2, 'target': 3, 'value':19},
{'source': 2, 'target': 4, 'value':13},
{'source': 3, 'target': 4, 'value':27},
])
The Chord
method accepts a dataframe with source
, target
, and value
columns where source
and target
are numerical representations of the "to" and "from" categorical options and the value
is how many connections it has.
We can pass this dataframe to the Chord
method to get our first diagram:
hv.Chord(connections_df)
We can add labels to the diagram by passing a "nodes" data set. The nodes will have two columns, index
for the numerical representation of a value and name
for the label.
You can name these columns anything as long as you update the second parameter in the .Dataset()
function and the labels
parameter in the opts
function to match your column names.
nodes_ds = hv.Dataset(pd.DataFrame.from_records([
{'index': 0, 'name': "Stuff"},
{'index': 1, 'name': "Things"},
{'index': 2, 'name': "Whatnots"},
{'index': 3, 'name': "Odds & Ends"},
{'index': 4, 'name': "Cups"},
]), 'index')
hv.Chord((connections_df, nodes_ds)).opts(opts.Chord(labels='name'))
The opts
method allows us to pass a variety of options and settings to create our diagram, such as the labels column.
We're able to start seeing the connections, but it's challenging to evaluate with all the lines being the same color. We can use the opts
method to pass some settings for adjusting edge and node colors.
hv.Chord((connections_df, nodes_ds)).opts(
opts.Chord(
cmap='Category20',
edge_color=dim('source').astype(str),
labels='name',
node_color=dim('index').astype(str)
)
)
The diagram is looking much better now. We can start to see which nodes have the most connections between them.
Lastly, if we are just outputting in a Jupyter Notebook, this works, but we probably need to save this image to be used somewhere. Holoviews has a .save()
method for this:
# Create our chord diagram in the same way but saving it to a variable.
chord_example_3 = hv.Chord((connections_df, nodes_ds)).opts(
opts.Chord(
cmap='Category20',
edge_color=dim('source').astype(str),
labels='name',
node_color=dim('index').astype(str)))
# Use the .save() method to save the diagram to a file.
hv.save(chord_example_3, 'chord-example-3.svg')
Creating a Chord Diagram with Real Data
Now that we have created a basic diagram, let's look at how this would work for an actual data set.
Your categorical data could be in a variety of formats. For this example, we are looking at a data set that has 2 "types" per entity, and we'll visualize connections between these different types.
I found a dataset on Kaggle of all the Pokémon and their types. Pokémon is a video game with hundreds of animal-like creatures with different "types." Pokémon can have 1 or 2 types, and there are 18 potential types.
To make this simple, I'll only look at Pokémon that have 2 types and use pandas value_counts() method to quickly extract out the main connection counts.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
hv.extension('matplotlib')
hv.output(fig='svg', size=500)
# Load our DataFrame.
pokemon_df = pd.read_csv('pokemon.csv')
# Only use base forms to make this analysis more straightforward for this example.
pokemon_regular_forms_only_df = pokemon_df[pokemon_df['Alternate Form Name'].isnull()]
# To make this analysis simple, only look at Pokémon that have two types.
two_types_df = pokemon_regular_forms_only_df[~pokemon_regular_forms_only_df['Secondary Type'].isnull()]
# Create a dict of the type combinations and a frequency count.
type_connections = two_types_df.apply(lambda x: f'{x["Primary Type"][1:-1]},{x["Secondary Type"][1:-1]}', axis=1).value_counts().to_dict()
"""
type_connections is in the format of:
{
'Normal,Flying': 26,
'Ghost,Dark': 12
}
"""
Now, we have a dict of type combinations and counts. If this were a larger and more complex dataset, we'd have to approach this differently. However, for this, I'll loop over the type combinations and convert them into a dictionary of source types with their accompanying target types.
"""
Cycle over our type combinations, split each, and add it to our connections
dict to end up with a format like:
connections = {
'normal': {
'targets': {
'flying': 26,
'water': 12
}
}
}
"""
from collections import defaultdict
connections = defaultdict(lambda: {'targets': defaultdict(int)})
for type_combo, value in type_connections.items():
pk_types = type_combo.split(',')
for pk_type in pk_types:
for target in pk_types:
if target != pk_type:
connections[pk_type]['targets'][target] += value
Now, we need to convert this to our chords and nodes format. Plus, there might be some source->target inverse (such as normal/flying vs flying/normal) that we want to convert all to the same for our individual chord record.
# Create a unique nodes list first
nodes = list(set(list(connections.keys()) + [target for d in connections.values() for target in d['targets'].keys()]))
nodes_df = pd.DataFrame({'node': nodes}, index=range(len(nodes)))
# Create the chords dataframe
chord_data = []
node_to_id = {node: idx for idx, node in enumerate(nodes)}
seen_pairs = set()
for source, target_data in connections.items():
source_id = node_to_id[source]
for target, count in target_data['targets'].items():
target_id = node_to_id[target]
"""
The original connections could have duplicate counts,
one where the 2nd type is the source and one where the 2nd type is the target.
So, create a frozen set of the pair to check for duplicates as these are order-independent.
"""
pair = frozenset([source_id, target_id])
if pair not in seen_pairs:
seen_pairs.add(pair)
chord_data.append([source_id, target_id, count])
chords_df = pd.DataFrame(chord_data, columns=['source', 'target', 'value'])
Now, we can pass our nodes dataframe to the Dataset method and then create our diagram.
# We use .reset_index() here to create the `index` column used in the HoloViews dataset.
nodes_ds = hv.Dataset(nodes_df.reset_index(), 'index')
hv.Chord((chords_df, nodes_ds)).opts(
opts.Chord(
cmap='Category20',
edge_color=dim('source').astype(str),
labels='node', # Make sure this matches the column name from nodes_df
node_color=dim('index').astype(str)
)
)
Now, you may have noticed in the examples with labels that the labels along the left side were upside down. HoloViews rotates the labels as it rotates around the diagram by default, which causes many to be upside-down. There are several GitHub issues and Stack Overflow questions about this, but it has not been changed as of now. Luckily, there are hooks that we can add a function to that can correct this.
First, let's create our function that will determine the rotation of the text.
def rotate_label(plot, element):
labels = plot.handles["labels"]
for annotation in labels:
angle = annotation.get_rotation()
if 90 < angle < 270:
annotation.set_rotation(180 + angle)
annotation.set_horizontalalignment("right")
Now, we can create our diagram as before but, this time, pass our new function to the hooks
parameter.
hv.Chord((chords_df, nodes_ds)).opts(
opts.Chord(
cmap='Category20',
edge_color=dim('source').astype(str),
labels='node',
node_color=dim('index').astype(str),
hooks=[rotate_label]
)
)
We now have our finished Chord diagram! We can quickly spot that there are a lot of Pokémon with normal and flying. There are also quite a bit of connections between bug and poison, between grass and poison, and between flying and bug.
Next Steps
Once you work with the Chord diagrams, there are a few more things you can do, such as:
- Use Bokeh as the main library instead to have an interactive Chord diagram
- Using the
select
method on the Chord object, you can filter what data in the chords dataframe gets visualized
If you create any fun Chord diagrams, let me know!