This is an attempt to use Spotify's audio feature data and k-means clustering to algorithmically generate playlists of similar songs.
Back in 2016, Spotify rolled out their Daily Mixes feature, which automatically generates playlists of songs that you've already saved to your library. Spotify had previously released a number of auto-generated playlists (Discover Weekly, Release Radar), but as the names imply, these playlists were intentionally filled with tracks that you had not already saved. The Daily Mixes were different, instead focusing on creating playlists made up of songs saved in your library. Each mix is intended to hit on a different "listening mode or grouping" specific to each person, which means you might have a lo-fi hip hop mix and a stomp-and-holler folk mix both show up.
I appreciated the concept behind the Daily Mixes, but I often found that the "listening mode" I was into at the moment was not always represented in the Daily Mixes. This made me wonder how difficult it would be to create my own generated mixes so that I could find a playlist of my own music that truly matched the vibe of a song or album I was into. This gave birth to the concept of using clustering to quickly generate a bunch of (hopefully) representative mixes.
I have previously written about/used both k-means clustering and Spotify's API data before, so these two were a natural combination to try for this experiement. Spotify obviously has substantially more data available for their analyses and they have teams of exceptionally qualified data scientists working on their algorithms, but maybe, just maybe, I would be able to crack into one of their secrets...
Let's start with some basic code setup and then get into the meat of the analysis.
I'm going to use spotipy for interfacing with Spotify and then scikit learn for my clustering analysis.
import os, json
import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA # possibly going to try some PCA
# set API keys
apikeys = json.load(open("data/api-keys.json"))
os.environ["SPOTIPY_CLIENT_ID"] = apikeys["spotipy-client-id"]
os.environ["SPOTIPY_CLIENT_SECRET"] = apikeys["spotipy-client-secret"]
os.environ["SPOTIPY_REDIRECT_URI"] = apikeys["redirect-url"]
# set my user_id
user_id = '129874447'
# connect to spotify
token = util.prompt_for_user_token(user_id, \
scope = 'user-library-read, playlist-modify-public, playlist-modify-private')
sp = spotipy.Spotify(auth = token)
### function to get the current user's saved tracks (track name, artist, id)
def get_saved_tracks(limit = 50, offset = 0):
saved_tracks = [ ]
# get initial list of tracks to determine length
saved_tracks_obj = sp.current_user_saved_tracks(limit = limit, offset = offset)
num_saved_tracks = saved_tracks_obj['total']
# loop through to get all saved tracked
while (offset < num_saved_tracks):
saved_tracks_obj = sp.current_user_saved_tracks(limit = limit, offset = offset)
# add track information to running list
for track_obj in saved_tracks_obj['items']:
saved_tracks.append({
'name': track_obj['track']['name'],
'artists': ', '.join([artist['name'] for artist in track_obj['track']['artists']]),
'track_id': track_obj['track']['id']
})
offset += limit
return saved_tracks
### function to get tracks from a specified playlist (track name, artist, id)
def get_playlist_tracks(user_id, playlist_id, limit = 100, offset = 0):
playlist_tracks = [ ]
# get initial initial list of tracks in playlist to determine length
playlist_obj = sp.user_playlist_tracks(user = user_id, playlist_id = playlist_id, \
limit = limit, offset = offset)
num_playlist_tracks = playlist_obj['total']
# loop through to get all playlist tracks
while (offset < num_playlist_tracks):
playlist_obj = sp.user_playlist_tracks(user = user_id, playlist_id = playlist_id, \
limit = limit, offset = offset)
# add track information to running list
for track_obj in playlist_obj['items']:
playlist_tracks.append({
'name': track_obj['track']['name'],
'artists': ', '.join([artist['name'] for artist in track_obj['track']['artists']]),
'track_id': track_obj['track']['id']
})
offset += limit
return playlist_tracks
### function to get spotify audio features when given a list of track ids
def get_audio_features(track_ids):
saved_tracks_audiofeat = [ ]
# iterate through track_ids in groups of 50
for ix in range(0,len(track_ids),50):
audio_feats = sp.audio_features(track_ids[ix:ix+50])
saved_tracks_audiofeat += audio_feats
return saved_tracks_audiofeat
### function to get all of the current user's playlists (playlist names, ids)
def get_all_user_playlists(playlist_limit = 50, playlist_offset = 0):
# get initial list of users playlists (first n = playlist_limit), determine total number of playlists
playlists_obj = sp.user_playlists(user_id, limit = playlist_limit, offset = playlist_offset)
num_playlists = playlists_obj['total']
# start accumulating playlist names and ids
all_playlists = [{'name': playlist['name'], 'id': playlist['id']} for playlist in playlists_obj['items']]
playlist_offset += playlist_limit
# continue accumulating through all playlists
while (playlist_offset < num_playlists):
playlists_obj = sp.user_playlists(user_id, limit = playlist_limit, offset = playlist_offset)
all_playlists += [{'name': playlist['name'], 'id': playlist['id']} for playlist in playlists_obj['items']]
playlist_offset += playlist_limit
return(all_playlists)
With this code originally set up, let's get started! First, we'll pull in a list of all of my saved tracks and then merge on the audio feature data associated with these songs. From there, we should be able to let the clustering algorithm loose, right?
# get list of saved songs
saved_tracks = get_saved_tracks()
saved_tracks_df = pd.DataFrame(saved_tracks)
print("tracks: %d" % saved_tracks_df.shape[0])
saved_tracks_df.head()
# get audio features for saved songs
saved_tracks_audiofeat = get_audio_features(track_ids = list(saved_tracks_df['track_id']))
saved_tracks_audiofeat_df = pd.DataFrame(saved_tracks_audiofeat).drop(['analysis_url', 'track_href', \
'type', 'uri'], axis = 1)
# merge audio features onto tracks df
saved_tracks_plus_df = saved_tracks_df.merge(saved_tracks_audiofeat_df, how = 'left', \
left_on = 'track_id', right_on = 'id').drop('id', axis = 1)
saved_tracks_plus_df.head()
With a full table of songs and hopefully meaningful audio features, we should be good to let the scikit-learn function do its thing.
The goal of this exercise is to make mixes similar to Spotify's Daily Mixes. Their mixes are technically endless (they grow as you listen), but for now let's shoot for playlists of 10 - 20 songs to start, which should be small enough to see that we are hopefully get meaningful results. As of writing this, I have more than 2,500 tracks saved, so it would make sense to create k = 200
clusters.
# try clustering on the full dataset, excluding the non-numeric variables
kmeans = KMeans(n_clusters = 200).fit(saved_tracks_plus_df.drop(['track_id', 'track_id', 'name', 'artists'], axis = 1))
# add results to df
saved_tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1
With our tracks clustered together, let's take a look at a few and see what we've got!
saved_tracks_plus_df[saved_tracks_plus_df['cluster'] == 1].head(10)
saved_tracks_plus_df[saved_tracks_plus_df['cluster'] == 94]
These clusters don't look that great. The songs all seem pretty different from each other, or at least no more similar than if I were to just take a random sample from my saved library. I don't think I would be in the mood to listen to Neutral Milk Hotel and clipping at the same time. What could be going on?
Just from scanning the data, it seems like all of the songs in a cluster do have one thing in common: song length. The songs that are being grouped together appear to share a relatively similar duration_ms
, which makes sense because there is a lot more variance from that variable compared to the other metrics that range from 0 - 1. The k-means algorithm is going to be driven towards those more varied variables, even if that is not our intention. This is why you are suppose to normalize your data first!
While this is a slightly interesting result - maybe I want to listen to exactly 13 songs in 45 minutes, so I need all of my songs to be 3:28 long - it was not exactly what I was shooting for. It does highlight the need and importance for me to normalize and center my data though (silly mistake on my part)!
Before rushing into another attempt (with normalized data), it may also be worth taking a step back and seeing if clustering based on this audio feature data even produces meaningful results. One of the difficulties of any clustering analysis is validating your results and knowing if the clusters that are output even make sense. Before getting too ahead of myself, I want to try clustering the songs of two very different pre-made playlists and see if I have any luck. If that works, great; if not, I might be SOL.
For ease, I can use Spotify's mass selection of playlists to find two drastically different playlists. To start, let's compare the smooth and calm sounds of the "Ambient Chill" playlist to the more "lit" musings of "Get Turnt".
# get tracks for "ambient chill" playlist
testA_tracks = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX3Ogo9pFvBkY')
testA_tracks_df = pd.DataFrame(testA_tracks)
testA_tracks_df['playlist'] = "ambient chill"
# get tracks for "get turnt" playlist
testB_tracks = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DWY4xHQp97fN6')
testB_tracks_df = pd.DataFrame(testB_tracks)
testB_tracks_df['playlist'] = "get turnt"
# stack all tracks together
testAB_tracks_df = testA_tracks_df.append(testB_tracks_df).sort_values(by = "track_id")
testAB_tracks_df.head()
_testAB_audiofeat = get_audio_features(track_ids = list(testAB_tracks_df['track_id']))
_testAB_audiofeat_df = pd.DataFrame(_testAB_audiofeat).drop(['analysis_url', 'track_href', 'type', 'uri'], axis = 1)
_testAB_audiofeat_df.head()
testAB_audiofeat_scaler = StandardScaler()
testAB_audiofeat = testAB_audiofeat_scaler.fit_transform(_testAB_audiofeat_df.drop(['id'], axis = 1))
testAB_audiofeat_df = pd.DataFrame(testAB_audiofeat, columns = _testAB_audiofeat_df.drop('id', axis = 1).columns)
testAB_audiofeat_df['id'] = _testAB_audiofeat_df['id']
testAB_audiofeat_df.head()
testAB_tracks_plus_df = testAB_tracks_df.merge(testAB_audiofeat_df, how = 'left', \
left_on = 'track_id', right_on = 'id').drop('id', axis = 1)
testAB_tracks_plus_df.head()
# try clustering full stack of songs into two distinctplaylists
kmeans = KMeans(n_clusters = 2).fit(testAB_tracks_plus_df.drop(['track_id', 'track_id', 'name', \
'artists', 'playlist'], axis = 1))
testAB_tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1
# see if successful (hopefully see the playlists are clustered mutually exclusively)
testAB_tracks_plus_df[['track_id', 'playlist', 'cluster']].groupby(['playlist', 'cluster']).agg('count')
Aha! Using normalized audio feature data, the songs do get properly clustered into their respective, mutually exclusive playlists. To an extent, this is validation of the method, at least on two quite different sounding sets of songs.
As a secondary experiment, before trying it on my own library, I want to see how it performs on more similar playlists.
But first, going to throw this code into some functions for later use.
### function to create "tracks plus" df (including normalized audio features) when given a tracks df
def build_tracks_plus_df(tracks_df, normalize = True):
# get raw audio features
_audiofeat = get_audio_features(track_ids = list(tracks_df['track_id']))
_audiofeat_df = pd.DataFrame(_audiofeat).drop(['analysis_url', 'track_href', 'type', 'uri'], axis = 1)
# scale audio features (if desired)
if normalize:
scaler = StandardScaler()
audiofeat = scaler.fit_transform(_audiofeat_df.drop(['id'], axis = 1))
audiofeat_df = pd.DataFrame(audiofeat, columns = _audiofeat_df.drop('id', axis = 1).columns)
audiofeat_df['id'] = _audiofeat_df['id']
else:
audiofeat_df = _audiofeat_df
# merge audio features with tracks_df
tracks_plus_df = tracks_df.merge(audiofeat_df, how = 'left', left_on = 'track_id', right_on = 'id')
return(tracks_plus_df)
### function to cluster tracks based on normalized audio features
def cluster_tracks_plus_df(tracks_plus_df, num_clusters, drop_vars = None):
kmeans = KMeans(n_clusters = num_clusters).fit(tracks_plus_df.drop(['track_id', 'id', 'name', 'artists'] + \
(drop_vars if drop_vars != None else []), \
axis = 1))
tracks_plus_df['cluster'] = pd.Series(kmeans.labels_) + 1
return(tracks_plus_df)
This time around, let's try the same approach but on three playlists that are all pretty similar (really just slight variants of the same type of indie music). Because these are all pretty similar to one another (there are subtle differences, but even to an untrained human, it would be hard to differentiate these), I don't expect these results to be as clean as the last run, but we'll see!
# get tracks for "lo-fi indie" playlist
testC_tracks = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX0CIO5EOSHeD')
testC_tracks_df = pd.DataFrame(testC_tracks)
testC_tracks_df['playlist'] = "lo-fi indie"
# get tracks for "dreampop" playlist
testD_tracks = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DX6uhsAfngvaD')
testD_tracks_df = pd.DataFrame(testD_tracks)
testD_tracks_df['playlist'] = "dreampop"
# get tracks for "bedroom pop" playlist
testE_tracks = get_playlist_tracks(user_id = 'spotify', playlist_id = '37i9dQZF1DXcxvFzl58uP7')
testE_tracks_df = pd.DataFrame(testE_tracks)
testE_tracks_df['playlist'] = "bedroom pop"
# stack all tracks together
testCDE_tracks_df = testC_tracks_df.append(testD_tracks_df).append(testE_tracks_df)
testCDE_tracks_df.head()
# build plus df and cluster
testCDE_tracks_plus_df = cluster_tracks_plus_df(build_tracks_plus_df(testCDE_tracks_df), 3, drop_vars = ['playlist'])
testCDE_tracks_plus_df[['track_id', 'playlist', 'cluster']].groupby(['playlist', 'cluster']).agg('count')
The results aren't as clean on this try, but that makes sense because these are similar sounding playlists. On my own listening, it wasn't entirely obvious how one song would fall into one playlist or another, so it would be a lot to expect the clustering algorithm to do it.
Nonetheless, the goal of this experiment is not to make perfectly partitioned playlists purely based on Spotify's own genres, but instead to create playlists that have similar vibes, hopefully grouping together songs that are not entirely obvious at first listen.
# get in list of saved songs
saved_tracks_df = pd.DataFrame(saved_tracks)
saved_tracks_plus_df = build_tracks_plus_df(saved_tracks_df, True)
saved_tracks_clustered1_df = cluster_tracks_plus_df(saved_tracks_plus_df, 200)
saved_tracks_clustered1_df[saved_tracks_clustered1_df['cluster'] == 1].head()
Now that we have the ability to generate some clustered playists and we've gotten some minor validation that it works on Spotify's own playlists, it is worth seeing how successful this method is when it makes playlists from my own library. To do this, let's try analyzing these clusters (1) quantitatively, (2) qualitatively, and (3) visually.
A major input to the k-means algorithm dictates the number of unique clusters k
that we want to generate from our data. If we have a fairly uniform dataset, we may only need to use two or three clusters. But as our data gets more varied and diverse, we will be better of using more clusters to do the job right. However, more clusters does not necessarily mean better, from both a data science perspective and a musical playlist perspective.
First, from a data science perspective, while increasing the number of clusters is likely to increase the similarity of the songs in our generated clusters, there is a diminishing return as we increase the number of clusters. This means using k = 20
may not be meaningfully better than k = 15
. Second, from a music perspective, if we increase the number of clusters too much, we will end up with clusters that have only one or two songs, which defeats the purpose of this whole exercise. Therefore, we want to find that sweet spot where k
is not too big and not too small.
To do this, we can use a "scree plot". A scree plot is a way to quantify and visualize the ideal number (or rather range) of clusters we'll want to use for our analysis. As we increase k
, we expect the similarity of our clustered songs to improve, which we can measure via the sum square of distance (see the inertia_
parameter from the scikit k-means documentation).
ssds = [ ]
for k in [1, 2, 3, 5, 10, 20, 30, 40, 50, 75, 100, 125, 150, 200, 300, 400, 500, 1000]:
print("clustering with %d clusters..." % k)
kmeans = KMeans(n_clusters = k).fit(saved_tracks_plus_df.drop(['track_id', 'id', 'name', 'artists'], axis = 1))
ssds.append({"k": k, "ssd": kmeans.inertia_})
import matplotlib.pyplot as plt
ssds_df = pd.DataFrame(ssds)
plt.rcParams['figure.figsize'] = [12, 4]
plt.subplot(1, 2, 1)
plt.scatter(ssds_df["k"], ssds_df["ssd"])
plt.xlabel("number of clusters")
plt.ylabel("similarity (sum square of distance)")
plt.subplot(1, 2, 2)
ssds_df_short = ssds_df[ssds_df["k"] <= 100]
plt.scatter(ssds_df_short["k"], ssds_df_short["ssd"])
plt.xlabel("number of clusters")
plt.ylabel("similarity (sum square of distance)")
plt.show()
When we try clustering our songs using a range of choices for k
, we see that we get a huge improvement when we go from k = 1
to k = 2
, but the jumps from k = 2
=> k = 3
, k = 3
=> k = 4
, and so on get sizably smaller each time. Ideally, we want to be any point past the "elbow bend" in our scree plot. Thus, we can see that based on our clustering, even though increasing k
may improve our clusters, we see a substantial diminishing in improvement after k = 20
or so, which means any choice of k >= 20
that also maintains decently long playlists (roughly 20 songs) will work for our uses.
If we pick k = 100
, the vast majority of our clusters have at least 10 songs, with most having 25 or more. This will work for us.
saved_tracks_clustered2_df = cluster_tracks_plus_df(saved_tracks_plus_df, 100)
clusters, counts = np.unique(saved_tracks_clustered2_df["cluster"], return_counts = True)
counts_df = pd.DataFrame(list(zip(clusters, counts)), columns = ["cluster", "count"])
print("clusters with at least 10 songs: %d" % counts_df[counts_df["count"] >= 10].shape[0])
print("clusters with at least 20 songs: %d" % counts_df[counts_df["count"] >= 20].shape[0])
print("clusters with at least 25 songs: %d" % counts_df[counts_df["count"] >= 25].shape[0])
Another way we can assess the quality of our clusters is by simply looking at how artists, and albums are split up. Though there are obviously varience in an artist's discography and even within a single album, we would expect a particular artist or album to fall between a smaller number of clusters. If we see a certain artist appearing a few times in a handful of clusters, great! If we see every single one of an artist's songs in entirely different clusters, we may have to reassess.
To test this, we can count the number of distinct clusters that certain artists fall into, relative to the number of songs they have.
clusters_per_artist_df = saved_tracks_clustered2_df[["artists", "track_id", "cluster"]].groupby(["artists"]).agg(lambda x : len(np.unique(x)))
clusters_per_artist_df['songs_per_cluster'] = clusters_per_artist_df['track_id'] / clusters_per_artist_df['cluster']
print("number of artists: %d" % clusters_per_artist_df.shape[0])
print("average (median) number of songs per artist: %1.2f (%1.2f)" %
(clusters_per_artist_df['track_id'].mean(), clusters_per_artist_df['track_id'].median()))
print("average (median) number of songs from same artist per cluster: %1.2f (%1.2f)" %
(clusters_per_artist_df['songs_per_cluster'].mean(), clusters_per_artist_df['songs_per_cluster'].median()))
artists_of_interest = ["NEEDTOBREATHE", "Gregory Alan Isakov", "Wild Child", "Otis Redding", "Bad Bad Hats", "The Rubens"]
clusters_per_artist_df[clusters_per_artist_df.index.isin(artists_of_interest)]
In general, it seems like the artists in my library are relatively split, appearing around once per cluster. However, it seems like the majority of artists only have one song in my library, making it difficult to appear in more than one cluster. With more 851 distinct artists, it would be difficult to have a significant amount of artists appear in more than one playlist when there are only 100.
Of the six artists I singled out (some of my top listened to artists), some artists are more split than others. NEEDTOBREATHE has the most songs, but they appear in not that many different playlists relatively, averaging more than two songs per cluster. We also see similar dispersion for Otis Redding. On the other end, The Bad Bad Hats, who fewer songs, appeared in quite a few clusters, with relatively few duplicate apperances.
However, even though our songs_per_cluster
metric is relatively low across most of our artists, I'm choosing to see this as another win for the experiment. The goal of this was never to create algorithimically generated playlists of one or two artists, since that is relatively trivial and not particularly helpful. We'd ideally want a mix of artists (of similar vibe) to appear in our playlists and it seems like that is happening for the most part.
Finally, to assess the quality of our clustering both within a specific playlist and relative to other playlists, we can visualize the audio features of our songs. Since our clustering is based on similarity in these audio features, we would expect songs within a specific cluster to have similar audio features. In our initial attempt at clustering, we saw this happen overwhelming based on the length of the songs. Now that we have normalized our data, we still expect this to happen, just on more meaningful features.
To visualize this, we can use the old trusty box plot to compare the variability in different features within and between clusters.
# pick random sample of clusters to compare
rand_clusters = np.random.choice(saved_tracks_clustered2_df['cluster'].unique(), 5, False).tolist()
rand_clusters.sort()
print("random clusters: " + str(rand_clusters))
rand_clusters_df = saved_tracks_clustered2_df[saved_tracks_clustered2_df['cluster'].isin(rand_clusters)]
rand_clusters_df.head()
def boxplot_cluster_features(df, feat, clusters):
plt.rcParams['figure.figsize'] = [12, 5]
plt.figure()
fig, ax = plt.subplots()
df[["cluster", feat]].boxplot(by = "cluster", ax = ax)
for i in range(1, len(clusters) + 1):
y = df[df["cluster"] == clusters[i - 1]][feat]
x = np.random.normal(i, 0.04, size = len(y))
plt.plot(x, y, 'r.', alpha=0.5)
boxplot_cluster_features(rand_clusters_df, "acousticness", rand_clusters)
boxplot_cluster_features(rand_clusters_df, "danceability", rand_clusters)
boxplot_cluster_features(rand_clusters_df, "energy", rand_clusters)
These box plots show that, at least based on the features we looked at here, the clustering did seem to work effectively. Some of our clusters are similar on certain dimensions, which makes sense given they are sampled from my relatively consistent Spotify library (a lot of indie rock and pop). But there are some clusters (e.g., cluster 67) that stand out from the rest. If we look at cluster 67, we can see that almost all of the songs are live, which will lead to a significant difference in the energy and acousticness, which we see on the above box plots.
saved_tracks_clustered2_df[saved_tracks_clustered2_df['cluster'] == 67]
Overall, another score for k-means music!
With all of this, I vote this experiment was a moderate -- maybe even overwhelming :) -- success, at least based on the few tests above. However, at the end of the day, the way to truly test the success of these playlists is going to be by listening to them. There is a huge amount of data wrapped up in all of the music we listen to, but at the end of the day, it sometimes just comes down to the ~vibe~ you get from a playlist, which is hard to quantify.
I'm going to generate some of my own playlists and give them a listen. Feel free to check out my playlists (link1, link2, link3) and I also encourage you to generate some from your own library! I've saved all of the relevant code (without any of the validation code) here, so feel free to play around for yourself. You'll have to set up your own Spotify API application to get a usable key, but you can find instructions on doing so here.
Feel free to reach out with any questions, comments, or criticisms of the above - always looking to improve on my methods and music.
def save_cluster_tracks_to_playlist(playlist_name, track_ids):
# get all of the users playlists
all_playlists = get_all_user_playlists()
# check if playlist already exists
if (playlist_name not in [playlist['name'] for playlist in all_playlists]):
playlist = sp.user_playlist_create(user = user_id, name = playlist_name, public = True)
else:
playlist_id = [playlist['id'] for playlist in all_playlists if playlist['name'] == playlist_name][0]
playlist = sp.user_playlist(user = user_id, playlist_id = playlist_id)
# remove any existing tracks in playlist
while (playlist['tracks']['total'] > 0):
sp.user_playlist_remove_all_occurrences_of_tracks(user_id, playlist['id'], \
tracks = [track['track']['id'] for track in \
playlist['tracks']['items']])
playlist = sp.user_playlist(user = user_id, playlist_id = playlist_id)
# add tracks from cluster
sp.user_playlist_add_tracks(user_id, playlist_id = playlist['id'], tracks = track_ids)
rand_clusters2 = np.random.choice(saved_tracks_clustered2_df['cluster'].unique(), 3, False).tolist()
for i in range(1, len(rand_clusters2) + 1):
cluster = saved_tracks_clustered2_df[saved_tracks_clustered2_df['cluster'] == rand_clusters2[i - 1]]
print(cluster[["name", "artists", "cluster"]].head())
save_cluster_tracks_to_playlist("k-means, cluster %d" % i, list(cluster['id']))