Skip to content

Repeated clustering

MariusKlug edited this page May 20, 2022 · 3 revisions

The problem

If source analysis is to be performed on the group level, e.g. to investigate spectral power changes in the parietal cortex during movement, it is necessary to find independent components (ICs) of all participants that correspond to this location. To this end, k-means clustering can be used which finds similar components based on weighted measures like dipole location, scalp topographies, spectrum, ERPs, or ERSPs.

Choosing the weights is subject to the experimenter, but it is recommended to weigh the location highly (e.g. 3), add topographies (e.g. 1) and spectra (e.g. 1) to the mix, and depending on the situation, ERPs and ERSPs can also be used. However, when using ERPs or ERSPs, it can be argued that double-dipping happens in the selection of the relevant ICs (the measure that is later used to compute statistics is also used to select the ICs). A counter-argument to this would be that the clustering uses average measures while the statistics are used to investigate condition differences. All in all, no final rule on how to choose the weights can be given.

The biggest issue with this approach, however, is not the selection of the weights, but the fact that the k-means clustering is not stable. Repeating the clustering can give different results, and depending on the location and the similarity of the ICs, the cluster of interest (the cluster closest to your region of interest, ROI) can contain surprisingly different ICs.

Our solution

We implemented a repeated clustering approach which clusters several hundred or thousand times, and for each clustering solution, selects the cluster of interest (the one closest to your ROI in MNI coordinates). To find the correct coordinates for your ROI, you can use this tool.

For each of these ROI clusters, a set of quality measures is derived: the number of subjects in that cluster, the number of ICs per subject on average, the spread of the cluster (normalized by the number of ICs), the mean residual variance of the ICs in the cluster, the distance of the cluster centroid from the ROI, and the Mahalanobis distance from the median of the multivariate distribution of all cluster solutions (this is essentially a measure of how normal or representative the given solution is). In this plot, you can see an example of 10.000 repetitions with a region of interest in the retrosplenial cortex (N=19):

repeated clustering multivariate data Clearly, the distributions are rather wide, and if only a single clustering is performed, an unlucky selection could be far from usable or representative for the average.

Ideally, we are looking for a solution that contains as many subjects as possible (so the final measures are representative for the group), few ICs per subject (because it is difficult to interpret this), a low distance from the ROI and low normalized spread (tight cluster around the ROI), a low residual variance (we want to investigate physiologically plausible ICs), and a low distance from the median. To this end, the quality measures are assigned a weight and the clustering solutions are sorted according to their summed score. The solution with the highest combined score is taken as the final clustering solution that can be used for further analysis.

To make sure that no single outlier solution is taken as the final solution, on the one hand one can weigh the Mahalanobis distance higher, on the other hand, we provide plots of the locations and average scalp topographies of the five highest-ranking solutions. These should look very similar, meaning the results are stable. Depending on the location it might be able to achieve a stable solution with only 100 repetitions, but more tricky locations like the retrosplenial cortex may require several thousand repetitions.

If one is interested not only in one ROI but several, two options are possible: 1) Optimize separately for all ROIs and create different STUDY files accordingly (this means that the same IC may be present in two clusters if they are too close together), or 2) if one ROI is more important than the others, it might be better to only optimize for that one ROI and then take the other ROIs from that solution.

Usage

To use this solution, the clustering must be prepared first using bemobil_precluster on your STUDY with your desired clustering weights. This function essentially just wraps the EEGLAB preclustering, but it stores additional information that is needed later on. When the STUDY is prepared for clustering, the function bemobil_repeated_clustering_and_evaluation performs all necessary steps. It requires all relevant parameters for clustering in general (outlier sigma, number of clusters), and in addition the number of repetitions, the desired ROI to optimize for, the weight vector for the quality measures, and several filepaths to store the intermediate results. The output is a STUDY with the final and best clustering solution according to the specified weights, as well as the plots of the five top solutions. Please refer to the help of the two functions for more detailed information on the input.