All proposed approaches in prior work on PPEM involve a trade-off between accuracy, privacy, and performance.
For example, Fully Homomorphic Encryption (FHE) based algorithms can be time consuming and computationally heavy as computing over encrypted data
incurs a high computational overhead, however privacy will be maintained as FHE allows for secure computations on encrypted data
without requiring access to the plaintext.
Another example is Secure Multi-Party Computation (SMPC) based PPEM: maintaining privacy is a primary goal of SMPC. However, achieving perfect privacy comes
at the cost of computational complexity and performance, and the efficiency of SMPC can be impacted by the number of parties involved in the computation.
We propose a protocol for privacy-preserving expectation maximization that attains the desirable properties of FHE
while simplifying calculations to only require addition operations on encrypted data.
This approach allows us to achieve a high level of privacy while also improving performance and reducing computational overhead.
By utilizing the strengths of FHE while streamlining the computation process,
we are able to strike a balance between accuracy, privacy, and performance in our proposed approach.
We address the following scenario:
- Data is distributed and private: there are
$n$ parties, each holding its own private data$x_i$ (a two-dimensional point) and we wish to fit a GMM to the full dataset, without revealing the private data of each party. - We propose a client-server model, where the cloud service is an untrusted third party and acts as the central server, providing a service to the clients who are the owners of the private data.
- We assume an Honest-but-Curious adversary and do not consider the Malicious case, thus we can assume that the server (the untrusted third party, the "cloud") is not mailcious.
- To generate, distribute, and manage cryptographic keys for the CKKS scheme via TenSEAL library, we utilize a Key Management Service (KMS).
Consider the scenario where
Assuming there are
The EM algorithm is iterative and for each iteration
where
where
It's worth noting that in the distributed version of the Expectation Maximization algorithm, the
However, the
To address this, we simplify the M-step by breaking it down into intermediate updates that are also computed locally and will periodically be communicated with the central server.
For each iteration
For every Gaussian component
All the above updates can also be computed locally at node
Node
After receiving these intermediate updates from all nodes, the server computes the sum of all the vectors (for a specific Gaussian component
The server sends back the result
These sums are used to update the global estimates of the mixture weights and the means and covariances of each Gaussian component (global updates):
Let us quantify the individual privacy of each node’s private data using mutual information
Suppose each node shared its intermediate updates vector without encrypting it first, then although the node does not share the private date directly, it is still revealed to the server. This is because with the intermediate updates
That is, at each iteration the server has the following mutual information
In our approach, the server is unable to determine the private data
Moreover, at the end of each iteration, each node solely acquires the global updates, which do not expose any information regarding the private data of other nodes. Specifically, from the global updates, nodes can acquire
The fitted model, i.e., the estimated parameters of GMM, should be the same as using non-privacy preserving counterparts. Namely, the performance of the
GMM should not be compromised by considering privacy.
Since we're using the CKKS scheme, the results obtained through this scheme may be approximate, but this does not significantly compromise the accuracy of the model.
In this section, we demonstrate numerical results to validate the comparison between the non-privacy preserving algorithm and our privacy preserving approach.
To achieve this, we utilize a data generation function that takes three input parameters: num_of_gaussians
, which determines the number of Gaussian components to simulate, points_per_gaussian
, which determines the number of data points in each Gaussian component, and mean_range
, which specifies the range of mean values for the data points.
We simulate several GMMs using the data generation function, and compare the two algorithms using the log-likelihood of the GMMs. We plot the log-likelihood as a function of the iteration number of our proposed privacy preserving approach and the existing non-privacy preserving approach.
The log-likelihood of a GMM with
-
Experiment 1:
num_of_gaussians=3, points_per_gaussian=1000, mean_range=[-20, 20]
-
Experiment 2:
num_of_gaussians=5, points_per_gaussian=900, mean_range=[-10, 10]
-
Experiment 3:
num_of_gaussians=4, points_per_gaussian=500, mean_range= [-10, 10]
-
Experiment 4:
num_of_gaussians=5, points_per_gaussian=500, mean_range=[-10, 10]
-
Experiment 5:
num_of_gaussians=2, points_per_gaussian=1000, mean_range=[-10, 10]
-
Experiment 6:
num_of_gaussians=3, points_per_gaussian=1000, mean_range=[-10, 10]
As shown in the figures in the previous seciton, we see that the proposed approach yields parameter estimations for GMMs that are indistinguishable from those of the non-privacy preserving approach. Hence, the output correctness of the proposed approach is guaranteed and not compromised by considering privacy.
By using fully homomorphic encryption, we were able to maintain participants' individual privacy because the private data did not need to be disclosed for computations.
Regarding performance, applying parallel computing could significantly enhance it. Instead of having the server wait for all participants to send intermediate updates, the process can be done simultaneously, resulting in a faster runtime.