CodeCipher: Learning to Obfuscate Source Code Against LLMs
While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this p...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While large code language models have made significant strides in AI-assisted
coding tasks, there are growing concerns about privacy challenges. The user
code is transparent to the cloud LLM service provider, inducing risks of
unauthorized training, reading, and execution of the user code. In this paper,
we propose CodeCipher, a novel method that perturbs privacy from code while
preserving the original response from LLMs. CodeCipher transforms the LLM's
embedding matrix so that each row corresponds to a different word in the
original matrix, forming a token-to-token confusion mapping for obfuscating
source code. The new embedding matrix is optimized by minimizing the
task-specific loss function. To tackle the challenge of the discrete and sparse
nature of word vector spaces, CodeCipher adopts a discrete optimization
strategy that aligns the updated vector to the nearest valid token in the
vocabulary before each gradient update. We demonstrate the effectiveness of our
approach on three AI-assisted coding tasks including code completion,
summarization, and translation. Results show that our model successfully
confuses the privacy in source code while preserving the original LLM's
performance. |
---|---|
DOI: | 10.48550/arxiv.2410.05797 |