Watermarking LLMs
Cryptographic foundations for detecting and protecting AI-generated text
Large language models are rapidly becoming part of everyday writing, research, education, and software development. As these systems become more powerful, it becomes increasingly important to understand when a piece of text was generated by an AI system, whether it has been modified, and what guarantees such detection mechanisms can provide.
This project studies watermarking for large language models from a cryptographic and theoretical perspective.
Watermarking asks whether an LLM can generate text that looks natural to humans, but still contains a hidden signal that can later be detected by someone who knows the secret key. Ideally, this signal should survive small edits, paraphrasing, or adversarial attempts to remove it, while remaining invisible to ordinary readers.
The goal is to build a rigorous theory of LLM watermarking:
- What is possible?
- What is impossible?
- What assumptions are needed?
- And how strong can the security guarantees be?
This direction connects cryptography, pseudorandomness, coding theory, and the theory of language models. It also raises basic conceptual questions: unlike classical digital watermarks, text generated by an LLM must remain fluent, diverse, and statistically close to natural language. This makes watermarking LLM outputs both practically important and theoretically subtle.
Motivation
Cryptography has repeatedly shown that informal security ideas can fail unless they are supported by precise definitions and proofs. LLM watermarking is no exception.
A good watermarking scheme should not merely work against simple tests or weak attacks. It should be meaningful even when the adversary understands the watermarking algorithm, can query the system, and actively tries to remove or forge the watermark.
This project aims to understand watermarking through this stronger lens.
Main Themes
The project focuses on the following questions.
- Can watermarking be based on standard cryptographic assumptions?
- What kinds of robustness are achievable against editing and paraphrasing?
- Can watermarking remain secure under strong adversarial access?
- Are there inherent barriers to constructing watermarking schemes with very strong guarantees?
- How do watermarking schemes relate to pseudorandom codes and cryptographic authentication?