Information security issues have become increasingly prominent as the internet undergoes rapid development and widespread popularity. While traditional cryptographic solutions secure content confidentiality, they may still attract attackers’ attention. Thus, modern communication security requires not only the confidentiality of information but also the concealment of the process, leading to the emergence and widespread application of information hiding techniques. Steganography, a significant branch of information hiding, enables the covert embedding of secret messages within various carriers, which is crucial for military communications and personal privacy protection. Traditional modification-based steganographic algorithms can only achieve empirical security. Achieving provable security has always been an important pursuit for the steganographic community. One prerequisite for achieving provably secure steganography is the existence of a sampler that can strictly sample according to the carrier’s distribution. The swift advancement of deep generative models and the prevalence of AI-generated data have provided new techniques and environments for provably secure generative steganography, advancing research in this field.
Considering the rapid advances in computational power and the potential emergence of new attack vectors, traditional encryption techniques will likely become insecure in a few decades, necessitating more secure encryption measures to meet long-term security needs. Honey encryption is a novel technique that aims to produce a plausiblelooking decoy plaintext when an attacker attempts to decrypt the ciphertext with a wrong key. Thus, even if an attacker obtains all possible plaintexts through brute-force attacks, identifying the true plaintext among them is still challenging. Such a design significantly enhances encryption security, making honey encryption an effective tool against brute-force attacks and particularly suitable for scenarios requiring long-term protection against powerful attacks.
The thesis addresses the two core needs of data privacy protection—concealment and confidentiality—and revolves around researching provably secure generative steganography and honey encryption. The main contributions and innovations of the thesis can be summarized as follows.
1. Efficient and Provably Secure Steganographic Algorithm Based on Distribution Copies
The existing provably secure steganographic algorithms can be categorized into three types: those based on rejection sampling suffer from low embedding rates and slow speed; those based on arithmetic coding and those based on sample space grouping are capable of achieving higher embedding rates and speed, but their implementations are challenging to meet the theoretical assumptions, therefore they cannot achieve the expected security. Essentially, all the aforementioned algorithms use the index values or function values of samples to express the message. To overcome the limitations of existing algorithms, this thesis proposes an efficient and provably secure steganographic algorithm based on distribution copies. By creating multiple distribution copies by rotating the probability distribution provided by the generative models, the index values of the distribution copies are employed to express the message. Security proof is given from the perspective of distribution preservation. The embedding rate of the algorithm is significantly enhanced by recursive grouping. Taking the text generation task as an example, performance tests are conducted. Experimental results show that the embedding rate of this steganographic algorithm can approach the theoretical limit. Compared with the normal generation, the introduced additional time is relatively small, ensuring the algorithm’s efficiency in practical applications. Furthermore, the algorithm is deployed on the text-to-speech task, verifying its versatility.
2. Honey Encryption Scheme for Natural Language Text Based on Deep Generative Models and Arithmetic Coding
The core of honey encryption is the distribution-transforming encoder (DTE). Existing DTEs typically leverage traditional statistical models and a fixed-length encoding scheme based on the cumulative distribution function (CDF). However, these schemes fall short in modeling capability and generalization when dealing with complex data and struggle to maintain the original distribution. Starting with the minimization of modeling and encoding losses and applying ideas from the domain of provably secure generative steganography, this thesis designs a DTE based on deep generative models and arithmetic coding. It begins by tokenizing the plaintext, then encodes the tokens into a seed using arithmetic coding, leveraging the probability distribution provided by a deep generative model. To eliminate distinguishable characteristics, pseudo-random padding schemes are conceived at the plaintext and seed levels. The experimental results demonstrate that the proposed scheme achieves a high compression ratio and little encoding loss. The scheme can achieve little modeling loss and high security when using large models.
information hiding; steganography; provably secure steganography; honey encryption; generative model; arithmetic coding
生成式可证安全文本隐写及其应用
随着互联网的迅猛发展与广泛普及,信息安全问题日益突出。传统的基于密码学的解决方案虽能保障内容保密,但可能引起攻击者的注意。因此,现代通信安全既要求内容保密又希望过程隐蔽,信息隐藏技术应运而生并得到广泛应用。隐写术作为信息隐藏的重要分支,能将秘密消息隐秘地嵌入到各类信息载体中,从而实现隐蔽通信或存储,对军事通信和个人隐私保护至关重要。传统的修改式隐写只能停留在经验安全,实现可证安全一直是隐写领域的一大追求。实现可证安全隐写的一个前提是存在一个能够严格按照载体分布进行采样的采样器。深度生成模型的快速发展和生成数据的流行为可证安全隐写带来全新的技术手段和伪装环境,促进了生成式可证安全隐写的研究。
考虑到计算能力的快速进步以及未来可能出现的新型攻击手段,现有的传统密码学很可能在几十年后就不再安全,因此需要更加强大的加密措施应对长期安全需求。蜜罐加密是一种新型密码学技术,其目标是:若尝试使用错误密钥来解密密文,会得到看似合理的诱饵明文,这样,即使攻击者通过暴力破解获得了所有可能的明文,也难以从中定位出唯一的真明文。这样的设计显著提升了加密的安全性,使得蜜罐加密成为抵御暴力破解攻击的有效手段,特别适用于那些需要长期保护和对抗强大计算能力攻击的场景。
本文从数据隐私保护的两大核心需求——隐蔽性和保密性着手,围绕生成式可证安全隐写和蜜罐加密展开研究。本文的主要工作和创新点总结如下。
1. 提出了一种基于分布副本的高效可证安全隐写构造
现有的可证安全隐写构造可分为三种:基于拒绝采样的隐写构造嵌入率低、速度慢;基于算术编码的隐写构造和基于样本空间分组的隐写构造虽然能达到较高的嵌入率和较快的速度,但是具体实现过程难以满足理论假设,因此无法达到预期的安全性。上述方法本质上都是在使用样本的索引值或样本的函数值来表达消息。为了突破这些方法的局限,本文提出了基于分布副本的高效可证安全隐写构造。通过循环移位的方式为生成模型给出的概率分布创建多个分布副本,并利用分布副本的索引值来表达消息。从分布保持的角度,给出了这种隐写构造的安全性证明。通过递归分组的思想,显著提升了这种隐写构造的嵌入率。以文本生成任务为例,对这种隐写构造进行了性能测试。实验结果表明,该隐写构造的嵌入率能接近理论极限。与正常生成相比,该隐写构造引入的额外时间相对较小,这确保了它在实际应用中的高效性。此外,将该隐写构造应用于语音合成任务上,验证了该隐写构造的通用性。
2. 提出了一种基于深度生成模型和算术编码的文本蜜罐加密方案
蜜罐加密的核心是分布转换编码器 (DTE) ,现有的 DTE 通常采用传统的统计模型和基于累积分布函数的定长编码方案。然而,这种方案在处理复杂数据时显示出建模能力、泛化能力差,而且往往不能较好地保持原始分布。本文对生成式隐写与蜜罐加密之间的关系进行分析,并尝试将生成式可证安全隐写领域的先进思想应用于文本蜜罐加密系统的设计中,提出了一种基于深度生成模型和算术编码的 DTE:首先对明文进行分词,然后使用算术编码根据深度生成模型预测的概率分布将明文编码为种子。为了消除可区分特征,在明文和种子两个层面上设计了伪随机填充方案。实验结果表明,所提出的方案压缩率较高、编码损失较小,在使用规模较大的模型时,建模损失较小、安全性较高。
信息隐藏;隐写;可证安全隐写;蜜罐加密;生成模型;算术编码