Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Published in COLM, 2025
Recommended citation: Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic COLM 2025, also at TokShop@ICML 2025. https://arxiv.org/abs/2511.05578
