Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

Published in COLM, 2025

Recommended citation: Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic COLM 2025, also at TokShop@ICML 2025. https://arxiv.org/abs/2511.05578

Direct Link