Advanced Tokenization in LLMs: Usage of Graph Techniques

Sunila Gollapudi
9 min readApr 20, 2024

Recently, I was trying my hands on some rudimentary tokenization techniques on my LLM and, in specific, tried adaptive tokenization. To understand what adaptive tokenization means, it is an advanced approach to tokenization that dynamically adjusts the granularity of tokens based on the specific context or content of the text. The aim is to optimize the balance between a model’s vocabulary size and the semantic richness of tokens, improving the model’s efficiency and its understanding of the text. I applied the token frequency approach where I kept the common tokens as-is but broke down the rare tokens into sub-words. This is a known approach that worked and improved the performance, but what I now tried is use graph techniques.

I built a token graph with each node representing a unique token, and edges represent relationships between tokens. These relationships could be based on co-occurrence within a certain window of text, syntactic dependencies, or semantic similarities. Then, applied semantic clustering to group semantically related tokens to decide how to adaptively tokenize the text, such as merging closely related short tokens into a single token or further splitting apart long tokens that bridge different semantic clusters.

In this article, I will go ground-up and share my experiments in applying Graph techniques and tokenization.

Tokenization — An introduction

--

--

Sunila Gollapudi

Enterprise Data Strategy, Big Data Engineering, Knowledge Graphs, Semantic Modeling, Cloud Architecture, GenAI Doctoral Researcher- sunilagollapudi.com