Character Tokens
Description
This is another method that can deal successfully with new words because it has the raw letters to fall back on. While that makes the representation easier to tokenize, it makes the modeling more difficult. Where a model with subword tokenization can represent play
as one token, a model using character-level tokens needs to model the information to spell out p-l-a-y
in addition to modeling the rest of the sequence. Subword tokens present an advantage over character tokens in the ability to fit more text within the limited context length of a Transformer model. So with a model with a context length of 1,024, you may be able to fit about three times as much text using subword tokenization than using character tokens.
Example
Input: Have the ๐ต bards who preceded
Output: H
, a
, v
, e
, <space>
, t
, h
, e
, <space>
, ๐ต
, <space>
, ...