Byte Tokens
Description
This method breaks down tokens into the individual bytes that are used to represent unicode characters.
Info
Some Subword tokenizers also include bytes as tokens in their vocabulary as the final building block to fall back to when they encounter characters they can't otherwise represent. The GPT-2 and RoBERTa tokenizers do this, for example.
Example
Input: Have the ๐ต bards who preceded
Output: 1001000
, 1100001
, 1110110
, 1100101
, 0100000
, 1110100
, 1101000
, ...