Skip to content

Text Generation With T5 [NLP] [Augmentation]

Description

T5 aids data augmentation through:

  • Paraphrasing: Rephrasing sentences while preserving meaning, e.g., "The movie was boring." becomes "The film was dull."
  • Synonym replacement: Substituting words with synonyms to create variations, e.g., "The movie was long and tedious." becomes "The film was lengthy and boring."
  • Sentiment-based transformation: Changing sentence sentiment, e.g., given a negative sentence such as "The movie was very disappointing." can generate a neutral or positive version, such as "The movie had a slow start but improved later."
  • Text expansion: Adding context or details to short sentences, e.g., "The event was great." becomes "The event was great, with excellent speakers and engaging discussions."

Example

def t5_augmentation(text, model, tokenizer, num_return_sequences=1):
    input_ids = tokenizer.encode(
        f"paraphrase: {text}",
        return_tensors="pt",
        max_length=512,
        truncation=True,
    )
    outputs = model.generate(
        input_ids=input_ids,
        max_length=150,
        num_return_sequences=num_return_sequences,
        num_beams=5,
        no_repeat_ngram_size=2,  # Prevents repetition of 2-grams
        top_k=50,
        top_p=0.95,
    )
    return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]