← Back to tools

ucto

Tokenize text files by separating words from punctuation and splitting sentences

Text Processing linuxmacos C++ GPL-3.0

Description

Ucto tokenizes text files by separating words from punctuation and splitting sentences. It has rules based on regular expressions for several languages, making it a versatile text processing tool.

AI Summary

Multilingual text tokenizer that separates words, punctuation, and sentences

Capabilities

  • + Tokenize text into words
  • + Separate punctuation
  • + Split sentences
  • + Multi-language support
  • + Regular expression-based rules

Use When

  • When you need text tokenization for NLP
  • When processing multilingual text

Avoid When

  • x When you need deep NLP analysis

Related Tools

View AGENTS.md for ucto