Create a large, deduplicated dataset for LLM pre-training
Display html
Access NLPre-PL dataset and pre-trained models
Browse and view Hugging Face datasets
Colabora para conseguir un Carnaval de Cádiz más accesible
Upload files to a Hugging Face repository
Curate and manage datasets for AI and machine learning
Browse and search datasets
Browse and view Hugging Face datasets from a collection
Display translation benchmark results from NTREX dataset
Speech Corpus Creation Tool
List of French datasets not referenced on the Hub
Validate JSONL format for fine-tuning
TxT360: Trillion Extracted Text is a powerful tool designed for creating large-scale, deduplicated datasets specifically tailored for pre-training large language models (LLMs). It efficiently processes and extracts text from various sources, ensuring high-quality and diverse data for AI training purposes.
What is TxT360: Trillion Extracted Text used for?
TxT360 is primarily used for creating large-scale, deduplicated datasets for training and fine-tuning large language models. It ensures high-quality, diverse, and relevant text data.
Can I customize the dataset creation process?
Yes, TxT360 allows users to define specific criteria, filter content, and select sources to tailor datasets according to their needs.
How does the deduplication process work?
The deduplication process in TxT360 identifies and removes duplicate or near-duplicate text entries, ensuring that the dataset is unique and efficient for training purposes.
Can TxT360 handle data from multiple sources?
Yes, TxT360 supports data extraction from various sources, including web pages, documents, and other repositories, ensuring a diverse and comprehensive dataset.