Browse and submit language model benchmarks
Find recent high-liked Hugging Face models
Analyze model errors with interactive pages
Evaluate and submit AI model results for Frugal AI Challenge
Merge Lora adapters with a base model
Evaluate LLM over-refusal rates with OR-Bench
View RL Benchmark Reports
Explore and manage STM32 ML models with the STM32AI Model Zoo dashboard
Track, rank and evaluate open LLMs and chatbots
Display benchmark results
Determine GPU requirements for large language models
Export Hugging Face models to ONNX
View and submit machine learning model evaluations
The HHEM Leaderboard is a platform designed for model benchmarking, specifically tailored for language models. It allows users to browse and submit benchmarks, making it easier to compare performance across different models and datasets. This tool is invaluable for researchers and developers looking to evaluate and improve language models in a competitive and transparent environment.
• Real-time updates: Stay current with the latest benchmark results as they are submitted.
• Customizable filters: Narrow down results by specific models, datasets, or metrics.
• Detailed analytics: Access in-depth performance metrics for each submission.
• Submission interface: Easily upload your own model benchmarks for comparison.
• Community-driven: Engage with a community of researchers and developers to share insights and learn from others.
• Transparency: Clear documentation of evaluation methodologies and metrics.
What types of models can I benchmark on HHEM Leaderboard?
The HHEM Leaderboard supports a variety of language models, including but not limited to transformer-based architectures and other state-of-the-art models.
How do I submit a benchmark?
To submit a benchmark, create an account, ensure your model meets the submission criteria, and follow the step-by-step instructions provided on the platform.
What metrics are used to evaluate models?
The leaderboard uses standard metrics such as perplexity, accuracy, F1-score, and inference speed, depending on the specific task and dataset.