Private preview

ai4science Leaderboard

Enter the shared password to view the private leaderboard dashboard.

Scientific model evaluation

ai4science Leaderboard

The ai4science leaderboard evaluates large language models on representative scientific AI tasks across materials science and chemistry. It compares model performance, cost, provider, prompting setting, and Pareto efficiency to identify models that offer the best performance-cost trade-off.

Runs are aggregated automatically. Models with repeated runs show mean performance and error bars; models with only Run 1 show a single point.

Performance vs Cost Pareto Front

Cost is on the x-axis and the selected performance metric is on the y-axis. The Pareto front highlights models that are not dominated by cheaper and better alternatives.

Leaderboard and token-cost summaries

Use these tables to inspect the ranked models and the total cost/token footprint by provider and model.

Token columns appear when token data is available in data.js. Current data includes task/model cost, so cost is summarized now and tokens will fill automatically when token fields are added.

This table breaks each provider into individual models for the selected task, metric, model setting, and provider filter.