Learning quickly and robustly from pairwise evaluation results
In LLM evaluation and many other areas, it is difficult to get an absolute rating of a set of options (“how good is X?”), but fairly easy to get relative comparison-based information (“is X better than Y?”).
Often, there is noise and variance in the results of individual comparisons. This can complicate the search for a global ranking that explains the entire set of comparisons, in fact, there is no universally best such ranking; it depends on the situation, and the real-life context of evaluation.
Very good rankings, which explain most of the comparisons, will generally all agree with each other. Any way of finding such a ranking quickly and efficiently is extremely useful.
Like most problems, this can be done by using SGD to optimize an appropriate (ranking) loss. There are some incredibly efficient techniques to derive such rankings, which use the most efficient of all computational toolkits: linear algebra. Here we show some AI-relevant examples applying a practical, fundamental tool for getting from a set of pairwise comparisons to a global ranking.
The tool: linear algebra on the comparison graph
The tool is very efficient to implement, and involves constructing a graph out of the pairwise comparison data. It can be very sparse and incomplete, and yet we can always find a way of ranking the nodes that best explains the comparisons we do observe. For many input graphs, there might not be a ranking that perfectly satisfies this, but we’d at least want results to agree with rankings in most comparisons, especially for larger ranking gaps: higher-ranked nodes should beat lower-ranked nodes in general.
We use a fundamental method to get such a ranking, based on graph calculus, a discrete analogue of vector calculus on graphs. As mentioned, it relies on two steps: making a graph out of the matches played, and then computing a unique global ranking on this graph that might best explain the comparisons.
Analyzing the comparison graph
We take the perspective that when we make pairwise comparisons between items, we are implicitly working with local differences on a graph. Each edge encodes a “who beats whom” or “which output is preferred” relation. But what is often wanted is a global ordering or potential: a scalar score assigned to every node that explains the local comparisons as consistently as possible.
Graph calculus ideas like the gradient, divergence, and Laplacian, imported from continuous calculus, provide a toolkit to move between edge-level signals and node-level scores. Intuitively, the gradient measures how much two connected nodes differ; the divergence measures whether a node is observed to “win” or “lose” overall; and the Laplacian ties these together into a system of equations whose solution is the smoothest global ranking consistent with the observed comparisons.
For LLM evaluation, this matters directly. Suppose we run a large batch of A/B tests between model outputs, asking annotators or other models to pick the better response. We end up with a directed, weighted graph: nodes are model outputs (or entire models), edges are preferences. The raw edge data is noisy and inconsistent: cycles exist (A preferred over B, B over C, but C over A), and the same item may win against weak baselines but lose against strong competitors.
Graph calculus provides a principled way to decompose this messy situation. As we discuss below, it splits every dataset of preferences into three parts: (1) a clean score-based ranking (the part we really want for global evaluation), (2) local cyclic inconsistencies (revealing areas where annotators or preference models are intransitive or contradictory), and (3) a harmonic residual (global ambiguities where relative rankings are fundamentally underdetermined by the data, e.g. two groups of models that haven’t been compared enough). In practice, this means we can get a ranking that respects most of the preferences, while also diagnosing where the evaluation data are inherently noisy.
So the intuition is: graph calculus is a bridge between local judgments (edges) and global evaluation metrics (node scores). In LLM contexts, this transforms a huge, messy set of pairwise comparisons into:
An overall leaderboard of models or outputs
Insight into why some comparisons don’t fit neatly into a ranking
This turns out to be crucial when evaluation data is both large-scale and imperfect, as it almost always is for LLMs.
Other approaches
Existing systems for turning pairwise comparisons into rankings are often used in sports or online gaming.
Elo: Treats each comparison like a match and updates scores incrementally. Works well when comparisons are sparse and arrive sequentially, but doesn’t naturally decompose inconsistencies or cycles.
TrueSkill (Microsoft’s system for Xbox Live): Bayesian extension of Elo that tracks uncertainty, designed for large multiplayer games. Great for operations like direct confidence intervals, but still assumes a clean global skill variable.
Bradley–Terry–Luce: A maximum-likelihood model where each item \(i\) has a latent score \(\theta_{i}\), and the probability of \(i\) beating \(j\) is $ P (i>j) = $. Widely used in pairwise preference modeling, including many ML evaluation setups.
These methods are powerful, but they all force global consistency: they assume comparisons can be fully explained by a latent score for each item. In real evaluation datasets (for LLMs especially), this is of course false. Preferences are messy, annotators disagree, and sometimes preferences form cycles (A > B, B > C, C > A). The statistical learning methods above tend to smooth over these contradictions, and directly pointing to them requires further analysis.
Graph calculus and linear algebra
To directly identify these structures locally, we can use the tools of graph calculus. These are sparse linear algebra tools that decompose the comparisons over the graph into interpretable components.
Gradient / potential: the part of the flow that’s due to some latent node-wise scores. This is the “clean” ranking we would want for a leaderboard.
Curl / cyclic part: this part measures where cycles exist. For LLM evaluation, this is incredibly valuable: it highlights clusters of models or prompts where annotators’ preferences tend to contradict the ranking, or each other. Preference loops among multiple models may expose underlying value conflicts.
Harmonic part: the globally ambiguous piece. This gives information on whether there are entire subgraphs where relative rankings are fundamentally underdetermined by the data (e.g. two groups of models that haven’t been compared enough).
There is a direct payoff to this: not only do we get a leaderboard, we also get diagnostics on where our evaluation pipeline is weak, contradictory, or underspecified. This information is not directly exposed by Elo/TrueSkill/BTL, and must instead be inferred from separate, often non-robust interpretation methods.
Applying this to evaluate LLMs
In practice, evaluation of LLMs involves a few characteristics that can make it tricky for principled decision-makers. There are large and diverse sets of prompts with systematic biases and distribution shifts (e.g., prompt distribution, annotator cultural/organizational training). This challenging situation has led to a reliance upon pairwise preference data (human or model-as-judge), which is more feasible to collect at the cost of being much more indirect.
The graph calculus tools here let us separate “true leaderboard signal” from “annotation noise” in a principled way:
Aggregate pairwise preferences into a global ranking while still respecting the network structure.
Detect cycles and inconsistencies that reveal where models behave differently (e.g. one model is better at reasoning, another better at conciseness).
This makes it closer to a diagnostic toolkit than just a ranking method.
Reference: implementation on NBA data
To introduce some of the basic concepts here, it’s useful to run on some NBA data.
This relies on the games.csv file from the comprehensive NBA dataset downloadable from Kaggle. We assume it is at path_games and restrict to the most recent Regular Season present (season_id = 22022). The season contains 1,230 games across 30 teams, with columns for home/away teams and points scored.
This ranking is a graph-aware smoothing of raw W–L, implicitly accounting for head-to-head structure; tiny schedule asymmetries are diffused through neighbors. This has several advantages. In practice, this lightweight framework can repeatedly recalculate rankings on different parts of prompt space, which is a powerful empirical tool against distribution shifts like task-specific languages and temporal decays. Edge weights per game can be set based on the situation, so this aggregates weighted wins. This sharpens separation between teams that beat strong opponents decisively vs squeaking by weak ones. These degrees of flexibility are often useful in practical applications.
Because \(\text{div}(A)\) equals W-L for the season for each team, and the NBA schedule is densely connected, we’d expect the graph ranking to end up nearly a monotone transform of win% on a single season. We are working with a smoothing/propagation operator on the match graph, but with uniform, balanced schedules, the smoothing preserves order to high fidelity.
CODE
import pandas as pd, numpy as npimport matplotlib.pyplot as pltpearson =float(np.corrcoef(res['potential'], res['win_pct'])[0,1])spearman =float(res['potential'].rank().corr(res['win_pct'].rank(), method='spearman'))plt.figure(figsize=(6,4))plt.scatter(res['win_pct'], res['potential'])plt.xlabel('Win %')plt.ylabel('Potential (mean=0)')plt.title(f'Win% vs Potential (Pearson corr: {pearson:.4f}, Spearman corr: {spearman:.4f})')plt.tight_layout()plt.show()
Implementation on LMArena data
We load the cheap, robust, and principled ranker of the models here. It is built with graph calculus tools, using just a little vector-based calculation on the pairwise comparison graph. Then we build a digraph from all decisive votes.
import pandas as pd, numpy as np, scipyimport networkx as nxfrom datasets import load_datasetfrom numpy.linalg import pinvfrom sentence_transformers import SentenceTransformer, util# -------------------------------# Load LM Arena dataset# -------------------------------ds = load_dataset("lmarena-ai/arena-human-preference-55k", split="train")# Convert to pandas for conveniencedf = ds.to_pandas()df
id
model_a
model_b
prompt
response_a
response_b
winner_model_a
winner_model_b
winner_tie
0
30192
gpt-4-1106-preview
gpt-4-0613
["Is it morally right to try to have a certain...
["The question of whether it is morally right ...
["As an AI, I don't have personal beliefs or o...
1
0
0
1
53567
koala-13b
gpt-4-0613
["What is the difference between marriage lice...
["A marriage license is a legal document that ...
["A marriage license and a marriage certificat...
0
1
0
2
65089
gpt-3.5-turbo-0613
mistral-medium
["explain function calling. how would you call...
["Function calling is the process of invoking ...
["Function calling is the process of invoking ...
0
0
1
3
96401
llama-2-13b-chat
mistral-7b-instruct
["How can I create a test set for a very rare ...
["Creating a test set for a very rare category...
["When building a classifier for a very rare c...
1
0
0
4
198779
koala-13b
gpt-3.5-turbo-0314
["What is the best way to travel from Tel-Aviv...
["The best way to travel from Tel Aviv to Jeru...
["The best way to travel from Tel-Aviv to Jeru...
0
1
0
...
...
...
...
...
...
...
...
...
...
57472
4294656694
gpt-4-0613
claude-1
["A simple mnemonic for \u03c0:\n\"How I wish ...
["Sure, let's break it down:\n\n1. \"How\" has...
["Here is how that mnemonic represents the dig...
1
0
0
57473
4294692063
claude-2.0
llama-2-13b-chat
["In python, implement a naive Bayes with gaus...
["Here is an implementation of a naive Bayes c...
["Sure! Here's an implementation of a naive Ba...
1
0
0
57474
4294710549
claude-1
alpaca-13b
["is it unethical to work on building weapons?...
["Working on weapons technology raises some et...
["It depends on the context. Weapons can be us...
1
0
0
57475
4294899228
palm-2
tulu-2-dpo-70b
["If a bait contains 0,0025% bromadiolon then ...
["Bromadiolone is a rodenticide which is most ...
["As an AI language model, I do not promote or...
0
1
0
57476
4294947231
gemini-pro-dev-api
gpt-4-1106-preview
["three kids eat three apples in three days, h...
["27 apples"]
["If three kids eat three apples in three days...
1
0
0
57477 rows × 9 columns
CODE
# Each row has model_a, model_b, winner, tie, comment, etc.# Keep only decisive votes (ignore ties for baseline)decisive = df[df['winner_tie'] ==0]# -------------------------------# Build preference digraph# -------------------------------models =sorted(set(decisive['model_a']).union(set(decisive['model_b'])))idx = {m:i for i,m inenumerate(models)}n =len(models)flow = scipy.sparse.csr_matrix((n,n))for _, row in decisive.iterrows():if row['winner_model_a'] ==1: i, j = idx[row['model_a']], idx[row['model_b']]else: i, j = idx[row['model_b']], idx[row['model_a']] flow[i,j] +=1
/opt/anaconda3/envs/env-dash/lib/python3.10/site-packages/scipy/sparse/_index.py:168: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil and dok are more efficient.
self._set_intXint(row, col, x.flat[0])
Finding a ranking
We have a choice here - whether or not to antisymmetrize the graph. We don’t do this in practice because the deviation from antisymmetry is itself interesting. But semantically, an argument could be made that A beating B (a positively weighted edge) should always go hand-in-hand with B losing to A. We can try both and verify that it’s inconsequential, as should normally be the case.
One simple way to get a ranking from pairwise comparisons is to treat each comparison as a match, so that the overall win-loss records of models can be compared as a crude measure of proficiency, just as in a sports league. As a sanity check, this shows rough agreement with the graph-based ranking.
CODE
def compute_raw_wins_metric(subset_df): wins_dict = {} losses_dict = {}for i inrange(subset_df.shape[0]): winner_model = subset_df.iloc[i,:]['model_a'] if subset_df.iloc[i,:]['winner_model_a'] ==1else subset_df.iloc[i,:]['model_b'] loser_model = subset_df.iloc[i,:]['model_b'] if subset_df.iloc[i,:]['winner_model_a'] ==1else subset_df.iloc[i,:]['model_a']if winner_model notin wins_dict: wins_dict[winner_model] =0else: wins_dict[winner_model] +=1if loser_model notin losses_dict: losses_dict[loser_model] =0else: losses_dict[loser_model] +=1 net_diff_dict = wins_dict.copy()for m in losses_dict:if m notin net_diff_dict: net_diff_dict[m] =-losses_dict[m]else: net_diff_dict[m] -= losses_dict[m]return pd.DataFrame(net_diff_dict, index=[0]).Tnet_diff_df = compute_raw_wins_metric(decisive).sort_values(0, ascending=False)net_diff_df.columns = ['Net Difference']net_diff_df['Score'] = ranking_baseline.loc[net_diff_df.index, 'Score']net_diff_df.sort_values('Score', ascending=False)import matplotlib.pyplot as pltplt.scatter(scipy.stats.rankdata(net_diff_df['Score']), scipy.stats.rankdata(net_diff_df['Net Difference']))plt.xlabel('Ranked Score')plt.ylabel('Ranked Wins - losses')plt.title('Ranked Score vs. Ranked Wins - losses')plt.show()
Quick filter: “substantial difference”
We can use embeddings to approximate whether the two responses differ substantially; only comparisons with large differences are kept, keeping only large differences and then recomputing the ranking. This lets us see how much the leaderboard depends on ambiguous cases.
To simplify the current implementation, call the difference substantial if cosine similarity between responses is low enough. This yields a subset of the dataframe.
CODE
embedder = SentenceTransformer("all-MiniLM-L6-v2")def is_substantial(row, threshold=0.85): reps = [row['response_a'], row['response_b']] embs = embedder.encode(reps, convert_to_tensor=True) sim =float(util.cos_sim(embs[0], embs[1]))return sim# Apply filter on a sample subset (to save time in demo)subset = decisive.copy() #sample(5000, random_state=42).copy()subset['substantial'] = subset.apply(is_substantial, axis=1)filtered = subset[subset['substantial']]
We can also subtract the ranking-implied flow from the observed flow, leaving the residual flow. This highlights pairs (or clusters) of models with cyclic preference structures, and other areas where there’s preference structure that does not further clarify the ranking.
The residual flow highlights not only contradictions, but also cases where the rankings systematically faily to More than this depends on how the data are actually filtered.
Discussion
Top models (GPT-4, Claude, GPT-4 Turbo, etc.) maintain the same rankings in both baseline and filtered rankings. Mid-tier models (Mistral, Llama-3, etc.) end up shifting in relative performance more noticeably, once only substantial differences are considered. So we have the intuitive finding that the marginal differences help tell apart performance of the less-good models, which can make mistakes in such situations.