Global evaluations from two-way comparisons

graphs

linear-algebra

evaluations

Quickly learning from pairwise feedback

Author

Akshay Balsubramani

Learning quickly and robustly from pairwise evaluation results

In LLM evaluation and many other areas, it is difficult to get an absolute rating of a set of options (“how good is X?”), but fairly easy to get relative comparison-based information (“is X better than Y?”).

Often, there is noise and variance in the results of individual comparisons. This can complicate the search for a global ranking that explains the entire set of comparisons, in fact, there is no universally best such ranking; it depends on the situation, and the real-life context of evaluation.

Very good rankings, which explain most of the comparisons, will generally all agree with each other. Any way of finding such a ranking quickly and efficiently is extremely useful.

Like most problems, this can be done by using SGD to optimize an appropriate (ranking) loss. There are some incredibly efficient techniques to derive such rankings, which use the most efficient of all computational toolkits: linear algebra. Here we show some AI-relevant examples applying a practical, fundamental tool for getting from a set of pairwise comparisons to a global ranking.

The tool: linear algebra on the comparison graph

The tool is very efficient to implement, and involves constructing a graph out of the pairwise comparison data. It can be very sparse and incomplete, and yet we can always find a way of ranking the nodes that best explains the comparisons we do observe. For many input graphs, there might not be a ranking that perfectly satisfies this, but we’d at least want results to agree with rankings in most comparisons, especially for larger ranking gaps: higher-ranked nodes should beat lower-ranked nodes in general.

We use a fundamental method to get such a ranking, based on graph calculus, a discrete analogue of vector calculus on graphs. As mentioned, it relies on two steps: making a graph out of the matches played, and then computing a unique global ranking on this graph that might best explain the comparisons.

Analyzing the comparison graph

We take the perspective that when we make pairwise comparisons between items, we are implicitly working with local differences on a graph. Each edge encodes a “who beats whom” or “which output is preferred” relation. But what is often wanted is a global ordering or potential: a scalar score assigned to every node that explains the local comparisons as consistently as possible.

Graph calculus ideas like the gradient, divergence, and Laplacian, imported from continuous calculus, provide a toolkit to move between edge-level signals and node-level scores. Intuitively, the gradient measures how much two connected nodes differ; the divergence measures whether a node is observed to “win” or “lose” overall; and the Laplacian ties these together into a system of equations whose solution is the smoothest global ranking consistent with the observed comparisons.

For LLM evaluation, this matters directly. Suppose we run a large batch of A/B tests between model outputs, asking annotators or other models to pick the better response. We end up with a directed, weighted graph: nodes are model outputs (or entire models), edges are preferences. The raw edge data is noisy and inconsistent: cycles exist (A preferred over B, B over C, but C over A), and the same item may win against weak baselines but lose against strong competitors.

Graph calculus provides a principled way to decompose this messy situation. As we discuss below, it splits every dataset of preferences into three parts: (1) a clean score-based ranking (the part we really want for global evaluation), (2) local cyclic inconsistencies (revealing areas where annotators or preference models are intransitive or contradictory), and (3) a harmonic residual (global ambiguities where relative rankings are fundamentally underdetermined by the data, e.g. two groups of models that haven’t been compared enough). In practice, this means we can get a ranking that respects most of the preferences, while also diagnosing where the evaluation data are inherently noisy.

So the intuition is: graph calculus is a bridge between local judgments (edges) and global evaluation metrics (node scores). In LLM contexts, this transforms a huge, messy set of pairwise comparisons into:

An overall leaderboard of models or outputs
Insight into why some comparisons don’t fit neatly into a ranking

This turns out to be crucial when evaluation data is both large-scale and imperfect, as it almost always is for LLMs.

Other approaches

Existing systems for turning pairwise comparisons into rankings are often used in sports or online gaming.

Elo: Treats each comparison like a match and updates scores incrementally. Works well when comparisons are sparse and arrive sequentially, but doesn’t naturally decompose inconsistencies or cycles.
TrueSkill (Microsoft’s system for Xbox Live): Bayesian extension of Elo that tracks uncertainty, designed for large multiplayer games. Great for operations like direct confidence intervals, but still assumes a clean global skill variable.
Bradley–Terry–Luce: A maximum-likelihood model where each item $i$ has a latent score $\theta_{i}$, and the probability of $i$ beating $j$ is $ P (i>j) = $. Widely used in pairwise preference modeling, including many ML evaluation setups.

These methods are powerful, but they all force global consistency: they assume comparisons can be fully explained by a latent score for each item. In real evaluation datasets (for LLMs especially), this is of course false. Preferences are messy, annotators disagree, and sometimes preferences form cycles (A > B, B > C, C > A). The statistical learning methods above tend to smooth over these contradictions, and directly pointing to them requires further analysis.

Graph calculus and linear algebra

To directly identify these structures locally, we can use the tools of graph calculus. These are sparse linear algebra tools that decompose the comparisons over the graph into interpretable components.

Gradient / potential: the part of the flow that’s due to some latent node-wise scores. This is the “clean” ranking we would want for a leaderboard.
Curl / cyclic part: this part measures where cycles exist. For LLM evaluation, this is incredibly valuable: it highlights clusters of models or prompts where annotators’ preferences tend to contradict the ranking, or each other. Preference loops among multiple models may expose underlying value conflicts.
Harmonic part: the globally ambiguous piece. This gives information on whether there are entire subgraphs where relative rankings are fundamentally underdetermined by the data (e.g. two groups of models that haven’t been compared enough).

There is a direct payoff to this: not only do we get a leaderboard, we also get diagnostics on where our evaluation pipeline is weak, contradictory, or underspecified. This information is not directly exposed by Elo/TrueSkill/BTL, and must instead be inferred from separate, often non-robust interpretation methods.

Applying this to evaluate LLMs

In practice, evaluation of LLMs involves a few characteristics that can make it tricky for principled decision-makers. There are large and diverse sets of prompts with systematic biases and distribution shifts (e.g., prompt distribution, annotator cultural/organizational training). This challenging situation has led to a reliance upon pairwise preference data (human or model-as-judge), which is more feasible to collect at the cost of being much more indirect.

The graph calculus tools here let us separate “true leaderboard signal” from “annotation noise” in a principled way:

Aggregate pairwise preferences into a global ranking while still respecting the network structure.
Detect cycles and inconsistencies that reveal where models behave differently (e.g. one model is better at reasoning, another better at conciseness).

This makes it closer to a diagnostic toolkit than just a ranking method.

Reference: implementation on NBA data

To introduce some of the basic concepts here, it’s useful to run on some NBA data.

This relies on the games.csv file from the comprehensive NBA dataset downloadable from Kaggle. We assume it is at path_games and restrict to the most recent Regular Season present (season_id = 22022). The season contains 1,230 games across 30 teams, with columns for home/away teams and points scored.

CODE

import pandas as pd, numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve

path_games = '../../files/dataviz/games_NBA.csv'
df = pd.read_csv(path_games)
season_id = int(df.loc[df['season_type']=='Regular Season','season_id'].max())
g = df[(df['season_id']==season_id) & (df['season_type']=='Regular Season')].copy()

teams = pd.Index(sorted(pd.unique(g[['team_abbreviation_home','team_abbreviation_away']].values.ravel())))
tid = {t:i for i,t in enumerate(teams)}
n = len(teams)

wins = np.zeros((n,n), int); games = np.zeros((n,n), int)
for _, r in g.iterrows():
    h,a = tid[r.team_abbreviation_home], tid[r.team_abbreviation_away]
    hs,as_ = int(r.pts_home), int(r.pts_away)
    games[h,a]+=1; games[a,h]+=1
    if hs>as_: wins[h,a]+=1
    else:      wins[a,h]+=1

A = wins - wins.T
K = games.copy(); np.fill_diagonal(K, 0)
deg = K.sum(1); L = sparse.csr_matrix(np.diag(deg) - K)
div = A.sum(1)


mask = np.ones(n, bool); mask[0]=False
phi_red = spsolve(L[mask][:,mask], (-div)[mask])
phi = np.zeros(n); phi[mask]=phi_red; phi -= phi.mean()   # mean-zero

# Add conventional win%
Ws = pd.Series(0, index=teams, dtype=int); Ls = pd.Series(0, index=teams, dtype=int)
for _, r in g.iterrows():
    if r.pts_home>r.pts_away: Ws[r.team_abbreviation_home]+=1; Ls[r.team_abbreviation_away]+=1
    else:                     Ws[r.team_abbreviation_away]+=1; Ls[r.team_abbreviation_home]+=1
winpct = Ws/(Ws+Ls)

res = pd.DataFrame({'potential':phi, 'win_pct':winpct, 'wins':Ws, 'losses':Ls}, index=teams)

This ranking is a graph-aware smoothing of raw W–L, implicitly accounting for head-to-head structure; tiny schedule asymmetries are diffused through neighbors. This has several advantages. In practice, this lightweight framework can repeatedly recalculate rankings on different parts of prompt space, which is a powerful empirical tool against distribution shifts like task-specific languages and temporal decays. Edge weights per game can be set based on the situation, so this aggregates weighted wins. This sharpens separation between teams that beat strong opponents decisively vs squeaking by weak ones. These degrees of flexibility are often useful in practical applications.

Because $\text{div}(A)$ equals W-L for the season for each team, and the NBA schedule is densely connected, we’d expect the graph ranking to end up nearly a monotone transform of win% on a single season. We are working with a smoothing/propagation operator on the match graph, but with uniform, balanced schedules, the smoothing preserves order to high fidelity.

CODE

import pandas as pd, numpy as np
import matplotlib.pyplot as plt

pearson = float(np.corrcoef(res['potential'], res['win_pct'])[0,1])
spearman = float(res['potential'].rank().corr(res['win_pct'].rank(), method='spearman'))

plt.figure(figsize=(6,4))
plt.scatter(res['win_pct'], res['potential'])
plt.xlabel('Win %')
plt.ylabel('Potential (mean=0)')
plt.title(f'Win% vs Potential (Pearson corr: {pearson:.4f}, Spearman corr: {spearman:.4f})')
plt.tight_layout()
plt.show()

Implementation on LMArena data

We load the cheap, robust, and principled ranker of the models here. It is built with graph calculus tools, using just a little vector-based calculation on the pairwise comparison graph. Then we build a digraph from all decisive votes.

CODE

file_pfx = "../../files/graphs/"

tools_path = file_pfx + "graph_calculus.py"

from importlib.machinery import SourceFileLoader
graph_calculus = SourceFileLoader("graph_calculus", tools_path).load_module()

CODE

import pandas as pd, numpy as np, scipy
import networkx as nx
from datasets import load_dataset
from numpy.linalg import pinv
from sentence_transformers import SentenceTransformer, util

# -------------------------------
# Load LM Arena dataset
# -------------------------------
ds = load_dataset("lmarena-ai/arena-human-preference-55k", split="train")

# Convert to pandas for convenience
df = ds.to_pandas()
df

	id	model_a	model_b	prompt	response_a	response_b	winner_model_a	winner_model_b	winner_tie
0	30192	gpt-4-1106-preview	gpt-4-0613	["Is it morally right to try to have a certain...	["The question of whether it is morally right ...	["As an AI, I don't have personal beliefs or o...	1	0	0
1	53567	koala-13b	gpt-4-0613	["What is the difference between marriage lice...	["A marriage license is a legal document that ...	["A marriage license and a marriage certificat...	0	1	0
2	65089	gpt-3.5-turbo-0613	mistral-medium	["explain function calling. how would you call...	["Function calling is the process of invoking ...	["Function calling is the process of invoking ...	0	0	1
3	96401	llama-2-13b-chat	mistral-7b-instruct	["How can I create a test set for a very rare ...	["Creating a test set for a very rare category...	["When building a classifier for a very rare c...	1	0	0
4	198779	koala-13b	gpt-3.5-turbo-0314	["What is the best way to travel from Tel-Aviv...	["The best way to travel from Tel Aviv to Jeru...	["The best way to travel from Tel-Aviv to Jeru...	0	1	0
...	...	...	...	...	...	...	...	...	...
57472	4294656694	gpt-4-0613	claude-1	["A simple mnemonic for \u03c0:\n\"How I wish ...	["Sure, let's break it down:\n\n1. \"How\" has...	["Here is how that mnemonic represents the dig...	1	0	0
57473	4294692063	claude-2.0	llama-2-13b-chat	["In python, implement a naive Bayes with gaus...	["Here is an implementation of a naive Bayes c...	["Sure! Here's an implementation of a naive Ba...	1	0	0
57474	4294710549	claude-1	alpaca-13b	["is it unethical to work on building weapons?...	["Working on weapons technology raises some et...	["It depends on the context. Weapons can be us...	1	0	0
57475	4294899228	palm-2	tulu-2-dpo-70b	["If a bait contains 0,0025% bromadiolon then ...	["Bromadiolone is a rodenticide which is most ...	["As an AI language model, I do not promote or...	0	1	0
57476	4294947231	gemini-pro-dev-api	gpt-4-1106-preview	["three kids eat three apples in three days, h...	["27 apples"]	["If three kids eat three apples in three days...	1	0	0

57477 rows × 9 columns

CODE

# Each row has model_a, model_b, winner, tie, comment, etc.
# Keep only decisive votes (ignore ties for baseline)
decisive = df[df['winner_tie'] == 0]

# -------------------------------
# Build preference digraph
# -------------------------------
models = sorted(set(decisive['model_a']).union(set(decisive['model_b'])))
idx = {m:i for i,m in enumerate(models)}
n = len(models)

flow = scipy.sparse.csr_matrix((n,n))
for _, row in decisive.iterrows():
    if row['winner_model_a'] == 1:
        i, j = idx[row['model_a']], idx[row['model_b']]
    else:
        i, j = idx[row['model_b']], idx[row['model_a']]
    flow[i,j] += 1

/opt/anaconda3/envs/env-dash/lib/python3.10/site-packages/scipy/sparse/_index.py:168: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil and dok are more efficient.
  self._set_intXint(row, col, x.flat[0])

Finding a ranking

We have a choice here - whether or not to antisymmetrize the graph. We don’t do this in practice because the deviation from antisymmetry is itself interesting. But semantically, an argument could be made that A beating B (a positively weighted edge) should always go hand-in-hand with B losing to A. We can try both and verify that it’s inconsequential, as should normally be the case.

CODE

flow_antisym = (flow - flow.T)/2

W = graph_calculus.symmetric_part(abs(flow) > 0)
phi, _ = graph_calculus.helmholtz(flow, W)
phi_antisym, _ = graph_calculus.helmholtz(flow_antisym, W)

CODE

ranking_baseline = pd.DataFrame({"Model":models, "Score": np.ravel(phi), "Score_antisym": np.ravel(phi_antisym)})
ranking_baseline = ranking_baseline.sort_values("Score",ascending=False)
ranking_baseline.index = ranking_baseline['Model']

print("=== Baseline Ranking (all decisive votes) ===")
print(ranking_baseline)

=== Baseline Ranking (all decisive votes) ===
                                           Model      Score  Score_antisym
Model                                                                     
gpt-4-1106-preview            gpt-4-1106-preview  29.899841      29.899841
gpt-4-0314                            gpt-4-0314   8.757097       8.757097
gpt-3.5-turbo-0314            gpt-3.5-turbo-0314   7.687998       7.687998
gpt-4-0125-preview            gpt-4-0125-preview   6.839075       6.839075
claude-1                                claude-1   5.541850       5.541850
...                                          ...        ...            ...
fastchat-t5-3b                    fastchat-t5-3b  -7.678716      -7.678716
dolly-v2-12b                        dolly-v2-12b  -8.648299      -8.648299
oasst-pythia-12b                oasst-pythia-12b  -9.012049      -9.012049
chatglm-6b                            chatglm-6b  -9.197234      -9.197234
stablelm-tuned-alpha-7b  stablelm-tuned-alpha-7b -10.024013     -10.024013

[64 rows x 3 columns]

One simple way to get a ranking from pairwise comparisons is to treat each comparison as a match, so that the overall win-loss records of models can be compared as a crude measure of proficiency, just as in a sports league. As a sanity check, this shows rough agreement with the graph-based ranking.

CODE

def compute_raw_wins_metric(subset_df):
    wins_dict = {}
    losses_dict = {}
    for i in range(subset_df.shape[0]):
        winner_model = subset_df.iloc[i,:]['model_a'] if subset_df.iloc[i,:]['winner_model_a'] == 1 else subset_df.iloc[i,:]['model_b']
        loser_model = subset_df.iloc[i,:]['model_b'] if subset_df.iloc[i,:]['winner_model_a'] == 1 else subset_df.iloc[i,:]['model_a']
        if winner_model not in wins_dict:
            wins_dict[winner_model] = 0
        else:
            wins_dict[winner_model] += 1
        if loser_model not in losses_dict:
            losses_dict[loser_model] = 0
        else:
            losses_dict[loser_model] += 1

    net_diff_dict = wins_dict.copy()
    for m in losses_dict:
        if m not in net_diff_dict:
            net_diff_dict[m] = -losses_dict[m]
        else:
            net_diff_dict[m] -= losses_dict[m]
    return pd.DataFrame(net_diff_dict, index=[0]).T

net_diff_df = compute_raw_wins_metric(decisive).sort_values(0, ascending=False)
net_diff_df.columns = ['Net Difference']


net_diff_df['Score'] = ranking_baseline.loc[net_diff_df.index, 'Score']
net_diff_df.sort_values('Score', ascending=False)

import matplotlib.pyplot as plt
plt.scatter(scipy.stats.rankdata(net_diff_df['Score']), scipy.stats.rankdata(net_diff_df['Net Difference']))
plt.xlabel('Ranked Score')
plt.ylabel('Ranked Wins - losses')
plt.title('Ranked Score vs. Ranked Wins - losses')
plt.show()

Quick filter: “substantial difference”

We can use embeddings to approximate whether the two responses differ substantially; only comparisons with large differences are kept, keeping only large differences and then recomputing the ranking. This lets us see how much the leaderboard depends on ambiguous cases.

To simplify the current implementation, call the difference substantial if cosine similarity between responses is low enough. This yields a subset of the dataframe.

CODE

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def is_substantial(row, threshold=0.85):
    reps = [row['response_a'], row['response_b']]
    embs = embedder.encode(reps, convert_to_tensor=True)
    sim = float(util.cos_sim(embs[0], embs[1]))
    return sim

# Apply filter on a sample subset (to save time in demo)
subset = decisive.copy() #sample(5000, random_state=42).copy()
subset['substantial'] = subset.apply(is_substantial, axis=1)
filtered = subset[subset['substantial']]

These can now be used to rerank the models.

CODE

# Build new flow matrix from filtered data
flow_filt = np.zeros((n,n))
for _, row in filtered.iterrows():
    if row['winner_model_a'] == 1:
        i, j = idx[row['model_a']], idx[row['model_b']]
    else:
        i, j = idx[row['model_b']], idx[row['model_a']]
    flow_filt[i,j] += 1
    flow_filt[j,i] -= 1

W_f = (np.abs(flow_filt) > 0).astype(int)
deg_f = np.diag(W_f.sum(axis=1))
L_f = deg_f - W_f
div_f = flow_filt.sum(axis=1)

phi_f = pinv(L_f) @ div_f
phi_f -= phi_f.mean()

ranking_filtered = pd.DataFrame({"Model":models,"Score":phi_f})
ranking_filtered = ranking_filtered.sort_values("Score",ascending=False)

print("\n=== Filtered Ranking (only substantial differences) ===")
print(ranking_filtered.head(15))


=== Filtered Ranking (only substantial differences) ===
                         Model      Score
24          gpt-4-1106-preview  39.013812
22                  gpt-4-0314  15.236997
18          gpt-3.5-turbo-0314  11.860085
21          gpt-4-0125-preview  10.124660
5                     claude-1  10.024054
23                  gpt-4-0613   9.830485
35              mistral-medium   7.582994
30            llama-2-70b-chat   5.819264
34    mistral-7b-instruct-v0.2   4.856387
49            qwen1.5-72b-chat   4.553717
8             claude-instant-1   4.489515
60                wizardlm-70b   4.130915
50             qwen1.5-7b-chat   4.001747
36  mixtral-8x7b-instruct-v0.1   3.943557
61                 yi-34b-chat   3.639078

Localizing cycles (residual flow)

We can also subtract the ranking-implied flow from the observed flow, leaving the residual flow. This highlights pairs (or clusters) of models with cyclic preference structures, and other areas where there’s preference structure that does not further clarify the ranking.

CODE

grad_flow = scipy.sparse.csr_matrix(flow.shape)
for i in range(n):
    for j in range(n):
        if W[i,j] > 0:
            grad_flow[i,j] = phi[i] - phi[j]

residual = flow - grad_flow

# Extract strongest 10 inconsistent edges
inconsistencies = []
for i in range(n):
    for j in range(i+1,n):
        if W[i,j]:
            inconsistencies.append((models[i], models[j], residual[i,j]))
inconsistencies = sorted(inconsistencies, key=lambda x: -abs(x[2]))

for a,b,r in inconsistencies[:10]:
    print(f"{a} vs {b}: residual flow {r:.2f}")

claude-1 vs claude-2.1: residual flow 266.18
gpt-4-0613 vs gpt-4-1106-preview: residual flow 246.87
claude-2.1 vs gpt-4-1106-preview: residual flow 238.17
claude-2.1 vs gpt-4-0613: residual flow 230.30
gpt-4-1106-preview vs mistral-medium: residual flow 202.18
gpt-4-1106-preview vs mixtral-8x7b-instruct-v0.1: residual flow 174.64
claude-instant-1 vs gpt-3.5-turbo-1106: residual flow 166.51
gpt-4-0314 vs gpt-4-0613: residual flow 161.27
llama-2-70b-chat vs vicuna-33b: residual flow 147.10
mistral-medium vs mixtral-8x7b-instruct-v0.1: residual flow 132.46

CODE

import scipy

nonzeroes = np.nonzero(flow)
plt.scatter(np.ravel(flow[nonzeroes]), np.ravel(grad_flow[nonzeroes]))
plt.xlabel('Flow')
plt.ylabel('Gradient flow')
# Set limits
# plt.xlim(-80, 80)
# plt.ylim(-80, 80)
#plt.scatter(scipy.stats.rankdata(flow[nonzeroes]), scipy.stats.rankdata(grad_flow[nonzeroes]))
plt.show()

The residual flow highlights not only contradictions, but also cases where the rankings systematically faily to More than this depends on how the data are actually filtered.

Discussion

Top models (GPT-4, Claude, GPT-4 Turbo, etc.) maintain the same rankings in both baseline and filtered rankings. Mid-tier models (Mistral, Llama-3, etc.) end up shifting in relative performance more noticeably, once only substantial differences are considered. So we have the intuitive finding that the marginal differences help tell apart performance of the less-good models, which can make mistakes in such situations.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Global Evaluations from Two-Way Comparisons},
  langid = {en}
}

For attribution, please cite this work as:

Balsubramani, Akshay. n.d. “Global Evaluations from Two-Way Comparisons.”