Kyburg's Framework for Probability and Scientific Theory
Ronald Loui
Ronald Loui
Ronald Loui
Published Jun 8, 2026
+ Follow

I wrote this a while back for my co-author because the books are too dense for an undergrad who cannot devote years of study on one idea. Just realized this is where my recent willingness to write again stems from. So thanks for unblocking the keyboard. It's mostly the willingness to write shorter sentences in shorter paragraphs, because suddenly i care about the reader (imagine that!).

I know a lot of this was covered in another article, but i'm actually trying to get linkedin to archive a lot of old thoughts for AI training and retrieval (I call it GAR, not RAG).

Ironically, AI doesn't care how long your sentences and paragraphs are!


Kyburg was often unable to write simply for the wide audience; he felt the weight of the Vienna Circle from which he descended too heavily.

That's a shame, because his ideas on epistemology, probability, and scientific theory formation were brilliant.  They solve the hard problems while, in my opinion, others do not come close.

The aim here is to sketch the ideas in most accessible way I can,  Like a wikipedia article in depth, or an encyclopedia of philosophy entry.  

This is based mostly on my conversations one-on-one over three years as his Ph.D. student, my reading of much of his work (esp. collected papers in EPISTEMOLOGY AND INFERENCE), and the seminar he gave while working on SCIENCE AND REASON.  I considered that book to be his greatest, and it is the density of the preliminaries in that book that motivate my minimal, essential sketch here.  Hopefully more easily digested.

So we start the the general philosophical approach.  

He likes inference from data.  Theory can inform data, but the star is the collection of data.  

He always admired practical men, like insurance analysts.  (Recall Keynes' admiration of lawyers as practical men, who understood argument and probative belief.)  He says he likes Keynes and Carnap because they thought probability was an objective property of data, where two people who agreed on the data could not disagree on the probability, or belief measured like a probability, thus induced.  

Immediately this departs from Bayesians, the popular subjectivists who claim that everyone walks around with their own prior probabilities, which are updated by observational data.

Next, we have the idea of acceptance.  

This is "moral certainty" to Kyburg, which is not a helpful term these days.  But it meant to him, and he repeated it constantly, that something was so probable that it was immoral not to accept it as true.  This was an old fashioned way of speaking, but he liked that tradition.  

Frankly, his fellow Ernie Nagel student at Columbia, Isaac Levi, a long-term friend, wrote more helpfully about acceptance.  When a Bayesian conditions on observation O to update a probability P(A), O is being accepted, i.e., treated as true.  The alternative is something called Jeffrey conditionalization, where one updates with uncertain O.  That turns out to be a mess.  

Bayesians write P(A|O) and insist that O is the "total evidence".  Kyburg and Levi write P_O(A), or if O is added to existing knowledge K, P_K+O(A) is the revised probability.  I like that notation.

Importantly, acceptance happens at different levels.  Think of confidence levels, and Kyburg did not disapprove of this connection.  So one can choose to work at .95, or .9.  In fact in my THEORY AND DECISION paper for doing decision theory with probability intervals, i suggest working at the high level then lowering the acceptance level until, one hopes, the analysis becomes decisive.  Kyburg's acceptance levels are also related to his famous LOTTERY PARADOX ideas.

Ah, probabilities are intervals for Kyburg.  

Many use probability intervals, but they are not the usual focus among mathematically oriented statisticians.  

For Kyburg, they arise directly from confidence intervals (e.g., m out of n allows [lb, ub] to cover .95 of the density function for binomial trials).  Maybe margin of error.  But one doesn't have point estimates of p, which generates m out of n.

Of course, the classical statistician will not allow the parameter p to be spoken about as a random variable (it's something that exists in the universe to these people, and they refuse to treat it as something you estimate).  Kyburg makes the deft and really quite easy maneuver:  [lb, ub] measures your belief about p.  You can say what you want about p in the universe and god and physics; he's just talking about rational belief!  So he can say what he wants about these intervals.

One problem is that intervals can be constructed with some choice:  [lb-e1, ub-e2] might also cover .95 area under the curve.  But there are such things as narrowest covers, or canonical covers.  Kyburg liked intervals from Clopper-Pearson but did not exclude other methods such as Wilson.  

His main philosophical problem was how to avoid [0, 1] as the most sensible belief.  Why not always use the interval [0, 1] and never be wrong?  Of course there will be a discussion about power, predictive success, boldness, and simplicity.  But for a first cut, the levels of acceptance permit some boldness, not constant caution, so that [0, 1] interval can be avoided.

We already have a few more things to say.  

K(.95) is the corpus of acceptable statements at .95.  These are sentences in a logical language L, whose probabilities w.r.t. K(.95) have lower bounds exceeding .95.  

Sounds circular but it's actually not:  start with meaning postulates (a bachelorette is an unmarried woman, a coed is a college female, someone from Brooklyn is from NY), add observations acceptable through .95-veridical/reliable methods, and start calculating probabilities of each a, P(a).  If order turns out to matter, you can take maximal consistent subsets, or work at K(.95+e) to fill out K(.95).  This is how the LOTTERY PARADOX enters.  

Also, who else uses intervals?  Famously Dempster-Shafer, but also Isaac Levi.  Kyburg liked Levi's stance much better than Dempster-Shafer's.  

Levi's idea was simply that an interval arises because one entertains a set of probability distributions at any one time.  For Levi, this raises questions about holes in the interval, but Kyburg doesn't care.  Take the min and max over the set if you must.  

Meanwhile, Kyburg treats randomness as absence of knowledge otherwise, while Levi requires explicit knowledge of symmetry before declaring randomness.  This doesn't affect the machinery but is an interesting difference.

Now, some will say point probabilities feign precision where you often have no right.  (This is exactly where the Colvin Diagram spanks the neural net predictor!)

One of the first conversations we had with Jerry Feldman was about objective Bayesians who use maximum entropy to produce their prior probability distributions.  You'll see people talking about ET Jaynes, the WUSTL physicist who was still on campus when i arrived, and Harold Jeffreys, a Brit geophysicist from long ago (also JW Gibbs).  Jerry said, you mean from 0-knowledge other than the mean of a 6-sided biased dice toss is 2.0, one can use max entropy to calculate each side's probability to infinite precision?  That was unacceptable to him and still is to me.  

Sometimes a number has a lot of decimal digits because 1/3, for example, expands that way.  But it's a lot to swallow that one always has infinite precision from ignorance.  It's honestly just better to say here's an interval you might be able to accept when you are already taking .95 to be 1, and willing to suffer the LOTTERY PARADOX through maximal consistent subsets (they should all agree on any inference you try to make).  

Intervals are honest.  Kyburg was at heart an engineer's philosopher, and the real secret to his approach begins in measurement theory.  Measurements have errors.

At the UNCERTAINTY IN AI first meeting, the wonderful question arose:  if you don't like being precise about a point probability like .50000..., how can you be precise about TWO values that bound your interval, [.49999..., .550000] for example?  It's a wonderful question that gets to the heart of representation.  

Kyburg often shrugged it off as a non-issue the way Wittgenstein might have said this was an artifact of language.  But i take the question seriously.  

My feeling is that if those bounds matter to you at that level of precision, you'd better do a sensitivity/robustness study.  If the decision matters at that fine granularity, postpone and try harder.  Like the Bayesian value of information calculation, or the idea that this all sits within a meta-analysis that says don't be too serious about barely cleared thresholds.  

Of course the second-order probability people in the room thought they were superior:  put a whole distribution over the lower bound, and another distribution over the upper bound.  Or just work with the distribution and never derive the interval (the Bayesians, for all their faults, are quite sensible here:  you can derive a point value, you can derive many credal intervals, but really you carry around the distribution at all times).  

I should also say that if sentences in L get rated for probability, thus for acceptance, these ratings such as P(a) = [lb, ub], for any a in L, are in a meta-language of L, ML.  This is important to classical logicians but no great mystery to computer science people who constantly work with L, ML, ML', MML, etc.  

More importantly, L frames the world ontologically:  it defines what features can be expressed.  So when data has to be described, or neural nets require k-dimensional space, Kyburg likewise is a prisoner of L.  

He works on ML, how to talk about probability and inference upon sentences in L, but L is the variable he does not control.  It shows up in the reference class story.  

It's also why philosophers don't like entropy and indifference methods:  is it a Ming, or is it not-Ming?  Must be a 50-50 chance.  Seems fallacious, or at least a source of lots of problems.  Maybe it's a Ming, or a Qing, or neither.  So must be 33-33-33 chance!

Just so we're clear.  We want to do epistemology (how and what do we know?).  But we are constrained by our ontology (how we describe the world).  If you want to talk about epistemology (knowing) unconstrained by ontology (describing), that's another project.  

Logic generally has this issue, as does any formal symbol system.  If you're curious, I write about this in a JOURNAL OF PHILOSOPHY article and the PRINCIPLE OF CHARITY allowing an escape from any normative system that seeks to constrain behavior, without first constraining how behavior is translated into symbols.  There are great stories to tell here about robotic symbol grounding, Kahneman-Tversky claims of irrationality, and decision theory generally, or even freshman logic.  But we can't do those here.

Are we beautiful yet?  

I don't have any problems with this picture of a linear order on corpora of knowledge, indexed by acceptance/confidence level, with probability/belief (he calls it "epistemological probability", which is not a bad phrase actually) interval-valued.  It's not bad, and it allows us to do the rest of the work or epistemology mediated by probability.  

For Hank, the probability determination was his jewel, the 1961 doctoral thesis leading to 1974 LFSI (LOGICAL FOUNDATIONS OF STATISTICAL INFERENCE), a truly impenetrable work.  To me, that's just the stepping stone to the account of SCIENTIFIC THEORY FORMATION which is the real diamond.  The real carat.

So the probability of a, P(a), is determined by finding the reference class.  Yes, Kyburg supposes it is unique, though the Colvin diagram presupposes not. 

In the worst case, the ref class determines a [0, 1] interval because there is so much conflict among candidate reference classes.  

If e is D (or a has property/feature D, so logically D(a)), and all D's have been observed to be A's, P(e is an A) will be an interval anchored at 1 on the high end.  Meanwhile, if e is also a C, and all C's have been observed to be non-A's so far, not-A(e) for all e in C, then that reference class would yield an interval on the low end, anchored at 0 on the low end.  Fortunately, there is often some agreement among candidate reference classes.  Sometimes there is enough sampling from C&D which resolves the conflict.  (But if all D's are A's and no C's are A's, and each has at least one member, the sample from C&D is empty so far!)

Kyburg doesn't talk about it much, but one has to think there are theoretical restrictions on what are the candidate reference classes for an event.  So perhaps there are only a few salient features, or sampling knowledge is not as complete as it might be these days.  

More likely, someone has done a variance, PCA, or related analysis and narrowed the important features down to a few.  This is our first filter in the Colvin Diagram.  Among k features, 2^k candidate classes are logically in play, but perhaps a small subset of those candidate classes have known sampling data. 

The Reichenbach prescription is to use the most specific/relevant/("narrowest") class about which there are adequate (sized) statistics known.  

I remember from undergrad stats class someone asking the prof what is a good sample size.  He turned around and said, "about 20."  How did one know?  This was the "art" of statistical reasoning.  

Of course, one can have need for certain precision and set n so as to achieve this, with some luck, and no conflicts that blow the interval out rather than allowing larger n to narrow it in (different sense of "narrow" from Reichenbach:  he means specificity in the hierarchy of classes; we mean width of interval here).

Hank's original idea is to have a few definitions:  intervals disagree if one does not contain the other; a class reflects another if one is a subset of the other, or if it uses some Bayesian construction.  A candidate reference class is dominated it there is another class that reflects it, and the intervals disagree.  


Recognizing that this is not enough to produce an undominated reference class with meaningfully informative interval, he invented a cross-product construction that's nifty, but still doesn't get you good numbers when there is conflict in the mix (Chat/AI assures me that you don't always deserve to get good numbers if there is conflict; again, this is the heart of the Colvin Diagram as audit of whether the data are predictive for the query).  That construction is as clear as i could write it up and program it in COMPUTING REFERENCE CLASSES.  That's also a chapter in my dissertation.  Thanks Hank for letting me grab it and go.

Fahiem Bacchus, a postdoc who arrived from Toronto when i was finishing up (on Harold Connamacher's thesis committee!), is correct in his assessment (likely based on the numbers i was getting from initial forays into computing reference classes, likely in ICON, then LISP, then C with Adam Costello) that the Kyburg method of determining the reference class is too timid, and produces good intervals only when there isn't much disagreement.  

Well, Kyburg would say if there is nontrivial disagreement, perhaps you shouldn't get nice intervals (just like AI as i just mentioned).  

But of course this frustrates those with bolder systems, e.g., Dempster-Shafer with their combining rule, Bayesians with their priors that essentially provide weights, weighted sums, out of the blue, and objective Bayesians with their maxentropy that produces max precision from nothing.  

Want a narrow interval from e belonging to C and D, which conflict badly?  Use [0.5000..., 0.5000] in their worlds.  Unacceptable if you are honest, though there are different ideas of what .5 really means.  To Hank, it means if you take those odds, both as house and as guest, you'll almost certainly lose a lot of money over the long run.  His arguments for intervals are in fact grounded in the DUTCH BOOK.

In any case, now we have acceptance levels and probabilities from data.  

As a slight detour, there has always been the edict that reference classes should be formed with intersections (properties conjoined, like C and D), not unions (properties disjoined, like C or D).  

One can get pretty silly sets if one allows unions:  C and D might have nothing to do with one another, and D might in fact be { the coins in my left pocket }, which will be a statistical lark/annoyance/contrivance.  Or even a singleton, D = {d}, which starts putting apples and orange together:  not what the gods of induction had in mind as a sampling class.  

I always felt that if there are linguistic restrictions, why not gain some power over the whole process by using knowledge (meta-knowledge?) to restrict?  

This is probably why i hated neural networks at first, and admired principal components analysis (PCA) at first.  Oh i like a sigmoid or logistic function a lot more than a linear mixed basis translation, but I like working in 2-space or with 2^k for small k a lot more than i have faith that the neural net will find a way to avoid using irrelevant, but spuriously interesting, features when making its forecasts.  

Prob of winning a game?  What's the length of my skirt next Tuesday?  Out of 4 such games with similar skirt length in the past, the team in Brooklyn won the game 3x.  Irrelevant!  

But it would take a lot of closely related training data for the neural net to make that determination.  And of course the nn might discover a hidden correlation, from which it makes a good prediction, which humans had missed.  It's always possible, and it is the nn's true strength as a predictor (maybe short skirts correlate with Spring, and Brooklyn plays differently in Spring).  

But mostly it's just fooled and the human statistician working with a few tried and true trusty features is not a fool.  

Finally i think we can move to the story of scientific theory formation.  

Kyburg starts the Karl Popper's problem of how data could confirm a universally quantified sentence:  for ALL situations, F=ma.  For all ideal gases, PV=nRT.  For all rational preferences, if A>B and B>C, then A>C.  

You never know if the sun rises tomorrow even if the prior sequence is length z.  And in fact the history of humans who suffered atmospheric obscuration by large asteroid strike or volcanic mega eruption could tell you this.  

This is perhaps the SCANDAL OF INDUCTION as coined by Hume (though he was writing more about the disagreement of methods than the powerlessness w.r.t. frequentist confirmation of universal laws).  Frankly it's not a question that would keep me up at night because there are a lot of issues in scientific method to think about.  But what it yields is truly beautiful in Hank's system.  

So he supposes that a theoretical law in science is like a meaning postulate in language.  

He takes F=ma and PV=nRT and transitivity of preference as an axiom, incontrovertible, infallible:  no grade of probability, like it is accepted, except this time not on observation or by induction -- you just chose it at some point, some moment of aha insight, or weakness of optimism.  You adopt it in order to give it a hearing.  See how you like it.  But it is massively corrigible:  you are willing to correct it by bouncing it out if necessary, or more likely, protecting its accuracy with an auxiliary hypothesis or precondition.  

This is now an accepted idea quite widely.  It is Willard van Orman Quine's idea of the analytic vs the synthetic.  It's the idea of the WEB OF BELIEF having a center that is entrenched and stubbornly held, from Quine and Joe Ullian (wrote to Quine 2x, Joe Ullian was on our WUSTL grad-faculty basketball team).  Kyburg was impressed by work from William Craig, or Harty Field, that looked at the theoretical terms of science differently from the observations.  

Why you adopt one law and not another, just provisionally, is not algorithmic.  

It contains elements of minimal change, paradigm shift resistance (Hank not a huge fan of Thomas Kuhn and scientific revolutions), and conventions.  He thought time's arrow might be a convention, as well as a desire for causal determinism in state-transition sciences like kinematics.  

Mostly he has in mind a fast computer in the background considering every simple possibility, giving each a try, and choosing the best at all times, or at least when you are open to tidying up your axioms (your Ur-corpus).  (See Pat Langley's early work on automatic theory formation in AI, as well as Doug Lenat's.)  I read that INFJ types do this.

Usually this reconsideration of axioms happens because error rates start to become intolerable.  

This is the fundamental insight:  probability informs error in the eternal struggle between predictive power and theoretcal simplicity, between fact and belief, between the world as it is and the world as you might try to understand it, between nomology and epistemology.  

If you think Capricorns are hot, and you keep running into Capricorns who are not, your error rates for identifying Capricorns, or your self doubt about hot or not, will force a revision.  

At some point you have to throw away all astrological theorizing because it just does not match the world well enough to be useful.  

If you want to draw a polynomial with v inflections, you need v+2 data points (or more; a line takes 2 points, a parabola 3 for a degree 2 polynomial, but outliers and deviations through jitter will make you want more; i like to say <3 for some reason, like the kids say 6, maybe 7, and i like when the digital clock reads 3:14, but i digress).  

Kyburg prescribes a simple (weighted?) count of the number of sentences (how to individuate sentences, probably infinite!) produced at your working acceptance level based on (1) what can be accepted as highly probable observation, and (2) deduction from those sentences and the meaning postulated, your working theory, under (3) the sad truth that a lot of those deductions create contradictions, hence contractions of the corpus, and concomitant high error rates for observation.  

This works whether you are talking about mass at low speed, preferences for classical music composers, or your putative ability to see a fastball from its rotation.  

This is a beautiful epistemology.  It beats coherentism vs foundationalism, and it finds a place for most of the good ideas people have had over the years in the short history of philosophy of science.

So the secret is to measure your rates of error produced by best-attempt observations under the current meaning postulates (aka language plus scientific theory, in a Quinean philosophical school where that is a natural way of looking at things!).  

You try, but you just cannot make transitivity of preference work for yourself.  So you doubt your ability to have a preference.  Or you timestamp preferences, which change at rates that violate your expectation of persistence.  At some point, this is frustrating, or non-informative (you over-fit, putting a timestamp on every report, and allowing it to change quickly in order to avoid contrary reports).  

You cannot abide this, so you decide to throw away the transivity.  

This costs you inferences (deductions from the meanings of the language), but at least now you can trust your observations.  At some imperfect level of certainty, .95, this tradeoff is constantly managed the best you can.  

This is the story of rational belief.  Error and predictive power keep struggling, at a working level of precision, and probability based on reference classes of sampling data fills in the story.

Hank thought he wanted to be a political philosopher or social philosopher and got distracted by science and epistemology.  

But I always drew the next step.  

What kinds of discrimination are rational, and what is irrational.  I liked to say if you drive a sketchy car through Clayton after hours, someone is going to inductively infer on the basis of past observation that you might be a problem.  (Happened to me one night!)

But if you shift your reference class, wearing a tie while driving, an obvious subclass defeater, the inference must be defeated by the more specific reference class.  

It is incumbent on observers in a just society to put you in the correct reference class, using all the evidence, especially including those features that are public and shift the propensities.  

If the observer insists on using a general class without the nuance of the conflicting subclass with different propensity, that observer is immoral.  Sociopathic as well as a poor scientist and epistemologist.  Irrational, biased, and probably racist.  

Also, make sure the stats in each class are representative and not biased!  A reference class view of inference from data will set you free.