…the question, What are the laws of nature? may be stated thus: What are the fewest and simplest assumptions, which being granted, the whole existing order of nature would result? Another mode of stating it would be thus: What are the fewest general propositions from which all the uniformities which exist in the universe might be deductively inferred? A System of Logic J.S.Mill, 1843... if we knew everything, we should still want to systematize our knowledge as a deductive system, and the general axioms in that system would be the fundamental laws of nature." Universals of Law and of Fact Frank Ramsey, 1928
"a contingent generalization is a law of nature if and only if it appears as a theorem (or axiom) in each of the true deductive systems that achieves a best combination of simplicity and strength" Counterfactuals ,p73 David Lewis 1973
N.Vereshchagin and P.Vitányi (2004).,2004,"Kolmogorov’s structure functions and model selection". IEEE Transactions on Information Theory 50(12), 3265–3290.Vereshchagin & Vitányi,2004
The quotes from Mill, Ramsey and Lewis above express the what philosphers call the " cf. 'Laws of nature', Stanford Encyclopedia of Philosophy. Best System Analysis" of Laws.
BSA is probably the most widely accepted answer we have to the question, "What is it to be a Law of nature?" Even so, it is notoriously fraught with unanswered questions. A short list:
Can the account properly distinguish accidental from nomological regularities?
Can it explain the connection between the laws of nature, counterfactuals and dispositions?
Why should we count only generalizations as laws, given that many scientific principles do not obviously take this form? Can't singular statements describing, say, fundamental constants be laws too?
What is the connection between this deductive account of laws and our inductive methods of discovering them?
How does BSA accommodate the existence of probabilistic laws?
What do Mill and Lewis mean when they speak of "simplicity"? Isn't simplicity in the eye of the beholder? If so how can it be a subjective matter what the laws of nature are? Or, if simplicity is just a measure of shortness of our sentences, doesn't that make law-hood a matter of what language we happen to speak?
And, anyway, why should we think that the laws of nature must be simple in any sense?
In this post I want to raise different and, I think, more fundamental problems for BSA and provide an alternative theory of lawhood based on Algorithmic Information Theory (AIT). This new theory is precisely captured in the theorem of AIT that appears above. Don't worry if you don't understand it just now. AIT is a recent development and a novelty to most philosophers. Before we are done, I hope to have explained to you what this equation means and to have convinced you of its deep significance.
What is the Best System Theory?
As a first step we need to clarify what the Best System Analysis is. Though the theory is widely accepted it is also widely misunderstood. This is mostly Lewis's fault. Lewis introduces his version of BSA by saying:
Take all the deductive systems whose theorems are true. Some are stronger, more informative than others. These virtues compete: An uninformative system can be very simple, an systemized compendium of miscellaneous information can be very informative. The best system is one that strikes as good a balance as truth will allow between simplicity and strength. How good a balance that is will depend on how kind nature is. A regularity is a law if it is a theorem of the best system. David Lewis, Papers in Metaphysics and Epistemology, 1983, pp. 41-2 5
and this is how the theory is generally glossed. Thus the SEP dutifully reports:
Some true deductive systems will be stronger than others; some will be simpler than others. These two virtues, strength and simplicity, compete. (It is easy to make a system stronger by sacrificing simplicity: include all the truths as axioms. It is easy to make a system simple by sacrificing strength: have just the axiom that 2 + 2 = 4.) According to Lewis, the laws of nature belong to all the true deductive systems with a best combination of simplicity and strength. Carroll, John W., "Laws of Nature", The Stanford Encyclopedia of Philosophy (Spring 2012 Edition), Edward N. Zalta (ed.) 6
Problems intrude when we ask exactly what is meant by 'strength' here.
When we speak of the "strength" of a candidate set of axioms in a deductive system we can mean one of two things.
Strength-1: How many theorems can be formally derived from the axioms.
Strength-2: How much information is expressed by the axioms.
The first notion, strength-1, is syntactic. We measure it by counting sentences: some set of sentences A1 is a stronger axiom set than A2 if every theorem (formally) deducible from A2 is deducible from A1 but not vice versa.
Strength-2 is a semantic property: we measure it by counting worlds. If A2 is true at every world at which A1 is true, but not vice versa then A1 is more informative than A2. The idea is that the fewer worlds at which a sentence is true the more informative it is.
Lewis clearly intended this second, semantic, reading of 'strength' in the passage just quoted and this is how he is generally understood. Thus Ned Hall tells us:
Lewis takes it that there is some canonical scheme for representing facts about the world. Then any correct representation that makes use of this scheme will have two features: First, it will have a degree of informativeness, determined purely by which possible worlds the representation rules out. So it automatically follows that if one correct representation rules out more possible worlds than a second (i.e., every world in which the second true is one in which the first is true, but not vice versa), then the first is more informative. There are thus maximally informative representations, made so by being true only in the actual world. Second, it will have a degree of simplicity, determined by broadly syntactic features of the representation . These two factors of simplicity and informativeness then determine an ordering— presumably, partial—among all the correct representations there are, in terms of how well each one balances simplicity and informativeness. Lewis's hope is that the nature of our world will yield a clear winner.Ned Hall, "Humean Reductionism About Laws of Nature", p.12 7
The problem is that Lewis had no sooner told this simplicity vs. strength-2 story than he immediately refuted it.
We face an obvious problem. Different ways to express the same content, using different vocabulary, will differ in simplicity. The problem can be put in two ways, depending on whether we take our systems as consisting of propositions (classes of worlds) or as consisting of interpreted sentences. In the first case, the problem is that a single system has different degrees of simplicity relative to different linguistic formulations. In the second case, the problem is that equivalent systems, strictly implying the very same regularities, may differ in their simplicity. In fact, the content of any system whatever may be formulated very simply indeed. Given system S, let F be a predicate that applies to all and only things at worlds where S holds. Take F as primitive, and axiomatise S (or an equivalent thereof) by the single axiom (∀x)Fx. If utter simplicity is so easily attained, the ideal theory may as well be as strong as possible. Simplicity and strength needn't be traded off. Then the ideal theory will include (its simple axiom will strictly imply) all truths, and fortiori all regularities. Then, after all, every regularity will be a law. That must be wrong. David Lewis, Papers in Metaphysics and Epistemology, 1983, pp. 41-2 8
This means that simplicity and strength-2 do not conflict and don't need to be balanced off. Lewis's argument demonstrates that for any possible world there is a possible language— a "canonical scheme"— in which there is a very short one sentence axiom which entails all the truth about the world. Yet we do not want to say that that sentence expresses a law, not just because that would make every regularity a law but also because it would make every truth an immediate consequence of a law so every truth would be nomologically necessary.
What to do? Lewis says
The remedy, of course, is not to tolerate such a perverse choice of primitive vocabulary. We should ask how candidate systems compare in simplicity when each is formulated in the simplest eligible way; or, if we count different formulations as different systems, we should dismiss the ineligible ones from candidacy. An appropriate standard of eligibility is not far to seek: let the primitive vocabulary that appears in the axioms refer only to perfectly natural properties. David Lewis, Papers in Metaphysics and Epistemology, 1983, pp. 41-2) 9
This abandons the semantic criterion of strength in favor of the syntactic one. When we look for laws we are no longer looking for Hall's "maximally informative propositions" instead we are looking for propositions expressed by sentences that a) belong to a particular kind of language: a language whose lexicon includes names for perfectly " cf. Bird, Alexander and Tobin, Emma, "Natural Kinds", The Stanford Encyclopedia of Philosophy (Winter 2012 Edition), Edward N. Zalta (ed.)natural properties" (hereinafter an "N-Language") and 2) are sentences that have the form of generalizations in that N-language and 3) are sentences that belong to the simplest sets of sentences (axioms) from which the maximum number of true sentences of that language can be formally deduced.
Note this does not require that we identify the laws with these sentences nor does it require that we speak an N-language in order to express laws. We can still regard the laws as language transcendent propositions and hope to express them in, say English, provided we can believe that there are some English sentences— however unsimple they might be— that are synonymous with the law-expressing sentences in some N-language. Still, Lewis's theory is "linguistic" in the sense that it requires us to say that what makes such a proposition a law is that some sentence expressing it in some N-language appears in the axioms which best combine the genuinely competing virtues of simplicity and strength-1 in that language. Lewis certainly hoped that strength-1 in such a language would correlate with strength-2, but that hope was not part of his analysis.
Failure to grasp this point is, all by itself, the source of considerable For example, Ned Hall thinks that the central problem for Lewis's analysis is how to reconcile its competing demands of simplicity and informativeness. That problem goes away, or at least looks very different, when one notices that demands don't compete and that Lewis, in any case,doesn't demand that the laws be maximally informative in the relevant sense. confusion in the literature on laws but we need not dwell on that just now. What is more relevant is that understanding BSA in this way should not at all detract from its intuitive appeal.
God's Problem
That appeal is nicely captured by Helen Beebee's retelling.
So the idea is something like this. Suppose God wanted us to learn all the facts there are to be learned. (The Ramsey-Lewis view is not an epistemological thesis but I'm putting it his way for the sake of the story.) He decides to give us a book— God's Big Book of Facts— so that we might come to learn its contents and thereby learn every particular matter of fact there is. As a first draft, God just lists all the particular matters of fact there are. But the first draft turns out to be an impossibly long and unwieldy manuscript, and very hard to make any sense of-it's just a long list of everything that's ever happened and will ever happen. We couldn't even come close to learning a big list of independent facts like that. Luckily, however (or so we hope), God has a way of making the list rather more comprehensible to our feeble, finite minds: he can axiomatize the list. That is, he can write down some universal generalizations with the help of which we can derive some elements of the list from others. This will have the benefit of making God's Big Book of Facts a good deal shorter and also a good deal easier to get our rather limited brains around.
For instance, suppose that all the facts in God's Big Book satisfy f=ma. Then God can write down f=ma at the beginning of the book, under the heading "Axioms", and cut down his hopelessly long list of particular matters of fact: whenever he sees facts about an object's mass and acceleration, say, he can cross out the extra fact about its force, since this fact follows from the others together with the axiom f=ma. And so on. God, in his benevolence, wants the list of particular matters of fact to be as short as possible-that is, he wants the axioms to be as strong as possible; but he also wants the list of axioms to be as short as possible-he wants the deductive system (the axioms and theorems) to be as simple as possible. The virtues of strength and simplicity conflict with each other to some extent; God's job is to strike the best balance. And the contingent generalizations that figure in the deductive closure of the axiomatic system which strikes the best balance are the laws of nature.Helen Beebee,"The Non-Governing Conception of Laws of Nature",Philosophy and Phenomenological Research, Vol. LXI, No.. 3, November 2000, pp.574-5 10
This gets Lewis's picture exactly right, provided that we stipulate that God writes his Big Book in an N-language and assume that it is a language we can understand I do not suggest that Beebee was unaware of this point. 11 . This is important because for this story to get God writing things that look like laws we must understand that God's problem is not just how to write a shorter book that gives us all the facts. If God's only task was to say the same thing in fewer words— to convey the same information— his best strategy would be to stop writing in N-language and speak the language Lewis described; the one with the single super predicate "F". Written in that language the book of facts is just one sentence long viz.'(∀)(Fx)'. The problem with this is not that we frail humans couldn't understand that language. The problem is that even if we could, neither we nor God would regard this axiom as a law of nature because that would make every truth nomologically necessarily.
The task Lewis sets God is not just to express all the facts but to express them in N-language. He must tell the truth about the world by telling us which N-Language sentences are true and he must tell us that as simply and concisely as possible.
As Beebee points out, the device of axiomatization provides one way of doing that. By writing a few general N-language sentences as axioms, God gives us a way to deduce the truth of many others; allowing Him to leave those others out and make the Big Book correspondingly shorter.
Axiomatization does the job of telling us which N-Language sentences are true using fewer sentences. But if that is the job that laws of nature do, we now need to ask if there is any other, better, way of doing that job?
ZIP
Let us begin by noticing that Beebee's story is, in more ways than one, a bit old fashioned. Why would God try giving us this big unwieldy manuscript? Surely God would know that we all have computers now. He ought make the Big Book an ebook and send us the file!
Of course, there would still be problems of size. How could he send it to us? Even the digitized Big eBook of Facts would be a very big file, certainly too big send to us in e-mail.
God's problem here is one we've all faced even with the small books and documents we write ourselves. How do you get that 10 megabyte chapter to the publisher when the email system only lets you send 5 megabyte attachments? Axiomatizing your documents would be one way shorten them. But that is not what we do. Instead we compress them. We put them in a ".zip" file. Nerds who hold that God speaks Linux, will insist that it should be a ".gz" file. As we shall see below, it makes no difference. 12
Compressing books, or any other kind of data, is a way of squeezing a lot of bits into fewer bits. That 10 megabyte chapter may get turned into a 1 megabyte zip file. Anyone who has the right decompression program can "unzip" that file and get the whole 10 megabyte document back again. Of course they will have to have the unzipping program but, remarkably, the decompressing program may itself be quite short. So even if the mail system won't let you send more than 5 megabytes at a time you may be able to send the zip of your 10 megabyte chapter and the unzip program in the same email. And the same short programs that zip and unzip your chapter can, of course, be used to compress and decompress documents of any size.
To understand what data compression is we are going to have to learn a bit about Algorithmic Information Theory. But before we start on that we should pause to take inventory of what data compression is not. We commonly say that the zip file contains "all the data" that was in the original, uncompressed document. As we shall see, there is a rigorous senses of "data" in which this is true but there are important senses in which it isn't:
It is not that a zip file "says" everything that was in the original document but in fewer or shorter words. That can happen: translating, say, the Kritik der reinen Vernunft into English may give you a shorter book. But compression is not a matter of translating a document into some other, less prolix, language.
Nor is a zip file anything like an abridgment or a summary of the original. Abridgment requires throwing some information away, compression need not. More importantly, an abridgment or summary of a document still says something. A zip file does not.
A zip of an English document will likely contain no sentences in any language. The zip of a document is not true in the same possible worlds as the original document. A zip file isn't true or false at all. Indeed, for reasons we shall explore below, the better the compression algorithm you use, the more the contents of the compressed document will approximate purely random data.
Nor, as we shall soon see is there any sense in which a zip file is a collection of axioms from which the decompressing program "deduces" the contents of the compressed document.
In any case, having come thus far it should start to seem very peculiar that Best System theorists should ever have thought that the "Best System" would be a deductive one….
Axing Axioms
Why peculiar?
Well, notice first that a deductive system is only one particular kind of formal system. A formal system is a collection of rules for manipulating symbols— that is, meaningful objects. What makes a system "formal" is that its rules will not mention those symbols' meaning but, will instead describe physical manipulations of symbols keyed only to those symbols' intrinsic physical properties.
Austerely conceived, a deductive system consists of a set of seed symbol strings (the axioms), and a set of transformation rules that describe how to transform and augment those seed strings to produce new strings (the theorems). What makes such a system "deductive" is that there is some interpretive scheme— some way of assigning the strings meanings— such that a) the axiom and theorem strings can be all be interpreted as sentences— the sorts of things that are true or false; and b) the rules for transforming strings into strings turn out to be truth preserving.
Deductive systems are a vanishingly small subset of all the formal systems there are. Not all strings of symbols are sentences; manipulations of symbols, even of sentences, do not always produce sentences, and manipulations that do map sentences onto other sentences are not all truth preserving. There are formal systems that generate sentences from seed strings that are not sentences: for example, a grammar is a formal system that starts off with words as its seeds and describes rules for building sentences from them; its "theorems" comprise all the sentences of a language, true and false.
What is special about deductive systems— what has given them their central place in Western thought— is that they are a means of proof. The transformations that lead from axioms to theorems can be interpreted as arguments in which the theorems appear as conclusions; valid arguments, if the system's transformations are truth preserving; sound arguments, if the system's axioms are true.
In this way, deductive systems serve as a kind of epistemological amplifier: they allow us to extend our confidence in the truth of a few seed sentences to an equal confidence in the truth of the many— perhaps infinitely many— theorems of the system. This in turn has seemed important because there seem to be some sentences about whose truth we can be certain without proof. These sentences serve as "axioms" in the formal sense because they are "axiomatic" in the informal sense: they, or the propositions they express, are held to be "obvious", "self-evident", "apodictic", "self-justifying", "a priori", "analytic" ... .
This variety of labels reflects an even larger number of competing philosophical theories about why these sentences enjoy this special epistemic status. Is it because they express propositions we learned in a prior life? Is it because they express relations between ideas, or concepts, or meanings? Or is that they are the deliverances of a special non-sensory faculty of Reason? Never mind. So long as we can agree that we somehow know that these special sentences are true we can use deductive systems to extend that knowledge to an infinity of other sentences which would otherwise not be "obvious", "self-evident", etc.
Now, what is peculiar about the idea that the Laws of Nature get their status by being— or being expressed by— axioms in formal deductive systems is that no one, at least no one nowadays, thinks that the laws of nature are axiomatic. No one thinks that they are "obvious", "self-evident", "analytic" or anything of the sort. No one thinks that we have a special way/grounds/justification/warrant for believing in the laws of nature which we can use to prove the truth of their consequences. Quite the opposite: in the sort of deductive system Ramsey and Lewis envisage, our confidence in the truth of the axioms will rest entirely on our confidence in the truth of the theorems— that is on matters of particular fact that constitute the data for the theory the laws embody.
Remember that Beebee's God doesn't axiomatize His book because He needs to convince us that what it says is true (We already believe Him. He's God!). He axiomatizes only to make the book shorter. Of course, in real life, we don't have a book from God. But we can write volumes and volumes of sentences that record our observations of particular matters of fact and we are altogether more confident in the truth of these observations (e.g. that this swan is white, and that one) than we are of any generalization over them (that all swans are white). Of course, we might think that the fact that a statement like "f=ma" could occur in a tight axiomatization of all our observation sentences is a kind of argument for its truth. And so it is. But it is not a deductive argument.
Formal deduction does two things: it gives us a way of building many sentences from a few sentences and it gives us a way of building arguments for the truth of those sentences. But if we already know what sentences we want to build and are confident in their truth —because we find them in God's Big Book, or because they are direct reports of empirical data—then the argument building won't give us any epistemic advance. In that case we might as well look for other ways of building those sentences which are more efficient because they are not constrained by deduction's requirement that they start from sentences or use transformations that preserve truth.
What we are looking for is not the simplest way to describe the world that makes God's Big Book true. What we are looking for is the simplest description of the book itself. What we are looking for is a data compression algorithm.
The word 'data' as we shall understand it here and in all that follows, does not denote facts about the world but rather sets of sentences, conceived as strings of symbols that may or may not express truths or, indeed, may express nothing at all. A data compression algorithm provides a formal way of describing those strings using shorter strings; "formal", again, in the sense that it ignores what, if anything, those strings might mean.
Your file zipping program operates a compression algorithm that doesn't care if the chapter you feed it is true or false. It can't tell if it is zipping sentences from God's Big of Facts or the Big Book of All Empirical Observations or a stream of random gibberish. Nevertheless, I am going to try to convince you that by understanding data compression we can understand something important, and deep, about the laws of nature.
"Objection: How can thinking about the mere formal manipulation of meaningless symbols, tell us why the laws of nature are true?"
Answer: It cannot. But remember that the Best Systems theory doesn't purport to tell us that either. The question that theory is supposed to answer is why— among all the sentences we think are true for whatever reason—we single out particular sentences for the label 'Laws'. More generally, why do we speak of some sentences as nomologically necessary and others as nomologically contingent? And remember that Lewis's answer is that what distinguishes the Law sentences from others is not that they express some special kind of proposition knowable in a special way, but rather that the propositions they express are expressible by sentences which have particular formal properties—simplicity and strength-1— in a special kind of language: an N-language.
The core of Lewis's idea was surely this: if we describe the world in a language whose basic vocabulary describes the "natural" properties—if we speak a language that "carves nature at it joints"—then whatever order and system we find in the world will be reflected in the words we use to describe it. Insofar as we speak such a language and we speak truly, then regularities in reality will be reflected in regularities in the data— that is, in the sentences— that report them. The order of the world, at least such order as we can ever hope to articulate or fathom, will be reflected in the forms of our descriptions of it.
As we shall see, data compression works precisely by exploiting system and regularity— in rigorously definable senses of these terms— in the data. Purely random data is incompressible. Your chapter shrinks when you zip it because it is not random data. Its compressibility is testimony to the underlying orderliness of your thoughts as expressed in your words. Likewise how compressible God's Big book is will reflect the orderliness of His thoughts. That "orderliness"—that system of the world—is, I will argue, what we are talking about when we talk about the laws of Nature.
Complexity Simplified
The central notion of For a good introduction see Peter D. Grünwald and Paul M.B. Vitányi "Algorithmic Information Theory", 2005 Algorithmic Information Theory is Marcus Hutter, "Algorithmic complexity" ,(2008), Scholarpedia, 3(1):2573 algorithmic complexity (K) usually called " Andrey Kolmogorov Kolmogorov complexity" but sometimes " Ray SolomonoffSolomonoff complexity" after its (independent) co-discoverers.
The complexity of any string is measured by the length of the computer program (measured in bits) required to generate it. So while these two strings are equally long:
(x1) aaaaaaaaaaaaaaaaaaaaaaaa
(x2) db4GmdfTIk30lOwq0ipv$7mY
the first is less complex than the second because it can be generated by a shorter computer program than the second. For example, the first could be rendered in the Python computer language as
print 24*'A'
while the Python program required to produce (x2) (originally generated by my randomly hitting my keyboard) would almost certainly have to be longer than the original string:
print 'db4GmdfTIk30lOwq0ipv$7mY'
So the Kolmogorov complexity of string (x2) is greater than (x1), at least in the Python Language.
KPython(x2) > KPython(x1)
This illustrates the connection between complexity and compression. The short program that describes (x1) is a compressed representation of that string: compressibility is algorithmic simplicity.
Now one might think that this way of measuring complexity is arbitrary. Doesn't it just depend upon what programming language or computer you happen to use? The astonishing answer is "No".
It is true that functionally equivalent programs in different languages can have different lengths. Thus to get (x1) from a program written in the Basic programing language you would need something like:
For i = 0; i < 24; i++:Print "a":Next i
So the minimum program length of (x1) in Basic is greater than it is in Python.
KBasic(x1) > KPython(x1)
But it is possible to show that while picking one language rather than another may add more or less overhead to the programmatic specification of strings, that overhead has a fixed maximum value which, while it will depend upon the programming languages in question, will not depend on the strings themselves. So there is a number n such that:
For all strings x: |KBasic(x) - KPython(x)| ≤ n
This is the The proof of this remarkable result is itself astonishingly simple: If two programing languages L1 and L2 are universal in the sense that each can be used to write a program for a Universal Turing machine, then it must be possible to write a program in L2 that will translate (compile) any L1 program into an L2 program and then execute it. Thus the length of any L1 program translated into L2 can’t be any longer than the length of the shortest L2 compiler for L1 plus the binary length of the L1 program. Invariance Theorem and it holds between any two programming languages.
Invariance is the foundation of AIT because it demonstrates that complexity measures something real, something that does not depend upon arbitrary choices of computer programs or programmers. Invariance tells us that our choice of programming language to measure complexity is like our choice of scales to measure weight: the scales may give results in different units—kilo's vs pounds—but they nevertheless are measuring the same real magnitude.
This entitles us to speak of the Kolmgorov complexity of a string in abstraction from any language we might choose to measure it.
For any programming languge L: K(x) = KL(x) + O(1) Here and in all the equations to follow the 'O(1)' represents a constant factor that will depend on the particular programming language (or Turing Machine) used to calculate 'K'. You can think of it as measuring the fixed overhead that comes with any programming language you choose to measure complexity. 13
But if algorithmic complexity is an objective property of strings, what property is it?
A look at (x1) suggests that the answer might be that complexity and simplicity are a matter of repetitiveness or regularity. And it is certainly true that regularity makes for algorithmic simplicity. This is the Minimum Description Length Principle ( MDL is sometimes described as the formal expression of Occam's razor. cf. The MDL web siteMDL) principle, a fundamental theorem of computer learning theory:
(MDL) Any regularity in a given set of data can be used to compress the data.
And yet there is more to simplicity that mere regularity. Consider the infinitely long string that expresses pi:
There are no known regularities or repetitive patterns here, and yet the string is not complex because there is a relatively short program that generates it.
So, again, what does algorithmic simplicity measure? The right answer— which may, at first, sound trivial but is, in fact, profound— is that it measures computability. This isn't trivial because computability, in the sense of Turing and Church, is the best account we have of what it means to say that something—anything!—is orderly or systematic. Put it another way, it is our best explanation of what we mean when we call something a "system": A system is something whose workings can be described by a partially recursive function; that is, by a Turing machine.
Church and Turing taught us what systematicity is, Solomonoff and Kolmogorov taught us how to measure it.
And yet is not quite correct to say that Kolmogorov complexity just measures the amount of "orderliness" in a string. Rather, it measures, by summing, the amount of order plus the amount of dis-order.
To see why orderliness and dis-orderliness don't co-vary, compare the infinite string (pi) with a string that contains just the first n digits of pi where n is some largish number.
Obviously pn is shorter than pi—infinitely shorter— but pn is more complex than p. The reason is that the shortest program that computes pn is likely one which takes the form:
Calculate pi to n digits then stop.
That additional stopping instruction will require a program slightly longer than one that just says "Calculate pi". Intuitively all the orderliness of pis still embodied in the program, its extra length comes from need to capture the extra—arbitrary—stopping value. We could get the same effect by taking (p) or (pn) and randomly swapping out some of its digits for randomly chosen ones. For each digit we swap the resulting string will become more complex because describing it will require augmenting the pi program to record the position and value of the random number.
We can see the same phenomenon at work in these strings.
aa
aaaaa
aaaaaaaaa
All of them can be generated by a simple program which says to print 'a' n times for varying values of 'n'. In Python:
print 'a' * n
This program captures as much system as these three strings have in common but it is isn't sufficient to uniquely describe any of them. To do that we'll have to add a specification of the relevant values of n— respectively 2, 5 and 9— and the resulting programs will each be a bit longer. Moreover, how much longer will be a (log) function of how big n is: it takes one more bit to specify '9' (1001) than '5' (101) so:
K ('aaaaaaaaa')
must be at least 1 greater than
K ('aaaaa')
So these strings of "a"s, though they embody exactly the same amount of system or order, have different complexities because they also embody varying amounts of disorder: our arbitrary choice of n.
This distinction between the order and disorder present in any given string is already evident in our homely example of zip files. Our picture is:
Here the complexity of the data will correspond to the combined length of the compressed data and the decompressing program: the length of the shortest combination of both, in whatever language the decompressing program is written. That combination— sometimes called a "source coding" of the data— will represent the optimal compression of the data in that language. In that optimal source coding the decompressing program will capture all of the order in the data and the compressed data all the disorder. Indeed, MTL tells us that the maximally compressed data must be random because, if it contained any regularity it could be exploited to compress it further.
The zip file of your chapter will almost certainly not be an optimal compression of it. Your zipping program is a general purpose compression algorithm which is used because it tends, on average, to achieve pretty good results at compressing the kinds of strings that we typically put on our computers. It will do a good job of shrinking your chapter, but it wouldn't do well with a binary approximation of (pi).
And yet is possible to show, at least for any finite string, there must be an optimal compression scheme. That is to say, not just that there must be, in any programming language, a shortest program that generates that data; but also that there must be an optimal characterization that parses the order and disorder present in that string.
In the next section I am going to try to explain why this is so. If you would rather skip to the bottom line then click here.
Solving God's Problem
A decompression program is a program. This is to say, it is the sort of thing that can be instantiated by a Turing machine. For technical reasons, some of which will soon become clear, AIT favors a particular kind of Turing machine called a "prefix Turing machine". So hereinafter when we are talking about complexity we will, technically speaking, be talking about Prefix Kolmogorov complexity. 14
A prefix Turing machine has an input tape and an output tape and some working tapes in between. The input tape reads only zeros and ones and it moves only one way. The output tape can contain any symbols you like but it too only moves one way as it prints data out. When the machine reads the input tape, it may or may not print something on the output tape and it may or may not halt. We say that a prefix machine Tα generates x given an input p, if and only if that machine would halt after printing x (and nothing else) on the output tape after reading p.
Ta (p) = x
If a machine would output x and stop without reading anything from the input tape we'll write 'ε' to signify an "empty" tape:
Tb(e) = x
Because the "language" the prefix machine reads is binary we can feed the machine anything we like expressed in binary. And because any finite binary string we can input to a prefix machine will correspond to binary expression of an integer, the integers enumerate all possible inputs to a prefix Turing machine.
We can think of any Turing machine— or any computer program for that matter— as embodying a compression program if it produces outputs which are measured in bits longer than its inputs. One of the merits of prefix machines when thinking about complexity is that we can compare the success of any two of them at compressing a string x simply by asking which of them requires the shortest input string. So if
Tα (p) = x
and
Tb (q) = x
and the binary length of p is shorter than q.
l(p) < l(q)
then Tα compresses x more that Tβ. This doesn't mean that Tα is a better source coding/compression scheme for x than Tβ because it does not take account of the complexity of Tα and Tβ. To take that into account we need a standard way to measure program length. We get that by thinking of a universal Prefix Turing machine.
A universal Turing machine is a machine that can be programed to emulate (=simulate = virtualize = implement the same recursive function as = generate the same outputs on inputs as) any other Turing machine. A universal prefix Turing machine will emulate any Turing machine provided that it is given the right program on its input tape. Given that it is a prefix machine that reads binary, that right program will be a binary string.
Suppose then that U is a universal prefix machine and n' is a binary string that happens to program U to behave like Tα. In that case, after U inputs n' will it will input the rest of what is on the tape and produce the same outputs as Tα.
U(p,n') = Ta (p) = x
So we can measure the size of the compression scheme Ta (p) relative to the machine U, as the length of p concatenated with n':
l(p,n') = l(p) + l(n')
where, again, n' is a binary string that happens to program U to embody the same function as Ta. This gives us directly the measure of the Kolmogorov complexity of Ta when it is implemented on U. So:
KU(x) = l(p,n')
Of course this value depends upon which universal machine U we happen to use. The string n' won't program other universal machines to emulate Tn. But the invariance theorem tells us that while our choice of machine might add or subtract a fixed amount of computational overhead it will not affect the underlying complexity of x.
K(x) = l(p,n') + O(1)
This abstracts away from our arbitrary choice of U.
The concatenation of p and n' is a program which generates x when run on U but it may not be the shortest program that does that. To describe the shortest such program that does that we need a way to sort, by length, all possible U programs, which is to say— given that U is a universal machine— all possible programs. Thanks to the remarkable properties of prefix machines, we can do precisely that.
Every U input is a binary string, every binary string expresses an integer. So every integer programs U to behave like a Turing Machine and, given that U is universal, some integer will program U to behave like every Turing machine (i.e. to compute every partially recursive function). Accordingly we can pick out every possible Turing machine, that is every possible program, with an enumeration that orders them by length. A standard enumeration looks like this: Here T1 picks out the empty tape ε 15
This enumeration gives us a way of associating a binary string and a Turing Machine with every integer. From now on we can write:
l (Ti)
to express the function which gives the length of the ith string in our enumeration and
Ti(p) = x
To indicate that if we write Ti prefixed by some string p onto the input tape of our fixed reference machine U then U will output x and stop.
We can now define the Kolmogorov complexity of any string x relative to U. It will be the length of the first program in the enumeration that gives us x as an output.
Which we can generalize to:
Now if x is any finite string we can be sure that there is some T which outputs x because there must be at least one way of programming U so that it is functionally equivalent to a program that says "print x'" for any input string, x', that is the binary expression of x.
:
Tprint(x') =x
So for any finite x the complexity of x on U cannot exceed the bit length of x plus the length of this minimal print program. For random strings (and, Though randomness is a complex business.provably, most strings will be random) this maximum will also be the minimum length of the U program that generates them This defines Kolmogorov randomness. 16 . If x is not random then the length of the shortest program that generates x will be shorter than l(x). How much shorter will measure how compressible x is.
Compressibility, as we have observed, depends both on how much order there is in x and how much disorder. If there is both order and disorder in x then it must be possible to divide the minimal program that generates x into two smaller components, one programmatic the other random. Prefix machines allows us draw this distinction very cleanly.
To see how, suppose that Ty is the first program in our enumeration for which U prints x and halts. Ty gets U to do this all by itself, without further input. But there may be some shorter program in our enumeration, Ti, that doesn't generate x all by itself but would generate x if it were given some input p: so that:
U(p, Ti) = x
Now l(p,Ti) cannot be shorter that Ty since that would mean, contrary to hypothesis, that the string (p, Ti) and not Ty was the shortest program which generated x. But l(p,i') could have the same length as Ty if Ty is just the result of prefixing Ti with p. So our picture looks like this.
Ty = (p, Ti)
Ty(e) = Ti(p) = x
Now there might be no Ti in our enumeration that looks like this. That would mean that x cannot be parsed into orderly and disorderly parts. Either it is a purely random string (that is, all disorder) or it is purely orderly (think of pi). If the former, x will be incompressible and l(Ty) >= l(x); if the latter then l(Ty) < l(x).
On the other hand, there might be a Ti that fits this picture. Indeed there might be more than one. The shortest such Ti will be the one which represents the maximal separation of order and disorder in x. It will be a Ti that satisfies Here i ranges over the integers and p over all finite binary strings 17 :
In that case we can say that Ti(p) represents an optimal source coding for x, not just because it is the shortest way of generating x but also because it maximally discriminates the orderly and disorderly components of x.
And AIT allows us one final elegance: We have chosen, for illustrative purposes, a Universal Prefix Turing machine U, which given a string from our enumeration of binary strings as input, emulates a particular Turing machine on U. But our enumeration itself represents a computable function from the integers onto Turing machines so it must be that corresponding to U there is another universal prefix machine U' which, given only an integer i as an input, emulates Ti. Think of U' as a machine that, given i first computes Ti then emulates Ti on U. The upshot is that U'(i) will instantiate the same function as U(Ti). That means that:
K(i) = l(Ti) + O(1) = l(i)
This entitles us to say that the function expressed by Ti on U, U(Ti), is the algorithmically simplest function that expresses the orderly component of x. Which, in turn, means that assuming this optimal Turing machine we can drop the reference to the computational overhead (the ‘O(1)’),
All of the forgoing assumes that x is finite in length. If x is infinite then there may or may not be a shortest program that generates it. In the infinite case K(x) may have no finite value, though we can never be sure. For Godelean reasons there is no way of proving that a string is incomputable, so K(x) itself is incomputable over infinite strings. Suffice it to note here that with those caveats in place, the apparatus we have been discussing so far can be extended to the infinite case. For example, we know there are programs that compute pi so we can be sure that K(p) has a finite value.
The System of the World
The forgoing was restricted, for the sake of simplicity, to finite strings. So let us keep it simple and suppose that God's Big Book of Facts is finite. If you think that is unrealistic then think about The Big Book of All True Empirical Observations, which is certainly finite. Regard this book as a string of symbols and call that string 'x'.
We already have some reason to think that there is some order in the world so we can be confident that x is not random. What we have learned from AIT is that we should expect the book to be in some measure orderly or systematic and in some measure disorderly or random. The sum of those two measures is the Kolmogorov complexity of the book, K(x), and that sum is expressed by the equation:
Here T(i) will be a computer program of finite length written in some computer language, it doesn't matter which language. This program will describe the orderly or systematic or computable features of the book. p will be a string of random, incompressible or disorderly binary data. Running the program Ti on the data p will give you the book:
Ti (p) = x = The Book
The program Ti optimally describes the systematic component of The Book. If The Book is The Book of The World, then Ti describes The System of the World.
"The System of the World" is, of course, the title Newton gave to the second volume of the Principia and LaPlace gave to his great work on astronomy. They thought of The System not just as a collection of general truths but as a calculus that would allow them to exactly predict the future from the past.
An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes. Laplace, Pierre Simon, A Philosophical Essay on Probabilities, translated into English from the original French 6th ed. by Truscott,F.W. and Emory,F.L., Dover Publications (New York, 1951) p.4 18
The 'Ti' in our equation is Laplace's "single formula", a function which takes as its argument the arbitrary and unsystematic data describing the motions and positions of "all items of which nature is composed" at some single moment of time and unfolds the subsequent story of the world.
Newton and Laplace supposed that their Systems would in fact describe many worlds. Given different initial conditions their formula would output descriptions of the different futures The System would require. We may likewise suppose that Ti may be defined over other inputs p',p'',p''… each generating a different "book", x',x'',x''…' each of which describes a possible world. Each of these worlds will be a nomologically possible world relative to ours: a world that obeys the system of our own.
The laws of nature will be those propositions that are true in every nomologically possible world. The sentences that express the laws of nature are the sentences that express those propositions. Given what we have observed about the connection between regularity and algorithmic simplicity we should not be surprised if these sentences will often take the form of generalizations but, as we have also seen, there is more to simplicity than mere regularity. We cannot legislate a priori that only generalizations will express laws. It will depend on what's in the story of the world.And, of course, we are only pretending for the sake of simplicity that there is just one book of the world and only one N-language. In fact their might be many books and they need not all be comensurable. This has consequences for the structure of science which I will take up in later posts. 19
Such is the Computational Theory of the Laws of Nature.
God's language
That said, the forgoing includes an assumption we must address. I have just now supposed that if Ti is the system of the world then if Ti would output some other book, x', given some other input p', that that book will describe a possible world. Is this correct?
Remember that a book is just a string of symbols and that it tells us about the world only under some interpretive scheme. We have seen why The Book— the one interpretable as telling the whole truth about the actual world— must have its optimal program Ti. But we haven't shown that whenever that program prints a string and halts that what is on the output tape must a self-consistent or even meaningful book.
Think of the zip file of your chapter. If you start altering it at random you may end up with a file that won't unzip at all because you will have created an input for which the decompression algorithm in your unzipping program is undefined. But even if you do manage to manage to randomly create an unzippable file, the uncompressed result is unlikely to be an alternative version of your chapter. It will very likely be gibberish. Might it not be the same for Ti?
There are worries here in two directions. First, suppose that x is written in the language Lx. Lx will comprise a grammar and a semantics. The grammar describes the strings of symbols that belong to the language. The semantics will map each such string onto the set of worlds: the worlds at which the sentence is true. Ex hypothesi, The Book is made up of a set of strings that are well-formed in Lx and which together pick out the actual world. But what guarantee do we have that other outputs of Ti will be well formed Lx sentences at all?
One answer to this worry is to note that the syntactic rules that define the sentences of Lx will show up as syntactic regularities in x. So it is plausible to suppose that the systematic component of any description of any complete book of the world would also reflect the syntactic regularities of the language in which it is written. The book would after all, surely contain a huge and representative sampling of well-formed L sentences. In which case we might expect that for every p' for which Ti would output some x' and stop, the string x' will be a well formed in L and hence interpretable. Moreover since, presumably, all the sampled sentences in The Book are logically atomic, so too will the sentences in x' . And so we might reasonably expect that each x' will be consistent.
As I said, this is plausible, but not, so far as I can tell provable. In any case whatever comfort there may be in this only raises a more fundamental worry. If Ti does incorporate the syntactic regularities of Lx then it won't just express the regularity in the world but also regularities of the language L.
Now our choice of Lx was not entirely arbitrary, it was, remember supposed to be an "N-language": a language, whose vocabulary, Lewis said, would include the names of the "natural properties". But a vocabulary is not a grammar. A grammar describes the formal regularities which distinguish the meaningful combinations of vocabulary from the meaningless ones. The worry currently on our plate is that even if Ti includes those rules, they are an artifact of the grammar of Lx, not of the world x represents. Our guiding thought was that an N-language would be one that, in the Platonic metaphor, "carved nature at its joints". How to distinguish the joins in Nature from the joints of language?
One answer, in the spirit of Lewis, would be to require that not only must an N-language have names for the natural properties but also that the syntactic categories and combinatorial rules of an N-language must reflect the "natural" categories of things and the connections among them. Thus if you think, as many contemporary philosophers rather blithely assume, that the story of the world should be told by listing n-tuples describing physical magnitudes, vectors and loci, then you will think that the syntax of ordered tuples represents an underlying order in nature that must be reflected in Ti. This path leads to deep metaphysical waters, but no deeper than we are already going to have to navigate if we are going to figure out what "natural properties" are.
There is a simpler course. We can keep merely syntactic regularity out of Ti by not putting any in. We can do that by requiring that an N-language be such that every combination of its primitive symbols will be meaningful. In such a system any string of words or symbols that can appear on the output tape will get mapped to a possible world. Such symbolic systems are actually very common in the world of computing though they are usually not called "languages". They are more commonly called simulations or models.
In the semantics of a simulation, individual symbols are mapped onto items in the world, intrinsic physical properties of the symbols onto intrinsic properties of those items and physical relations among the symbols onto relations among the items they represent. All formally— that is to say physically— possible permutations of the interpreted features of the symbols represent corresponding states and arrangements of the items themselves. A simulation shows you the way the world is or might be. However much the simulation may misrepresent the actual world it always represents some world or other and you cannot simulate a contradiction.
Most of us are most familiar with simulations in the form of video games. The business of designing video games nicely illustrates everything we have learned about AIT. Different games portray different worlds, but many games share a common program—a "game engine". If you give the game engine one set of parameters the output displays a world in which you fight demons and fly on the backs of dragons. Change the input parameters and it's Nazis and Spitfires. Different worlds, but as game programmers say, "same physics".
So let us bring Beebe's story entirely up to date. If God really wants to let us "learn all the facts there are to learn" he wouldn't bother giving us a Big Book, even a compressed one. He'd give us a simulation of the world. If a picture is worth a thousand words, think how much He could say with 3D-graphics!
There are some "It from bit". serious Silas R. Beane, Zohreh Davoudi, Martin J. Savage, "Constraints on the Universe as a Numerical Simulation"thinkers who contend that God has done exactly that.
My thanks to Paul Vitányi for his corrections to an earlier draft of this post. His text (with M. Li ) , An Introduction to Kolmogorov Complexity and its Applications, Springer-Verlag, New York, 1993, 1997, 2008 is an indispensible reference for anyone interested in AIT.
For more on the philosophic upshots of the Computational Theory see Computation, Laws and Supervenience
I confess this is far beyond my educational grounding and yet I read such things like I study cuneiform with my fingertips, like braille. In hopes something will enter and inform me.
However, as there is a typographical infelicity in an earlier section:
Some are stronger, more informative than others. These virtues compete: An uninformative system can be very simple, an systemized compendium of miscellaneous information can be very informative. - See more at: http://tomkow.typepad.com/tomkowcom/2013/09/the-computational-theory-of-natural-laws.html#sthash.A7wfCl0D.dpuf
where one suspects that it is meant to read: " . . . a systematized . . . "
one wonders, in one's ignorance, if the word "worlds" is meant to be read as "words", which seems to be more coherent in the suggested "semantic" distinction, as in:
Strength-2 is a semantic property: we measure it by counting worlds. If A2 is true at every world at which A1 is true, but not vice versa then A1 is more informative than A2. The idea is that the fewer worlds at which a sentence is true the more informative it is. - See more at: http://tomkow.typepad.com/tomkowcom/2013/09/the-computational-theory-of-natural-laws.html#sthash.67iV5R6t.dpuf
Counting "worlds"? Not counting words?
Forgive if I have offended.
Posted by: Typophobia | September 04, 2013 at 07:55 PM