Re: colt from Bruno Marchal on 2002-08-13 (everything)

From: Bruno Marchal <marchal.domain.name.hidden>
Date: Tue, 13 Aug 2002 14:49:16 +0200

Hi Juergen,

I don't know very well Solomonoff's work unfortunately.
But I know rather well the work by Putnam, Case and Smith, Royer,
Blum etc. in limiting inference identification theories, which are
part of colt too. (colt = computational learning theory = theoretical
artificial intelligence). A nice and rare book exists on the subject:

S.Jain, D.Osherson, J.S.Royer, A.Sharma, 1999, Systems that Learn,
MIT Press.

Those works rely heavily on Kleene second recursion theorem.
In the long version of my thesis, I have used such colt-results to
motivate the "consciousness theory" on which the reversal psycho/physics
is based. (Consciousness is seen as the result of instinctive anticipation
of self-consistency). Unfortunately or fortunately (I don't know), the
general use of the second recursion theorem by Solovay in its proof of
the arithmetical completeness of the logic G and G* allowed me to
bypasse explicit use of both Kleene's theorem and colt results.

But I agree with you of the general importance of theoretical
approaches in Inductive Inference theories.

For the sake of others let me just define here what is, in Blum
Case and Smith approaches, an inductive inference machine (IIM). (Some
people tend to think it is impossible to define such a concept!).

First let us define what it means to say a machine identifies
a phenomena. Let us (with comp) define a phenomenon by a total
(everywhere defined) computable function.
We will say that a machine identify a phenomenon f, if when
we present, one at a time but in *any* order, the f input-output, to
the machine M, M output (godel number code of) programs, and eventually
converges on a program for f. (That is the machines eventually
always output the same program).

Apparently this is a rather stupid definition. Indeed *all* total
computable function are identifiable by a machine. For example, take
the factorial. The IIM which *always* output a code for factorial
identify trivially the factorial!
The interesting notion is the identification of a *class* of programs
or functions. What we have just seen is that any singleton class, like
{factorial}, {fibonacci}, etc. are identifiable.
Is the set {factorial, fibonacci} identifiable? Of course: by dovetailing!
The machine outputs programs for factorial and fibonacci, computes
them both by dovetailing, and stop to output one of them after the
first discrepancy.
For that reason, as Blum showed, any Recursively Enumerable (RE) class
of total function is easily identifiable. The class being RE, you can
build a dovetailer running their executions and tested the member
of the class. This is really a (non practical) dovetailing on the
search space.
Now any reader of [Diagonalisation 1:
http://www.escribe.com/science/theory/m3079.html
Diagonalisation 1 (answers) http://www.escribe.com/science/theory/m3344.html ]
know that the set R of all code of total computable functions is not
RE. So the natural question is: does it exist a machine capable of
identifying any total computable function? Putnam Showed that such a
universal learning machine does not exist.
Case and Smith have then looked if a weakening of the identification
criterion can help. Suppose we make it possible to the machine to output
a program computing the code of the phenomenon f, but allowing that program
to make 0 or 1 mistake. Well if the "or" is non constructive, this works.
"EX" is the name of the collection of set of functions identifiable by a
machine, and EX1 is the corresponding name for the weakened criterion.
EXi: the same, where we allow 0 or 1 or 2 or ... or i errors (non-constructive
or: with constructive "or" the errors can always be recovered!) in the output
of the IIM.
EX*: the same but we allow a finite but undeterminate, number of errors.
Case and Smith showed that EX is strictly include in EX1, which is strictly
include in EX2, etc. The union of all EXi is itself strictly include in EX*.
Alas! R is still not in EX*.
But look! Let us allow the IIM to produce an infinite numbers (still one at
a time) of programs, and let us weakening the identification criterion by
just saying the IIM identify f, if, after some time it produces programs
(different one) computing f. (This is not the same as producing always
the same programs). Let BC be the collection of set of functions
identifiable by a machine in this new way. The crazy result is that
EX* is included in BC. And, with obvious notation, Case and Smith have
shown that BC is strictly include in BC1, itself st. incl. in BC2, etc.
Here also the union of all BCi is st. include in BC*.
And (the cake's cherry): R is included in BC*. In principle, there is
a universal learning machine. But this result is highly theoretical, full
of necessary non constructive existence ....
Still very enlightening for showing how cleverness (defined by the
bigness of identifiable classes of set of functions) is subtle.
There is also the NON-UNION theorem of BLUM and BLUM. In a nutshell,
the theorem say that there is something is *very* more (even non
computably more) clever than a anticipating machine! What? Answer:
two machines! And then 3 machines, 4 machines, etc...
The reason why I do not need to elaborate on this learning machine
theory, is that I need only the fact that G* is anticipable by the G
machine, and this follows trivially from the decidability of G*, which
follows from Solovay's work.
But I agree with Juergen. There is plenty of beautiful and
entertaining results in colt. There are a lot of identification
criteria the EX, BC like above, but also PEX (Popperian id), OCCAM sort
of identification, etc.

Bruno

------------------------------
Original message by Juergen

>Physics is an inductive science. You try to find a law that matches
>the data, and hope that it generalizes on unseen data.
>
>When asked about the nature of their inductive business, most
>physicists refer to Popper's ideas from the 1930s (falsifiability
>etc), and sometimes to Occam's razor: simple explanations are better
>explanations.
>
>Most physicists, however, are not even aware of the fact that there
>is a formal basis for their inductive science, provided by the field
>of computational learning theory (COLT), in particular, the theory of
>universal induction.
>
>The contents of the following COLT paper may be old news to some
>on this list.
>
>------------------------------------------------------------------
>
>
>The Speed Prior: a new simplicity measure yielding near-optimal
>computable predictions (Juergen Schmidhuber, IDSIA)
>
>In J. Kivinen and R. H. Sloan, eds, Proc. 15th Annual Conf. on
>Computational Learning Theory (COLT), 216-228, Springer, 2002;
>based on section 6 of http://arXiv.org/abs/quant-ph/0011122 (2000)
>http://www.idsia.ch/~juergen/speedprior.html
>ftp://ftp.idsia.ch/pub/juergen/colt.ps
>
>
>Solomonoff's optimal but noncomputable method for inductive
>inference assumes that observation sequences x are drawn from an
>recursive prior distribution mu(x). Instead of using the unknown
>mu(x) he predicts using the celebrated universal enumerable prior
>M(x) which for all x exceeds any recursive mu(x), save for a
>constant factor independent of x. The simplicity measure M(x)
>naturally implements "Occam's razor" and is closely related to the
>Kolmogorov complexity of x. However, M assigns high probability to
>certain data x that are extremely hard to compute. This does not match
>our intuitive notion of simplicity. Here we suggest a more plausible
>measure derived from the fastest way of computing data. In absence
>of contrarian evidence, we assume that the physical world is generated
>by a computational process, and that any possibly infinite sequence of
>observations is therefore computable in the limit (this assumption is
>more radical and stronger than Solomonoff's). Then we replace M by the
>novel Speed Prior S, under which the cumulative a priori probability
>of all data whose computation through an optimal algorithm requires
>more than O(n) resources is 1/n. We show that the Speed Prior allows
>for deriving a computable strategy for optimal prediction of future
>y, given past x. Then we consider the case that the data actually
>stem from a nonoptimal, unknown computational process, and use
>Hutter's recent results to derive excellent expected loss bounds for
>S-based inductive inference. Assuming our own universe is sampled
>from S, we predict: it won't get many times older than it is now;
>large scale quantum computation won't work well; beta decay is not
>random but due to some fast pseudo-random generator which we should
>try to discover.
>
>Juergen Schmidhuber http://www.idsia.ch/~juergen
Received on Tue Aug 13 2002 - 05:59:46 PDT

This archive was generated by hypermail 2.3.0 : Fri Feb 16 2018 - 13:20:07 PST