Re: Questions on Russell's "Why Occam" paper

From: Russell Standish <r.standish.domain.name.hidden>
Date: Mon, 6 Jun 2005 10:21:15 +1000

On Fri, Jun 03, 2005 at 04:22:07PM -0700, "Hal Finney" wrote:
> Russell Standish recently mentioned his paper "Why Occam's Razor" which
> can be found at http://parallel.hpc.unsw.edu.au/rks/docs/occam/ . Among
> other things he aims to derive quantum mechanics from a Schmidhuber type
> ensemble. I have tried to read this paper but never really understood it.
> Here I will try to ask some questions, taking it slowly.
>
> On this page, http://parallel.hpc.unsw.edu.au/rks/docs/occam/node2.html ,
> things get started. Russell describes a set of infinite bit strings he
> calls "descriptions". He writes:
>
> "By contrast to Schmidhuber, I assume a uniform measure over these
> descriptions -- no particular string is more likely than any other."
>
> This surprises me. I thought that Schmidhuber assumed a uniform measure
> over bit strings considered as programs for his universal computer. So
> what is the "contrast" to his work?

Nowhere in Schmidhuber (1997) does he propose a measure over the "input
programs". What he does is is justify the appearance of a universal
prior in the set of descriptions by passing the raw data through a
reference UTM. Presumably the set of descriptions is sampled uniformly
by the observer.

In Schmidhuber (2000), the set of descriptions is generated by a
machine which has resource bounds. This leads to the notion of "speed
prior" which differs from the universal prior in several important
respects. I sometimes refer to the two different ensembles as
Schmidhuber I and Schmidhuber II.

I am beginning to regret calling the all descriptions ensemble with
uniform measure a Schmidhuber ensemble. I think what I meant was that
it could be generated by a standard dovetailer algorithm, running for
2^\aleph_0 timesteps. However, as the cardinality of "my" ensemble is
actually "c" (cardinality of the real numbers), it is quite probably a
completely different beast. It is also not generated by a program,
Schmidhuber style, it simply is (in the sense of being the simplest
set - equivalent to nothing).

>
> It seems that the greater contrast is that while Schmidhuber assumed that
> the bit strings would be fed into a computer that would produce outputs,
> Russell is taking the bit strings directly as raw data.
>

Quite true.


> But I am confused about their role.
>
> "Since some of these descriptions describe self aware substructures..."
>
> Whoa! This is a big leap for me. First, I am not too happy that mere bit
> strings have been elevated with the title "descriptions". A bit string on
> its own doesn't seem to have the inherent meaning necessary for it to be
> considered a description.

Many apologies for deploying terminology in a different way to you
expect. A description (in my terminology) does not necessarily have
meaning. It is simply data. This is in accord with how I use the term
in casual English usage too - a description is simply a string of
letters, and may or may not be meaningful. Meaning is attached by an observer.

Now an observer will expect to find a SAS in one of the descriptions
as a corrolory of the anthropic principle, which is explicitly stated
as one of the assumptions in this work. I make no bones about this - I
consider the anthropic principle a mystery, not self-evident like
many people. Why should an observer expect to see a token of erself
embedded in reality? That is the mystery of the AP.

> And now we find not only that the bit string is
> a description, but it is a complex enough description to describe SAS's?
> How does that work?
>

The bitstrings are infinite in length. By reading enough bits, they can
have arbitrarily complex meanings attached to them.

> It's especially confusing to read the introductory word "since" as though
> this is all quite obvious and need not be explained. To me it is very
> confusing.
>

Sorry for not going slow enough. The habits of concise expression are
hard to shake.

> The page goes on to identify these SAS's as observers. Now they are mappings,
> or equivalently Turing Machines, which map finite bit strings to integers.
> These integers are the "meanings" of the bit strings.

Not equivalently. Not all maps can be represented by a Turing machine,
only computable ones.

>
> I believe the idea here is that the bit strings are taken as prefixes
> of the "description" bit strings in the ensemble. It is as though the
> "observers" are observing the "descriptions" a bit at a time, and mapping
> them to a sequence of integer "meanings". Is that correct?
>

Indeed that is one interpretation. The most important point is that
the observer map is a "prefix" map, in the sense of prefix machines of
Algorithmic Information Theory. In reading a bit string one bit at a
time, once a meaning is attached to the string, that is the meaning
for evermore - the observer cannot change er mind after reading a few
more bits.

Schmidhuber (2000) deals with machines that do "change their mind", so
perhaps there is some extension possible in this direction.

> So here is another confusion about the role of the "description" bit
> strings in the model. Are they things that "observer" TM's observe and
> map to integers? Or are they places where observers live, as suggested
> by the "Since" line quoted above? Or both?
>

All that is discussed in this paper is appearances - we only try to
explain the phenomenon (things as they appear). No attempt is made to
explain the noumenon (things as they are), nor do we need to assume
that there is a noumenon. Bruno Marchal has a detailed discussion on
this in his thesis, and concludes that he "has no need for this
hypothesis" (what he calls the extravagant hypothesis).

So the former statement is true : things that "observer" TM's observe and
map to integers. It is also true that descriptions of self aware
observers will appear within the description by the Anthropic
Principle. The phenomenon of observerhood is included. However where
the observers actually live is not a meaningful question in this framework.

> Now it gets a little more complicated: "Under the mapping O(x), some
> descriptions encode for identical meanings as other descriptions, so
> one should equivalence class the descriptions."
>
> The problem I have is, O takes only finite bit strings.

Only a finite number of bits are significant. They still operate on
infinite bitstrings.

> So technically
> a "description", which is an infinite bit string, does not encode a
> meaning.

Yes.

> What I think is meant here, though, is that two "descriptions"
> (i.e. infinite bit strings) will be considered equivalent if for every
> finite prefix of the strings, the O() mapping is the same.

Yes

> So if we
> think of O as "observing" the "description" bit strings one by one,
> it will go through precisely the same sequence of integer "meanings"
> in each case. Is that right?
>

I think you're saying that O(x) has unique value for any given value
of x. In which case you're right.

> "In particular, strings where the bits after some bit number n are
> ``don't care'' bits, are in fact equivalence classes of all strings that
> share the first n bits in common."
>

Sets of strings corresponding to a given meaning are an equivalence
class of descriptions, yes.

> I think what this considers is a special O() and a special string prefix
> such that if O sees that particular n-bit prefix, all extensions of
> that prefix get mapped to the same "meaning" integer. In that case the
> condition described in my previous paragraph would be met, and all
> strings with this n-bit prefix would be equivalent.
>

yes. This is the condition that O(x) is a prefix map.

> "One can see that the size of the equivalence class drops off
> exponentially with the amount of information encoded by the string."
>
> That seems a little questionable because the size of the equivalence class
> is infinite in all cases. However I think Russell means to use a uniform
> measure where the collection of all strings with a particular n-bit
> prefix have a measure of 1/2^n. It's not clear how well this measure
> really works or whether it applies to all sets of infinite strings.
>

Maybe I'm being sloppy. Size here means measure, not cardinality. I
can appreciate your confusion.

The uniform measure works, because the set of all descriptions is
one-to-one mappable to the real interval [0,1], except for a set of
points of measure zero (rational numbers with power of 2 denominators)
where 2 descriptions map to the same real number.

Just as the uniform measure works with the reals (and allows integral
calculus to work), the uniform measure works with descriptions. The
measure of equivalence classes O^{-1}(n) also follows in a straight
forward fashion from the uniform measure.

> "Under O(x), the amount of information is not necessarily equal to the
> length of the string, as some of the bits may be redundant."
>
> Now we have this new concept of "the amount of information" which has
> not previously been defined. This sentence is really hard for me.
> What does it mean for bits to be redundant?

This sentence is part of the hand-waving "furniture" - its meant to
make things easier for the reader, not harder. Most people have an
intuitive notion of what redundancy means. Please ignore it if it
makes you feel better.

Later on I do define amount of information, or complexity to use its
proper name as {\cal C}_O(x) = -\log_2 P_O(O(x)), and appeal to the
coding theorem as a way of relating it to Kolmogorov complexity K(x).

> We just discussed strings
> where all those after bit n are "don't care", but this sentence seems
> to be envisioning other kinds of redundancies.
>

Indeed.

> "The sum P_O(s) = [sum over p such that O(p)=s of] 2^(-|p|)
> where |p| means the number of bits of p consumed by O in returning s,
> gives the size of the equivalence class of all descriptions having
> meaning s."
>
> Boy, that's a tough one now. We consider all bit strings p such that O(p)
> = s. Now, is this supposed to just be those cases described earlier where
> the bits after |p| are "don't care" bits? Or is it all strings p such
> that O(p) = s, including all extensions of a string where we "don't care"
> after some point? If the latter, it seems problematical because there
> would be an uncountably infinite number of such p's. Also I think the sum
> would not converge in that case. So I think it is probably the former.
>

You have indeed hoisted me here. Well done! Of course the mistake is
not really serious, and can be patched up by taking the former
meaning you suggest, or merely to note that the measure of the set
with with common prefix of length |p| is precisely 2^{-|p|}, and then
the sum is over all such subsets of descriptions.

> But then, what if there is a string p such that O(p)=s, but if we append
> anything to p, either p+0 or p+1, then O of those strings is not s?
> Does that p add to the equivalence class or not?
>

Remember O(x) is a prefix map. That cannot happen.

> And is it the case that for all O and for every "description" bit string,
> there will be some prefix beyond which all the bits are "don't care"?
> Would that follow from the finiteness of O? If not, how do these
> "descriptions" feed into the measure?
>

If there are no "don't care" bits, the description contributes zero
measure. This is not a problem, so long as there is at least one class
of strings with don't care bits.

> And I'm a little concerned about the appearance of 2^(-|p|) here without
> much motivation.

Perhaps making the summation index p run over subsets rather than
descriptions would help here, as I men tioned above. The subsets
correpond to finite length bitstrings (the "prefix"), and the |p|
corresponds to the length of that prefix.

>
> "The quantity C_O(x) = - log2( P_O( O(x) ) )
> is a measure of the information content, or complexity of a description x."
>
> Now I am becoming confused about whether O() is supposed to apply
> to the infinite bit strings that are "descriptions" or only to their
> finite prefixes. I thought it was the latter. I don't see how a TM
> can come up with a meaningful value on an infinite bit string. It can
> never read in the whole input if it is infinite in size.
>

It does not need to.


> "If only the first n bits of the string are significant, with no
> redundancy, then it is easy to see C_O(x)=n."
>
> That part does make sense, if in fact the definition of P_O(s) was
> only with reference to "description" strings which were "don't care"
> after some number of bits. And I think the part about "no redundancy"
> means there are no other bit string prefixes which O maps to the same
> value as it maps the n-bit prefix of x.
>
> The page then goes on to make some comments about measure applied to
> universes. Here again I am confused about how to relate it to all that
> has been descibed. What are the analogs of universes, in this model?
> Is it "descriptions", the infinite bit strings? From what has been
> presented so far, I don't understand how to relate our experience of
> reality to this model.
>

Each description is a possible universe, composed of an infinite
amount of information. Any observer will of course only comprehend a
finite amount information, and hence be in a superposition of a subset
of universes corresponding to that finite information. Admittedly the
usage of the term "universe" is slightly strange here.

Alterantively, one could talk about "observer moments" as
corresponding to the equivalence classes of descriptions. This
interpretation would be more natural to many here on this list.

In section 3 of the paper, I now introduce a temporal dimension, with
the observer repeatedly sampling the set of all descriptions, with the
proviso that successor states can only differ by a finite number of bits.

> That's it for the first page. Hopefully these questions will help to
> show where I am getting confused.
>
> Hal Finney

Cheers

-- 
*PS: A number of people ask me about the attachment to my email, which
is of type "application/pgp-signature". Don't worry, it is not a
virus. It is an electronic signature, that may be used to verify this
email came from me if you have PGP or GPG installed. Otherwise, you
may safely ignore this attachment.
----------------------------------------------------------------------------
A/Prof Russell Standish                  Phone 8308 3119 (mobile)
Mathematics                         	       0425 253119 (")
UNSW SYDNEY 2052         	         R.Standish.domain.name.hidden             
Australia                                http://parallel.hpc.unsw.edu.au/rks
            International prefix  +612, Interstate prefix 02
----------------------------------------------------------------------------



Received on Sun Jun 05 2005 - 21:42:20 PDT

This archive was generated by hypermail 2.3.0 : Fri Feb 16 2018 - 13:20:10 PST