Questions on Russell's "Why Occam" paper

From: Hal Finney <hal.domain.name.hidden>
Date: Fri, 3 Jun 2005 16:22:07 -0700 (PDT)

Russell Standish recently mentioned his paper "Why Occam's Razor" which
can be found at http://parallel.hpc.unsw.edu.au/rks/docs/occam/ . Among
other things he aims to derive quantum mechanics from a Schmidhuber type
ensemble. I have tried to read this paper but never really understood it.
Here I will try to ask some questions, taking it slowly.

On this page, http://parallel.hpc.unsw.edu.au/rks/docs/occam/node2.html ,
things get started. Russell describes a set of infinite bit strings he
calls "descriptions". He writes:

"By contrast to Schmidhuber, I assume a uniform measure over these
descriptions -- no particular string is more likely than any other."

This surprises me. I thought that Schmidhuber assumed a uniform measure
over bit strings considered as programs for his universal computer. So
what is the "contrast" to his work?

It seems that the greater contrast is that while Schmidhuber assumed that
the bit strings would be fed into a computer that would produce outputs,
Russell is taking the bit strings directly as raw data.

But I am confused about their role.

"Since some of these descriptions describe self aware substructures..."

Whoa! This is a big leap for me. First, I am not too happy that mere bit
strings have been elevated with the title "descriptions". A bit string on
its own doesn't seem to have the inherent meaning necessary for it to be
considered a description. And now we find not only that the bit string is
a description, but it is a complex enough description to describe SAS's?
How does that work?

It's especially confusing to read the introductory word "since" as though
this is all quite obvious and need not be explained. To me it is very
confusing.

The page goes on to identify these SAS's as observers. Now they are mappings,
or equivalently Turing Machines, which map finite bit strings to integers.
These integers are the "meanings" of the bit strings.

I believe the idea here is that the bit strings are taken as prefixes
of the "description" bit strings in the ensemble. It is as though the
"observers" are observing the "descriptions" a bit at a time, and mapping
them to a sequence of integer "meanings". Is that correct?

So here is another confusion about the role of the "description" bit
strings in the model. Are they things that "observer" TM's observe and
map to integers? Or are they places where observers live, as suggested
by the "Since" line quoted above? Or both?

Now it gets a little more complicated: "Under the mapping O(x), some
descriptions encode for identical meanings as other descriptions, so
one should equivalence class the descriptions."

The problem I have is, O takes only finite bit strings. So technically
a "description", which is an infinite bit string, does not encode a
meaning. What I think is meant here, though, is that two "descriptions"
(i.e. infinite bit strings) will be considered equivalent if for every
finite prefix of the strings, the O() mapping is the same. So if we
think of O as "observing" the "description" bit strings one by one,
it will go through precisely the same sequence of integer "meanings"
in each case. Is that right?

"In particular, strings where the bits after some bit number n are
``don't care'' bits, are in fact equivalence classes of all strings that
share the first n bits in common."

I think what this considers is a special O() and a special string prefix
such that if O sees that particular n-bit prefix, all extensions of
that prefix get mapped to the same "meaning" integer. In that case the
condition described in my previous paragraph would be met, and all
strings with this n-bit prefix would be equivalent.

"One can see that the size of the equivalence class drops off
exponentially with the amount of information encoded by the string."

That seems a little questionable because the size of the equivalence class
is infinite in all cases. However I think Russell means to use a uniform
measure where the collection of all strings with a particular n-bit
prefix have a measure of 1/2^n. It's not clear how well this measure
really works or whether it applies to all sets of infinite strings.

"Under O(x), the amount of information is not necessarily equal to the
length of the string, as some of the bits may be redundant."

Now we have this new concept of "the amount of information" which has
not previously been defined. This sentence is really hard for me.
What does it mean for bits to be redundant? We just discussed strings
where all those after bit n are "don't care", but this sentence seems
to be envisioning other kinds of redundancies.

"The sum P_O(s) = [sum over p such that O(p)=s of] 2^(-|p|)
where |p| means the number of bits of p consumed by O in returning s,
gives the size of the equivalence class of all descriptions having
meaning s."

Boy, that's a tough one now. We consider all bit strings p such that O(p)
= s. Now, is this supposed to just be those cases described earlier where
the bits after |p| are "don't care" bits? Or is it all strings p such
that O(p) = s, including all extensions of a string where we "don't care"
after some point? If the latter, it seems problematical because there
would be an uncountably infinite number of such p's. Also I think the sum
would not converge in that case. So I think it is probably the former.

But then, what if there is a string p such that O(p)=s, but if we append
anything to p, either p+0 or p+1, then O of those strings is not s?
Does that p add to the equivalence class or not?

And is it the case that for all O and for every "description" bit string,
there will be some prefix beyond which all the bits are "don't care"?
Would that follow from the finiteness of O? If not, how do these
"descriptions" feed into the measure?

And I'm a little concerned about the appearance of 2^(-|p|) here without
much motivation.

"The quantity C_O(x) = - log2( P_O( O(x) ) )
is a measure of the information content, or complexity of a description x."

Now I am becoming confused about whether O() is supposed to apply
to the infinite bit strings that are "descriptions" or only to their
finite prefixes. I thought it was the latter. I don't see how a TM
can come up with a meaningful value on an infinite bit string. It can
never read in the whole input if it is infinite in size.

"If only the first n bits of the string are significant, with no
redundancy, then it is easy to see C_O(x)=n."

That part does make sense, if in fact the definition of P_O(s) was
only with reference to "description" strings which were "don't care"
after some number of bits. And I think the part about "no redundancy"
means there are no other bit string prefixes which O maps to the same
value as it maps the n-bit prefix of x.

The page then goes on to make some comments about measure applied to
universes. Here again I am confused about how to relate it to all that
has been descibed. What are the analogs of universes, in this model?
Is it "descriptions", the infinite bit strings? From what has been
presented so far, I don't understand how to relate our experience of
reality to this model.

That's it for the first page. Hopefully these questions will help to
show where I am getting confused.

Hal Finney
Received on Fri Jun 03 2005 - 20:16:02 PDT

This archive was generated by hypermail 2.3.0 : Fri Feb 16 2018 - 13:20:10 PST