Phoneme Encoding
13-OCT-1999
Consider the following superficially unrelated conflicts:
- Oral vs. Visual Traditions:
Ever since the development of language, human beings
have been telling stories orally and listening to
stories told by others. Our brains have evolved to process
complex signals received through the ears and interpret
these as meaningful abstract information as well as pleasing
music or annoying noise. Audio information reaches our
emotional centres much more directly than visual information,
which must be processed by the visual cortex before we decide
what to make of it. Our visual cortex is highly developed
for the purpose of detecting motion (in our peripheral vision)
or recognizing patterns (in the central focus); visual fields
hold our attention well, but the visual cortex did not evolve
as a conduit for abstract meaning.
In spite of this, for the last few hundred years
(since the invention of the printing press)
and especially for the last few decades
(since the invention of television)
we have been taking in most of our "stories"
through the visual cortex.
It may well be that we have lost something in the process.
- Mass-Market Publishing vs. Literature:
Authors who have not published a best seller recently
might as well forget about making a living as writers,
because beleaguered publishing houses are more and more
frequently taking their orders directly from big bookstore
chains, whose focus is exclusively on the "bottom line."
Like the bookstores themselves, authors have begun to
"phase separate" into giant successes and starving unknowns.
This has always been true to some extent, but the polarization
is more acute than ever today as the habit of reading books
becomes rarer and rarer.
- Anonymous Text vs. Signed Art:
The Web offers a wonderful medium for the free dispersal
of information and ideas. This is great. However,
as more and more text becomes Web-accessible,
the assurance that an author will receive due credit for his or her work
is steadily eroded, whether "credit" be taken to mean
financial remuneration or simply appreciation and recognition.
If I put an essay on the Web, anyone who can read it can also
capture the text on their own computer, present it as their own,
or modify it enough to raise doubts of its true origin and
then attempt to "own" it through hopelessly confused copyright laws.
As a teacher, I cannot prevent an unprincipled student from
simply piecing together a paper from fragments of text extracted
here and there. [But I can always tell, so don't you dare!]
Meanwhile, however, authentic signed artwork is very difficult
to "steal" simply because the physical medium is part of its
essence and copying takes more skill than simply recognizing
and reproducing the symbolic content.
Wouldn't it be nice if literature had this quality?
- Uninflected Text vs. Natural Language Recognition:
Those who do research in artificial intelligence (AI)
have advised me that natural language recognition
will "always" be beyond the capacity of computers
because an understanding of the precise meaning of
a given sentence depends so strongly on context
- which often extends to include the whole of human experience!
[I have always found this argument silly; I don't expect you
to understand the precise meaning of what I say, only to make
a good guess. Why should I ask more of a computer?]
But what AI researchers usually feed their computers is
written text, from which a large fraction of the information
in spoken language has been "projected out" onto uninflected
character strings. It is a much more challenging task to
imagine what a writer would like us to hear than to just
listen to a speaker. Wouldn't it be nice if there were a way
to feed the same richness of information into the computer
in natural language recognition experiments and still retain
the compact encoding of written language? Furthermore,
wouldn't it be nice if computers could generate
realistic spoken language instead of the caricatures
of human speech produced by most existing software?
- Compression vs. Efficient Representation:
Of course, it is possible to record spoken language and even to
compress it spectacularly using some of the new schemes such as MP3.
But even with compression it still takes a lot more bytes
to store a sentence as an audio signal than as written text.
Wouldn't it be nice if the additional storage were all
used to store additional information?
Now consider some encouraging factoids:
- Talking Books:
Most libraries now carry tape recordings of people reading books.
You can take these home and listen to a book while doing
routine chores or relaxing in bed. This goes a long way toward
overcoming the "three B's" barrier to electronic books
(which I first heard described by Spider Robinson):
you can't conveniently take them on the Bus, to the Bathroom
or into Bed, so they will never replace paperbacks.
However, one cassette tape only holds about an hour's worth
of reading, so you must either carry around several tapes
or admit defeat by the three B's.
- Cheap Chips:
Fortunately there are new media "out there" that can store
many hours' worth of audio in a device considerably smaller
than pocket-sized. As these become more popular,
economies of scale will make them cheaper - perhaps
one day cheaper than a paperback book, whose price
has been rising almost as precipitously as that
of electronics has been falling!
- Unicode:
Most "modern" programming languages, like Java,
are now in the process of "dumping" the old ASCII
character set (a throwback to the old days of scarce memory
and low bandwidth, in which only 128 characters can be readily
defined) in favour of Unicode, which uses a full
4 bytes for each character. (That's about 64,000 different
printable characters, folks!)
- Phonemes:
All human language can supposedly be constructed from
a modest number of "phonemes" (sounds) - certainly
far less than 64,000.
- MIDI:
For years now, music buffs have been making practical use
of the fact that it only takes a finite amount of information
to describe a single musical note: length, pitch, loudness,
attack, decay and overtone content are the obvious ones (to me)
and these can each be represented with adequate precision
in a limited number of bits. The quality of music generated
from good MIDI files seems more dependent upon the patience
and skill of the encoder and the capabilities of the synthesizer
than on any intrinsic limitations of the encoding scheme.
So what's the big idea, already?
- Phoneme Encoding:
Like MIDI for music, there is probably a speech encoding scheme
involving several bytes for the phoneme plus a byte for each of
loudness, length, pitch, inflection,
attack and decay, that will, when coupled to
a modest synthesizer and the right software,
produce what sounds very much like spoken human language.
In fact, with a few more bytes per phoneme, enough idiosyncracies
can be included to mimic not only various "accents" but also
the unique characteristics of a specific person's voice.
- Research Opportunities:
If this proves to be true, imagine what one might learn from
studies of different voices and what makes them different.
Not only might we learn to understand "voice" better
through the simple expedient of having a model
and a representation for the phenomenon,
but a major barrier to AI research would be removed.
- Model Independence:
However, to do phoneme encoding does not require
that we understand "voice" - given an appropriate combination
of hardware and software, we can run speech through the encoder
and produce a compact byte stream that can later be run back through
a synthesizer to reproduce the same speech.
- Language Independence:
Although speech in one language cannot be directly
phoneme-encoded into another language
(so translation will remain an important profession),
the phoneme encoding system doesn't care which
language it is encoding; we all use the same phonemes!
It thus plays approximately the role of a
universal alphabet, except of course that
it bypasses the visual cortex completely.
- Satisfying the 3 B's:
Once competition has had a chance to refine the technology,
I envision devices of roughly the same dimensions as a
credit card (only several times thicker) containing a
rechargable battery, a solid state memory of around 128 MB,
a miniature audio amplifier and the connectors for a
power source (and recharger), a set of stereo headphones
and a computer interface - probably a PCMCIA plugin for
the early versions.
Plug this "electronic storyteller" into your PC,
download a new "story" from the Web and it can be
a different "book" every day.
Once economies of scale have had their effect,
this device (which has no moving parts)
should sell for less than a floppy disk drive (which does).
When you can buy a used one for $20.00 it will hardly
be worth stealing, at which point it competes successfully
with the paperback book for versatility and convenience.
In fact, it wins handily; you cannot take a paperback
jogging or for a bike ride!
- Sell Your Own Book:
Authors are now free to experiment with a new form of "publication" -
having written a new manuscript, they can read it into the
phoneme encoder in their own voice
(an excellent way to clearly identify the source!)
or, if they prefer, hire a professional reader to narrate the story
for them. Better yet, they can bypass the written word completely
and perform their own "radio dramas" with a few friends - the
possibilities are endless. Then they can put the result on
their own Web site (an easy thing to obtain these days) and
sell it directly to "listeners" as "shareware."
("If you like this story, please send a contribution of $xx.xx to . . .")
Yes, most listeners will send nothing. But how many would you
need to match or exceed the typical royalties that publishers
now offer the authors of books they sell? Hmmmmm?
- Revival of Oral Tradition:
The effects on human culture are, of course, unpredictable;
but it is hard to see a downside, except for the parasites. . . .
-
-
-
You get the idea.
So take this idea and get rich. Just don't try to "own" it
and prevent others from making use of it, or I will come for you. . . .
Jess H. Brewer
Last modified: Tue Oct 19 01:20:19