Phoneme Encoding

13-OCT-1999

Consider the following superficially unrelated conflicts:

Oral vs. Visual Traditions: Ever since the development of language, human beings have been telling stories orally and listening to stories told by others. Our brains have evolved to process complex signals received through the ears and interpret these as meaningful abstract information as well as pleasing music or annoying noise. Audio information reaches our emotional centres much more directly than visual information, which must be processed by the visual cortex before we decide what to make of it. Our visual cortex is highly developed for the purpose of detecting motion (in our peripheral vision) or recognizing patterns (in the central focus); visual fields hold our attention well, but the visual cortex did not evolve as a conduit for abstract meaning. In spite of this, for the last few hundred years (since the invention of the printing press) and especially for the last few decades (since the invention of television) we have been taking in most of our "stories" through the visual cortex. It may well be that we have lost something in the process.
Mass-Market Publishing vs. Literature: Authors who have not published a best seller recently might as well forget about making a living as writers, because beleaguered publishing houses are more and more frequently taking their orders directly from big bookstore chains, whose focus is exclusively on the "bottom line." Like the bookstores themselves, authors have begun to "phase separate" into giant successes and starving unknowns. This has always been true to some extent, but the polarization is more acute than ever today as the habit of reading books becomes rarer and rarer.
Anonymous Text vs. Signed Art: The Web offers a wonderful medium for the free dispersal of information and ideas. This is great. However, as more and more text becomes Web-accessible, the assurance that an author will receive due credit for his or her work is steadily eroded, whether "credit" be taken to mean financial remuneration or simply appreciation and recognition. If I put an essay on the Web, anyone who can read it can also capture the text on their own computer, present it as their own, or modify it enough to raise doubts of its true origin and then attempt to "own" it through hopelessly confused copyright laws. As a teacher, I cannot prevent an unprincipled student from simply piecing together a paper from fragments of text extracted here and there. [But I can always tell, so don't you dare!] Meanwhile, however, authentic signed artwork is very difficult to "steal" simply because the physical medium is part of its essence and copying takes more skill than simply recognizing and reproducing the symbolic content. Wouldn't it be nice if literature had this quality?
Uninflected Text vs. Natural Language Recognition: Those who do research in artificial intelligence (AI) have advised me that natural language recognition will "always" be beyond the capacity of computers because an understanding of the precise meaning of a given sentence depends so strongly on context - which often extends to include the whole of human experience! [I have always found this argument silly; I don't expect you to understand the precise meaning of what I say, only to make a good guess. Why should I ask more of a computer?] But what AI researchers usually feed their computers is written text, from which a large fraction of the information in spoken language has been "projected out" onto uninflected character strings. It is a much more challenging task to imagine what a writer would like us to hear than to just listen to a speaker. Wouldn't it be nice if there were a way to feed the same richness of information into the computer in natural language recognition experiments and still retain the compact encoding of written language? Furthermore, wouldn't it be nice if computers could generate realistic spoken language instead of the caricatures of human speech produced by most existing software?
Compression vs. Efficient Representation: Of course, it is possible to record spoken language and even to compress it spectacularly using some of the new schemes such as MP3. But even with compression it still takes a lot more bytes to store a sentence as an audio signal than as written text. Wouldn't it be nice if the additional storage were all used to store additional information?

Now consider some encouraging factoids:

Talking Books: Most libraries now carry tape recordings of people reading books. You can take these home and listen to a book while doing routine chores or relaxing in bed. This goes a long way toward overcoming the "three B's" barrier to electronic books (which I first heard described by Spider Robinson): you can't conveniently take them on the Bus, to the Bathroom or into Bed, so they will never replace paperbacks. However, one cassette tape only holds about an hour's worth of reading, so you must either carry around several tapes or admit defeat by the three B's.
Cheap Chips: Fortunately there are new media "out there" that can store many hours' worth of audio in a device considerably smaller than pocket-sized. As these become more popular, economies of scale will make them cheaper - perhaps one day cheaper than a paperback book, whose price has been rising almost as precipitously as that of electronics has been falling!
Unicode: Most "modern" programming languages, like Java, are now in the process of "dumping" the old ASCII character set (a throwback to the old days of scarce memory and low bandwidth, in which only 128 characters can be readily defined) in favour of Unicode, which uses a full 4 bytes for each character. (That's about 64,000 different printable characters, folks!)
Phonemes: All human language can supposedly be constructed from a modest number of "phonemes" (sounds) - certainly far less than 64,000.
MIDI: For years now, music buffs have been making practical use of the fact that it only takes a finite amount of information to describe a single musical note: length, pitch, loudness, attack, decay and overtone content are the obvious ones (to me) and these can each be represented with adequate precision in a limited number of bits. The quality of music generated from good MIDI files seems more dependent upon the patience and skill of the encoder and the capabilities of the synthesizer than on any intrinsic limitations of the encoding scheme.

So what's the big idea, already?

Phoneme Encoding: Like MIDI for music, there is probably a speech encoding scheme involving several bytes for the phoneme plus a byte for each of loudness, length, pitch, inflection, attack and decay, that will, when coupled to a modest synthesizer and the right software, produce what sounds very much like spoken human language. In fact, with a few more bytes per phoneme, enough idiosyncracies can be included to mimic not only various "accents" but also the unique characteristics of a specific person's voice.
Research Opportunities: If this proves to be true, imagine what one might learn from studies of different voices and what makes them different. Not only might we learn to understand "voice" better through the simple expedient of having a model and a representation for the phenomenon, but a major barrier to AI research would be removed.
Model Independence: However, to do phoneme encoding does not require that we understand "voice" - given an appropriate combination of hardware and software, we can run speech through the encoder and produce a compact byte stream that can later be run back through a synthesizer to reproduce the same speech.
Language Independence: Although speech in one language cannot be directly phoneme-encoded into another language (so translation will remain an important profession), the phoneme encoding system doesn't care which language it is encoding; we all use the same phonemes! It thus plays approximately the role of a universal alphabet, except of course that it bypasses the visual cortex completely.
Satisfying the 3 B's: Once competition has had a chance to refine the technology, I envision devices of roughly the same dimensions as a credit card (only several times thicker) containing a rechargable battery, a solid state memory of around 128 MB, a miniature audio amplifier and the connectors for a power source (and recharger), a set of stereo headphones and a computer interface - probably a PCMCIA plugin for the early versions. Plug this "electronic storyteller" into your PC, download a new "story" from the Web and it can be a different "book" every day. Once economies of scale have had their effect, this device (which has no moving parts) should sell for less than a floppy disk drive (which does). When you can buy a used one for $20.00 it will hardly be worth stealing, at which point it competes successfully with the paperback book for versatility and convenience. In fact, it wins handily; you cannot take a paperback jogging or for a bike ride!
Sell Your Own Book: Authors are now free to experiment with a new form of "publication" - having written a new manuscript, they can read it into the phoneme encoder in their own voice (an excellent way to clearly identify the source!) or, if they prefer, hire a professional reader to narrate the story for them. Better yet, they can bypass the written word completely and perform their own "radio dramas" with a few friends - the possibilities are endless. Then they can put the result on their own Web site (an easy thing to obtain these days) and sell it directly to "listeners" as "shareware." ("If you like this story, please send a contribution of $xx.xx to . . .") Yes, most listeners will send nothing. But how many would you need to match or exceed the typical royalties that publishers now offer the authors of books they sell? Hmmmmm?
Revival of Oral Tradition: The effects on human culture are, of course, unpredictable; but it is hard to see a downside, except for the parasites. . . .

You get the idea.

So take this idea and get rich. Just don't try to "own" it and prevent others from making use of it, or I will come for you. . . .

Jess H. Brewer

Last modified: Tue Oct 19 01:20:19