1- Is it possible?
The short answer is: NO! For a
longer explanation, read on and see the illustrations at the end of
this article.
2- The task to do
I am referring here to people who
presumably want to take a recording of a full musical performance
(say, an entire opera, complete with orchestra and all vocals, both
solos and choruses, or a rock band complete with singers. I
inferred this from the fact that we want to convert MP3 and other
audio types of audio files, these would almost always be some
pre-recorded, and likely copyrighted full-fledged musical
performance. We hope to get a nice .MUS file out of it that would
already have all tracks very accurately transcribed into staves,
with lyrics in place for Virtual Singer and maybe even with
parameters set up in VS to closely match each individual singer's
voice, inflection, accent, etc. All the instruments would already
be assigned and modified to match the real instruments in the
recording, Digital Reverb effects already applied and tweaked to
match the acoustic space of the original recording, etc. etc. This
is by far a very popular feature request for a music program. I
agree that it would indeed be cool, but it just isn't possible with
today's technology.
3- Why not possible?
Asking for this would be like asking
for a line-art type graphics program (Illustrator, FreeHand,
CorelDRAW!, Canvas's draw features only, Xara X, etc.) to add a
feature to take a scanned photograph and convert all the objects in
it to line art objects, complete with Bezier curves and handles,
automatically grouping related things together (the wheels of a car
would be grouped with its body, leaves with a tree, facial features
of a person with the rest of the person, etc.). While autotrace
features and programs do exist (Adobe Streamline, for instance),
none of them would be capable of handling photographs and
recognizing the relationships of objects within them. Most just
handle scanned monochrome or, in rare cases, colored line art,
converting the scanned bitmap into real line art in .EPS or some
similar resolution-independent line-art format.
That analogy applies to audio and music because MIDI (and variants
such as .KAR and .ABC), .SEQ, and .MUS files (etc.) are analogous
to .EPS, .EMF, QuickDraw, .AI, .FH, .CNV, .WEB, or other line-art
formats, while uncompressed or losslessly-compressed .WAV or AIFF
or .AU files would be analogous to uncompressed or
losslessly-compressed TIFF/.TIF, .RIFF, .PSD, .BMP, .PNG, etc., and
.MP3, .ASF, .WMA, .RA., or lossily-compressed .WAV, AIFF, etc.
would be analogous to JPEG/.JPG or other lossily-compressed image
formats.
In short, bitmap/pixelmap image file formats (no matter how they're
compressed) and their analogous digital audio formats are direct,
resolution-dependent digitizations of analog information. While
they may appear to a viewer or listener to contain multiple
independent objects or instrument tracks, actually they are just
one image or sound recording (two in the case of stereo audio, and
the layers of a .RIFF or .PSD would count as separate pixelmaps,
but within each the same restriction applies). The computer just
sees them as 1s and 0s that contain only the raw information for
that particular point in the image (e.g. RGB or other color values
of a single pixel) or audio (individual audio amplitude sample),
and need not process them (beyond decompression if they were
compressed) to display or play them in a form that the human brain,
fed by input from eyes or ears, would recognize as being extremely
close to the original analog image or sound source. When you view a
.TIFF of the Mona Lisa, you see a beautiful woman against a
background of clouds and terrain, but the computer just sees rows
of pixels, each pixel having a Red, Green, and Blue (or Cyan,
Magenta, and Yellow) value, each value consisting of eight or
sixteen bits. When you listen to "Beethoven's Fifth Symphony" as a
.WAV file, you hear strings and brass and woodwinds and percussion
playing specific notes at specific rhythms, but all the computer
"hears" is 8-bit or 16-bit audio samples telling the sound card how
to move the speaker cones in time to reproduce the recorded
sound.
But image formats such as .AI and
.EPS, or sound formats such as .MID and .MUS, do not contain actual
image or sound information, but rather the commands needed to
generate them. An .EPS of a sketch of the Mona Lisa would have the
actual curves of the sketch in a format that the computer
understands and can display to the user. The computer can
manipulate each curve independently, even where it overlaps other
curves, without disturbing them. Likewise, a MIDI file of the Fifth
Symphony has tracks or channels for each of the instruments, and
the notes and velocities and other commands for each of those in a
way the computer understands and can manipulate. You could change
the individual notes of an individual instrument without affecting
the other sounds that occur during the same time. You can't do that
with a .WAV or .MP3.
4- What exists now
Just as monochrome line-art
autotrace programs do exist, so too do monophonic audio-to-MIDI
programs exist. And as with autotracing, most of them aren't very
good even with that mono-source limitation, and a few gems do shine
above the general muck (Streamline). The few that try to exceed the
mono-source limitation tend to do even worse jobs than the others.
Unless, that is, if the .WAV (or .AIFF or .AU or .MP3 or RealAudio
or Windows Media Audio or QuickTime Audio or any other audio
sampling format, compressed or uncompressed) file in question is of
a solo performance, of one instrument which only plays one note at
a time. It could also be of an individual person singing, humming,
"scatting" whistling, etc. a tune, without any form of
accompaniment (not even rhythm or a metronome).
An inexpensive program called "Digital
Ear" can do a quite decent job of translating such files into a
single MIDI track. Unlike its competition, Digital Ear can track
and respond to not only pitch changes, but also volume and
brightness changes, translating those into appropriate MIDI events
(e. g. volume to MIDI Expression [CC#11] or Breath Controller
[CC#2] Continuous Controller messages, and brightness to Brightness
[CC#74] or Harmonic Content [CC#71, if I remember right]). The
resulting .MID file can, of course, be imported into Melody or
Harmony Assistant, or any other MIDI-compatible program.
But if you want to be able to take a
.WAV of, say, a rock or jazz band performance, or the Mormon
Tabernacle Choir singing the Hallelujah Chorus with full orchestra
and pipe organ accompaniment, or even a Barbershop Quartet doing a
traditional Tin Pan Alley song, and get all of that translated
niceley into separate music tracks or notation (let alone lyrics!),
then no, that just can not be done with today's technology, nor is
it likely to be possible with any technology in the forseeable
future. Some programs claim to be able to handle polyphonic audio,
but in practice they can only handle source files of a single
polyphonic instrument (e.g. a piano or acoustic guitar),
preferrably played in an anechoic chamber (to eliminate reverb that
might be confused with additional notes), of mellow instruments low
in harmonics (e.g. a Baldwin grand piano, not a Kawai spinet piano,
or a nylon-stringed guitar, not a steel-stringed one), etc., and
then only if the settings are tweaked exactly right (which is not
at all an easy task), etc. A true polyphonic audio-to-MIDI
converter that actually works is many years down the road, and will
require CPUs many dozens of times more powerful than today's
Pentium 4s or Athlon XPs or PowerPC G4s (or even Itaniums and
Hammers and G5s), as well as better software technology and
algorithms.
5- What about the future?
Will it ever be possible to do what
I described in the first paragraph? As I said, not with typical
computer technology. Neural networks, though, are another matter.
Most small child can hear a piece of complex music and pick out the
people singing words, and hear the individual instruments (or
sections of the same instrument playing the same notes) from the
mix - even if they don't know the names of the instruments, they
can still hear that the tones made by a flute sound very different
from those made by a violin, which in turn are different from those
made by an electric guitar with high distortion or fuzz effects.
Furthermore, the child does this in real time, and doesn't have to
think about performing Fast-Fourier Transforms and other complex
math analyses on the analog audio waveforms coming into their
brains via the cochlear nerves in their inner ears in response to
vibrations of their eardrums.
Why then is it so hard for computers? Because computers are linear,
performing tasks in sequence. Tasks like this, though, require more
of a pattern-matching approach, which the human brain excels at.
Neural networks work in a way similar to the brain. Another
possibility is quantum computers (which are molecules - I saw a
photo of a vial containing trillions of such molecules, and it
looked like a tiny vial of over-colored lemon-flavored Kool-Aid,
not at all like what you would expect a computer to look like!),
which are also (theoretically) very good at non-linear tasks.
But those are still well into the
future (at least a decade or two) before we have any that can
process any musical audio file and spit out a fully accurate
representation in some command- or object-based format (e.g. MIDI
or .MUS).
A computer based on binary digital Von
Neuman/Babbage technology (and that's the sort of computer most
people would recognize as such, whether it's a microcontroller in a
VCR or microwave oven, or a massive supercomputer in the Pentagon,
or anything in between such as a desktop or laptop personal
computer, be it a Windows PC or a PowerMac G4, or even future
generations such as a next-generation 64-bit Itanium or AMD Hammer
or PowerPC G5 machine), no matter how fast, simply cannot do such a
task, at least not in the same way that our brains do. Our brains
are not binary digital Von Neuman/Babbage machines. They don't work
like binary digital Von Neuman/Babbage machines, and, more
importantly, the converse is also true.
The same thing applies visually: you
can just look at a photograph of someone you know and just
instantly recognize who that person is without even thinking about
it, let alone performing complex edge-detection and
content-extraction analyses, but even the most powerful analytical
imaging software has to go through such steps to perform facial
recognition, and even then doesn't get it right nearly as much or
as easily as even a toddler would.
To get an idea of just what a computer
would have to go through to be able to do this task, try to switch
senses: our visual cortex is no more designed to process sound and
extract instrument, note, lyrics, etc. information out of that than
a computer is. So, have a friend digitally record three sample .WAV
(or .AIFF if you use a Mac) files: a recording of a live musical
concert performance with vocals and multiple instrumentals, a
recording of the inside of a noisy factory, and a recording of a
crowded mall on a peak Christmas shopping day. Your friend must
name those files with plain names that do not describe their
content, such as "A.WAV", "B.WAV" and "C.WAV" not respectively (in
random order). Your task is simple: you are to completely mute your
computer's sound system (unplug the speakers if you have to), load
the files into a wave editing program that lets you see a visual
representation of the waves, and try to tell which file is the
music, which the noisy factory or similar non-musical noise, and
which the crowded mall. If you can even do that much, I will be
impressed. Now, using only your eyes, try to pick out the
individual notes and instruments from the music recording, or what
individual people are saying in the recording of the crowd.
Here is a graphic of two different sounds. What are they just by
looking at them? Here is a hint, you have 3 choices: they are both
music, they are both spoken words or one is music and the other
spoken words. If one is that of a person speaking and the other one
is music, can you tell them apart? What are the spoken words if
they are spoken words? What notes are being played by which
instrument if they are music?
|
|
|
This is the music
|
This are spoken words
|
Try the reverse experiment, too: save
a picture in uncompressed bitmap format, then load it into an audio
program as raw audio samples and play it back, this time with the
audio on, and see if you can "hear" the graphical nature of the
image in any way
So, it IS possible to do simple
pitch-to-MIDI conversion of monophonic audio sources, and even
simple polyphonic sources of one instrument that plays chords
(piano, guitar, etc.). For now, Digital Ear remains my favorite
monophonic audio-to-MIDI program, because it does more than just
note-to-MIDI. It tracks and translates pitch fluctuations such a
vibrato, scooping, guitar pulls, trombone and steel guitar slides,
etc. into MIDI Pitch Bends and Portamentos, and even tracks volume
and brightness changes, converting those into MIDI Continuous
Controller events of your choice. So, you could wordlessly sing a
tune into your computer microphone, using vowel sounds such as "oo"
and "ah" to do brightness changes, and vary your volume to match
the way you would want, say, a saxophone track to sound, and
Digital Ear would translate that into MIDI.
6- What about converting MIDI or MUS
files to audio type files?
It is VERY EASY to go the other way,
namely, to turn a MIDI file into a WAV, just as it is easy to turn
an .EPS into a bitmap (TIFF, etc.) -- in fact, since the MIDI and
.EPS files don't really contain sound or an image, respectively,
but rather the commands used to REPRODUCE the sound or image, you
can't even HEAR a MIDI file OR SEE a .EPS file UNTIL it has been
converted to wave audio or a bitmap (respectively), even if only
temporarily. Doing it permanently is only a matter of storing the
results of the conversion that has to be done anyway for the
results of the commands contained in the file to be humanly
perceivable!
For instance, when using Adobe
Illustrator or any similar program, you are NOT seeing the actual
Bezier curves on the screen. You are seeing a rendered bitmap of
them, since the screen is inherently a bitmapped device (in this
case, there ARE exceptions: pen plotters, X-Y vector-scan monitors,
etc. -- but in general what I said holds true). When you print out
such a file on any ordinary printer (even a PostScript laser
printer), you are seeing dots that were produced by a rasterizer in
the printer, which converted the graphic commands into a
bitmap.
When you play a MIDI file, however you
do so, the MIDI device interprets the commands and generates sound
waves. Once generated, they are of the same nature as sound waves
generated by recorded audio files.
In both cases, the output of the
conversion can be saved to a file, and the resulting file is an
ordinary uncompressed or compressed bitmapped graphics file (TIFF,
JPG, etc.) or an audio file (WAV, AIFF, MP3, etc.),
respectively.
7- What does Myriad Software has to
say?
In the very first version of
Harmony Assistant, in 1994, we implemented a frequency (notes)
recognition module. It worked quite well when only one instrument
played only one note at a time, as it is the case or voice and all
wind instruments. It worked too for most of polyphonic instruments
that play chords, for example a single guitar, a single piano, etc.
But it did not work at all on complex orchestration, drums,
etc.
Because many users tried to use this
feature in a way it has not been built for, then complained to our
tech support that it did not work as they expected, we deleted it
from the next versions of the program. From this original set of
features, only remains now in Harmony Assistant the "Tune-up"
option, that has been remained untouched for 7 years !
Because computer power goes
increasing, we took a look to what is available now in this domain.
It seems that things did not improved too much these last years.
Many programs can process mono-instrument, mono-pitched samples,
some of them try to recognize notes for single polyphonic
instrument, with more or less success, but *none* of them is
capable of outputing a clean score from a complex source, as an
orchestra piece or even a small rock band digital track. We can
really wonder whether it is actually possible. On my mind, it is
not possible at present to get good results for such tracks.
If you read carefully the usage notice
of most of these pieces of software, you will notice it is clearly
said it works only for "solo" instrument, and not intended to
process complex digital tracks taken from a CD. But the first thing
each user does to evaluate the software (I did this too) is to try
it on pieces as complex as the 9th symphony of Beethoven, or on an
excerpt of the latest CD from Iron Maiden, according to each one's
taste. And of course, it does not work, and users complain either
on the newsgroup or to the software company tech support (It
*didn't* do this).
It is for this reason we do not want
to implement such a feature in Harmony Assistant. Because we know
even if we display in big red fonts the way this feature is
designed to work, most of the users will try it above its limits,
then bug the tech support...
So we probably will never implement a
recognition feature as most of you expect. In any case, such a
module wouldn't be capable to output a 30-staves score from a
symphony. If we do something in this field, it will be as part of a
more global module, for example voice-oriented, so that there could
not be any confusion about the limit of the process.
==============================
Analogy between graphic programs and music programs
1
|
2
|
This is your computer screen magnified many times. Each square
is called a pixel and is in fact a very tiny dot that can be ON
(white or lit) or OFF (black or shut).
|
If you type the letter "a" in a "paint" type program, here is
what you see on screen, again magnified many times. Your file will
be a BMP, PICT, PSD, TIFF or similar type of file. This is similar
to an MP3, WAV and other types of audio files. In this case, we
would have only one intrument playing one note, nothing more. If we
had a word, it would be like having as many notes as there are
letters.
|
3
|
4
|
This is what the computer sees: a serie of 1 and 0. The 1
indicates that this dot of light or pixel is ON and the 0 indicates
that the pixel in OFF.
|
If you want to convert this file to an Illustrator's EPS or
similar file type automatically, you would have to autotrace the
bitmap file. This is just like trying to convert a WAV file to a
MIDI or MUS file automatically. The resulting file is editable at
will.
|
5
|
6
|
Your converted file would look similar to this. This compares to
the end result of your one note one intrument MP3 file converted to
MIDI. The blue lines and dots indicates points from where the
drawing can be modified without loss of resolution or quality.
|
This is what you hoped to get with your automatic conversion.
This is just like the original MUS or MIDI file would look like if
it had been created as such in the first place. This is what you
see. To see what the computer sees, click here. You will see
a serie of commands that tells the computer what to do and how to
do it instead of a serie of 1 and 0.
|
7
|
8
|
9
|
So, this is what you started with, with the hope that your
automatic conversion would give you...
|
...this MUS or MIDI file: a file that can be edited at will,
but...
|
...this is the end result of your automatic conversion and it is
the very best you can hope for with only one instrument playing one
note in the best of conditions.
|
10
|
Now, this is similar to the recording of an opera, a jazz big
band or the Rolling Stones playing and singing with many voices,
instruments, chords, etc. What you want to do is to convert it
automatically to something that would be exactly the same but that
you could edit at will afterwards hoping that all that belongs to
the tree on the right will be grouped in logical order and the same
for all other picture elements as well so you can edit them easily
(the notes played by the piano are all on the same staff with
proper velocity, duration, etc. and the same for each intruments).
Can you imagine the task? It is impossible to do with any graphic
programs on the market at the present time and we may have serious
doubts that it will ever be possible in the near future. The same
goes for music.
|
==============================
Written by Joel Ellis Rea aka "COMALite J.", with a comment by
Myriad Software.
Compiled and processed by Laurier Nappert. Illustrations and
painting by Laurier Nappert.
November 2001
Home >
Resources >
Experience Sharing > WAV and
all other forms of audio files recognition