An acoustical study of Korean vowels
Professor Byunggon Yang
partment of English, College of Humanities, Dongeui University,4 Kaya-dong, Pusanjin-gu,
Pusan 614-714, Korea
I. Introduction
In real life, speech signals vary greatly between and within speakers, but human beings
seem to have little difficulty communicating. For instance, a speaker never produces
a word in physically the same way on two occasions or in two different contexts.
Moreover, no two speakers produce a word in exactly the same way, articulatorily or
acoustically. This speaker variation has been attributed to (1) linguistic factors
such as dialectal and sociolectal differences and to (2) non-linguistic factors such
as physical anatomy, age, gender, and emotional state of the speaker. Some of these factors
are systematic so that their effects may be theoretically separable from linguistically
relevant properties of speech by systematic transformations, while others may be
minimized by methods of statistical inference. The goal of factoring out these nonlinguistic
factors is to establish a "pure," linguistically relevant acoustic specification
of the vowel qualities of any given language. This procedure has been called "normalization". This study will review three non-linguistic factors which are so problematic
in establishing the phonetically significant correlates of vowel quality obtained
from male and female speakers. They are fundamental frequency, vocal tract length,
and the ratio of front cavity to back cavity.First, the speed of the vocal fold vibration
or F0 can be regarded as inversely proportional to the mass and length of the vocal
fold and proportional to the tension. A study by Negus (1949) reported that vocal
cord length averaged 12 to 17 mm in the adult females and 17 to 23 mm in the adult males.
Thus, the vocal fold vibration of females with smaller vocal cords is predictably
faster than that of males. The average vibration of males' F0 is about 125 Hz but
that of females is around 200 Hz. F0 also varies as the speaker changes the tension of
laryngeal muscles and to some extent the subglottal pressure. Anatomically, the vocal
folds can be lengthened with increased tension when the cricothyroid muscle contracts
causing the cricoid to tilt back and thereby stretching the vocal folds. Boothroyd
(1986) also observed that F0 varied between a low value of about 70 Hz and a high
value of about 200 Hz in men. In women, the range was from 140 to 400 Hz. Second,
formant frequencies are inversely related to the overall length of speaker's vocal tract. Vocal
tract size varies according to age and gender. Females usually have shorter vocal
tracts than males. Therefore, although a vowel phoneme may be articulated with the
relatively identical vocal tract configuration, the formant frequencies increase from males
to females. The overall vocal tract length can be estimated directly from formant
frequency measurements. Assuming the cross-sectional area of the human vocal tract
to be almost uniform for the vowel [/\] as in an English token Hudd, one can obtain the
length of the speaker's vocal tract (L) by introducing a measurement of F3 of [/\]
into the well-known formula of Equation 1. The vocal tract ratios of female to male
in three European languages were determined according to Eq. (1) and were found to be 0.89
for Swedish, 0.89 for Dutch, and 0.86 for English. The sources are van Nierop et
al. (1973) and Pols et al. (1973) for Dutch, Peterson and Barney (1952) for English,
and Fant (1975) for Swedish. This corroborates Chiba and Kajiyama (1941) who estimated
overall vocal tract length, assigning the relative numbers of 1.0 to males, 0.87
to females. These numbers all indicate that female vocal tracts are 11-14% shorter
than those of males. Based on the overall vocal tract difference, Nordstrom and Lindblom (1975)
proposed a uniform scaling method for gender normalization. It involved estimating
the total length of a subject's vocal tract from an average of F3 in vowels with
F1 greater than 600 Hz. Because the length of the speaker's vocal tract is inversely related
to formant frequency, the ratio of the length of the average male vocal tract (Lm)
to the average female vocal tract length (Lf) can be written as in Equation 2. F3m.av and F3f.av indicate an average of the third male and female formant values, respectively.
Then, the normalized nth female formant frequency is denoted as Fnf (scaled) and
can be determined according to Equation 3.Third, the ratio of pharynx to mouth cavity lengths is another factor contributing variation between speakers. Chiba and Kajiyama
(1941) stated that mouth cavity length of an eight-year-old girl was 30% shorter
than that of an adult male while the length of the girl's pharynx was 56% shorter
than that of the male. Again, the length of pharynx and mouth cavity can be estimated from
the formant frequencies of the vowel [ i ]. In a two-cavity simplified model of vowel
[ i ], F2 depends on the back cavity or pharynx while F3 depends on the front cavity or mouth cavity. The length of the back cavity (LB) and that of the front cavity
(LF) can be approximated by Equations 4 and 5.These are only approximate values given
the simplicity of the model. For Swedish speakers, Fant (1973) reported that the
female pharynx according to the formulas above was 2.1 cm shorter than the male pharynx;
and the female mouth cavity was 1.25 cm shorter than the male mouth. This observation
fitted well the physiological data. From these differences in pharynx-to-mouth-cavity
ratios Fant predicted that male-female formant values would be related by non-uniform
scale factors. Fant (1968) proposed to consider not only differences in the overall
vocal tract length between male and female speakers but also the complex formant-cavity
relationships. Therefore, Fant(1975) recommended using scale factors that are both
vowel and formant specific. His method applies a different scale factor to each individual
vowel and individual formant category. In this paper, F0 and the first three formants of ten Korean vowels produced by 20 male and female speakers were studied while
controlling the linguistic factors as homogeneously as possible in each group. Second,
the male female variation in the Korean data were examined. Third, the data were
studied in terms of fundamental frequency, vocal tract length, and the ratio of pharynx
to mouth cavity.
2. Method
2.1. Subjects and speech samples
A total of 20 subjects were chosen from a larger group participating in recording
and listening sessions at the University of Texas at Austin. They formed two groups:
ten Korean males and ten Korean females. Subjects were students attending the University of Texas at Austin and all had normal hearing and health. All the Korean subjects
spoke Standard Korean. Two screening instruments were used to make each group linguistically
homogeneous. First, subjects were grouped homogeneously on the basis of collected information from a questionnaire. It included subjects' dialect and history of
speech and hearing disorders. Second, peer judgment was employed to screen out those
subjects who had different dialects in the language group. Five peers in each male
and female group were randomly chosen from among the subjects. Then, the peers were asked
to listen to the four sets of tokens consisting of the vowels [i a u]. Each set was
composed of different male and female subjects saying the same token. In the listening
session, the peers put a check mark on each token that sounded different from their
own dialect. All the marks were counted to find four peers (two males and two females)
who had the fewest marks. Finally, marks by the four chosen peers were used to screen
out those subjects who had more than 35% of the total tokens perceived as a different
dialect from other members. The speech samples consisted of 52 Korean words. Each
English and Korean vowel occurred in an |h(V)da| context. In this context, the following vowel formant can be easily identified because the /h/ noise on the spectrogram shows
similar patterns of the following vowel formants.Ten Standard Korean vowels studied
were /a, E, u, i, i-, we, wi, /\ o, /. These ten Korean vowels appeared five times
in random order. Later, three out of the five productions of each vowel were randomly
chosen for the average data set, avoiding unnaturally-produced tokens at the beginning
and ending of the recording.
2.2. Procedures
The recording was done in a sound-proof booth in the Phonetics Lab of the University
of Texas at Austin (UT). The experimenter asked the subjects to produce each word
at a normal rate and as naturally as possible. The recording took 2-3 min per subject.
The recorded samples were analyzed using the VAX computer in the UT Phonetics Lab. The
KLSPEC software package was used to interactively examine, measure, and analyze the
recorded samples. The input samples were low pass filtered at 4 kHz and digitized
at a 10-kHz sampling rate. Spectrograms were made using a 256-point discrete Fourier transform
(DFT) analysis with a 6.4-ms Hamming window once every millisecond. The dynamics
of the vowel formant pattern made it difficult to find a consistent time point for
spectrum analysis. Therefore, the author calculated the total duration from the vowel
onset and offset time points. Each time point was determined by adding one-third
of the total duration to the vowel onset point.
2.3. Formant Decisions
In this study the spectrogram of each vowel was used for the first reference and final
decision on formant values. First, formant values on the spectrogram were estimated
by drawing a pencil line through the center of each formant band with a ruler. Then,
visual estimates were compared with those estimates automatically computed by one
of the KLSPEC software packages from the two DFT harmonic spectra (average envelope
and LPC envelope). For reliability, the measurements were checked by an independent
observer.
3. DATA ANALYSIS AND DISCUSSIONS
Tables I and II list the average values of F0, the first three formants (F1, F2, and
F3) and their standard deviation of the Korean vowels. As is shown in the tables,
the deviation within males is generally lower than that of females. Figure 1 illustrates
the vowel space of males and females in which adjacent vowel points are connected
peripherally. Phonemes are given near various symbols. The same symbol indicates
the same phoneme. The two vowel spaces appear triangular. A thicker line connects
male vowels, while a thinner line connects female vowels. The vowel spaces show somewhat systematic
relationship. It implies a trend toward higher formant frequencies for female speakers.
This tendency may be largely due to nonlinguistic factors because linguistic factors in this experiment were homogeneously controlled in each group as much as possible.
To examine that relationship, the percent difference (Diff.) in the male and female
formant frequencies across vowel formant was calculated by equation (6). Fnf denotes the n th female formant value. Diff. in F0 was 59%. The average formant Diff. across
all the vowels comes out to 18%. Specifically, the Diff. ranged from 18%for F1, 20%
for F2, to 17%for F3. Total variation except F0 amounted to 8%. The following discussion focuses on three non-linguistic factors in the data. The first is fundamental
frequency. The average F0 was 169 Hz for males and 269 Hz for females. The female
average F0 was about 1.6 times that of males. A question arises as to whether F0
can serve an independent source of speaker normalization. The present data showed a strong negative
correlation between F0 and F1 in males (r=- 0.92). The correlation among the females
was weaker ( r=- 0.73). But the correlation was very weak (r < .39) with F0 versus F2 or F3. Another factor is overall vocal tract length. From the F3 of the vowel
[ /\] the average vocal tract length was estimated to be 16 cm for males and 14 cm
for females. Fig. 2 shows the average values and +/- one standard deviation bar.
The uniform scaling method was employed to normalize the gender difference of vocal tract length.
The uniform scale factor is 0.87 estimated from Eq. (1). The data were scaled by
Eq. (2). Then, a numerical criterion was used to see how closely the female data
were scaled to the male reference data. For that purpose, Fnf in Eq. (3) was replaced by
Fnf(scaled). The uniform scaling method had an average Diff. of less than 6% across
all the vowels. Regressional analyses were conducted to find regression equations
(7) or (8) for the relationship between all the female formant values (Fnf) and those of
males (Fnm) or vice versa. The slope 0.85 is similar to the average of F3f/F3m (=0.87)
for the uniform scaling method. With small intercepts, both equations imply that
one may expect a good fit with a regression equation through the origin (zero intercept).
Thus, the uniform scale factor can be easily determined through the regression. The
r2 indicates that female formant values can be accurately predicted from male values
or vice versa. The third factor is the difference in size of mouth and pharynx cavities.
The lengths of the front and back cavities were estimated from the vowel of Korean
[i] using Eqs. (4) and (5). The average front cavity length was about 7.5 cm for
Korean male groups, and 6 cm for female groups. The average back cavity length was about
5.5 cm for males, and 5 cm for the females. Fig. 3 shows average values and +/- one
standard deviation bar. Generally, male speakers have a back cavity that is longer
than the front cavity. The average difference in the front cavity of male and female between
the languages is small, but the difference in the back cavity is almost twice that
of the front cavity. Diff. for the Korean vowel [i] was 1% whereas for F2 it was
about 27%. Since F2 of [i] depends on the length of the pharynx, this difference provides
evidence for the non-uniform shortening of this cavity in the Korean data. As argued
by Fant (1975), this circumstance also contributes to making scale factors vowel-
and formant-number-specific scale factors. Since the scale factors in the non-uniform
methods were derived from the average variation of the six European languages, they
may not be proper to use in the scaling of the Korean data. However, one can reason
that if one employs a formant-number-specific method, we may get about 2% more reduction
since the deviation between formant numbers was 2% in the previous analysis. Moreover,
if both the formant-number and vowel- specific scale factors are used, then it will
improve a less than 6% reduction. In summary, F0 and the first three formants of the
ten Korean vowels produced by 20 male and female speakers were studied. Second, the
gender variation was examined. The average formant difference across all vowels comes
out to be 18%. Third, the data were examined in terms of fundamental frequency, vocal
tract length, and the ratio of pharynx to mouth cavity. The uniform scaling method
resulted in less than a 6 % difference between the scaled and reference data. The
slope of a regression equation with a small intercept could be used for the uniform scale
factor. The non-uniform scaling was expected to be less than 6% reduction of the
gender difference.