An acoustical study of Korean vowels

Professor Byunggon Yang
partment of English, College of Humanities, Dongeui University,4 Kaya-dong, Pusanjin-gu, Pusan 614-714, Korea

I. Introduction

In real life, speech signals vary greatly between and within speakers, but human beings seem to have little difficulty communicating. For instance, a speaker never produces a word in physically the same way on two occasions or in two different contexts. Moreover, no two speakers produce a word in exactly the same way, articulatorily or acoustically. This speaker variation has been attributed to (1) linguistic factors such as dialectal and sociolectal differences and to (2) non-linguistic factors such as physical anatomy, age, gender, and emotional state of the speaker. Some of these factors are systematic so that their effects may be theoretically separable from linguistically relevant properties of speech by systematic transformations, while others may be minimized by methods of statistical inference. The goal of factoring out these nonlinguistic factors is to establish a "pure," linguistically relevant acoustic specification of the vowel qualities of any given language. This procedure has been called "normalization". This study will review three non-linguistic factors which are so problematic in establishing the phonetically significant correlates of vowel quality obtained from male and female speakers. They are fundamental frequency, vocal tract length, and the ratio of front cavity to back cavity.First, the speed of the vocal fold vibration or F0 can be regarded as inversely proportional to the mass and length of the vocal fold and proportional to the tension. A study by Negus (1949) reported that vocal cord length averaged 12 to 17 mm in the adult females and 17 to 23 mm in the adult males. Thus, the vocal fold vibration of females with smaller vocal cords is predictably faster than that of males. The average vibration of males' F0 is about 125 Hz but that of females is around 200 Hz. F0 also varies as the speaker changes the tension of laryngeal muscles and to some extent the subglottal pressure. Anatomically, the vocal folds can be lengthened with increased tension when the cricothyroid muscle contracts causing the cricoid to tilt back and thereby stretching the vocal folds. Boothroyd (1986) also observed that F0 varied between a low value of about 70 Hz and a high value of about 200 Hz in men. In women, the range was from 140 to 400 Hz. Second, formant frequencies are inversely related to the overall length of speaker's vocal tract. Vocal tract size varies according to age and gender. Females usually have shorter vocal tracts than males. Therefore, although a vowel phoneme may be articulated with the relatively identical vocal tract configuration, the formant frequencies increase from males to females. The overall vocal tract length can be estimated directly from formant frequency measurements. Assuming the cross-sectional area of the human vocal tract to be almost uniform for the vowel [/\] as in an English token Hudd, one can obtain the length of the speaker's vocal tract (L) by introducing a measurement of F3 of [/\] into the well-known formula of Equation 1. The vocal tract ratios of female to male in three European languages were determined according to Eq. (1) and were found to be 0.89 for Swedish, 0.89 for Dutch, and 0.86 for English. The sources are van Nierop et al. (1973) and Pols et al. (1973) for Dutch, Peterson and Barney (1952) for English, and Fant (1975) for Swedish. This corroborates Chiba and Kajiyama (1941) who estimated overall vocal tract length, assigning the relative numbers of 1.0 to males, 0.87 to females. These numbers all indicate that female vocal tracts are 11-14% shorter than those of males. Based on the overall vocal tract difference, Nordstrom and Lindblom (1975) proposed a uniform scaling method for gender normalization. It involved estimating the total length of a subject's vocal tract from an average of F3 in vowels with F1 greater than 600 Hz. Because the length of the speaker's vocal tract is inversely related to formant frequency, the ratio of the length of the average male vocal tract (Lm) to the average female vocal tract length (Lf) can be written as in Equation 2. F3m.av and F3f.av indicate an average of the third male and female formant values, respectively. Then, the normalized nth female formant frequency is denoted as Fnf (scaled) and can be determined according to Equation 3.Third, the ratio of pharynx to mouth cavity lengths is another factor contributing variation between speakers. Chiba and Kajiyama (1941) stated that mouth cavity length of an eight-year-old girl was 30% shorter than that of an adult male while the length of the girl's pharynx was 56% shorter than that of the male. Again, the length of pharynx and mouth cavity can be estimated from the formant frequencies of the vowel [ i ]. In a two-cavity simplified model of vowel [ i ], F2 depends on the back cavity or pharynx while F3 depends on the front cavity or mouth cavity. The length of the back cavity (LB) and that of the front cavity (LF) can be approximated by Equations 4 and 5.These are only approximate values given the simplicity of the model. For Swedish speakers, Fant (1973) reported that the female pharynx according to the formulas above was 2.1 cm shorter than the male pharynx; and the female mouth cavity was 1.25 cm shorter than the male mouth. This observation fitted well the physiological data. From these differences in pharynx-to-mouth-cavity ratios Fant predicted that male-female formant values would be related by non-uniform scale factors. Fant (1968) proposed to consider not only differences in the overall vocal tract length between male and female speakers but also the complex formant-cavity relationships. Therefore, Fant(1975) recommended using scale factors that are both vowel and formant specific. His method applies a different scale factor to each individual vowel and individual formant category. In this paper, F0 and the first three formants of ten Korean vowels produced by 20 male and female speakers were studied while controlling the linguistic factors as homogeneously as possible in each group. Second, the male female variation in the Korean data were examined. Third, the data were studied in terms of fundamental frequency, vocal tract length, and the ratio of pharynx to mouth cavity.

2. Method

2.1. Subjects and speech samples
A total of 20 subjects were chosen from a larger group participating in recording and listening sessions at the University of Texas at Austin. They formed two groups: ten Korean males and ten Korean females. Subjects were students attending the University of Texas at Austin and all had normal hearing and health. All the Korean subjects spoke Standard Korean. Two screening instruments were used to make each group linguistically homogeneous. First, subjects were grouped homogeneously on the basis of collected information from a questionnaire. It included subjects' dialect and history of speech and hearing disorders. Second, peer judgment was employed to screen out those subjects who had different dialects in the language group. Five peers in each male and female group were randomly chosen from among the subjects. Then, the peers were asked to listen to the four sets of tokens consisting of the vowels [i a u]. Each set was composed of different male and female subjects saying the same token. In the listening session, the peers put a check mark on each token that sounded different from their own dialect. All the marks were counted to find four peers (two males and two females) who had the fewest marks. Finally, marks by the four chosen peers were used to screen out those subjects who had more than 35% of the total tokens perceived as a different dialect from other members. The speech samples consisted of 52 Korean words. Each English and Korean vowel occurred in an |h(V)da| context. In this context, the following vowel formant can be easily identified because the /h/ noise on the spectrogram shows similar patterns of the following vowel formants.Ten Standard Korean vowels studied were /a, E, u, i, i-, we, wi, /\ o, /. These ten Korean vowels appeared five times in random order. Later, three out of the five productions of each vowel were randomly chosen for the average data set, avoiding unnaturally-produced tokens at the beginning and ending of the recording.

2.2. Procedures
The recording was done in a sound-proof booth in the Phonetics Lab of the University of Texas at Austin (UT). The experimenter asked the subjects to produce each word at a normal rate and as naturally as possible. The recording took 2-3 min per subject. The recorded samples were analyzed using the VAX computer in the UT Phonetics Lab. The KLSPEC software package was used to interactively examine, measure, and analyze the recorded samples. The input samples were low pass filtered at 4 kHz and digitized at a 10-kHz sampling rate. Spectrograms were made using a 256-point discrete Fourier transform (DFT) analysis with a 6.4-ms Hamming window once every millisecond. The dynamics of the vowel formant pattern made it difficult to find a consistent time point for spectrum analysis. Therefore, the author calculated the total duration from the vowel onset and offset time points. Each time point was determined by adding one-third of the total duration to the vowel onset point.

2.3. Formant Decisions
In this study the spectrogram of each vowel was used for the first reference and final decision on formant values. First, formant values on the spectrogram were estimated by drawing a pencil line through the center of each formant band with a ruler. Then, visual estimates were compared with those estimates automatically computed by one of the KLSPEC software packages from the two DFT harmonic spectra (average envelope and LPC envelope). For reliability, the measurements were checked by an independent observer.

3. DATA ANALYSIS AND DISCUSSIONS

Tables I and II list the average values of F0, the first three formants (F1, F2, and F3) and their standard deviation of the Korean vowels. As is shown in the tables, the deviation within males is generally lower than that of females. Figure 1 illustrates the vowel space of males and females in which adjacent vowel points are connected peripherally. Phonemes are given near various symbols. The same symbol indicates the same phoneme. The two vowel spaces appear triangular. A thicker line connects male vowels, while a thinner line connects female vowels. The vowel spaces show somewhat systematic relationship. It implies a trend toward higher formant frequencies for female speakers. This tendency may be largely due to nonlinguistic factors because linguistic factors in this experiment were homogeneously controlled in each group as much as possible. To examine that relationship, the percent difference (Diff.) in the male and female formant frequencies across vowel formant was calculated by equation (6). Fnf denotes the n th female formant value. Diff. in F0 was 59%. The average formant Diff. across all the vowels comes out to 18%. Specifically, the Diff. ranged from 18%for F1, 20% for F2, to 17%for F3. Total variation except F0 amounted to 8%. The following discussion focuses on three non-linguistic factors in the data. The first is fundamental frequency. The average F0 was 169 Hz for males and 269 Hz for females. The female average F0 was about 1.6 times that of males. A question arises as to whether F0 can serve an independent source of speaker normalization. The present data showed a strong negative correlation between F0 and F1 in males (r=- 0.92). The correlation among the females was weaker ( r=- 0.73). But the correlation was very weak (r < .39) with F0 versus F2 or F3. Another factor is overall vocal tract length. From the F3 of the vowel [ /\] the average vocal tract length was estimated to be 16 cm for males and 14 cm for females. Fig. 2 shows the average values and +/- one standard deviation bar. The uniform scaling method was employed to normalize the gender difference of vocal tract length. The uniform scale factor is 0.87 estimated from Eq. (1). The data were scaled by Eq. (2). Then, a numerical criterion was used to see how closely the female data were scaled to the male reference data. For that purpose, Fnf in Eq. (3) was replaced by Fnf(scaled). The uniform scaling method had an average Diff. of less than 6% across all the vowels. Regressional analyses were conducted to find regression equations (7) or (8) for the relationship between all the female formant values (Fnf) and those of males (Fnm) or vice versa. The slope 0.85 is similar to the average of F3f/F3m (=0.87) for the uniform scaling method. With small intercepts, both equations imply that one may expect a good fit with a regression equation through the origin (zero intercept). Thus, the uniform scale factor can be easily determined through the regression. The r2 indicates that female formant values can be accurately predicted from male values or vice versa. The third factor is the difference in size of mouth and pharynx cavities. The lengths of the front and back cavities were estimated from the vowel of Korean [i] using Eqs. (4) and (5). The average front cavity length was about 7.5 cm for Korean male groups, and 6 cm for female groups. The average back cavity length was about 5.5 cm for males, and 5 cm for the females. Fig. 3 shows average values and +/- one standard deviation bar. Generally, male speakers have a back cavity that is longer than the front cavity. The average difference in the front cavity of male and female between the languages is small, but the difference in the back cavity is almost twice that of the front cavity. Diff. for the Korean vowel [i] was 1% whereas for F2 it was about 27%. Since F2 of [i] depends on the length of the pharynx, this difference provides evidence for the non-uniform shortening of this cavity in the Korean data. As argued by Fant (1975), this circumstance also contributes to making scale factors vowel- and formant-number-specific scale factors. Since the scale factors in the non-uniform methods were derived from the average variation of the six European languages, they may not be proper to use in the scaling of the Korean data. However, one can reason that if one employs a formant-number-specific method, we may get about 2% more reduction since the deviation between formant numbers was 2% in the previous analysis. Moreover, if both the formant-number and vowel- specific scale factors are used, then it will improve a less than 6% reduction. In summary, F0 and the first three formants of the ten Korean vowels produced by 20 male and female speakers were studied. Second, the gender variation was examined. The average formant difference across all vowels comes out to be 18%. Third, the data were examined in terms of fundamental frequency, vocal tract length, and the ratio of pharynx to mouth cavity. The uniform scaling method resulted in less than a 6 % difference between the scaled and reference data. The slope of a regression equation with a small intercept could be used for the uniform scale factor. The non-uniform scaling was expected to be less than 6% reduction of the gender difference.