icon

DanPASS - Danish Phonetically Annotated Spontaneous Speech

Nina Grønnum

Goal

The corpus is intended for acoustic and perceptual phonetic investigations. That is, the primary goal is neither syntactic, pragmatic, socio-linguistic, psychological, nor whichever other aspect of spoken language one might wish to investigate. There are therefore a considerable number of discourse variables that have not been taken into account in the choice of elicitation material. Nevertheless, the corpus may serve as a basis for any number of linguistic and/or speech technological investigations.

Carlsbergfondet financed the project.

The Corpus

The corpus consists of monologues and dialogues with word lists. Apart from the word lists, the corpus represents an approximation to speech in a natural setting: On the one hand the material for elicitation is controlled in the sense that the speakers are given specific tasks to talk about. This is to facilitate comparison across speakers and ensure sufficiently uniform materials for subsequent analyses. On the other hand the speech is non-scripted.

Monologues

The monologues were recorded in 1996. They represent various types of instructions. The speaker was seated alone in the professional recording studio in the former Department of General and Applied Linguistics (now part of the Department of Scandinavian Studies and Linguistics) and could communicate with the experimenter only via microphone and headphone. Once the speaker had been instructed in the specific task, (s)he could no longer address the experimenter with questions or comments. In other words, the monologues were recorded in one-way communication with an unseen partner who offered no feedback, neither in the form of questions nor confirmation. Speakers were recorded with professional equipment (Sennheiser Microphone ME64, Revox A700, Agfa PEM368 tape). The analog recordings were digitized later, at a sampling frequency of 48kHz, and transferred to cd-roms.

The speaker first described a network consisting of various geometrical shapes in various colours. It is an elaboration of Swerts and Collier's (1992) network. It was specifically intended to reveal whether or not speakers look ahead and signal prosodically an upcoming utterance boundary prior to its actual occurrence.

The network.

The speaker then guided the experimenter through four different routes in a virtual city map, inspired by Swerts (1994).

Slotsby.

Finally, the speaker – who had a model – told the experimenter how to assemble a house from its individual pieces. This house is an almost exact copy of Terken's (1984) edifice.

The house.

Instructions as they were read to the speakers.

Dialogues

The dialogues were recorded in the summer of 2004. They are replicas of the Human Communication Research Centre's Map Tasks (cf. Anderson et al., 1991; Brown et al., 1984; http://www.hcrc.ed.ac.uk/maptask/).

The exercise involved the cooperation of two participants. They were seated in separate locations, one the department's recording studio, the other a recording facility established for the purpose in the main control room, with curtains of very heavy material surrounding the speaker. The speakers communicated via headsets.

A laboratory set-up like this is hardly the most natural environment for communication, but it turned out to be necessary in order to obtain recordings of sufficiently good quality for subsequent acoustic analysis: Seated in the same room, across from each other with eye-contact, speaker A could invariably be heard over speaker B's microphone, and vice-versa; whereas we got clean acoustic signals when the speakers were separated, with no appreciable difference in quality from the studio proper and the ad hoc studio established in the control room. Given the setting, i.e. the lack of visual and direct auditory contact, the participants would presumably be most comfortable if they were not also to communicate with a stranger. Accordingly, the two members of a pair knew each other well. They were recorded via professional headset microphones (Voice Technologies VT700), directly onto cd-roms (HHB Professional Compact Disc Recorder CDR-850) to separate channels in a stereo recording.

Each participant had a map. One, the giver, had a route on his or her map; the other, the follower, did not. Their goal was to collaborate so as to reproduce the giver's route on the follower's map. The maps are not exactly identical: Landmarks are missing on one or the other map, a landmark may appear twice – in two different locations – on one map but in only one location on the other; and the same landmark may have slightly different names on the two maps. This gives rise to a true negotiation, with questions and answers, backtracks, etc. Participants were explicitly informed about these irregularities in written instructions prior to the recording. It was left to them, however, to discover how and where the maps or the designations differed, and to supply the missing items and correct the names on their respective maps. Each pair of speakers completed four different sets of maps.

Written instructions for the speakers.

First set of maps.

Second set of maps.

Third set of maps.

Fourth set of maps.

Word Lists

After completion of the map sessions subjects were asked to read a word list containing all the feature names from the maps they had encountered. Each name appeared twice, in random order, and subjects were asked to read the list in a distinct speech mode. The lists provide citation forms for comparison with the less distinct dialogue forms. Landmarks and names in the original English maps were designed with specific phonological phenomena and processes in mind. I was more or less bound by the landmarks and their translation into Danish, with only moderate influence over phonological structure.

Word lists.

Speakers

There were 27 speakers, all of them students or former students or colleagues in the Department of General and Applied Linguistics, and all except one originated in the greater Copenhagen area. None had any known speech or language deficits.

18 speakers recorded the monologues, 13 men and 5 women. 22 speakers recorded the dialogues, 13 of whom also recorded the monologues.

The speakers appeared to be comfortable with the task(s) and the experimental setting. They produced fluent speech for both monologues and dialogues and were not in any obvious way influenced by the non-naturalness of the circumstances.

Information about the speakers.

Video Recording

In the studio proper a video-recorder was mounted. The camera was placed as close as possible, and as nearly perpendicular as possible, to the frontal plane of the speaker's face without impeding his/her view of the map. The videos are intended as analysis material for whomever should want to attempt to accompany synthetic Danish speech with a model talking face.

Each speaker had to serve as giver as well as follower, in alternation. Each speaker also had to be video-recorded in both roles. The logistics of running two video-cameras were prohibitive, and we had only one. Accordingly, after two map sessions, with speaker A being giver and follower, respectively, the speakers changed places in order for speaker B to be video-recorded as well. Thus, each pair of speakers had a run through four different sets of maps. A complete recording session lasted 30-40 minutes.

Statistics

There are 9 hours and 46 minutes of speech altogether, 2 hours and 51 minutes of monologues, 6 hours and 55 minutes of dialogues, including the word lists. There are 2110 orthographically different word forms in the corpus as a whole, with considerable overlap between monologues and dialogues, of course, 1075 in the monologues and 1593 in the dialogues. The total number of running words is approximately 21.000 in the monologues and 52.500 in the dialogues, i.e. a grand total of approximately 73.500 running words.

A dictionary comprising the words from monologues and dialogues was extracted from the texts and supplied with a phonological transcription as well as an idealized phonetic transcription, with the frequency of occurrence of each word.

The dictionary.

Processing

Monologues and dialogues were transcribed orthographically in standard orthography, without punctuation, with capital letters for proper names only, with indication of empty and filled pauses, and with marks for articulatory hesitation. The orthographical representation is supplemented with stress marks (commas directly before the vowel letter representing the vowel of the stressed syllable), intended for those who are interested only in the distribution of stress across the texts, regardless of the pronunciation. – There is also an edition of these texts with prosodic phrase boundaries ("|") added.

Orthographical transcription of the monologues.

With prosodic phrase boundaries.

Orthographical transcription of the dialogues.

With prosodic phrase boundaries.

The speech signals are segmented and annotated in Praat. The acoustic signal is segmented into prosodic phrases, words and syllables, always to the nearest zero-crossing in the waveform.

Segmentation conventions.

There are ten separate interval tiers for (1) the orthographical representation, (2) a detailed part-of-speech (PoS) tagging, (3) a simplified PoS-tagging, (4) an abstract phonological representation, (5) a phonetic representation of each word as it would sound in distinct pronunciation uttered in isolation and in a norm which would characterize a majority of the speakers at the time of recording, (6) a semi-narrow phonetic transcription with the same domain boundaries as tiers 1-4 (this is to facilitate combined searches in the phonological and phonetic representations), (7) the same semi-narrow phonetic transcription but in syllable-sized domains, (8) a symbolic representation of the pitch relation between each stressed and its first post-tonic syllable, (9) a symbolic representation of the phrasal intonation contour, and (10) a tier for comments.

In a project headed by Patrizia Paggio at the Centre for Language Technology, University of Copenhagen, the information structure of the monologues was analysed, and topic and focus tags have been added to the orthographical representation in a separate tier (11).

A Praat screen.

Annotation

The PoS-tagging in tiers 2 and 3 is automated. The tagger, developed by Peter Juel Henrichsen, Department of International Business Communication at Copenhagen Business School, was trained on written language, not spontaneous speech (Henrichsen, 2002). At the outset there was no way to predict how well the tagger would perform on non-scripted speech. However, although the tagger does make mistakes, they are not random. They are more or less confined to certain types, as revealed in the subsequent manual proof-reading process, and on the whole the tagger is efficient and reliable.

The original complete set of part-of-speech tags.

Full versus reduced tag correspondences.

However, the tag set (which was originally designed to be applicable to many more languages than Danish) is not quite appropriate in two respects. For one thing, there is no category 'article,' and it over-generalizes the category 'demonstrative pronoun' to contain also definite articles. Accordingly, I have added indefinite and definite articles to the tags, thus:

Ruben Schachtenhaufen created two part-of-speech dictionaries, one alphabetical, one in descending frequency of occurrence:

PoS dictionary – alphabetical.

PoS dictionary – frequency.

The phonological representation in tier 4 is fairly abstract where the segments are concerned, in accordance with the phonological analysis of Danish in Grønnum (2005), but stress marks are added to polysyllables, and stød is designated as well, although both stress and stød are to a very large extent predictable from the segmental and morphological structure and thus – strictly speaking – phonologically redundant. Adding stress and stød, however, will presumably facilitate certain search procedures at a later stage. (Stød is a special kind of creaky voice characterizing certain syllable types under certain morphological conditions. See, e.g., Grønnum and Basbøll (2007)and Grønnum et al. (2013).

The phonetic transcription in tiers 5, 6 and 7 is semi-narrow, with a fairly liberal use of relevant diacritics. Note however – in the cardinal vowel diagram in the link below – how the vowel symbols are conventionalized in order to avoid excessive use of vowel diacritics in the transcription.

Phonetic segmental transcription conventions.

The phonetic transcription has been extracted from tier 6 to running text.

Phonetic transcription of the monologues.

And broken into prosodic phrases.

Phonetic transcription of the dialogues.

And broken into prosodic phrases.

Two pronouncing dictionaries have been extracted. Ruben Schachtenhaufen built one which also gives the frequency of occurrence – in descending order – of each pronounced word form, or sequence of word forms (certain sequences of words could not be delimited acoustically). They appear as separate entries in the dictionary. Incomplete words are listed separately.

Pronouncing dictionary with frequency of occurrence.

Incomplete word forms.

On Ruben Schachtenhaufen's website there is an edition of the pronouncing dictionary with sound bites.

Another dictionary gives the reference to each occurrence in the textgrids.

Pronouncing dictionary with references.

Tier 8 contains the symbolic representation of syllable prominence (stress) and the pitch relation between stressed and post-tonic syllable. Tier 9 contains the symbolic representation of phrasal intonation.

Prosodic transcription conventions.

Readers familiar with the ToBI convention for transcribing prosody (e.g., Silverman et al., 1992), should note that any similarity with our annotation is merely superficial. For the description of Danish intonation the phonological assumptions behind ToBI are inappropriate, and as a phonetic transcription system it is not sufficiently fine grained for our purpose (Grønnum 1985, 1986, 1995). For a general critique of ToBI, see Kohler (2005, 2006, 2007).

Note that, for reasons to do with time and resources, the pitch relation between successive prosodic phrases is not represented. Given the flexibility of Praat, it can easily be added to the grid if and when the need arises.

As mentioned above, the monologues have been supplied with tags for information topic (T) and information focus (F) in tier 10, cf. Paggio (2006).

Patrizia Paggio's guidelines for focus and topic tagging.

Although topic and focus have been tagged in the text grids, text files are available for easier reading. In the texts, stress marks are omitted, but pauses are retained from the orthographical representation in tier 1. If you open the files in Wordpad or emacs (unix) sentence boundaries - indicated in the textgrids by "boundary" - are shown as line shifts. Please note that the texts were focus-tagged prior to the final proof-reading and minor - immaterial - differences may appear between tiers 1 and 10.

Download the focus and topic tagged text files.

Tier 10 is for ad hoc comments.

The segmental and prosodic annotation in tiers 5-8 was performed independently and in parallel by two assistants. Disagreements between them were resolved in conferences with Nina Grønnum. Subsequently, NG proof-read the entire file. This procedure is repeated through every step: first the phonetic transcription, then the stress-and-pitch relation and finally the phrasal intonation.

Ruben Schachtenhaufen extracted information from the tiers in the textgrids to facilitate search procedures and statistical calculations. He further added lemmatization, orthography without stress information (commas), phoneme transcription with phonological syllable boundaries inserted according to the principles in Grønnum 2007, and finally a phonetic transcription as generated by the phonetic manifestation rules he formulated in his dissertation (Schachtenhaufen 2013). Note that his rules and transcriptions pertain to a somewhat younger norm as that represented by the speakers in the corpus. Accordingly there are minor but systematic differences between his idealized forms and mine. I transformed Schachtenhaufens .txt files to spreadsheets:

Monologues.

Dialogues.

Access

The zipped folders with sound files and text grids, respectively, are here for downloading. The dialogue sound files exist as stereo-sound (both speakers) and as mono-sound (each speaker in a pair). Note that to open the sound files you need a password. Contact ninag @ hum.ku.dk.

Sound files – monologues.
Sound files – dialogues; stereo-sound.
Sound files – dialogues; mono-sound.
Sound files – word lists.
Text grids. (Last update February 2014.)

Student projects

A number of student projects, BA as well as MA theses, are based on DanPASS data.

Student projects.

Publications based on the DanPASS corpus

Christiansen, Thomas Ulrich & Henrichsen, Peter Juel (2012), Speech transduction based on linguistic content, BNAM2012, Joint Baltic-Nordic Acoustics Meeting, June 18th-20th 2012 Odense, Denmark.

Grønnum, Nina "A Danish phonetically annotated spontaneous speech corpus (DanPASS)," Speech Communication 51, 2009, 594-603; doi:10.1016/j.specom.2008.11.002.

Grønnum, Nina & Tøndering, John (2007), Question intonation in non-scripted Danish dialogues, in Proceedings of the XVIth International Congress of Phonetic Sciences 2007, Saarbrücken, Saarbrücken, Saarland University, pp. 1229-1232.

Heegård, Jan (2012a) Funktionel udistinkthed: Danske verber og adjektivers te- og ede-endelser, Danske Talesprog 12, 34-61.

Heegård, Jan (2012b) Functional indistinctiveness: Danish verbal and adjectival -te and -ede endings, in Heegård, J. & P.J. Henrichsen (eds.) Speech in Action – Proceedings of the 1st SJUSK Conference on Contemporary Speech Habits, Copenhagen Studies in Language/ 42, 29-52.

Henrichsen, Peter Juel & Christiansen, Thomas Ulrich (2011), Information based speech transduction, ISAAR-2011, International Symposium on Auditory and Audiological Research, Nyborg, 24-26 August 2011.

Henrichsen, Peter Juel (2010), Den Lilla Trekant : Learning Danish Shape and Color Terms from Scratch, Copenhagen Studies in Language 40, 27-44.

Jensen, Christian & Tøndering, John (2005a), Perceived prominence and scale types, in A. Eriksson & J. Lindh (eds.), Proceedings FONETIK 2005, The XVIIIth Swedish Phonetics Conference, May 25-27 2005, Göteborg, Sweden: Göteborg University, Department of Linguistics, pp. 111-114.

Jensen, Christian & Tøndering, John (2005b), Choosing a Scale for "Measuring Perceived Prominence", in Proceedings of Interspeech 2005, September, 4-8, Lisbon, Portugal, pp. 2385-2388.

Mortensen, Johannes & Tøndering, John (2013), The effect of vowel height on Voice Onset Time in stop consonants in CV sequences in spontaneous Danish, in Proceedings of Fonetik 2013, 12-13 June, Linköping University, Sweden, pp. 49-52.

Pharao, Nicolai (2009), Consonant Reduction in Copenhagen Danish - A study of linguistic and extra-linguistic factors in phonetic variation and change. Ph.D. dissertation.

Pharao, Nicolai (2011), Plosive reduction at the group level and in the individual speaker, Proc.Phon.XVII, Hong Kong 17-21 August 2011, 1590-1593.

Schachtenhaufen, Ruben (2013), Fonetisk reduktion i dansk, Ph.d.-afhandling. Ph.d. skolen LIMAC, Copenhagen Business School.

Tøndering, John (2003), Intonation contours in Danish spontaneous speech, in M.J. Solé D. Recasens and J. Romero (eds.), Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona 3-9 August 2003. Barcelona, Spain: Universitat Autònoma de Barcelona, pp. 1241-1244.

Tøndering, John (2008), Skitser af prosodi i spontant dansk, Ph.d.-afhandling. Det Humanistiske Fakultet, Københavns Universitet.

Acknowledgements

This project would not have been possible without extensive help from many people, and not without external funding either. The Carlsberg Foundation provided a grant of 1.08 million Danish kroner.

A number of individuals have each contributed invaluable assistance: Preben Dømler and Svend-Erik Lystlund assisted at the recordings. Gert Foget Hansen segmented a part of the monologues. Maja Dyrby and Line Burholt proof-read the PoS-tags. John Tøndering transcribed orthographically all the monologues. He has written a number of immensely useful scripts for Praat, to locate mistakes, to move boundaries etc. He also used the corpus for his own ph.d. project and liberally shared his results with me. Nicolai Pharao supplied the 2140 word forms with an abstract phonological representation. The major and most tedious work, however, is the responsibility of the transcribers, Cem Avus, Jeppe Beck, Andreas Geisler, Louise Astrid Johansson, Ruben Schachtenhaufen and Thit Wange Stærkær. Line Burholt Kristensen and Tina Ringkjær performed the topic and focus annotation of the monologues. Finally, without the twenty-seven speakers who gave liberally of their time and enthusiasm, none of this would have been possible. – Lately, Ruben Schachtenhaufen – in connection with his ph.d. project – added a number of very useful resources to the corpus.

A number of individuals have each contributed invaluable assistance: Preben Dømler and Svend-Erik Lystlund assisted at the recordings. Gert Foget Hansen segmented a part of the monologues. Maja Dyrby and Line Burholt proof-read the PoS-tags. John Tøndering transcribed orthographically all the monologues. He has written a number of immensely useful scripts for Praat, to locate mistakes, to move boundaries etc. He also used the corpus for his own ph.d. project and liberally shared his results with me. Nicolai Pharao supplied the 2140 word forms with an abstract phonological representation. The major and most tedious work, however, is the responsibility of the transcribers, Cem Avus, Jeppe Beck, Andreas Geisler, Louise Astrid Johansson, Ruben Schachtenhaufen and Thit Wange Stærkær. Line Burholt Kristensen and Tina Ringkjær performed the topic and focus annotation of the monologues. Finally, without the twenty-seven speakers who gave liberally of their time and enthusiasm, none of this would have been possible. – Lately, Ruben Schachtenhaufen – in connection with his ph.d. project – added a number of very useful resources to the corpus.

References

Anderson, A.H., Bader, M., Bard, E.G., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H.S., Weinert, R., 1991. The HCRC Map Task Corpus. Language and Speech 34, 351-366.

Brown, G., Anderson, A., Shillcock, R., Yule, G., 1984. Teaching Talk. Cambridge University Press, Cambridge.

Grønnum, N., 1985. Intonation and text in Standard Danish, J. Acoust. Soc. Am. 77, 1205-1216.

Grønnum, N., 1986. Sentence intonation in textual context – supplementary data, J. Acoust. Soc. Am. 80, 1040-1047.

Grønnum, N., 1990. Prosodic parameters in a variety of regional Danish standard languages, with a view towards Swedish and German. Phonetica 47, 182-214.

Grønnum, N., 1995. Superposition and subordination in intonation – a nonlinear approach. In Elenius, K., Branderud, P. (Eds.) Proc. XIIIth Int. Cong. Phonetic Sc. Stockholm 1995, vol. II. KTH and Stockholm University, Stockholm, 124-131.

Grønnum, N., 2005. Fonetik og Fonologi, 3. udg., Akademisk Forlag, København.

Grønnum, N., 2007. Rødgrød med Fløde, Akademisk Forlag, København.

Grønnum, N. and Basbøll, H., 2007. Danish Stød – Phonological and Cognitive Issues. In Solé Sabater, M.-J., Beddor, P.S. and Ohala, M. (Eds.) Experimental approaches to Phonology. Oxford University Press, Oxford, 192-206.

Grønnum, N., Vazquez-Larruscaín, M. and Basbøll, H., 2013. Danish Stød – Laryngealization or Tone. Phonetica 70, 66-92.

Henrichsen, P. Juel, 2002. Sidste Års Aviser – Grammatisk opmærkning af et stort dansk aviskorpus. Lambda 27. Institut for Datalingvistik, Handelshøjskolen i København, København.

Jensen, C. and Tøndering, J., 2005. Choosing a Scale for Measuring Perceived Prominence, in Isabel Trancoso (Ed.) Proceedings of Interspeech 2005, September 4-8, Lisbon, Portugal, pp. 2385-2388.

Kohler, K.J., 2005. Timing and Communicative Functions of Pitch Contours. Phonetica 62, 88-105.

Kohler, K.J., 2006. Paradigms in Experimental Prosodic Analysis – From Measurement to Function, in Sudhoff, S., Lenertová, D., Meyer, R., Pappert, S., Augurzky, P.,, Mleinek, I., Richter, N., Schließer, J. (Eds.) Methods in Empirical Prosody Research. (= Language, context, and cognition, 3). de Gruyter, Berlin, New York.

Kohler, K.J. 2007. Beyond Laboratory Phonology. In Solé Sabater, M.-J., Beddor, P.S. and Ohala, M., (Eds.) Experimental approaches to Phonology. Oxford University Press, Oxford, 41-53.

Paggio, P., 2006. Annotating Information Structure in a Corpus of Spoken Danish, in: Calzolari, N., Choukri, K., Gangemi, A., Maegaard, B., Mariani, J., Odijk, J., Tapias, D. (Eds.), Proceedings from the 5th International Conference on Language Resources and Evaluation, Genova 24-24 May 2006 (cd-rom).

Silverman K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J., 1992. ToBI: A standard for Labeling English Prosody, in Proceedings of the International Conference on Spoken Language Processing, pp. 867-870.

Swerts, M., 1994. Prosodic features of discourse units. Technische Universiteit Eindhoven, Eindhoven.

Swerts, M.. and Collier, R., 1992. On the controlled elicitation of spontaneous speech. Speech Communication 121, 463-468.

Terken, J.M.B., 1984. The distribution of pitch accents in instructions as a function of discourse structure. Language and Speech 27, 269-289.



Updated March 2016.