The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.The goal of this chapter is to answer the following questions: Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus.As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.Five of the sentences read by each speaker are also read by six other speakers (for comparability).

As we update our site, you might notice that some pages have a different appearance.TIMIT illustrates several key features of corpus design.First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.At the top level there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models.Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.

