Released in 2010, the TPS Frequency Dictionary of Mandarin Chinese is not just another character frequency based dictionary. It has been designed to provide students with a guide for learning new characters, words, and phrases gradually, building upon characters already familiar. The 24,000 entries have been arranged according to a Triple Progression System in which words are grouped first by character frequency, then by word frequency, and filtered so that new words and phrases appear only after all of their component characters have been introduced.
Complete Word List
- 2,500 lines, grouped as in the dictionary, but without definitions (UTF-8 encoded TEXT)
Let's pull back the curtain and take a look at how the dictionary came to be.
The dictionary contains a total of 26,704 entries of between one and four characters in length. Each entry appears in one of 2,500 character sections, arranged according to character frequency. Within each character section, words are listed in frequency order and grouped in numbered sections from 1 through 5, meant to give a general sense of how common the words are: group 1 words are the most common, while group 5 words are relatively rare. The character section in which a particular entry appears--all entries appear only once--has been determined by the least common component character. For example, the common compound 以后 is not listed with character #18 (以) but rather with #59 (后). This three-way ordering and filtering technique is what I call the "Triple Progression System."
The character and word data used to build the word list are derived from Jun Da's Chinese Text Computing website
, and in particular the character and word frequency distribution data related to a news and information based subcorpus of modern Mandarin Chinese. In order to eliminate nonsense character combinations and other frequently occurring non-words, the CC-CEDICT
Chinese-English dictionary was used to validate the data. Words and phrases not appearing in CC-CEDICT were excluded. (However, some high frequency non-word collocations that also happen to be low frequency words, such as 比亚, do appear in the list.)
In all, 152,688 bigrams, trigrams, and guadrigrams were processed using a Perl script that compared them with a list of the top 2,500 characters and the CC-CEDICT dictionary. The resulting list contains a total of 26,704 entries (2,500 single character, 20,294 bigrams, 999 trigrams, and 2,911 quadrigrams).
There are 42 character sections that do not include any word entries. Of these 42 characters, 14 appear in entries elsewhere in the dictionary. The other 28 characters do not appear anywhere in the dictionary.
The words accompanying each character section have been grouped into levels according to word frequency. Jun Da suggests
three frequency (X) ranges (or "stages"): very low (X≤5), medium-low to low (5<X≤50), and medium to high (X>50) for the acquisition of new vocabulary by foreign learners of Chinese. For this project, the lowest frequency range has been eliminated, and the top range has been split up, resulting in a total of five ranges, or groups, as follows:
(1) 1,000 ≤ X 1,495 words (most common)
(2) 250 ≤ X < 1,000 2,979 words
(3) 100 ≤ X < 250 3,251 words
(4) 50 ≤ X < 100 3,037 words
(5) 5 ≤ X < 50 13,442 words (least common)
This arrangement results in the smallest number of words at level one (the most common words), relatively uniform distribution over levels two through four, and a sizable group of the least common words at level five.