字母頻率
字母頻率(frequency of letters; character frequencies),指的是各個字母在文本材料中出現的頻率。常被應用於密碼學,尤其是可破解古典密碼的頻率分析。在英語中最常見的字母是e。而在鉛字印刷時代,人們已根據經驗在Linotype排字機上將字母按常用與否排列成etaoin shrdlu cmfwyp vbgkjq xz 。還有,摩斯電碼中越常用的字母,其編碼符號就越短;而發出各字母的用時由快到慢順序是e it san hurdm wgvlfbk opjxcz yq。數據壓縮技術中也有相似的方法,如霍夫曼編碼就是按來源符號出現的機率大小去編碼。
介紹
編輯有分析顯示字母頻率就像詞頻,不同作者或寫作主題的作品中往往各不相同。當為x射線(x-rays)撰文時,文章中就會有大量的字母X。而撰寫用x射線治療卡塔爾(Qatar)的斑馬(zebras)時,一般很少出現的字母X、Q和Z就會充斥文中。可從作者的字母使用頻率中看出他的某些寫作習慣。例如,海明威的寫作風格明顯不同於福克納。字母、雙字母組、三字母組、單詞頻率、單詞長度和句子長度,這些都可以經統計後用以證明或反駁某一作品是某作者所寫,甚至待鑑別作品與作者的寫作風格相近也可用這一方法。
只能靠分析大量有代表性的文本才可得出準確的字母平均頻率,而藉由現代計算機和龐大的文本語料庫,很容易完成這樣的統計工作。又聾又瞎網(Deafandblind)列出了各種文本材料(新聞報告、宗教文本、科學文本和一般小說)的字母頻率順序,其中在一般小說類里,字母「h」與「i」的排位差異尤甚,由Linotype排字機的「etaoin shrdlu」變成了「etaohn isrdlu」。
赫伯特·S·基姆在他那部經典的密碼學入門著作 《密碼和隱密寫作》(Codes and Secret Writing)里提道:英文的字母頻率排列順序是ETAON RISHD LFCMU GYPWB VKJXQ Z,最常見的字母對是TH HE AN RE ER IN ON AT ND ST ES EN OF TE ED OR TI HI AS TO,最常見的連寫字母對是LL EE SS OO TT FF RR NN PP CC。[1]
使用最多的前12個字母占了總使用次數的80%,使用最多的前8個字母則占了總使用次數的65%。數種排名函數能很好地擬合字母頻率,而雙參數Cocho/Beta排名函數(two-parameter Cocho/Beta rank function)是當中的佼佼者。[2]用另一種不能調節參數的排名函數也能不錯地擬合字母頻率分布,[3]該函數也能擬合蛋白質序列中的氨基酸頻率。[4]
使用VIC暗號或其他基於縱橫棋盤格的暗號時,間諜常用助記符如「a sin to err」(最後的r不計)來記住最常用的8個字母。在密碼解謎遊戲cryptograms和單詞解謎遊戲如猜單詞遊戲、Scrabble、香蕉拼字遊戲和電視遊戲節目幸運輪中,須要運用字母頻率和頻率分析。在古典文學中,愛倫坡早在其著名小說《金甲蟲》描述了如何用英文字母頻率的知識去解開故事中的替換式密碼,找出船長基德埋藏寶藏的所在。[5]
字母頻率在一些鍵盤布局的設計上舉足輕重。Blickensderfer打字機在下排放置最常用的字母。德沃夏克鍵盤將最常用的字母放在最易輸入的中排,即除拇指外的八指所放之處。
英語中的字母頻率
編輯英語中的字母頻率如下:[6]
字母 | 英語中出現的頻率 | |
---|---|---|
a | 8.167% | |
b | 1.492% | |
c | 2.782% | |
d | 4.253% | |
e | 12.702% | |
f | 2.228% | |
g | 2.015% | |
h | 6.094% | |
i | 6.966% | |
j | 0.153% | |
k | 0.772% | |
l | 4.025% | |
m | 2.406% | |
n | 6.749% | |
o | 7.507% | |
p | 1.929% | |
q | 0.095% | |
r | 5.987% | |
s | 6.327% | |
t | 9.056% | |
u | 2.758% | |
v | 0.978% | |
w | 2.360% | |
x | 0.150% | |
y | 1.974% | |
z | 0.074% |
上面列出的表格引自Algoritmy網站。[7]而這個列表和其他的表稍微不同,如美國康奈爾大學數學探索項目(Math Explorer's Project)在統計40000個單詞後得到了大同小異的另一表(頁面存檔備份,存於網際網路檔案館)。牛津大學出版社分析簡明牛津詞典的詞條後也得出百分比稍有不同的一表。[8]
英語中空格出現的頻率比使用最多的字母(e)還稍稍多點[9](約為107%),而非字母符號(如數字、標點等)統共後排名第四,即在字母「T」和「A」之間。[10]
英語單詞中首字母的頻率
編輯首字母 | 單詞頻率 | |
---|---|---|
a | 11.602% | |
b | 4.702% | |
c | 3.511% | |
d | 2.670% | |
e | 2.007% | |
f | 3.779% | |
g | 1.950% | |
h | 7.232% | |
i | 6.286% | |
j | 0.597% | |
k | 0.590% | |
l | 2.705% | |
m | 4.374% | |
n | 2.365% | |
o | 6.264% | |
p | 2.545% | |
q | 0.173% | |
r | 1.653% | |
s | 7.755% | |
t | 16.671% | |
u | 1.487% | |
v | 0.649% | |
w | 6.753% | |
x | 0.037% | |
y | 1.620% | |
z | 0.034% |
其他語言中的字母頻率
編輯字母 | 法語 [12] | 德語 [13] | 西班牙語 [14] | 葡萄牙語 [15] | 世界語 [16] | 意大利語[17] | 土耳其語 | 瑞典語[18] | 波蘭語[19] | 荷蘭語 [20] | 道本語 [21] |
---|---|---|---|---|---|---|---|---|---|---|---|
a | 7.636% | 6.516% | 12.525% | 14.634% | 12.117% | 11.745% | 11.680% | 9.341% | 11.503% | 7.486% | 17.2% |
b | 0.901% | 1.886% | 2.215% | 1.043% | 0.980% | 0.927% | 2.952% | 1.254% | 1.740% | 1.584% | 0 |
c | 3.260% | 2.732% | 4.139% | 3.882% | 0.776% | 4.501% | 0.970% | 1.213% | 3.895% | 1.242% | 0 |
d | 3.669% | 5.076% | 5.860% | 4.992% | 3.044% | 3.736% | 4.871% | 4.521% | 4.225% | 5.933% | 0 |
e | 14.715% | 17.396% | 13.681% | 12.570% | 8.995% | 11.792% | 9.007% | 9.647% | 8.352% | 18.914% | 7.4% |
f | 1.066% | 1.656% | 0.692% | 1.023% | 1.037% | 1.153% | 0.444% | 1.931% | 0.143% | 0.805% | 0 |
g | 0.866% | 3.009% | 1.768% | 1.303% | 1.171% | 1.644% | 1.340% | 3.269% | 1.731% | 3.403% | 0 |
h | 0.737% | 4.757% | 0.703% | 0.781% | 0.384% | 0.636% | 1.145% | 2.103% | 1.015% | 2.380% | 0 |
i | 7.529% | 7.550% | 6.247% | 6.186% | 10.012% | 11.283% | 8.274%* | 7.190% | 9.328% | 6.499% | 14.8% |
j | 0.545% | 0.268% | 0.443% | 0.397% | 3.501% | 0.011% | 0.046% | 0.652% | 1.836% | 1.461% | 3.0% |
k | 0.049% | 1.417% | 0.011% | 0.015% | 4.163% | 0.009% | 4.715% | 3.214% | 2.753% | 2.248% | 5.1% |
l | 5.456% | 3.437% | 4.967% | 2.779% | 6.145% | 6.510% | 5.752% | 5.229% | 3.064% | 3.568% | 10.2% |
m | 2.968% | 2.534% | 3.157% | 4.738% | 2.994% | 2.512% | 3.745% | 3.460% | 2.515% | 2.213% | 4.4% |
n | 7.095% | 9.776% | 6.71% | 5.046% | 7.955% | 6.883% | 7.231% | 8.796% | 6.737% | 10.032% | 11.6% |
o | 5.378% | 2.594% | 8.683% | 10.735% | 8.779% | 9.832% | 2.653% | 4.317% | 7.167% | 6.063% | 7.7% |
p | 2.521% | 0.670% | 2.510% | 2.523% | 2.745% | 3.056% | 0.788% | 1.437% | 2.445% | 1.370% | 3.7% |
q | 1.362% | 0.018% | 0.877% | 1.204% | 0 | 0.505% | 0 | 0.007% | 0 | 0.009% | 0 |
r | 6.553% | 7.003% | 6.871% | 6.530% | 5.914% | 6.367% | 6.948% | 8.309% | 5.743% | 6.411% | 0 |
s | 7.948% | 7.273% | 7.977% | 7.805% | 6.092% | 4.981% | 2.950% | 6.374% | 6.224% | 3.733% | 4.1% |
t | 7.244% | 6.154% | 4.632% | 4.736% | 5.276% | 5.623% | 3.049% | 8.693% | 2.475% | 6.923% | 4.6% |
u | 6.311% | 4.346% | 3.927% | 4.634% | 3.183% | 3.011% | 3.430% | 2.066% | 2.062% | 2.192% | 3.2% |
v | 1.628% | 0.846% | 1.138% | 1.665% | 1.904% | 2.097% | 0.977% | 2.289% | 0 | 1.854% | 0 |
w | 0.074% | 1.921% | 0.017% | 0.037% | 0 | 0.033% | 0.016% | 2.107% | 6.313% | 1.821% | 2.8% |
x | 0.427% | 0.034% | 0.215% | 0.253% | 0 | 0 | 0.007% | 0.103% | 0 | 0.036% | 0 |
y | 0.128% | 0.039% | 1.008% | 0.006% | 0 | 0.020% | 3.371% | 0.601% | 3.206% | 0.035% | 0 |
z | 0.326% | 1.134% | 0.517% | 0.470% | 0.494% | 1.181% | 1.497% | 0.020% | 5.852% | 1.374% | 0 |
à | 0.486% | 0 | 0 | 0.072% | 0 | 0.635% | 0 | 0 | 0 | 0 | 0 |
â | 0.051% | 0 | 0 | 0.562% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
á | 0 | 0 | 0.502% | 0.118% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
å | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.221% | 0 | - | 0 |
ä | 0 | 0.447% | 0 | 0 | 0 | 0 | 0 | 1.809% | 0 | 0 | 0 |
ã | 0 | 0 | 0 | 0.733% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ą | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.699% | - | 0 |
œ | 0.018% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ç | 0.085% | 0 | 0 | 0.530% | 0 | 0 | 0.825% | 0 | 0 | - | 0 |
ĉ | 0 | 0 | 0 | 0 | 0.657% | 0 | 0 | 0 | 0 | - | 0 |
ć | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.743% | - | 0 |
è | 0.271% | 0 | 0 | 0 | 0 | 0.263% | 0 | 0 | 0 | 0 | 0 |
é | 1.504% | 0 | 0.433% | 0.337% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ê | 0.225% | 0 | 0 | 0.450% | 0 | 0 | 0 | 0 | 0 | - | 0 |
ë | 0.001% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ę | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 1.035% | - | 0 |
ĝ | 0 | 0 | 0 | 0 | 0.691% | 0 | 0 | 0 | 0 | - | 0 |
ğ | 0 | 0 | 0 | 0 | 0 | 0 | 1.129% | 0 | 0 | - | 0 |
ĥ | 0 | 0 | 0 | 0 | 0.022% | 0 | 0 | 0 | 0 | - | 0 |
î | 0.045% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ì | 0 | 0 | 0 | 0 | 0 | 0.030% | 0 | 0 | 0 | 0 | |
í | 0 | 0 | 0.725% | 0.132% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ï | 0.005% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ı | 0 | 0 | 0 | 0 | 0 | 0 | 5.199%* | 0 | 0 | - | 0 |
ĵ | 0 | 0 | 0 | 0 | 0.055% | 0 | 0 | 0 | 0 | - | 0 |
ł | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 2.109% | - | 0 |
ñ | 0 | 0 | 0.311% | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ń | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.362% | - | 0 |
ò | 0 | 0 | 0 | 0 | 0 | 0.002% | 0 | 0 | 0 | 0 | 0 |
ö | 0 | 0.573% | 0 | 0 | 0 | 0 | 0.270% | 0.514% | 0 | 0 | 0 |
ô | 0.023% | 0 | 0 | 0.635% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ó | 0 | - | 0.827% | 0.296% | 0 | 0 | 0 | 0 | 1.141% | 0 | 0 |
ŝ | 0 | 0 | 0 | 0 | 0.385% | 0 | 0 | 0 | 0 | - | 0 |
ş | 0 | 0 | 0 | 0 | 0 | 0 | 1.938% | 0 | 0 | - | 0 |
ś | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.514% | - | 0 |
ß | 0 | 0.307% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - | 0 |
ù | 0.058% | 0 | 0 | 0 | 0 | 0.166% | 0 | 0 | 0 | 0 | 0 |
ú | 0 | 0 | 0.168% | 0.207% | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ŭ | 0 | 0 | 0 | 0 | 0.520% | 0 | 0 | 0 | 0 | - | 0 |
ü | 0 | 0.995% | 0.012% | 0.026% | 0 | 0 | 1.992% | 0 | 0 | 0 | 0 |
ź | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.078% | - | 0 |
ż | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0.706% | - | 0 |
*參見帶點與不帶點I
根據上表,英語中使用頻率最高的10個字母為etaoi nshrd,而其他語言的排列順序如下:
語言 | 排序 | 語族與其他 |
---|---|---|
法語 | esait nrulo | 印歐語系- 羅曼語族;傳統上使用發音更便利的esartinulop排列。[22] |
西班牙語 | eaosr nidlt | 印歐語系-羅曼語族 |
葡萄牙語 | aeosr indmt | 印歐語系-羅曼語族 |
意大利語 | eaion lrtsc | 印歐語系-羅曼語族 |
世界語 | aieon lsrtk | 人工語言-基於印歐語系,詞源上多採用羅曼詞彙,音位系統本質上是斯拉夫形式,也有少量日耳曼語言特徵。 |
德語 | enisr atdhu | 印歐語系-日耳曼語族 |
瑞典語 | eantr isldo | 印歐語系-日耳曼語族 |
土耳其語 | aeinr ldkmu | 阿爾泰語系-突厥語族 |
荷蘭語 | enati rodsl | 印歐語系-日耳曼語族[20] |
波蘭語 | aoien wszrd | 印歐語系-斯拉夫語族 |
以上語言基本使用相似的25個(或以上)字母。而道本語的排列順序是ainlo ektms,與以上語言不同的是道本語只使用了14個字母。
註釋
編輯- ^ Zim, Herbert Spencer. Codes & Secret Writing: Authorized Abridgement. Scholastic Book Services. 1961. OCLC 317853773.
- ^ Li, Wentian; Miramontes, Pedro. Fitting ranked English and Spanish letter frequency distribution in US and Mexican presidential speeches. Journal of Quantitative Linguistics. 2011, 18 (4): 359. doi:10.1080/09296174.2011.608606.
- ^ Gusein-Zade, S.M. Frequency distribution of letters in the Russian language. Probl. Peredachi Inf. 1988, 24 (4): 102–7.
- ^ Gamow, George; Ycas, Martynas. Statistical correlation of protein and ribonucleic acid composition (PDF). Proc. Natl. Acad. Sci. 1955, 41 (12): 1011–19 [2013-06-05]. PMC 528190 . doi:10.1073/pnas.41.12.1011. (原始內容存檔 (PDF)於2015-09-24).
- ^ Poe, Edgar Allan. The works of Edgar Allan Poe in five volumes. Project Gutenberg. [2013-06-05]. (原始內容存檔於2015-09-24).
- ^ Beker, Henry; Piper, Fred. Cipher Systems: The Protection of Communications. Wiley-Interscience. 1982: 397. Table also available from Lewand, Robert. Cryptological Mathematics. The Mathematical Association of America. 2000: 36 [2013-06-05]. ISBN 978-0-88385-719-9. (原始內容存檔於2020-08-01). and 存档副本. [2008-06-25]. (原始內容存檔於2008-07-08).
- ^ Mička, Pavel. Letter frequency (English). Algoritmy.net. [2013-06-05]. (原始內容存檔於2021-03-04).
- ^ What is the frequency of the letters of the alphabet in English?. Oxford Dictionary. Oxford University Press. [29 December 2012]. (原始內容存檔於2015-04-22).
- ^ Statistical Distributions of English Text. [2013-06-05]. (原始內容存檔於2004-06-03).
- ^ Lee, E. Stewart. Essays about Computer Security (PDF). University of Cambridge Computer Laboratory: 181. [2010-02-13]. (原始內容存檔 (PDF)於2011-06-04).
- ^ Calculated from "Project Gutenberg Selections" available from the NLTK Corpora (頁面存檔備份,存於網際網路檔案館)
- ^ CorpusDeThomasTempé. [2007-06-15]. (原始內容存檔於2007-09-30).
- ^ Beutelspacher, Albrecht. Kryptologie 7. Wiesbaden: Vieweg. 2005: 10. ISBN 3-8348-0014-7.
- ^ Pratt, Fletcher. Secret and Urgent: the Story of Codes and Ciphers. Garden City, N.Y.: Blue Ribbon Books. 1942: 254–5. OCLC 795065.
- ^ Frequência da ocorrência de letras no Português. [2009-06-16]. (原始內容存檔於2009-08-03).
- ^ La Oftecoj de la Esperantaj Literoj. [2007-09-14]. (原始內容存檔於2021-01-17).
- ^ Singh, Simon; Galli, Stefano. Codici e Segreti. Milano: Rizzoli. 1999. ISBN 978-8-817-86213-4. OCLC 535461359 (意大利語).
- ^ Singh, Simon; Brogren, Margareta. Kodboken : konsten att skapa sekretess - från det gamla Egypten till kvantkryptering. Stockholm: Norstedts. 1999. ISBN 978-9-113-00708-3. OCLC 186495779 (瑞典語).
- ^ Wstęp do kryptologii (頁面存檔備份,存於網際網路檔案館), counting [space] 17.2%, [dot point] 0.9%, [comma] 0.9% and [semicolon] 0.5%
- ^ 20.0 20.1 Letterfrequenties. Genootschap OnzeTaal. [2009-05-17]. (原始內容存檔於2011-07-24).
- ^ lipu pi jan Jakopo pi toki pona. [2007-09-14]. (原始內容存檔於2007-11-14).
- ^ Perec, Georges; 「「Alphabets「」 Éditions Galilée, 1976
參考文獻
編輯- 注:若需要單個字母、雙字母組、三字母組、四字母組和五字母組的頻率表格,可參考如下資料(基於20000個單詞,且考慮到不同的單詞長度和字母位置):
- Mayzner, M.S.; Tresselt, M.E. Tables of single-letter and digram frequency counts for various word-length and letter-position combinations. Psychonomic Monograph Supplements. 1965, 1 (2): 13–32. OCLC 639975358.
- Mayzner, M.S.; Tresselt, M.E.;Wolin, B.< R.<. Tables of trigram frequency counts for various word-length and letter-position combinations. Psychonomic Monograph Supplements. 1965, 1 (3): 33–78.
- Mayzner, M.S.; Tresselt, M.E.;Woliin, B.< R,.. Tables of tetragram frequency counts for various word-length and letter-position combinations. Psychonomic Monograph Supplements. 1965, 1 (4): 79–143.
- Mayzner, M.S.; Tresselt, M.E.Wolin, B,.< R.>. Tables of pentagram frequency counts for various word-length and letter-position combinations. Psychonomic Monograph Supplements. 1965, 1 (5): 144–190.
參閲
編輯外部連結
編輯- A site with content of Cryptographical Mathematics by Robert Edward Lewand
- Some examples of letter frequency rankings in some common languages (頁面存檔備份,存於網際網路檔案館)
- Java-Application for building letter frequencies out of a text file
- JavaScript Heatmap Visualization showing letter frequencies of texts on different keyboard layouts (頁面存檔備份,存於網際網路檔案館)
- An updated version of Mayzner's work using Google books Ngrams data set(頁面存檔備份,存於網際網路檔案館) by Peter Norvig
- Counter--character frequencies (頁面存檔備份,存於網際網路檔案館)
- Letter frequency-simia.net
- letter frequency (頁面存檔備份,存於網際網路檔案館)