13 March 2022

Appendix 1

Letter usage in English, French, Latin, Greek, and epic poetry.

To get an idea of the difference in the frequency in which European languages and authors use their letters, I performed a principal components (PC) analysis (Kassambara, 2017) of the frequencies of Greek and Latin letters in English, French, Latin, Modern Greek, Ancient Greek, and each of the four major Greek epic poems two by Homer (the Odyssey and Iliad) and two by Hesiod (Theogony and Works and Days). The greatest proportion of the difference among languages is obvious and trivial. We know that some letters are used more frequently than others in all of these languages (see section The Proto-Canaanite scripts). The vowels, for example, are more frequent than most consonants. Some consonants are significantly rarer than others. The same is true for letter combinations, i.e., bigrams or trigrams. Clusters of consonants with vowels are more frequent than clusters of two or three consonants in all modern and ancient languages.

This trivial variation is taken into account and measured by the first PC (PC1). There is, however, variation in letter usage that is specific to a linguistic group or to a language. Obviously, the Latin and Greek alphabets are not identical. Some Greek letters are absent from Latin and vice versa. Even letters and clusters that exist in all languages within a group may present significantly different frequencies of usage. For example, K and Ck are quite frequent in English, rare in French (they use Qu instead), and hardly ever seen in Latin. The morphemes of each language, e.g. the English past -ed  and gerund -ing endings, cause their respective letters to increase in usage within that language. Such language-specific or, even, text-specific variation is quantified by the subsequent principal components, PC2, PC3, etc., which are numbered by the decreasing amount of variance they explain (labeled as Dimension 1-8).

I measured the frequencies of single letters and 2 or 3-letter clusters in lexica (Balota et al., 2007; Gregory R. Crane (ed.), 2012Kyparissiadis, n.d.; Kyparissiadis et al., 2017Liddell & Scott, 1996; New & Pallier, 2017) where each word is only accounted for once. The number of entries in each lexicon is shown in Table 1. To make the Greek and Latin-born vocabularies comparable, I transliterated the Greek words using the LSJ beta code and removed all the diacritics for all languages. Words that differ only in tonic accent or other diacritics remain separate words and are accounted for independently for the calculation of letter or cluster frequencies. The overlap between the lexica is shown in Fig. 1, for Latin-alphabet languages, Fig. 2 for Greek languages, and Fig. 3, for Ancient Greek lexica. In the latter figure, the individual epic texts are pooled by their presumed author.

Table 1. Lexica and number of words used in the principal components analysis of letter usage frequencies.

Lexicon 

Words

English

40480

French

46898

Latin

16969

Ancient Greek

36486

Modern Greek

35282

Iliad

19245

Odyssey

15456

Theogony

2848

Works and Days

2618





Figure 1. Venn diagram of the lexica of languages using the Latin alphabet


Figure 2. Venn diagram of the lexica of languages (or dialects) using the Greek alphabet


Figure 3. Venn diagram of the Ancient Greek lexica with epic dialects grouped by author.


The principal components analysis is summarised in Table 2, showing the proportion of variance explained by each PC, and Fig. 4 to 8, showing the correlations of each language with each PC and the contribution of each leater or letter-cluster (bigram or trigram) to language separation.

Table 2. Principal components of letter and letter-cluster frequencies.

Dimension

Eigenvalue

Variance (%)

Cumulative variance (%)

Dim.1

8.401

93.34

93.345

Dim.2

0.320

3.56

96.900

Dim.3

0.170

1.88

98.785

Dim.4

0.058

0.65

99.432

Dim.5

0.027

0.30

99.732

Dim.6

0.014

0.16

99.889

Dim.7

0.006

0.06

99.953

Dim.8

0.004

0.04

99.995

 



Figure 4. Principal components of letter and cluster usage frequencies.




Figure 5. Biplot of principal components 1 and 2 of letter and cluster usage frequencies showing the most influential letters or clusters.




Figure 6. Biplot of principal components 3 and 4 of letter and cluster usage frequencies showing the most influential letters or clusters.




Figure 7. Biplot of principal components 5 and 6 of letter and cluster usage frequencies showing the most influential letters or clusters.




Figure 8. Biplot of principal components 7 and 8 of letter and cluster usage frequencies showing the most influential letters or clusters.



References

Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459.

Gregory R. Crane (ed.). (2012). Open Source Code. Perseus Digital Library. 

Kassambara, A. (2017). PCA - Principal Component Analysis Essentials. STHDA.

Kyparissiadis, A. (n.d.). GreekLex: A lexical database of Modern Greek. Retrieved March 13, 2022

Kyparissiadis, A., van Heuven, W. J. B., Pitchford, N. J., & Ledgeway, T. (2017). GreekLex 2: A comprehensive lexical database with part-of-speech, syllabic, phonological, and stress information. PLoS ONE, 12(2). 

Liddell, H. G., & Scott, R. (1996). A Greek-English Lexicon (H. S. Jones & R. McKenzie, Eds.; 9th ed.). Oxford University Press.

New, B., & Pallier, C. (2017). Lexique version 3.82. Lexique.Org.