Specifying orthography: harmonization, tokenization and transliteration

Michael Cysouw

2024-06-08

Introduction

Given any collection of linguistic strings, there are various issues that often arise in using these linguistic strings in the computational processing of such data. This vignette will give a short practical introduction to the solutions offered in the qlcData package. For a full theoretical discussion of all issues involved, see Moran & Cysouw (forthcoming).

All proposals made here (and in the paper by Moran & Cysouw) are crucially rooted in the structure and technologies developed over the last few decades by the Unicode Consortion. Specifically the implementation as provided by the UCI and their porting to R in the stringi package are crucial for the functions described here. One might even question, whether there is any need for the functions in this package, and whether the functionality of stringi is not already sufficient. We see our additions as high-level functionality that (hopefully) is easily enough to be applied to also allow non-technically-inclined linguists to use it.

Specifically, we offer an approach to document tailorder grapheme clusters (as they are called by the Unicode consortium). To deal consistenly with such clusters, the official Unicode route would be to produce Unicode Local Descriptions, which are overly complex for the use-cases that we have in mind. In general, our goal is to allow for quick and easy processing, which can be used for dozens (or even hundreds) of different languages/orthographies without becoming a life-long project.

We see various use-cases for the orthographic processing approach as made available in the qlcData package, e.g.:

In general, our solutions will not be practical for ideosyncratic orthographies like English or French, nor for chracter-based orthographies like Chinese or Japanese, but is mostly geared towards practical orthographies as used in the hundreds (thousands) of other languages in the world.

Installing the package

The current alpha-version of the package qlcData is available on CRAN (Comprehensive R Archive Network) for easy download and application. You can also directly try to install the most recent development version. If you haven’t done so already, please install the package devtools and then install the package qlcData directly from github.

# install devtools from CRAN
install.packages("devtools")
# install qlcData from github using devtools
devtools::install_github("cysouw/qlcData", build_vignettes = TRUE)
# load qlcTokenize package
library(qlcData)
# access help files of the package
help(qlcData)
# access this vignette
vignette("orthography_processing")

Orthography Profiles

The basic object in qlcData is the Orthography Profile. This is basically just a simple tab-separated file listing all (tailored) graphemes in some data. We have decided to go for a tab-separated file (instead of a JSON or CSV file) because a tab separated file is easier to edit by hand, something which we explicitly expect to happen a lot. An orthography profile can be easily made by using write.profile. The result of this function is an R-dataframe, but it can also be directly written to a file by using the option file = path/filename.

test <- "hállo hállо"
write.profile(test)
Grapheme Frequency Codepoint UnicodeName
1 U+0020 SPACE
1 U+0061, U+0301 LATIN SMALL LETTER A, COMBINING ACUTE ACCENT
h 2 U+0068 LATIN SMALL LETTER H
l 4 U+006C LATIN SMALL LETTER L
o 1 U+006F LATIN SMALL LETTER O
á 1 U+00E1 LATIN SMALL LETTER A WITH ACUTE
о 1 U+043E CYRILLIC SMALL LETTER O

There are a few interesting aspects in this orthography profile.

# the differenec between various "o" characters is mostly invisible on screen
"o" == "o"  # these are the same "o" characters, so this statement in true
## [1] TRUE
"o" == "о"  # this is one latin and and cyrillic "o" character, so this statement is false
## [1] FALSE
test <- c("this thing", "is", "a", "vector", "with", "many", "strings")
write.profile(test)
Grapheme Frequency Codepoint UnicodeName
1 U+0020 SPACE
a 2 U+0061 LATIN SMALL LETTER A
c 1 U+0063 LATIN SMALL LETTER C
e 1 U+0065 LATIN SMALL LETTER E
g 2 U+0067 LATIN SMALL LETTER G
h 3 U+0068 LATIN SMALL LETTER H
i 5 U+0069 LATIN SMALL LETTER I
m 1 U+006D LATIN SMALL LETTER M
n 3 U+006E LATIN SMALL LETTER N
o 1 U+006F LATIN SMALL LETTER O
r 2 U+0072 LATIN SMALL LETTER R
s 4 U+0073 LATIN SMALL LETTER S
t 5 U+0074 LATIN SMALL LETTER T
v 1 U+0076 LATIN SMALL LETTER V
w 1 U+0077 LATIN SMALL LETTER W
y 1 U+0079 LATIN SMALL LETTER Y

Normally, you won’t type your data directly into R, but load the data from some file with functions like scan or read.table, and then perform write.profile on the data. Given the information as provided by the orthography profile, you might then want to go back to the original file and correct the inconsistencies, and then check again to see if everything is consistent now.

Tokenization

In most cases you will probably want to use the function tokenize. Besides creating orthography profiles, it will also check orthography profiles against new data (and give warnings if there is something), it will separate the input strings into graphemes, and even perform transliteration. Let’s run through a typical workflow using tokenize.

Given some data in a specific orthography, you can call tokenize on the data to create an initial orthography profile (just like with write.profile discussed above, though there are less options for the splitting of graphemes, the addition of info, etc.).

The output of tokenize always is a list of four elements: $strings, $profile, $errors, and $missing. The second element in the list $profile is the table we already encountered above (though in a different order because of different default settings). The first element $strings is a table with the original strings, and the tokenization into graphemes as specified by the orthography profile (which in the case below was automatically produced, so there is nothing strange happening here, just a splitting into letters). The $errors and $missing are just empty at this stage, but it will contain information about strings that cannot be tokenized with a pre-established profile.

tokenize(test)
## $strings
##    originals         tokenized
## 1 this thing t h i s t h i n g
## 2         is               i s
## 3          a                 a
## 4     vector       v e c t o r
## 5       with           w i t h
## 6       many           m a n y
## 7    strings     s t r i n g s
## 
## $profile
##    Grapheme Frequency
## 1         y         1
## 2         w         1
## 3         v         1
## 4         t         5
## 5         s         4
## 6         r         2
## 7         o         1
## 8         n         3
## 9         m         1
## 10        i         5
## 11        h         3
## 12        g         2
## 13        e         1
## 14        c         1
## 15        a         2
## 16                  1
## 
## $errors
## NULL
## 
## $missing
## NULL

Now, you can work further with this profile inside R, but it is easier to write the results to a file, then correct/change these files, and use R again to process the data again. In this vignette we will not start writing anything to your disk (so the following commands will not be executed), but you might try something like the following:

dir.create("~/Desktop/tokenize")
setwd("~/Desktop/tokenize")
tokenize(test, file.out = "test_profile.txt")

We are going to add two new “tailored grapheme clusters” to this profile: open the file “test_profile.txt” (in the folder “tokenize” on your Desktop) with a text editor like SublimeText, Atom, Textmate, Textwrangler/BBedit or Notepad++ (don’t use Microsoft Word!!!). First, add a new line with only “th” on it and, second, add another line with only “ng” on it. The file will then roughly look like this:

Grapheme Frequency
y 1
w 1
v 1
t 5
s 4
r 2
o 1
n 3
m 1
i 5
h 3
g 2
e 1
c 1
a 2
1
th
ng

Now try to use this this profile with the function tokenize. Note that you will get a different tokenization of the strings (“th” and “ng” are now treated as a complex grapheme) and you will also obtain an updated orthography profile, which you could also immediately use to overwrite the existing profile on your disk.

tokenize(test, profile = "test_profile.txt")

# with overwriting of the existing profile:
# tokenize(test, profile = "test_profile.txt", file.out = "test_profile.txt")

# note that you can abbreviate this in R:
# tokenize_old(test, p = "test_profile.txt", f = "test_profile.txt")
## $strings
##    originals      tokenized
## 1 this thing th i s th i ng
## 2         is            i s
## 3          a              a
## 4     vector    v e c t o r
## 5       with         w i th
## 6       many        m a n y
## 7    strings   s t r i ng s
## 
## $profile
##    Grapheme Frequency
## 18       ng         2
## 17       th         3
## 16                  1
## 15        a         2
## 14        c         1
## 13        e         1
## 12        g         0
## 11        h         0
## 10        i         5
## 9         m         1
## 8         n         1
## 7         o         1
## 6         r         2
## 5         s         4
## 4         t         2
## 3         v         1
## 2         w         1
## 1         y         1
## 
## $errors
## NULL
## 
## $missing
## NULL

Now that we have an orthography profile, we can use this orthography profile on other data, using the profile to produce a tokenization, and at the same time checking the data for any strings that do not appear in the profile (which might be errors in the data). Note that the following will give a warning, but it will still go through and give some output. All symbols that were not found in the orthography profile are simply separated according to unicode grapheme definitions, a new orthogrphy profile explicitly for this dataset is made, and the problematic string are summarised in the warnings of the output, linked to the original strings in which they occured. In this way it is easy to find the problems in the data.

tokenize(c("think", "thin", "both"), profile = "test_profile.txt")
## Warning in tokenize(c("think", "thin", "both"), profile = test_profile.txt): 
## There were unknown characters found in the input data.
## Check output$errors for a table with all problematic strings.
## $strings
##   originals tokenized
## 1     think  th i n ⁇
## 2      thin    th i n
## 3      both    ⁇ o th
## 
## $profile
##    Grapheme Frequency
## 18       ng         0
## 17       th         3
## 16                  0
## 15        a         0
## 14        c         0
## 13        e         0
## 12        g         0
## 11        h         0
## 10        i         2
## 9         m         0
## 8         n         2
## 7         o         1
## 6         r         0
## 5         s         0
## 4         t         0
## 3         v         0
## 2         w         0
## 1         y         0
## 
## $errors
##   originals   errors
## 1     think th i n ⁇
## 3      both   ⁇ o th
## 
## $missing
##   Grapheme Frequency Codepoint          UnicodeName
## 1        b         1    U+0062 LATIN SMALL LETTER B
## 2        k         1    U+006B LATIN SMALL LETTER K

Transliteration, Contexts, Classes and Regular Expressions

After tokenization the resulting tokenized string can then be transliterated into a different orthographic representation by using the option transliterate. Then the grapheme as specified are used (by default this columns is called “Replacement”, but other names can be used, and one orthography profile can include multiple transliteration columns).

To achieve contextually determined replacements (e.g. in Italian becomes /k/ except before <i,e>, then it becomes /tʃ/) you can use columns called “Left” and “Right” for left and right contexts, respectively. For example, consider the following toy-profile for Italian:

Grapheme Right IPA
c k
c [ie]
n n
s s
a a
i i

To use this profile, you have to add the option regex = TRUE because contextual matching uses regular expressions. Note that you can now also use regular expressions in the specification of the context!

tokenize(c("casa", "cina"), profile = italian, transliterate = "IPA", regex = TRUE)$strings
##   originals tokenized transliterated
## 1      casa   c a s a        k a s a
## 2      cina   c i n a       tʃ i n a

Another possibility is to use a column “Class” to specify a class of graphemes, and then use this class in the specification of context. You are free to use any class-name you like, as long as it doesn’t clash with the rest of the profile.

Grapheme Right Class IPA
c k
c frontV
n n
s s
a a
i frontV i
e frontV e
tokenize(c("casa", "cina"), profile = italian, transliterate = "IPA", regex = TRUE)$strings
##   originals tokenized transliterated
## 1      casa   c a s a        k a s a
## 2      cina   c i n a       tʃ i n a