Given any collection of linguistic strings, there are various issues
that often arise in using these linguistic strings in the computational
processing of such data. This vignette will give a short practical
introduction to the solutions offered in the qlcData
package. For a full theoretical discussion of all issues involved, see
Moran & Cysouw (forthcoming).
All proposals made here (and in the paper by Moran & Cysouw) are
crucially rooted in the structure and technologies developed over the
last few decades by the Unicode Consortion. Specifically the
implementation as provided by the UCI and their porting to R in the
stringi
package are crucial for the functions described
here. One might even question, whether there is any need for the
functions in this package, and whether the functionality of
stringi
is not already sufficient. We see our additions as
high-level functionality that (hopefully) is easily enough to be applied
to also allow non-technically-inclined linguists to use it.
Specifically, we offer an approach to document tailorder grapheme clusters (as they are called by the Unicode consortium). To deal consistenly with such clusters, the official Unicode route would be to produce Unicode Local Descriptions, which are overly complex for the use-cases that we have in mind. In general, our goal is to allow for quick and easy processing, which can be used for dozens (or even hundreds) of different languages/orthographies without becoming a life-long project.
We see various use-cases for the orthographic processing approach as
made available in the qlcData
package, e.g.:
In general, our solutions will not be practical for ideosyncratic orthographies like English or French, nor for chracter-based orthographies like Chinese or Japanese, but is mostly geared towards practical orthographies as used in the hundreds (thousands) of other languages in the world.
The current alpha-version of the package qlcData
is
available on CRAN (Comprehensive R Archive Network) for easy
download and application. You can also directly try to install the most
recent development version. If you haven’t done so already, please
install the package devtools
and then install the package
qlcData
directly from github.
# install devtools from CRAN
install.packages("devtools")
# install qlcData from github using devtools
devtools::install_github("cysouw/qlcData", build_vignettes = TRUE)
# load qlcTokenize package
library(qlcData)
# access help files of the package
help(qlcData)
# access this vignette
vignette("orthography_processing")
The basic object in qlcData
is the Orthography
Profile. This is basically just a simple tab-separated file listing
all (tailored) graphemes in some data. We have decided to go for a
tab-separated file (instead of a JSON or CSV file) because a tab
separated file is easier to edit by hand, something which we explicitly
expect to happen a lot. An orthography profile can be easily made by
using write.profile
. The result of this function is an
R-dataframe, but it can also be directly written to a file by using the
option file = path/filename
.
Grapheme | Frequency | Codepoint | UnicodeName |
---|---|---|---|
1 | U+0020 | SPACE | |
á | 1 | U+0061, U+0301 | LATIN SMALL LETTER A, COMBINING ACUTE ACCENT |
h | 2 | U+0068 | LATIN SMALL LETTER H |
l | 4 | U+006C | LATIN SMALL LETTER L |
o | 1 | U+006F | LATIN SMALL LETTER O |
á | 1 | U+00E1 | LATIN SMALL LETTER A WITH ACUTE |
о | 1 | U+043E | CYRILLIC SMALL LETTER O |
There are a few interesting aspects in this orthography profile.
# the differenec between various "o" characters is mostly invisible on screen
"o" == "o" # these are the same "o" characters, so this statement in true
## [1] TRUE
## [1] FALSE
info = FALSE
.editing = TRUE
Grapheme | Frequency | Codepoint | UnicodeName |
---|---|---|---|
1 | U+0020 | SPACE | |
a | 2 | U+0061 | LATIN SMALL LETTER A |
c | 1 | U+0063 | LATIN SMALL LETTER C |
e | 1 | U+0065 | LATIN SMALL LETTER E |
g | 2 | U+0067 | LATIN SMALL LETTER G |
h | 3 | U+0068 | LATIN SMALL LETTER H |
i | 5 | U+0069 | LATIN SMALL LETTER I |
m | 1 | U+006D | LATIN SMALL LETTER M |
n | 3 | U+006E | LATIN SMALL LETTER N |
o | 1 | U+006F | LATIN SMALL LETTER O |
r | 2 | U+0072 | LATIN SMALL LETTER R |
s | 4 | U+0073 | LATIN SMALL LETTER S |
t | 5 | U+0074 | LATIN SMALL LETTER T |
v | 1 | U+0076 | LATIN SMALL LETTER V |
w | 1 | U+0077 | LATIN SMALL LETTER W |
y | 1 | U+0079 | LATIN SMALL LETTER Y |
Normally, you won’t type your data directly into R, but load the data
from some file with functions like scan
or
read.table
, and then perform write.profile
on
the data. Given the information as provided by the orthography profile,
you might then want to go back to the original file and correct the
inconsistencies, and then check again to see if everything is consistent
now.
In most cases you will probably want to use the function
tokenize
. Besides creating orthography profiles, it will
also check orthography profiles against new data (and give warnings if
there is something), it will separate the input strings into graphemes,
and even perform transliteration. Let’s run through a typical workflow
using tokenize
.
Given some data in a specific orthography, you can call
tokenize
on the data to create an initial orthography
profile (just like with write.profile
discussed above,
though there are less options for the splitting of graphemes, the
addition of info, etc.).
The output of tokenize
always is a list of four
elements: $strings
, $profile
,
$errors
, and $missing
. The second element in
the list $profile
is the table we already encountered above
(though in a different order because of different default settings). The
first element $strings
is a table with the original
strings, and the tokenization into graphemes as specified by the
orthography profile (which in the case below was automatically produced,
so there is nothing strange happening here, just a splitting into
letters). The $errors
and $missing
are just
empty at this stage, but it will contain information about strings that
cannot be tokenized with a pre-established profile.
## $strings
## originals tokenized
## 1 this thing t h i s t h i n g
## 2 is i s
## 3 a a
## 4 vector v e c t o r
## 5 with w i t h
## 6 many m a n y
## 7 strings s t r i n g s
##
## $profile
## Grapheme Frequency
## 1 y 1
## 2 w 1
## 3 v 1
## 4 t 5
## 5 s 4
## 6 r 2
## 7 o 1
## 8 n 3
## 9 m 1
## 10 i 5
## 11 h 3
## 12 g 2
## 13 e 1
## 14 c 1
## 15 a 2
## 16 1
##
## $errors
## NULL
##
## $missing
## NULL
Now, you can work further with this profile inside R, but it is easier to write the results to a file, then correct/change these files, and use R again to process the data again. In this vignette we will not start writing anything to your disk (so the following commands will not be executed), but you might try something like the following:
dir.create("~/Desktop/tokenize")
setwd("~/Desktop/tokenize")
tokenize(test, file.out = "test_profile.txt")
We are going to add two new “tailored grapheme clusters” to this profile: open the file “test_profile.txt” (in the folder “tokenize” on your Desktop) with a text editor like SublimeText, Atom, Textmate, Textwrangler/BBedit or Notepad++ (don’t use Microsoft Word!!!). First, add a new line with only “th” on it and, second, add another line with only “ng” on it. The file will then roughly look like this:
Grapheme | Frequency |
---|---|
y | 1 |
w | 1 |
v | 1 |
t | 5 |
s | 4 |
r | 2 |
o | 1 |
n | 3 |
m | 1 |
i | 5 |
h | 3 |
g | 2 |
e | 1 |
c | 1 |
a | 2 |
1 | |
th | |
ng |
Now try to use this this profile with the function
tokenize
. Note that you will get a different tokenization
of the strings (“th” and “ng” are now treated as a complex grapheme) and
you will also obtain an updated orthography profile, which you could
also immediately use to overwrite the existing profile on your disk.
tokenize(test, profile = "test_profile.txt")
# with overwriting of the existing profile:
# tokenize(test, profile = "test_profile.txt", file.out = "test_profile.txt")
# note that you can abbreviate this in R:
# tokenize_old(test, p = "test_profile.txt", f = "test_profile.txt")
## $strings
## originals tokenized
## 1 this thing th i s th i ng
## 2 is i s
## 3 a a
## 4 vector v e c t o r
## 5 with w i th
## 6 many m a n y
## 7 strings s t r i ng s
##
## $profile
## Grapheme Frequency
## 18 ng 2
## 17 th 3
## 16 1
## 15 a 2
## 14 c 1
## 13 e 1
## 12 g 0
## 11 h 0
## 10 i 5
## 9 m 1
## 8 n 1
## 7 o 1
## 6 r 2
## 5 s 4
## 4 t 2
## 3 v 1
## 2 w 1
## 1 y 1
##
## $errors
## NULL
##
## $missing
## NULL
Now that we have an orthography profile, we can use this orthography profile on other data, using the profile to produce a tokenization, and at the same time checking the data for any strings that do not appear in the profile (which might be errors in the data). Note that the following will give a warning, but it will still go through and give some output. All symbols that were not found in the orthography profile are simply separated according to unicode grapheme definitions, a new orthogrphy profile explicitly for this dataset is made, and the problematic string are summarised in the warnings of the output, linked to the original strings in which they occured. In this way it is easy to find the problems in the data.
## Warning in tokenize(c("think", "thin", "both"), profile = test_profile.txt):
## There were unknown characters found in the input data.
## Check output$errors for a table with all problematic strings.
## $strings
## originals tokenized
## 1 think th i n ⁇
## 2 thin th i n
## 3 both ⁇ o th
##
## $profile
## Grapheme Frequency
## 18 ng 0
## 17 th 3
## 16 0
## 15 a 0
## 14 c 0
## 13 e 0
## 12 g 0
## 11 h 0
## 10 i 2
## 9 m 0
## 8 n 2
## 7 o 1
## 6 r 0
## 5 s 0
## 4 t 0
## 3 v 0
## 2 w 0
## 1 y 0
##
## $errors
## originals errors
## 1 think th i n ⁇
## 3 both ⁇ o th
##
## $missing
## Grapheme Frequency Codepoint UnicodeName
## 1 b 1 U+0062 LATIN SMALL LETTER B
## 2 k 1 U+006B LATIN SMALL LETTER K
After tokenization the resulting tokenized string can then be
transliterated into a different orthographic representation by using the
option transliterate
. Then the grapheme as specified are
used (by default this columns is called “Replacement”, but other names
can be used, and one orthography profile can include multiple
transliteration columns).
To achieve contextually determined replacements (e.g. in Italian
Grapheme | Right | IPA |
---|---|---|
c | k | |
c | [ie] | tʃ |
n | n | |
s | s | |
a | a | |
i | i |
To use this profile, you have to add the option
regex = TRUE
because contextual matching uses regular
expressions. Note that you can now also use regular expressions in the
specification of the context!
## originals tokenized transliterated
## 1 casa c a s a k a s a
## 2 cina c i n a tʃ i n a
Another possibility is to use a column “Class” to specify a class of graphemes, and then use this class in the specification of context. You are free to use any class-name you like, as long as it doesn’t clash with the rest of the profile.
Grapheme | Right | Class | IPA |
---|---|---|---|
c | k | ||
c | frontV | tʃ | |
n | n | ||
s | s | ||
a | a | ||
i | frontV | i | |
e | frontV | e |
## originals tokenized transliterated
## 1 casa c a s a k a s a
## 2 cina c i n a tʃ i n a