Linguistic Variation Toolbox
Linguistic Variation Toolbox
The Linguistic Variation Toolbox (LVT) is a MATLAB software for the study and characterization of linguistic variation through a mathematical and computational approach. It was developed by Acadèmia de su Sardu APS and released with an Open Source Apache 2.0 license.
Usage guide
Installing and running
You can use this software for free on MATLAB Online. You might have to create a MathWorks.com
account for this: this is 100% free. Once you click the link and confirm the download of
the package, you'll need to add the folder source
to the MATLAB path as follows:
addpath("source");
If you want to try it on your own MATLAB installation, on your computer, you can either:
- install the Linguistic Variation Toolbox through MATLAB's Add-On Explorer.
- download this code and add the folder
source
to the MATLAB path withaddpath("source")
The Linguistic Variation Toolbox requires the following MATLAB toolboxes:
- Statistics and Machine Learning Toolbox.
Defining categories
The first step to using LVT is to define the categories in your data. For example, let us
imagine we are working on Sardinian and using its two macro-varieties as categories. For
short, we can model Campidanese with the string "C"
and Logudorese-Nugorese with the
string "L"
:
allCategories(["C", "L"]);
The categories can be any number of strings, and you are free to define them in any way that suits your research. To retrieve the list of categories after we have set them, we can run:
allCategories()
Defining a set of variants
LVT helps study the properties of sets of variants and the
patterns within. To do this, the toolbox provides an object called SetOfVariants
.
From a linguistics point of view, to work with these variants you need to have:
- a set of transcription rules to be able to represent the variants as strings. You can use phonetic or orthographic transcriptions as suits your research.
- a way of measuring the distance between two transcribed variants.
Continuing with the example of Sardinian, let us assume we are using orthographic transcription according to the rules in Acadèmia de su Sardu's normative grammar "Su Sardu Standard". To measure distances between variants, let us assume we are using Levenshtein's distance:
The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.
For example, the Lenvenshtein distance between the strings cat and catfish is 4.
However, for our application we also ignore diacritics and therefore set to 0 the distance
between variants that are written similarly apart from the stress. For example, the
distance between arrèxini and arrexìni is going to be 0. This way of measuring distance
is the default for SetOfVariants
objects.
One way to create the object is to list the variants of interest, the categories they belong to, and whether each variant is a reference within its category:
variants = ["ocisòrgiu", "ochisorzu", "bochisorzu"];
categories = ["C", "L", "L"];
isCategoryReference = [true, false, true];
set = SetOfVariants(variants, categories, isCategoryReference);
For example, the category reference could be the standard variant. If the mapping
between variants, categories, and category references is not this straightforward, we can
use VariantAttribute
objects. Using VariantAttribute
objects, the previous code can be
written as follows:
variants = ["ocisòrgiu", "ochisorzu", "bochisorzu"];
attributes = { ...
VariantAttribute("C", true), ...
VariantAttribute("L", false), ...
VariantAttribute("L", true)};
set = SetOfVariants(variants, attributes);
If we want to specify a custom distance function:
set = SetOfVariants(variants, attributes, DistanceFunction=@myCustomDistance);
Once the object has been created, we can view some data by accessing its properties
set.VariantTable
set.DistanceTable
set.DistanceFunction
For a complete documentation on SetOfVariants
objects, you can type:
help SetOfVariants
Representing the data graphically
To represent the set of variants graphically, one can type:
set.plot()
This will show a representation of the set of variants as a graph, where every variant corresponds to a node and the distance between two variants is related to the length of the arcs between their two nodes. We only represent the arcs whose length is less than the median value, that is the most statistically significant arcs.
Important: Note that this representation does not represent the distances exactly, but can highlight patterns within the set of variants. We can use these representations to formulate hypotheses on the data, which can be then proved using the statistics (see the following section in this guide).
Different options can be combined to represent the data graphically. For the full documentation, you can type:
help SetOfVariants/plot
The option CenterCategories
expects two categories as an input. It rotates and centres
the plot in a way that the category references lay on a line in the middle of the plot.
The center of the segment between the categories is the center of the plot.
set.plot(CenterCategories=["C", "L"])
The option PlacementAlgorithm
changes the way the nodes are placed on the plot. It can
be mds
(the default) or force
. By default, the Linguistic Variation Toolbox will use
multi-dimensional
scaling
to represent the distances as accurately as possible and write to the command line the
maximum relative error in the plot. The force
algorithm uses an alternative approach
to represent the graph, which often leads to better readability of the variants in the
plot.
set.plot(CenterCategories=["C", "L"], PlacementAlgorithm="force")
The option Mode
toggles between the complete
plot (default) and the proximal
plot. The
proximal
plot is a representation where every variant is connected to its closest
variant by an arc with direction. This can be used to highlight other patterns within the
set of variants.
set.plot(CenterCategories=["C", "L"], PlacementAlgorithm="force", Mode="proximal")
Computing statistics on the data
To study the statistics on sets of variants, one can type:
stats = set.computeStatistics()
This also accepts the option Quiet
, which can be false
(the default) or true
. When
Quiet
is false
, MATLAB will display all the statistics it computed on the command
line. The data is stored in the output stats
for further analysis.
The following statistics are computed both by category and on the overall set of variants:
- Diameter: the maximum distance between two variants.
- MeanDistance: the average distance between two variants.
- RangeDistance: the difference between the maximum and the minimum distance.
- MeanDistanceFromBaricentre: LVT computes the multi-dimensional scaling of the variants in the graph, which is the representation of the variants and their distances as a geometry. It then computes the baricentre, i.e. the central point, of the obtained geometry. Then, it computes the distances of the variants from this abstract point.
For every variant in the category or in the overall set of variants, the following statistics are computed:
- WeightedDegree: this is a graph-based metric that is equal to the sum of all the distances between the current variant and the other variants.
- MeanDistance: similar to the previous metric, but divided (i.e. normalized) by the number of the other variants. The smallest MeanDistance, the more central the variant in a graph-theoretical sense.
- RangeDistance: difference between the distance of the farthest and the closest variant in the set, with respect to the current variant.
- Closeness: the inverse of the WeightedDegree metric.
- DistanceFromBaricentre: the distance from the current variant to the geometric baricentre, an abstract point that represents the geometric centre of the variants but that almost never corresponds to any actual variant.
How to cite
Acadèmia de su Sardu APS (2023). Linguistic Variation Toolbox, version XXXX. [https://github.com/academiadesusardu/linguistic-variation-toolbox].
In the above, substitute XXXX with the release number.
Cite As
Acadèmia de su Sardu APS (2023). linguistic-variation-toolbox (https://github.com/academiadesusardu/linguistic-variation-toolbox/releases/tag/1.0.0), GitHub. Retrieved September 2, 2023.
MATLAB Release Compatibility
Platform Compatibility
Windows macOS LinuxTags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Discover Live Editor
Create scripts with code, output, and formatted text in a single executable document.
source
source/private
test
Version | Published | Release Notes | |
---|---|---|---|
1.0.0 |