Linguistic Variation Toolbox

Version 1.0.0 (31.7 KB) by Andria
MATLAB Toolbox for the characterization of linguistic variation.
1 Download
Updated 2 Sep 2023

Logo-Acad-mia-de-su-Sardu-piticu

Linguistic Variation Toolbox Open in MATLAB OnlineMATLAB test workflow

The Linguistic Variation Toolbox (LVT) is a MATLAB software for the study and characterization of linguistic variation through a mathematical and computational approach. It was developed by Acadèmia de su Sardu APS and released with an Open Source Apache 2.0 license.

Usage guide

Installing and running

You can use this software for free on MATLAB Online. You might have to create a MathWorks.com account for this: this is 100% free. Once you click the link and confirm the download of the package, you'll need to add the folder source to the MATLAB path as follows:

addpath("source");

If you want to try it on your own MATLAB installation, on your computer, you can either:

  • install the Linguistic Variation Toolbox through MATLAB's Add-On Explorer.
  • download this code and add the folder source to the MATLAB path with addpath("source")

The Linguistic Variation Toolbox requires the following MATLAB toolboxes:

  • Statistics and Machine Learning Toolbox.

Defining categories

The first step to using LVT is to define the categories in your data. For example, let us imagine we are working on Sardinian and using its two macro-varieties as categories. For short, we can model Campidanese with the string "C" and Logudorese-Nugorese with the string "L":

allCategories(["C", "L"]);

The categories can be any number of strings, and you are free to define them in any way that suits your research. To retrieve the list of categories after we have set them, we can run:

allCategories()

Defining a set of variants

LVT helps study the properties of sets of variants and the patterns within. To do this, the toolbox provides an object called SetOfVariants.

From a linguistics point of view, to work with these variants you need to have:

  • a set of transcription rules to be able to represent the variants as strings. You can use phonetic or orthographic transcriptions as suits your research.
  • a way of measuring the distance between two transcribed variants.

Continuing with the example of Sardinian, let us assume we are using orthographic transcription according to the rules in Acadèmia de su Sardu's normative grammar "Su Sardu Standard". To measure distances between variants, let us assume we are using Levenshtein's distance:

The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.

For example, the Lenvenshtein distance between the strings cat and catfish is 4. However, for our application we also ignore diacritics and therefore set to 0 the distance between variants that are written similarly apart from the stress. For example, the distance between arrèxini and arrexìni is going to be 0. This way of measuring distance is the default for SetOfVariants objects.

One way to create the object is to list the variants of interest, the categories they belong to, and whether each variant is a reference within its category:

variants = ["ocisòrgiu", "ochisorzu", "bochisorzu"];
categories = ["C", "L", "L"];
isCategoryReference = [true, false, true];
set = SetOfVariants(variants, categories, isCategoryReference);

For example, the category reference could be the standard variant. If the mapping between variants, categories, and category references is not this straightforward, we can use VariantAttribute objects. Using VariantAttribute objects, the previous code can be written as follows:

variants = ["ocisòrgiu", "ochisorzu", "bochisorzu"];
attributes = { ...
    VariantAttribute("C", true), ...
    VariantAttribute("L", false), ...
    VariantAttribute("L", true)};
set = SetOfVariants(variants, attributes);

If we want to specify a custom distance function:

set = SetOfVariants(variants, attributes, DistanceFunction=@myCustomDistance);

Once the object has been created, we can view some data by accessing its properties

set.VariantTable
set.DistanceTable
set.DistanceFunction

For a complete documentation on SetOfVariants objects, you can type:

help SetOfVariants

Representing the data graphically

To represent the set of variants graphically, one can type:

set.plot()

This will show a representation of the set of variants as a graph, where every variant corresponds to a node and the distance between two variants is related to the length of the arcs between their two nodes. We only represent the arcs whose length is less than the median value, that is the most statistically significant arcs.

plot-no-options

Important: Note that this representation does not represent the distances exactly, but can highlight patterns within the set of variants. We can use these representations to formulate hypotheses on the data, which can be then proved using the statistics (see the following section in this guide).

Different options can be combined to represent the data graphically. For the full documentation, you can type:

help SetOfVariants/plot

The option CenterCategories expects two categories as an input. It rotates and centres the plot in a way that the category references lay on a line in the middle of the plot. The center of the segment between the categories is the center of the plot.

set.plot(CenterCategories=["C", "L"])

plot-center-categories

The option PlacementAlgorithm changes the way the nodes are placed on the plot. It can be mds (the default) or force. By default, the Linguistic Variation Toolbox will use multi-dimensional scaling to represent the distances as accurately as possible and write to the command line the maximum relative error in the plot. The force algorithm uses an alternative approach to represent the graph, which often leads to better readability of the variants in the plot.

set.plot(CenterCategories=["C", "L"], PlacementAlgorithm="force")

plot-force

The option Mode toggles between the complete plot (default) and the proximal plot. The proximal plot is a representation where every variant is connected to its closest variant by an arc with direction. This can be used to highlight other patterns within the set of variants.

set.plot(CenterCategories=["C", "L"], PlacementAlgorithm="force", Mode="proximal")

plot-proximal

Computing statistics on the data

To study the statistics on sets of variants, one can type:

stats = set.computeStatistics()

This also accepts the option Quiet, which can be false (the default) or true. When Quiet is false, MATLAB will display all the statistics it computed on the command line. The data is stored in the output stats for further analysis.

The following statistics are computed both by category and on the overall set of variants:

  • Diameter: the maximum distance between two variants.
  • MeanDistance: the average distance between two variants.
  • RangeDistance: the difference between the maximum and the minimum distance.
  • MeanDistanceFromBaricentre: LVT computes the multi-dimensional scaling of the variants in the graph, which is the representation of the variants and their distances as a geometry. It then computes the baricentre, i.e. the central point, of the obtained geometry. Then, it computes the distances of the variants from this abstract point.

For every variant in the category or in the overall set of variants, the following statistics are computed:

  • WeightedDegree: this is a graph-based metric that is equal to the sum of all the distances between the current variant and the other variants.
  • MeanDistance: similar to the previous metric, but divided (i.e. normalized) by the number of the other variants. The smallest MeanDistance, the more central the variant in a graph-theoretical sense.
  • RangeDistance: difference between the distance of the farthest and the closest variant in the set, with respect to the current variant.
  • Closeness: the inverse of the WeightedDegree metric.
  • DistanceFromBaricentre: the distance from the current variant to the geometric baricentre, an abstract point that represents the geometric centre of the variants but that almost never corresponds to any actual variant.

How to cite

Acadèmia de su Sardu APS (2023). Linguistic Variation Toolbox, version XXXX. [https://github.com/academiadesusardu/linguistic-variation-toolbox].

In the above, substitute XXXX with the release number.

Cite As

Acadèmia de su Sardu APS (2023). linguistic-variation-toolbox (https://github.com/academiadesusardu/linguistic-variation-toolbox/releases/tag/1.0.0), GitHub. Retrieved September 2, 2023.

MATLAB Release Compatibility
Created with R2023a
Compatible with any release
Platform Compatibility
Windows macOS Linux
Tags Add Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
1.0.0

To view or report issues in this GitHub add-on, visit the GitHub Repository.
To view or report issues in this GitHub add-on, visit the GitHub Repository.