include/vlc_iso_lang.h · master · François Cartegnie / VLC

iso639: tweak name-inclusive search algorithm to check names last · 58260a7a
Lyndon Brown authored May 26, 2021 and
Jean-Baptiste Kempf committed Jun 12, 2021
under this algorithm codes are searched first, names second. (though of
course codes are skipped if length is not 2 or 3).

in short, this is partly because in cases where clashes between short
names and codes occur (which can happen, as discussed shortly), i feel
that it is best to prefer the code match, and partly because of the
disadvantages of using names as identifiers, as discussed in a moment,
that make codes perhaps more likely to be used for most languages.

please note the note in the previous commit about how this is only of
relevance to users specifying language preference for dvdnav and bluray,
with `--sub-language`, `--audio-language` and `--menu-language` options.

-- names as identifiers --

the disadvantages/problems of using names as identifiers:

 1. first, note that since 236ca7ae
    introduced the possibility, the help text of the few relevant
    options has never actually informed users that it was possible. (in
    fact they also fail to clarify that the mentioned codes to be used
    are iso-639). so presumably any users of these options are more
    used to using codes already.

 2. only English names are available, since only English names exist in
    the table, and the lookups have never involved translation. this
    makes the feature less useful than originally viewed in the commit
    log of the commit that introduced it. so for instance you can use
    "french" but not in fact "francais".

 3. the names in the iso-639 table are primarily intended for display
    purposes rather than matching purposes. while many of the names are
    simple like "English" and "French", working just fine for the type of
    lookup performed, many are not so ideal like the following examples
    (some of which have been picked from the MR 146 update ([1])):

      - "Greek, Modern" (updated to "Greek, Modern (1453-)")
      - "Chichewa; Nyanja"
      - "Sotho, Southern"
      - "Tonga (Tonga Islands)"
      - "North Azerbaijani"
      - "Limburgan; Limburger; Limburgish"
      - "Gaelic; Scottish Gaelic"
      - "Interlingua (International Auxiliary Language Association)"
      - "Altaic (Other)"
      - "Apache languages"

    there are many such examples (especially after MR 146). (we would
    not want to rename them to be better identifiers, since this would
    make them less ideal for their primary display purpose, and it
    could make future updates from the glibc set much harder).

 5. it is not even possible for users to easily discover the (English)
    language names (or rather labels?) that are available for use
    instead of codes. not all are easily guessable.

 6. as shown by the MR 146 update, the names are far more prone to
    change than codes. this creates a backwards compatibility problem
    both for CLI use and saved settings. (we should not want to avoid
    such updates just for such backwards compatibility).

so, aside from some cases like "english" and "french", which are ideal
and reliable, for most languages codes are the better choice, putting
more emphasis on codes being checked first. though of course the
name/code clash issue discussed next is more significant.

-- result differences --

the results given are identical with the current data set, since there are
currently no records where the 3-char name of one matches (ignoring case)
the iso-639-2 code of another.

if/when MR 146 is merged, there are some such clashing records added, and
considering the order of the records (preserved from glibc order to make
updates easier if for no other reason), a different match would be returned
by this algorithm than the previous one, preferring now the code-based
match of the later record over the name-based match of the earlier.

the clashing records of interest are:
 - "Kru" and "Kurukh", with the latter having an ISO-639-2 code of "kru".
 - "Mon" and "Mongolian", with the latter having a code of "mon".

with "Kru" coming before "Kurukh" and "Mon" before "Mongolian", use of "kru"
and "mon" with the previous algorithm would have matched "Kru" and "Mon"
named records respectively, while the new algorithm will instead match
"Kurukh" and "Mongolian" respectively, preferring the code-based match.

(MR 146 with the old algorithm actually introduces a regression for
"Mongolian" in that "mon" then matches the "Mon" record, whilst the new
algorithm fixes that, restoring the "Mongolian" match).

thus "Kurukh" and "Mongolian" could with the previous algorithm only be
matched via their full names (or the "mn" iso-639-1 code in the Mongolian
case), whilst with the new algorithm they can be reached via codes also,
whilst "Kru" and "Mon" can only now be reached via codes ("kro" and "mnw"
respectively).

either way there's an unavoidable imperfection in doing a case-insensitive
name-inclusive lookup, but i feel code-based being primary is best; we don't
necessarily want to ditch the name-based lookup considering those languages
that it does work well for, and i don't expect we would want to make the
name search case-sensitive, requiring capitals.

[1]: !146
58260a7a