Skip to content
  • Lyndon Brown's avatar
    iso639: tweak name-inclusive search algorithm to check names last · 58260a7a
    Lyndon Brown authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
    under this algorithm codes are searched first, names second. (though of
    course codes are skipped if length is not 2 or 3).
    
    in short, this is partly because in cases where clashes between short
    names and codes occur (which can happen, as discussed shortly), i feel
    that it is best to prefer the code match, and partly because of the
    disadvantages of using names as identifiers, as discussed in a moment,
    that make codes perhaps more likely to be used for most languages.
    
    please note the note in the previous commit about how this is only of
    relevance to users specifying language preference for dvdnav and bluray,
    with `--sub-language`, `--audio-language` and `--menu-language` options.
    
    -- names as identifiers --
    
    the disadvantages/problems of using names as identifiers:
    
     1. first, note that since 236ca7ae
        introduced the possibility, the help text of the few relevant
        options has never actually informed users that it was possible. (in
        fact they also fail to clarify that the mentioned codes to be used
        are iso-639). so presumably any users of these options are more
        used to using codes already.
    
     2. only English names are available, since only English names exist in
        the table, and the lookups have never involved translation. this
        makes the feature less useful than originally viewed in the commit
        log of the commit that introduced it. so for instance you can use
        "french" but not in fact "francais".
    
     3. the names in the iso-639 table are primarily intended for display
        purposes rather than matching purposes. while many of the names are
        simple like "English" and "French", working just fine for the type of
        lookup performed, many are not so ideal like the following examples
        (some of which have been picked from the MR 146 update ([1])):
    
          - "Greek, Modern" (updated to "Greek, Modern (1453-)")
          - "Chichewa; Nyanja"
          - "Sotho, Southern"
          - "Tonga (Tonga Islands)"
          - "North Azerbaijani"
          - "Limburgan; Limburger; Limburgish"
          - "Gaelic; Scottish Gaelic"
          - "Interlingua (International Auxiliary Language Association)"
          - "Altaic (Other)"
          - "Apache languages"
    
        there are many such examples (especially after MR 146). (we would
        not want to rename them to be better identifiers, since this would
        make them less ideal for their primary display purpose, and it
        could make future updates from the glibc set much harder).
    
     5. it is not even possible for users to easily discover the (English)
        language names (or rather labels?) that are available for use
        instead of codes. not all are easily guessable.
    
     6. as shown by the MR 146 update, the names are far more prone to
        change than codes. this creates a backwards compatibility problem
        both for CLI use and saved settings. (we should not want to avoid
        such updates just for such backwards compatibility).
    
    so, aside from some cases like "english" and "french", which are ideal
    and reliable, for most languages codes are the better choice, putting
    more emphasis on codes being checked first. though of course the
    name/code clash issue discussed next is more significant.
    
    -- result differences --
    
    the results given are identical with the current data set, since there are
    currently no records where the 3-char name of one matches (ignoring case)
    the iso-639-2 code of another.
    
    if/when MR 146 is merged, there are some such clashing records added, and
    considering the order of the records (preserved from glibc order to make
    updates easier if for no other reason), a different match would be returned
    by this algorithm than the previous one, preferring now the code-based
    match of the later record over the name-based match of the earlier.
    
    the clashing records of interest are:
     - "Kru" and "Kurukh", with the latter having an ISO-639-2 code of "kru".
     - "Mon" and "Mongolian", with the latter having a code of "mon".
    
    with "Kru" coming before "Kurukh" and "Mon" before "Mongolian", use of "kru"
    and "mon" with the previous algorithm would have matched "Kru" and "Mon"
    named records respectively, while the new algorithm will instead match
    "Kurukh" and "Mongolian" respectively, preferring the code-based match.
    
    (MR 146 with the old algorithm actually introduces a regression for
    "Mongolian" in that "mon" then matches the "Mon" record, whilst the new
    algorithm fixes that, restoring the "Mongolian" match).
    
    thus "Kurukh" and "Mongolian" could with the previous algorithm only be
    matched via their full names (or the "mn" iso-639-1 code in the Mongolian
    case), whilst with the new algorithm they can be reached via codes also,
    whilst "Kru" and "Mon" can only now be reached via codes ("kro" and "mnw"
    respectively).
    
    either way there's an unavoidable imperfection in doing a case-insensitive
    name-inclusive lookup, but i feel code-based being primary is best; we don't
    necessarily want to ditch the name-based lookup considering those languages
    that it does work well for, and i don't expect we would want to make the
    name search case-sensitive, requiring capitals.
    
    [1]: !146
    58260a7a