iso639: tweak name-inclusive search algorithm to check names last
under this algorithm codes are searched first, names second. (though of course codes are skipped if length is not 2 or 3). in short, this is partly because in cases where clashes between short names and codes occur (which can happen, as discussed shortly), i feel that it is best to prefer the code match, and partly because of the disadvantages of using names as identifiers, as discussed in a moment, that make codes perhaps more likely to be used for most languages. please note the note in the previous commit about how this is only of relevance to users specifying language preference for dvdnav and bluray, with `--sub-language`, `--audio-language` and `--menu-language` options. -- names as identifiers -- the disadvantages/problems of using names as identifiers: 1. first, note that since 236ca7ae introduced the possibility, the help text of the few relevant options has never actually informed users that it was possible. (in fact they also fail to clarify that the mentioned codes to be used are iso-639). so presumably any users of these options are more used to using codes already. 2. only English names are available, since only English names exist in the table, and the lookups have never involved translation. this makes the feature less useful than originally viewed in the commit log of the commit that introduced it. so for instance you can use "french" but not in fact "francais". 3. the names in the iso-639 table are primarily intended for display purposes rather than matching purposes. while many of the names are simple like "English" and "French", working just fine for the type of lookup performed, many are not so ideal like the following examples (some of which have been picked from the MR 146 update ([1])): - "Greek, Modern" (updated to "Greek, Modern (1453-)") - "Chichewa; Nyanja" - "Sotho, Southern" - "Tonga (Tonga Islands)" - "North Azerbaijani" - "Limburgan; Limburger; Limburgish" - "Gaelic; Scottish Gaelic" - "Interlingua (International Auxiliary Language Association)" - "Altaic (Other)" - "Apache languages" there are many such examples (especially after MR 146). (we would not want to rename them to be better identifiers, since this would make them less ideal for their primary display purpose, and it could make future updates from the glibc set much harder). 5. it is not even possible for users to easily discover the (English) language names (or rather labels?) that are available for use instead of codes. not all are easily guessable. 6. as shown by the MR 146 update, the names are far more prone to change than codes. this creates a backwards compatibility problem both for CLI use and saved settings. (we should not want to avoid such updates just for such backwards compatibility). so, aside from some cases like "english" and "french", which are ideal and reliable, for most languages codes are the better choice, putting more emphasis on codes being checked first. though of course the name/code clash issue discussed next is more significant. -- result differences -- the results given are identical with the current data set, since there are currently no records where the 3-char name of one matches (ignoring case) the iso-639-2 code of another. if/when MR 146 is merged, there are some such clashing records added, and considering the order of the records (preserved from glibc order to make updates easier if for no other reason), a different match would be returned by this algorithm than the previous one, preferring now the code-based match of the later record over the name-based match of the earlier. the clashing records of interest are: - "Kru" and "Kurukh", with the latter having an ISO-639-2 code of "kru". - "Mon" and "Mongolian", with the latter having a code of "mon". with "Kru" coming before "Kurukh" and "Mon" before "Mongolian", use of "kru" and "mon" with the previous algorithm would have matched "Kru" and "Mon" named records respectively, while the new algorithm will instead match "Kurukh" and "Mongolian" respectively, preferring the code-based match. (MR 146 with the old algorithm actually introduces a regression for "Mongolian" in that "mon" then matches the "Mon" record, whilst the new algorithm fixes that, restoring the "Mongolian" match). thus "Kurukh" and "Mongolian" could with the previous algorithm only be matched via their full names (or the "mn" iso-639-1 code in the Mongolian case), whilst with the new algorithm they can be reached via codes also, whilst "Kru" and "Mon" can only now be reached via codes ("kro" and "mnw" respectively). either way there's an unavoidable imperfection in doing a case-insensitive name-inclusive lookup, but i feel code-based being primary is best; we don't necessarily want to ditch the name-based lookup considering those languages that it does work well for, and i don't expect we would want to make the name search case-sensitive, requiring capitals. [1]: videolan/vlc!146
parent
d9347452
No related branches found
No related tags found
Loading
Please register or sign in to comment