It has a much broader support (ENCA only supports latin/cyrillic language, with the exception of Chinese as only Asian language). This is much more user-friendly for instance for Korean or Japanese viewers.
Would VLC consider using uchardet too? :-)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
No child items are currently assigned. Use child items to break down this issue into smaller parts.
VLC does not use ENCA. Instead it uses the character set most likely for the user language. I doubt that either ENCA or uchardet would adapt to that scheme.
Ok. I thought VLC would use ENCA because attempting to remove it in my package manager triggers removal of VLC as well, but checking with the proper tools, I indeed see that this is not a direct dependency.
This said, using the character set most likely for the user lang is not in my opinion ideal. For instance I have a user lang in en_US.UTF-8, whereas my native lang is French, but I most usually read subtitles in Korean for a friend. Using the user language as a hint is still a good idea in general though, but it can't say all.
Also I think most lang are mostly going for UTF-8, but I'm sure that if we were getting French subtitles, we'd still find sometimes ISO-8859-15 or ISO-8859-1 subtitles. Similarly from experience, about maybe half of the Korean subtitles I get are UTF-8, and another half are EUC-KR (this half ends up always garbled in VLC by default).
Anyway all this to said that I believe that having proper encoding recognition — which can be hinted of course with the user locale — would definitely be more user-friendly.
This said, using the character set most likely for the user lang is not in my opinion ideal.
For instance I have a user lang in en_US.UTF-8, whereas my native lang is French, but I most usually read subtitles in Korean for a friend. Using the user language as a hint is still a good idea in general though, but it can't say all.
With French or English locale, VLC tries UTF-8 and falls back to Windows-1252 (which is mostly compatible with ISO_8859-1). If you mix different scripts, then Unicode is the only sane option, no matter how you look at it.
Anyway all this to said that I believe that having proper encoding recognition — which can be hinted of course with the user locale — would definitely be more user-friendly.
But that's my point: we have hinting and it works relatively well. uchardet does not seem to support language hinting in its external API.
But that's my point: we have hinting and it works relatively well. uchardet does not seem to support language hinting in its external API.
Right, as far as I could see, uchardet does not have hinting. But it still works better than current situation.
I have lived 2 years in Japan, 1 year in Korea, and I live with a Korean person. And since I use Linux, no video player has ever been able to read about half the subtitles I find for Japanese or Korean, without me tinkering with GUI options or command line. On the other hand, uchardet get perfect detection each time for all the Japanese and Korean subtitles I have. Maybe it's not perfect, maybe it would still makes errors from time to time (time would tell), and I don't use it for every language on earth, but considering it is Firefox algorithm and that Firefox is used all over the world and with pretty good encoding detection, I'm guessing it may be one of the best deals we have for now.
I have built 2 files in EUC-KR and ISO-2022-JP (common encoding in Korea and Japan respectively) and made them available there: https://cloud.libreart.info/public.php?service=files&t=0ed40c3a05231ed58bc184081753f191
They don't work on VLC 2.2.2. You can try, unless you have brand new encoding code in the dev version, they won't work for you either. uchardet on the other hand detect the right encoding with no problem.
libenca was huge, IIRC, and slow.
I don't advise libenca because from it supports too few languages, and from my tests, even with hinting it does not detect well the encoding it is supposed to support.
Same for libguess. I tested it on some EUC-KR subtitle and it was unable to detect the right encoding with "korean" hinting (even though it was in its list).
All the various bindings based on Firefox code are the best detection libs I found up to now.
Now I don't mind if you find another way or another lib to improve subtitle encoding detection. I'm just trying to improve things for non-westerner people who always have to manually set their file encoding.
Browsing the code of uchardet, I discovered that original C++ code has hinting. Simply the C wrapper did not implement this. It should be fairly easy to add in a coming version.
As for licensing, it is MPL 1.1/GPL 2.0/LGPL 2.1. VLC is GPL 2.0 as well. This should not be a problem.
Just to confirm that uchardet would have detected this file as EUC-KR.
I still don't have time to patch VLC to use uchardet, and would thank anyone doing so.