The text renderer is supposed to turn the region into a rendered region. Or not create an picture in the case of Text-To-Speech.
Supporting a different output region would need some API tweaking so the region is created by the renderer and the core would replace the input region. But that's only for text renderers that create an output picture.
VideoLAN code repository instance