Re: mailer.py cannot handle utf-8 path in Subject correctly (Re: mailer.py can produce subject header violates RFC 5321/5322 if truncate_subject is not set)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Tue, 7 Jan 2020 17:19:02 +0000

Yasuhito FUTATSUKI wrote on Wed, Jan 08, 2020 at 00:26:39 +0900:
> On 2020/01/07 9:41, Yasuhito FUTATSUKI wrote:
> > On 2020/01/07 6:52, Yasuhito FUTATSUKI wrote:
> >> By the way, it seems another issue about truncate_subject that current
> >> implementation of truncate_subject may break utf-8 multi-bytes character
> >> sequence, but I didn't reproduce it(because I always use ascii
> >> characters only for file names...).
>
> I could reproduce this problem.
â‹®
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-4: invalid continuation byte
>
> > Probably it needs something like this (but it doesn't support conbining
> > characters, and I didn't any test...):
> > [[[
> > Index: tools/hook-scripts/mailer/mailer.py
> > ===================================================================
> > --- tools/hook-scripts/mailer/mailer.py (revision 1872398)
> > +++ tools/hook-scripts/mailer/mailer.py (working copy)
> > @@ -159,7 +159,13 @@
> > truncate_subject = 0
> >
> > if truncate_subject and len(subject) > truncate_subject:
> > - subject = subject[:(truncate_subject - 3)] + "..."
> > + # To avoid breaking utf-8 multi-bytes character sequence, we should
> > + # search the top of the sequence if the byte of the truncate point is
> > + # secound or later part of multi-bytes character sequence.
> > + idx = truncate_subject - 3
> > + while 0x80 <= ord(subject[idx]) <= 0xbf:
> > + idx -= 1
> > + subject = subject[:idx] + "..."
> > return subject
> >
> > def start(self, group, params):
> > ]]]
>
> After this patch applied, the script above runs without error.
>
> However, this produces Subject line below.
>
> [[[
> Subject: r1 - =?utf-8?b?44CH44CH44CH5LiA?= =?utf-8?b?44CH44CH44CH5LiJ?= =?utf-8?b?44CH44CH44CH5LqM?= =?utf-8?b?44CH44CH44CH5LqU?= =?utf-8?b?44CH44CH44CH5YWt?= =?utf-8?b?44CHLi4u?=^M
> ]]]
>
> and decoded Results is
>
> "Subject: r1 - ã€‡ã€‡ã€‡ä¸€ã€‡ã€‡ã€‡ä¸‰ã€‡ã€‡ã€‡äºŒã€‡ã€‡ã€‡äº”ã€‡ã€‡ã€‡å…ã€‡..."
>
> because white space(s) between encoded words are ignored.
> I think this is not what we want.

We shouldn't be handling a UTF-8 string bytewise. If it needs truncating, then
we should truncate it characterwise or wordwise, not bytewise.

We shouldn't be doing the MIME-encoding ourselves. We should just provide the
subject line to the 'email' module and let it worry about MIME encoding, line
folding, and everything else. (This is a preÃ«xisting problem.)

Makes sense?

Cheers,

Daniel
Received on 2020-01-07 18:19:11 CET

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: "svn list -v" column alignment issue"
Previous message: Daniel Shahaf: "Re: mailer.py can produce subject header violates RFC 5321/5322 if truncate_subject is not set"
In reply to: Yasuhito FUTATSUKI: "mailer.py cannot handle utf-8 path in Subject correctly (Re: mailer.py can produce subject header violates RFC 5321/5322 if truncate_subject is not set)"
Next in thread: Daniel Shahaf: "Re: mailer.py can produce subject header violates RFC 5321/5322 if truncate_subject is not set"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]