Add failing message with non-ascii single-byte headers in attachment by ad-m · Pull Request #130 · coddingtonbear/django-mailbox

ad-m · 2017-02-22T15:22:50Z

Hello,

In the wild I again meet messages which failing to parse.

Here is stacktrace:

___________________________ TestProcessEmail.test_message_with_utf_headers_in_attachments ____________________________
django_mailbox/tests/test_process_email.py:153: in test_message_with_utf_headers_in_attachments
    msg = mailbox.process_incoming_message(email_object)
django_mailbox/models.py:227: in process_incoming_message
    msg = self._process_message(message)
django_mailbox/models.py:365: in _process_message
    message = self._get_dehydrated_message(message, msg)
django_mailbox/models.py:251: in _get_dehydrated_message
    self._get_dehydrated_message(part, record)
django_mailbox/models.py:297: in _get_dehydrated_message
    attachment[key] = value
django_mailbox/models.py:748: in __setitem__
    rehydrated = self._get_rehydrated_headers()
django_mailbox/models.py:736: in _get_rehydrated_headers
    headers = unicode(headers, 'utf-8').encode('utf-8')
E   UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 82: invalid continuation byte

ad-m · 2017-02-22T15:25:08Z

@coddingtonbear , do you have any recommendation about fix that?

coddingtonbear · 2017-02-22T18:56:45Z

None offhand, but I'll have a glance tonight! Thanks for posting a test case that I can work with, @a-m!

coddingtonbear · 2017-02-23T06:03:36Z

I've been looking through this a little bit, but I'm afraid there aren't many good options -- the message's filename isn't in UTF-8: it's a single-byte character. There won't be a way for us to recover the actual filename, and we'll instead have to either drop that character or let unicode replace it with the "missing character" symbol.

I'm currently poking around at a fix and hope to post something in a few days when I have a bit more time.

ad-m · 2017-02-23T06:36:58Z

The pull requests contains message truncated. Here is a messages download from Roundcube directly: https://files.jawne.info.pl/private_html/2017_02_0c53ceaece887135789e40006cf992e419bdb647 . I truncated to prevent flooding of the repository.

I am freedom of information activist which works for Citizens Network Watchdog Poland. Yesterday I send over 2,400 freedom of information requests. All responses we collect and publish on fedrowanie.siecobywatelska.pl thanks to django-mailbox. Then we want to engage citizens to analyze the data, which will teach them how the governements works, and to us it will provide the data necessary in the public debate and advocacy.

It is some responses from governements. So in my use case of loss of a single character from the filename is not a problem. The file name has no legal significance.

coddingtonbear · 2017-02-27T05:49:16Z

Aww, @ad-m, you don't need to remind me who you are -- I wouldn't forget that.

I've spent a bit more time on this this afternoon and think this is the most reasonable path forward: changing header parsing and storage logic to respect DJANGO_MAILBOX_DEFAULT_CHARSET (in your case, you'll probably want to start setting that setting to iso8859-2 -- for single-byte non-ascii characters, that's the encoding that will be assumed).

In the short term, I think I'll be able to figure out a path forward for that in the next couple days -- I'm pretty close to a reasonable solution, but it might take a bit of time to make a release for it since I've found some existing bugs in how django-mailbox handles single-byte non-ascii characters.

In the longer term, I'm considering spending some serious time in the next month or so to take a deeper look into django-mailbox to prepare for another major release. There are a couple things that originally required very complicated solutions when this library was created, but are now a lot easier to manage -- specifically around storing non-unicode data in the database. That'll take some time and effort, though -- do you happen to both have the willingness and free time to help re-think some things with me?

…mong other unicode handling fixes.

coddingtonbear · 2017-03-06T05:33:15Z

I've just pushed up a few (not completely tested) changes to encoding-handling for filenames; I'm not totally sure what kind of environment you're running this in, @ad-m, but if it's possible for you to verify that this changeset solves your problem and doesn't introduce any problems, that might help accelerate this change making it into master.

coddingtonbear · 2017-03-06T05:35:09Z

django_mailbox/models.py

        if self.encoded:
-            return base64.b64decode(self.body.encode('ascii'))
+            return base64.b64decode(
+                self.body.encode(settings['default_charset'], 'replace')


This change definitely isn't necessary now that I'm re-reviewing my own changes here -- a base64-encoded value will never have non-ascii characters. I'll remove it this before the final version.

coddingtonbear · 2017-03-06T05:36:04Z

django_mailbox/models.py

-        self.body = base64.b64encode(body).decode('ascii')
+        self.body = base64.b64encode(body).decode(
+            settings['default_charset'],
+            'replace',


Nor here; base64 encoding shouldn't have to worry about default_charset at all. I was just a little overzealous in removing those ascii decode/encodes.

…is not necessary.

ad-m · 2017-07-08T23:23:04Z

I am sorry for delay in response. I checked your code. It pass for 1 GB of emails.

In the near future I'm going to import more than 10 GB of messages. We will therefore have a well-tried message import.

I will try to improve your code for Python 3 compatibility this weekend. Due to added Python 3 compatibility in talon, we are migrating our projects to Python 3.

ad-m · 2017-07-09T10:21:22Z

I tried all night to find a way to unify messaging. However, I am aware that this standard library has changed its behavior.

Add failing message with utf-8 headers in attachment

81d6648

coddingtonbear self-assigned this Feb 22, 2017

coddingtonbear added the bug label Feb 22, 2017

coddingtonbear changed the title ~~Add failing message with utf-8 headers in attachment~~ Add failing message with non-ascii single-byte headers in attachment Feb 23, 2017

[#130] Fix handling of single-byte filenames in attachment headers; a…

2df2d8a

…mong other unicode handling fixes.

coddingtonbear reviewed Mar 6, 2017

View reviewed changes

[#130] Taking default_charset into account for base64-encoded values …

13abec2

…is not necessary.

Merge branch 'master' into utf8-failing

0449e99

ad-m merged commit 0449e99 into master Jul 9, 2017

ad-m deleted the utf8-failing branch April 21, 2018 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failing message with non-ascii single-byte headers in attachment#130

Add failing message with non-ascii single-byte headers in attachment#130
ad-m merged 4 commits intomasterfrom
utf8-failing

ad-m commented Feb 22, 2017

Uh oh!

ad-m commented Feb 22, 2017

Uh oh!

coddingtonbear commented Feb 22, 2017

Uh oh!

coddingtonbear commented Feb 23, 2017

Uh oh!

ad-m commented Feb 23, 2017

Uh oh!

coddingtonbear commented Feb 27, 2017

Uh oh!

coddingtonbear commented Mar 6, 2017

Uh oh!

coddingtonbear Mar 6, 2017

Uh oh!

coddingtonbear Mar 6, 2017

Uh oh!

ad-m commented Jul 8, 2017

Uh oh!

ad-m commented Jul 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ad-m commented Feb 22, 2017

Uh oh!

ad-m commented Feb 22, 2017

Uh oh!

coddingtonbear commented Feb 22, 2017

Uh oh!

coddingtonbear commented Feb 23, 2017

Uh oh!

ad-m commented Feb 23, 2017

Uh oh!

coddingtonbear commented Feb 27, 2017

Uh oh!

coddingtonbear commented Mar 6, 2017

Uh oh!

coddingtonbear Mar 6, 2017

Choose a reason for hiding this comment

Uh oh!

coddingtonbear Mar 6, 2017

Choose a reason for hiding this comment

Uh oh!

ad-m commented Jul 8, 2017

Uh oh!

ad-m commented Jul 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants