Skip to content

Add failing message with non-ascii single-byte headers in attachment#130

Merged
ad-m merged 4 commits intomasterfrom
utf8-failing
Jul 9, 2017
Merged

Add failing message with non-ascii single-byte headers in attachment#130
ad-m merged 4 commits intomasterfrom
utf8-failing

Conversation

@ad-m
Copy link
Copy Markdown
Collaborator

@ad-m ad-m commented Feb 22, 2017

Hello,

In the wild I again meet messages which failing to parse.

Here is stacktrace:

___________________________ TestProcessEmail.test_message_with_utf_headers_in_attachments ____________________________
django_mailbox/tests/test_process_email.py:153: in test_message_with_utf_headers_in_attachments
    msg = mailbox.process_incoming_message(email_object)
django_mailbox/models.py:227: in process_incoming_message
    msg = self._process_message(message)
django_mailbox/models.py:365: in _process_message
    message = self._get_dehydrated_message(message, msg)
django_mailbox/models.py:251: in _get_dehydrated_message
    self._get_dehydrated_message(part, record)
django_mailbox/models.py:297: in _get_dehydrated_message
    attachment[key] = value
django_mailbox/models.py:748: in __setitem__
    rehydrated = self._get_rehydrated_headers()
django_mailbox/models.py:736: in _get_rehydrated_headers
    headers = unicode(headers, 'utf-8').encode('utf-8')
E   UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 82: invalid continuation byte

@ad-m
Copy link
Copy Markdown
Collaborator Author

ad-m commented Feb 22, 2017

@coddingtonbear , do you have any recommendation about fix that?

@coddingtonbear
Copy link
Copy Markdown
Owner

None offhand, but I'll have a glance tonight! Thanks for posting a test case that I can work with, @a-m!

@coddingtonbear coddingtonbear self-assigned this Feb 22, 2017
@coddingtonbear
Copy link
Copy Markdown
Owner

I've been looking through this a little bit, but I'm afraid there aren't many good options -- the message's filename isn't in UTF-8: it's a single-byte character. There won't be a way for us to recover the actual filename, and we'll instead have to either drop that character or let unicode replace it with the "missing character" symbol.

I'm currently poking around at a fix and hope to post something in a few days when I have a bit more time.

@ad-m
Copy link
Copy Markdown
Collaborator Author

ad-m commented Feb 23, 2017

The pull requests contains message truncated. Here is a messages download from Roundcube directly: https://files.jawne.info.pl/private_html/2017_02_0c53ceaece887135789e40006cf992e419bdb647 . I truncated to prevent flooding of the repository.

I am freedom of information activist which works for Citizens Network Watchdog Poland. Yesterday I send over 2,400 freedom of information requests. All responses we collect and publish on fedrowanie.siecobywatelska.pl thanks to django-mailbox. Then we want to engage citizens to analyze the data, which will teach them how the governements works, and to us it will provide the data necessary in the public debate and advocacy.

It is some responses from governements. So in my use case of loss of a single character from the filename is not a problem. The file name has no legal significance.

@coddingtonbear coddingtonbear changed the title Add failing message with utf-8 headers in attachment Add failing message with non-ascii single-byte headers in attachment Feb 23, 2017
@coddingtonbear
Copy link
Copy Markdown
Owner

Aww, @ad-m, you don't need to remind me who you are -- I wouldn't forget that.

I've spent a bit more time on this this afternoon and think this is the most reasonable path forward: changing header parsing and storage logic to respect DJANGO_MAILBOX_DEFAULT_CHARSET (in your case, you'll probably want to start setting that setting to iso8859-2 -- for single-byte non-ascii characters, that's the encoding that will be assumed).

In the short term, I think I'll be able to figure out a path forward for that in the next couple days -- I'm pretty close to a reasonable solution, but it might take a bit of time to make a release for it since I've found some existing bugs in how django-mailbox handles single-byte non-ascii characters.

In the longer term, I'm considering spending some serious time in the next month or so to take a deeper look into django-mailbox to prepare for another major release. There are a couple things that originally required very complicated solutions when this library was created, but are now a lot easier to manage -- specifically around storing non-unicode data in the database. That'll take some time and effort, though -- do you happen to both have the willingness and free time to help re-think some things with me?

@coddingtonbear
Copy link
Copy Markdown
Owner

I've just pushed up a few (not completely tested) changes to encoding-handling for filenames; I'm not totally sure what kind of environment you're running this in, @ad-m, but if it's possible for you to verify that this changeset solves your problem and doesn't introduce any problems, that might help accelerate this change making it into master.

if self.encoded:
return base64.b64decode(self.body.encode('ascii'))
return base64.b64decode(
self.body.encode(settings['default_charset'], 'replace')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change definitely isn't necessary now that I'm re-reviewing my own changes here -- a base64-encoded value will never have non-ascii characters. I'll remove it this before the final version.

self.body = base64.b64encode(body).decode('ascii')
self.body = base64.b64encode(body).decode(
settings['default_charset'],
'replace',
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nor here; base64 encoding shouldn't have to worry about default_charset at all. I was just a little overzealous in removing those ascii decode/encodes.

@ad-m
Copy link
Copy Markdown
Collaborator Author

ad-m commented Jul 8, 2017

I am sorry for delay in response. I checked your code. It pass for 1 GB of emails.

In the near future I'm going to import more than 10 GB of messages. We will therefore have a well-tried message import.

I will try to improve your code for Python 3 compatibility this weekend. Due to added Python 3 compatibility in talon, we are migrating our projects to Python 3.

@ad-m
Copy link
Copy Markdown
Collaborator Author

ad-m commented Jul 9, 2017

I tried all night to find a way to unify messaging. However, I am aware that this standard library has changed its behavior.

@ad-m ad-m merged commit 0449e99 into master Jul 9, 2017
@ad-m ad-m deleted the utf8-failing branch April 21, 2018 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants