Add failing message with non-ascii single-byte headers in attachment#130
Add failing message with non-ascii single-byte headers in attachment#130
Conversation
|
@coddingtonbear , do you have any recommendation about fix that? |
|
None offhand, but I'll have a glance tonight! Thanks for posting a test case that I can work with, @a-m! |
|
I've been looking through this a little bit, but I'm afraid there aren't many good options -- the message's filename isn't in UTF-8: it's a single-byte character. There won't be a way for us to recover the actual filename, and we'll instead have to either drop that character or let unicode replace it with the "missing character" symbol. I'm currently poking around at a fix and hope to post something in a few days when I have a bit more time. |
|
The pull requests contains message truncated. Here is a messages download from Roundcube directly: https://files.jawne.info.pl/private_html/2017_02_0c53ceaece887135789e40006cf992e419bdb647 . I truncated to prevent flooding of the repository. I am freedom of information activist which works for Citizens Network Watchdog Poland. Yesterday I send over 2,400 freedom of information requests. All responses we collect and publish on fedrowanie.siecobywatelska.pl thanks to django-mailbox. Then we want to engage citizens to analyze the data, which will teach them how the governements works, and to us it will provide the data necessary in the public debate and advocacy. It is some responses from governements. So in my use case of loss of a single character from the filename is not a problem. The file name has no legal significance. |
|
Aww, @ad-m, you don't need to remind me who you are -- I wouldn't forget that. I've spent a bit more time on this this afternoon and think this is the most reasonable path forward: changing header parsing and storage logic to respect In the short term, I think I'll be able to figure out a path forward for that in the next couple days -- I'm pretty close to a reasonable solution, but it might take a bit of time to make a release for it since I've found some existing bugs in how django-mailbox handles single-byte non-ascii characters. In the longer term, I'm considering spending some serious time in the next month or so to take a deeper look into |
…mong other unicode handling fixes.
|
I've just pushed up a few (not completely tested) changes to encoding-handling for filenames; I'm not totally sure what kind of environment you're running this in, @ad-m, but if it's possible for you to verify that this changeset solves your problem and doesn't introduce any problems, that might help accelerate this change making it into master. |
django_mailbox/models.py
Outdated
| if self.encoded: | ||
| return base64.b64decode(self.body.encode('ascii')) | ||
| return base64.b64decode( | ||
| self.body.encode(settings['default_charset'], 'replace') |
There was a problem hiding this comment.
This change definitely isn't necessary now that I'm re-reviewing my own changes here -- a base64-encoded value will never have non-ascii characters. I'll remove it this before the final version.
django_mailbox/models.py
Outdated
| self.body = base64.b64encode(body).decode('ascii') | ||
| self.body = base64.b64encode(body).decode( | ||
| settings['default_charset'], | ||
| 'replace', |
There was a problem hiding this comment.
Nor here; base64 encoding shouldn't have to worry about default_charset at all. I was just a little overzealous in removing those ascii decode/encodes.
…is not necessary.
|
I am sorry for delay in response. I checked your code. It pass for 1 GB of emails. In the near future I'm going to import more than 10 GB of messages. We will therefore have a well-tried message import. I will try to improve your code for Python 3 compatibility this weekend. Due to added Python 3 compatibility in talon, we are migrating our projects to Python 3. |
|
I tried all night to find a way to unify messaging. However, I am aware that this standard library has changed its behavior. |
Hello,
In the wild I again meet messages which failing to parse.
Here is stacktrace: