mozilla

Mozilla Nederland LogoDe Nederlandse
Mozilla-gemeenschap

Daniel Stenberg: curl 7.74.0 with HSTS

Mozilla planet - wo, 09/12/2020 - 07:51

Welcome to another curl release, 56 days since the previous one.

Release presentation Numbers

the 196th release
1 change
56 days (total: 8,301)

107 bug fixes (total: 6,569)
167 commits (total: 26,484)
0 new public libcurl function (total: 85)
6 new curl_easy_setopt() option (total: 284)

1 new curl command line option (total: 235)
46 contributors, 22 new (total: 2,292)
22 authors, 8 new (total: 843)
3 security fixes (total: 98)
1,600 USD paid in Bug Bounties (total: 4,400 USD)

Security

This time around we have no less than three vulnerabilities fixed and as shown above we’ve paid 1,600 USD in reward money this time, out of which the reporter of the CVE-2020-8286 issue got the new record amount 900 USD. The second one didn’t get any reward simply because it was not claimed. In this single release we doubled the number of vulnerabilities we’ve published this year!

The six announced CVEs during 2020 still means this has been a better year than each of the six previous years (2014-2019) and we have to go all the way back to 2013 to find a year with fewer CVEs reported.

I’m very happy and proud that we as an small independent open source project can reward these skilled security researchers like this. Much thanks to our generous sponsors of course.

CVE-2020-8284: trusting FTP PASV responses

When curl performs a passive FTP transfer, it first tries the EPSV command and if that is not supported, it falls back to using PASV. Passive mode is what curl uses by default.

A server response to a PASV command includes the (IPv4) address and port number for the client to connect back to in order to perform the actual data transfer.

This is how the FTP protocol is designed to work.

A malicious server can use the PASV response to trick curl into connecting back to a given IP address and port, and this way potentially make curl extract information about services that are otherwise private and not disclosed, for example doing port scanning and service banner extractions.

If curl operates on a URL provided by a user (which by all means is an unwise setup), a user can exploit that and pass in a URL to a malicious FTP server instance without needing any server breach to perform the attack.

There’s no really good solution or fix to this, as this is how FTP works, but starting in curl 7.74.0, curl will default to ignoring the IP address in the PASV response and instead just use the address it already uses for the control connection. In other words, we will enable the CURLOPT_FTP_SKIP_PASV_IP option by default! This will cause problems for some rare use cases (which then have to disable this), but we still think it’s worth doing.

CVE-2020-8285: FTP wildcard stack overflow

libcurl offers a wildcard matching functionality, which allows a callback (set with CURLOPT_CHUNK_BGN_FUNCTION) to return information back to libcurl on how to handle a specific entry in a directory when libcurl iterates over a list of all available entries.

When this callback returns CURL_CHUNK_BGN_FUNC_SKIP, to tell libcurl to not deal with that file, the internal function in libcurl then calls itself recursively to handle the next directory entry.

If there’s a sufficient amount of file entries and if the callback returns “skip” enough number of times, libcurl runs out of stack space. The exact amount will of course vary with platforms, compilers and other environmental factors.

The content of the remote directory is not kept on the stack, so it seems hard for the attacker to control exactly what data that overwrites the stack – however it remains a Denial-Of-Service vector as a malicious user who controls a server that a libcurl-using application works with under these premises can trigger a crash.

CVE-2020-8286: Inferior OCSP verification

libcurl offers “OCSP stapling” via the CURLOPT_SSL_VERIFYSTATUS option. When set, libcurl verifies the OCSP response that a server responds with as part of the TLS handshake. It then aborts the TLS negotiation if something is wrong with the response. The same feature can be enabled with --cert-status using the curl tool.

As part of the OCSP response verification, a client should verify that the response is indeed set out for the correct certificate. This step was not performed by libcurl when built or told to use OpenSSL as TLS backend.

This flaw would allow an attacker, who perhaps could have breached a TLS server, to provide a fraudulent OCSP response that would appear fine, instead of the real one. Like if the original certificate actually has been revoked.

Change

There’s really only one “change” this time, and it is an experimental one which means you need to enable it explicitly in the build to get to try it out. We discourage people from using this in production until we no longer consider it experimental but we will of course appreciate feedback on it and help to perfect it.

The change in this release introduces no less than 6 new easy setopts for the library and one command line option: support HTTP Strict-Transport-Security, also known as HSTS. This is a system for HTTPS hosts to tell clients to attempt to contact them over insecure methods (ie clear text HTTP).

One entry-point to the libcurl options for HSTS is the CURLOPT_HSTS_CTRL man page.

Bug-fixes

Yet another release with over one hundred bug-fixes accounted for. I’ve selected a few interesting ones that I decided to highlight below.

enable alt-svc in the build by default

We landed the code and support for alt-svc: headers in early 2019 marked as “experimental”. We feel the time has come for this little baby to grow up and step out into the real world so we removed the labeling and we made sure the support is enabled by default in builds (you can still disable it if you want).

8 cmake fixes bring cmake closer to autotools level

In curl 7.73.0 we removed the “scary warning” from the cmake build that warned users that the cmake build setup might be inferior. The goal was to get more people to use it, and then by extension help out to fix it. The trick might have worked and we’ve gotten several improvements to the cmake build in this cycle. More over, we’ve gotten a whole slew of new bug reports on it as well so now we have a list of known cmake issues in the KNOWN_BUGS document, ready for interested contributors to dig into!

configure now uses pkg-config to find openSSL when cross-compiling

Just one of those tiny weird things. At some point in the past someone had trouble building OpenSSL cross-compiled when pkg-config was used so it got disabled. I don’t recall the details. This time someone had the reversed problem so now the configure script was fixed again to properly use pkg-config even when cross-compiling…

curl.se is the new home

You know it.

curl: only warn not fail, if not finding the home dir

The curl tool attempts to find the user’s home dir, the user who invokes the command, in order to look for some files there. For example the .curlrc file. More importantly, when doing SSH related protocol it is somewhat important to find the file ~/.ssh/known_hosts. So important that the tool would abort if not found. Still, a command line can still work without that during various circumstances and in particular if -k is used so bailing out like that was nothing but wrong…

curl_easy_escape: limit output string length to 3 * max input

In general, libcurl enforces an internal string length limit that prevents any string to grow larger than 8MB. This is done to prevent mistakes or abuse. Due a mistake, the string length limit was enforced wrongly in the curl_easy_escape function which could make the limit a third of the intended size: 2.67 MB.

only set USE_RESOLVE_ON_IPS for Apple’s native resolver use

This define is set internally when the resolver function is used even when a plain IP address is given. On macOS for example, the resolver functions are used to do some conversions and thus this is necessary, while for other resolver libraries we avoid the resolver call when we can convert the IP number to binary internally more effectively.

By a mistake we had enabled this “call getaddrinfo() anyway”-logic even when curl was built to use c-ares on macOS.

fix memory leaks in GnuTLS backend

We used two functions to extract information from the server certificate that didn’t properly free the memory after use. We’ve filed subsequent bug reports in the GnuTLS project asking them to make the required steps much clearer in their documentation so that perhaps other projects can avoid the same mistake going forward.

libssh2: fix transport over HTTPS proxy

SFTP file transfers didn’t work correctly since previous fixes obviously weren’t thorough enough. This fix has been confirmed fine in use.

make curl –retry work for HTTP 408 responses too

Again. We made the --retry logic work for 408 once before, but for some inexplicable reasons the support for that was accidentally dropped when we introduced parallel transfer support in curl. Regression fixed!

use OPENSSL_init_ssl() with >= 1.1.0

Initializing the OpenSSL library the correct way is a task that sounds easy but always been a source for problems and misunderstandings and it has never been properly documented. It is a long and boring story that has been going on for a very long time. This time, we add yet another chapter to this novel when we start using this function call when OpenSSL 1.1.0 or later (or BoringSSL) is used in the build. Hopefully, this is one of the last chapters in this book.

“scheme-less URLs” not longer accept blank port number

curl operates on “URLs”, but as a special shortcut it also supports URLs without the scheme. For example just a plain host name. Such input isn’t at all by any standards an actual URL or URI; curl was made to handle such input to mimic how browsers work. curl “guesses” what scheme the given name is meant to have, and for most names it will go with HTTP.

Further, a URL can provide a specific port number using a colon and a port number following the host name, like “hostname:80” and the path then follows the port number: “hostname:80/path“. To complicate matters, the port number can be blank, and the path can start with more than one slash: “hostname://path“.

curl’s logic that determines if a given input string has a scheme present checks the first 40 bytes of the string for a :// sequence and if that is deemed not present, curl determines that this is a scheme-less host name.

This means [39-letter string]:// as input is treated as a URL with a scheme and a scheme that curl doesn’t know about and therefore is rejected as an input, while [40-letter string]:// is considered a host name with a blank port number field and a path that starts with double slash!

In 7.74.0 we remove that potentially confusing difference. If the URL is determined to not have a scheme, it will not be accepted if it also has a blank port number!

Categorieën: Mozilla-nl planet

Martin Thompson: Oblivious DoH

Mozilla planet - wo, 09/12/2020 - 01:00

Today we heard an announcement that Cloudflare, Apple, and Fastly are collaborating on a new technology for improving privacy of DNS queries using a technology they call Oblivious DoH (ODoH).

This is an exciting development. This posting examines the technology in more detail and looks at some of the challenges this will need to overcome before it can be deployed more widely.

How ODoH Provides Privacy for DNS Queries #

Oblivious DoH is a simple mixnet protocol for making DNS queries. It uses a proxy server to provide added privacy for query streams.

This looks something like:

digraph ODoH { graph [overlap=true, splines=line, nodesep=1.0, ordering=out]; node [shape=rectangle, fontname=" "]; edge [arrowhead=none]; { rank=same; Client->Proxy; Proxy->Resolver; } }

A common criticism of DNS over HTTPS (DoH) is that it provides DoH resolvers with lots of privacy-sensitive information[1]. Currently all DNS resolvers, including DoH resolvers, see the contents of queries and can link that to who is making those queries. DoH includes connection reuse, so resolvers can link requests from the same client using the connection.

In Oblivious DoH, a proxy aggregates queries from multiple clients so that the resolver is unable to link queries to individual clients. ODoH protects the IP address of the client, but it also prevents the resolver from linking queries from the same client together. Unlike an ordinary HTTP proxy, which handle TLS connections to servers[2], ODoH proxies handle queries that are individually encrypted.

ODoH prevents resolvers from assembling profiles on clients by collecting the queries they make, because resolvers see queries from a large number of clients all mixed together.

An ODoH proxy learns almost nothing from this process as ODoH uses HPKE to encrypt the both query and answer with keys chosen by the client and resolver.

The privacy benefits of ODoH can only be undone if both the proxy and resolver cooperate. ODoH therefore recommends that the two services be run independently, with the operator of each making a commitment to respecting privacy.

Costs #

The privacy advantages provided by the ODoH design come at a higher cost than DoH, where a client just queries the resolver directly:

  • The proxy adds a little latency as it needs to forward queries and responses.
  • HPKE encryption adds up to about 100 bytes to each query.
  • The client and resolver need to spend a little CPU time to add and remove the encryption.

Cloudflare's tests show that the overall effect of ODoH on performance is quite modest. These early tests even suggest some improvement for the slowest queries. If those performance gains can be kept as they scale up their deployment, that would be strong justification for deployment.

Why This Design #

A similar outcome might be achieved using a proxy that supports HTTP CONNECT. However, in order to avoid the resolver from learning which queries come from the same client, each query would have to use a new connection.

That gets pretty expensive. While you might be able to tricks to drive down latency like sending the TLS handshake with the HTTP CONNECT, it means that every request uses a separate TCP connection and a round trip to establish the connection[3].

It is also possible to use something like Tor, which provides superior privacy protection. Tor is a lot more expensive.

Using HPKE and a multiplexed protocol like HTTP/2 or HTTP/3 avoids per-query connection setup costs. However, the most important thing is that it involves only minimal additional latency to get the privacy benefits[4].

Key Management in DNS #

The proposal puts HPKE keys for the resolver in the DNS[5]. The idea is that clients can talk to the resolver directly to get these, then use that information to protect its queries. As the keys are DNS records, they can be retrieved from any DNS resolver, which is a potential advantage.

This also means that this ODoH design depends on DNSSEC. Many clients rely on their resolver to perform DNSSEC validation, which doesn't help here. So this makes it difficult to deploy something like this incrementally in clients.

A better option might be to offer the HPKE public key information in response to a direct HTTP request to the resolver. That would ensure that the key could be authenticated by the client using HTTPS and the Web PKI.

Trustworthiness of Proxies #

Both client and resolver will want to authenticate the proxy and only allow a trustworthy proxy. The protocol design means that the need for trust in the proxy is limited, but it isn't zero.

Clients need to trust that the proxy is hiding their IP address. A bad proxy could attach the client IP address to every query they forward. Clients will want some way of knowing that the proxy won't do this[6].

Resolvers will likely want to limit the number of proxies that they will accept requests from, because the aggregated queries from a proxy of any reasonable size will look a lot like a denial of service attack. Mixing all the queries together denies resolvers the ability to do per-client rate limiting, which is a valuable denial of service protection measure. Resolvers will need to apply much more generous rate limits for these proxies and trust that the proxies will take reasonable steps to ensure that individual clients are not able to generate abusive numbers of queries.

This means that proxies will need to be acceptable to both client and resolver. Early deployments will be able to rely on contracts and similar arrangements to guarantee this. However, if use of ODoH is to scale out to support large numbers of providers of both proxies and resolvers, it could be necessary to build systems for managing these relationships.

Proxying For Other Applications #

One obvious thing with this design is that it isn't unique to DNS queries. In fact, there are a large number of request-response exchanges that would benefit from the same privacy benefits that ODoH provides. For example, Google this week announced a trial of a similar technology for preloading content.

A generic design that enabled protection for HTTP queries of any sort would be ideal. My hope is that we can design that protocol.

Once you look to designing a more generic solution, there are a few extra things that might improve the design. Automatic discovery of HTTP endpoints that allow oblivious proxying is one potential enhancement. Servers could advertise both keys and the proxies they support so that clients can choose to use those proxies to mask their address. This might involve automated proxy selection or discovery and even systems for encoding agreements. There are lots of possibilities here.

Centralization #

One criticism regarding DoH deployments is that they encourage consolidation of DNS resolver services. For ODoH - at least in the short term - options for ODoH resolvers will be limited, which could push usage toward a small number of server operators in exchange for the privacy gains ODoH provides.

During initial roll-out, the number of proxy operators will be limited. Also, using a larger proxy means that your queries are mixed in with more queries from other people, providing marginally better privacy. That might provide some impetus to consolidate.

Deploying automated discovery systems for acceptable proxies might help mitigate the worst centralization effects, but it seems likely that this will not be a feature of early deployments.

In the end, it would be a mistake to cry "centralization" in response to early trial deployments of a technology, which are naturally limited in scope. Furthermore, it's hard to know what the long term impact on the ecosystem will be. We might never be able to separate the effect of existing trends toward consolidation from the effect of new technology.

Conclusion #

I like the model adopted here. The use of a proxy neatly addresses one of the biggest concerns with the rollout of DoH: the privacy risk of having a large provider being able to gather information about streams of queries that can be linked to your IP address.

ODoH breaks streams of queries into discrete transactions that are hard to assemble into activity profiles. At the same time, ODoH makes it hard to attribute queries to individuals as it hides the origin of queries.

My sense is that the benefits very much outweigh the performance costs, the protocol complexity, and the operational risks. ODoH is a pretty big privacy win for name resolution. The state of name resolution is pretty poor, with much of it still unprotected from snooping, interception, and poisoning. The deployment of DoH went some way to address that, but came with some drawbacks. Oblivious DoH takes the next logical step.

  1. This is something all current DNS resolvers get, but the complaint is about the scale at which this information is gathered. Some people are unhappy that network operators are unable to access this information, but I regard that as a feature. ↩︎

  2. OK, proxies do handle individual, unencrypted HTTP requests, but that capability is hardly ever used any more now that 90% of the web is HTTPS. ↩︎

  3. Using 0-RTT doesn't work here without some fiddly changes to TLS because the session ticket used for TLS allows the server to link connections together, which isn't what we need. ↩︎

  4. This also makes ODoH far more susceptible to traffic analysis, but it relies on volume and the relative similarity of DNS queries to help manage that risk. ↩︎

  5. The recursion here means that the designers of ODoH probably deserve a prize of some sort. ↩︎

  6. The willful IP blindness proposal goes into more detail on what might be required for this. ↩︎

Categorieën: Mozilla-nl planet

The Mozilla Blog: Why getting voting right is hard, Part I: Introduction and Requirements

Mozilla planet - di, 08/12/2020 - 19:24

Every two years around this time, the US has an election and the rest of the world marvels and asks itself one question: What the heck is going on with US elections? I’m not talking about US politics here but about the voting systems (machines, paper, etc.) that people use to vote, which are bafflingly complex. While it’s true that American voting is a chaotic patchwork of different systems scattered across jurisdictions, running efficient secure elections is a genuinely hard problem. This is often surprising to people who are used to other systems that demand precise accounting such as banking/ATMs or large scale databases, but the truth is that voting is fundamentally different and much harder.

In this series I’ll be going through a variety of different voting systems so you can see how this works in practice. This post provides a brief overview of the basic requirements for voting systems. We’ll go into more detail about the practical impact of these requirements as we examine each system.

Requirements

To understand voting systems design, we first need to understand the requirements to which they are designed. These vary somewhat, but generally look something like the below.

Efficient Correct Tabulation

This requirement is basically trivial: collect the ballots and tally them up. The winner is the one with the most votes 1. You also need to do it at scale and within a reasonable period of time otherwise there’s not much point.

Verifiable Results

It’s not enough for the election just to produce the right result, it must also do so in a verifiable fashion. As voting researcher Dan Wallach is fond of saying, the purpose of elections is to convince the loser that they actually lost, and that means more than just trusting the election officials to count the votes correctly. Ideally, everyone in world would be able to check for themselves that the votes had been correctly tabulated (this is often called “public verifiability”), but in real-world systems it usually means that some set of election observers can personally observe parts of the process and hopefully be persuaded it was conducted correctly.

Secrecy of the Ballot

The next major requirement is what’s called “secrecy of the ballot”, i.e., ensuring that others can’t tell how you voted. Without ballot secrecy, people could be pressured to vote certain ways or face negative consequences for their votes. Ballot secrecy actually has two components (1) other people — including election officials — can’t tell how you voted and (2) you can’t prove to other people how you voted. The first component is needed to prevent wholesale retaliation and/or rewards and the second is needed to prevent retail vote buying. The actual level of ballot secrecy provided by systems varies. For instance, the UK system technically allows election officials to match ballots to the voter, but prevents it with procedural controls and vote by mail systems generally don’t do a great job of preventing you from proving how you voted, but in general most voting systems attempt to provide some level of ballot secrecy.2

Accessibility

Finally, we want voting systems to be accessible, both in the specific sense that we want people with disabilities to be able to vote and in the more general sense that we want it to be generally easy for people to vote. Because the voting-eligible population is so large and people’s situations are so varied, this often means that systems have to make accommodations, for instance for overseas or military voters or for people who speak different languages.

Limited Trust

As you’ve probably noticed, one common theme in these requirements is the desire to limit the amount of trust you place in any one entity or person. For instance, when I worked the polls in Santa Clara county elections, we would collect all the paper ballots and put them in tamper-evident bags before taking them back to election central for processing. This makes it harder for the person transporting the ballots to examine the ballots or substitute their own. For those who aren’t used to the way security people think, this often feels like saying that election officials aren’t trustworthy, but really what it’s saying is that elections are very high stakes events and critical systems like this should be designed with as few failure points as possible, and that includes preventing both outsider and insider threats, protecting even against authorized election workers themselves.

An Overconstrained Problem

Individually each of these requirements is fairly easy to meet, but the combination of them turns out to be extremely hard. For example if you publish everyone’s ballots then it’s (relatively) easy to ensure that the ballots were counted correctly, but you’ve just completely give up secrecy of the ballot.3 Conversely, if you just trust election officials to count all the votes, then it’s much easier to provide secrecy from everyone else. But these properties are both important, and hard to provide simultaneously. This tension is at the heart of why voting is so much more difficult than other superficially systems like banking. After all, your transactions aren’t secret from the bank. In general, what we find is that voting systems may not completely meet all the requirements but rather compromise on trying to do a good job on most/all of them.

Up Next: Hand-Counted Paper Ballots

In the next post, I’ll be covering what is probably the simplest common voting system: hand-counted paper ballots. This system actually isn’t that common in the US for reasons I’ll go into, but it’s widely used outside the US and provides a good introduction into some of the problems with running a real election.

  1. For the purpose of this series, we’ll mostly be assuming first past the post systems, which are the main systems in use in the US.
  2. Note that I’m talking here about systems designed for use by ordinary citizens. Legislative voting, judicial voting, etc. are qualitatively different: they usually have a much smaller number of voters and don’t try to preserve the secrecy of the ballot, so the problem is much simpler. 
  3. Thanks to Hovav Shacham for this example. 

The post Why getting voting right is hard, Part I: Introduction and Requirements appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Hacks.Mozilla.Org: An update on MDN Web Docs’ localization strategy

Mozilla planet - di, 08/12/2020 - 17:20

In our previous post — MDN Web Docs evolves! Lowdown on the upcoming new platform — we talked about many aspects of the new MDN Web Docs platform that we’re launching on December 14th. In this post, we’ll look at one aspect in more detail — how we are handling localization going forward. We’ll talk about how our thinking has changed since our previous post, and detail our updated course of action.

Updated course of action

Based on thoughtful feedback from the community, we did some additional investigation and determined a stronger, clearer path forward.

First of all, we want to keep a clear focus on work leading up to the launch of our new platform, and making sure the overall system works smoothly. This means that upon launch, we still plan to display translations in all existing locales, but they will all initially be frozen — read-only, not editable.

We were considering automated translations as the main way forward. One key issue was that automated translations into European languages are seen as an acceptable solution, but automated translations into CJK languages are far from ideal — they have a very different structure to English and European languages, plus many Europeans are able to read English well enough to fall back on English documentation when required, whereas some CJK communities do not commonly read English so do not have that luxury.

Many folks we talked to said that automated translations wouldn’t be acceptable in their languages. Not only would they be substandard, but a lot of MDN Web Docs communities center around translating documents. If manual translations went away, those vibrant and highly involved communities would probably go away — something we certainly want to avoid!

We are therefore focusing on limited manual translations as our main way forward instead, looking to unfreeze a number of key locales as soon as possible after the new platform launch.

Limited manual translations

Rigorous testing has been done, and it looks like building translated content as part of the main build process is doable. We are separating locales into two tiers in order to determine which will be unfrozen and which will remain locked.

  • Tier 1 locales will be unfrozen and manually editable via pull requests. These locales are required to have at least one representative who will act as a community lead. The community members will be responsible for monitoring the localized pages, updating translations of key content once the English versions are changed, reviewing edits, etc. The community lead will additionally be in charge of making decisions related to that locale, and acting as a point of contact between the community and the MDN staff team.
  • Tier 2 locales will be frozen, and not accept pull requests, because they have no community to maintain them.

The Tier 1 locales we are starting with unfreezing are:

  • Simplified Chinese (zh-CN)
  • Traditional Chinese (zh-TW)
  • French (fr)
  • Japanese (ja)

If you wish for a Tier 2 locale to be unfrozen, then you need to come to us with a proposal, including evidence of an active team willing to be responsible for the work associated with that locale. If this is the case, then we can promote the locale to Tier 1, and you can start work.

We will monitor the activity on the Tier 1 locales. If a Tier 1 locale is not being maintained by its community, we shall demote it to Tier 2 after a certain period of time, and it will become frozen again.

We are looking at this new system as a reasonable compromise — providing a path for you the community to continue work on MDN translations providing the interest is there, while also ensuring that locale maintenance is viable, and content won’t get any further out of date. With most locales unmaintained, changes weren’t being reviewed effectively, and readers of those locales were often confused between using their preferred locale or English, their experience suffering as a result.

Review process

The review process will be quite simple.

  • The content for each Tier 1 locale will be kept in its own separate repo.
  • When a PR is made against that repo, the localization community will be pinged for a review.
  • When the content has been reviewed, an MDN admin will be pinged to merge the change. We should be able to set up the system so that this happens automatically.
  • There will also be some user-submitted content bugs filed at https://github.com/mdn/sprints/issues, as well as on the issue trackers for each locale repo. When triaged, the “sprints” issues will be assigned to the relevant localization team to fix, but the relevant localization team is responsible for triaging and resolving issues filed on their own repo.
Machine translations alongside manual translations

We previously talked about the potential involvement of machine translations to enhance the new localization process. We still have this in mind, but we are looking to keep the initial system simple, in order to make it achievable. The next step in Q1 2021 will be to start looking into how we could most effectively make use of machine translations. We’ll give you another update in mid-Q1, once we’ve made more progress.

The post An update on MDN Web Docs’ localization strategy appeared first on Mozilla Hacks - the Web developer blog.

Categorieën: Mozilla-nl planet

Mozilla Attack & Defense: Guest Blog Post: Good First Steps to Find Security Bugs in Fenix (Part 1)

Mozilla planet - di, 08/12/2020 - 16:17

This blog post is one of several guest blog posts, where we invite participants of our bug bounty program to write about bugs they’ve reported to us.

Fenix is a newly designed Firefox for Android that officially launched in August 2020. In Fenix, many components required to run as an Android app have been rebuilt from scratch, and various new features are being implemented as well. While they are re-implementing features, security bugs fixed in the past may be introduced again. If you care about the open web and you want to participate in the Client Bug Bounty Program of Mozilla, Fenix is a good target to start with.

Let’s take a look at two bugs I found in the firefox: scheme that is supported by Fenix.

Bugs Came Again with Deep Links

Fenix provides an interesting custom scheme URL firefox://open?url= that can open any specified URL in a new tab. On Android, a deep link is a link that takes you directly to a specific part of an app; and the firefox:open deep link is not intended to be called from web content, but its access was not restricted.

Web Content should not be able to link directly to a file:// URL (although a user can type or copy/paste such a link into the address bar.) While Firefox on the Desktop has long-implemented this fix, Fenix did not – I submitted Bug 1656747 that exploited this behavior and navigated to a local file from web content with the following hyperlink:

<a href="firefox://open?url=file:///sdcard/Download"> Go </a>

But actually, the same bug affected the older Firefox for Android (unofficially referred to as Fennec) and was filed three years ago Bug 1380950.

Likewise, security researcher Jun Kokatsu reported Bug 1447853, which was an <iframe> sandbox bypass in Firefox for iOS. He also abused the same type of deep link URL for bypassing the popup block brought by <iframe> sandbox.

<iframe src="data:text/html,<a href=firefox://open-url?url=https://example.com> Go </a>" sandbox></iframe>

I found this attack scenario in a test file of Firefox for iOS and I re-tested it in Fenix. I submitted Bug 1656746 which is the same issue as what he found.

Conclusion

As you can see, retesting past attack scenarios can be a good starting point. We can find past vulnerabilities from the Mozilla Foundation Security Advisories. By examining histories accumulated over a decade, we can see what are considered security bugs and how they were resolved. These resources will be useful for retesting past bugs as well as finding attack vectors for newly introduced features.

Have a good bug hunt!

Categorieën: Mozilla-nl planet

Andrew Sutherland: Talk Script: Firefox OS Email Performance Strategies

Thunderbird - do, 30/04/2015 - 22:11

Last week I gave a talk at the Philly Tech Week 2015 Dev Day organized by the delightful people at technical.ly on some of the tricks/strategies we use in the Firefox OS Gaia Email app.  Note that the credit for implementing most of these techniques goes to the owner of the Email app’s front-end, James Burke.  Also, a special shout-out to Vivien for the initial DOM Worker patches for the email app.

I tried to avoid having slides that both I would be reading aloud as the audience read silently, so instead of slides to share, I have the talk script.  Well, I also have the slides here, but there’s not much to them.  The headings below are the content of the slides, except for the one time I inline some code.  Note that the live presentation must have differed slightly, because I’m sure I’m much more witty and clever in person than this script would make it seem…

Cover Slide: Who!

Hi, my name is Andrew Sutherland.  I work at Mozilla on the Firefox OS Email Application.  I’m here to share some strategies we used to make our HTML5 app Seem faster and sometimes actually Be faster.

What’s A Firefox OS (Screenshot Slide)

But first: What is a Firefox OS?  It’s a multiprocess Firefox gecko engine on an android linux kernel where all the apps including the system UI are implemented using HTML5, CSS, and JavaScript.  All the apps use some combination of standard web APIs and APIs that we hope to standardize in some form.

Firefox OS homescreen screenshot Firefox OS clock app screenshot Firefox OS email app screenshot

Here are some screenshots.  We’ve got the default home screen app, the clock app, and of course, the email app.

It’s an entirely client-side offline email application, supporting IMAP4, POP3, and ActiveSync.  The goal, like all Firefox OS apps shipped with the phone, is to give native apps on other platforms a run for their money.

And that begins with starting up fast.

Fast Startup: The Problems

But that’s frequently easier said than done.  Slow-loading websites are still very much a thing.

The good news for the email application is that a slow network isn’t one of its problems.  It’s pre-loaded on the phone.  And even if it wasn’t, because of the security implications of the TCP Web API and the difficulty of explaining this risk to users in a way they won’t just click through, any TCP-using app needs to be a cryptographically signed zip file approved by a marketplace.  So we do load directly from flash.

However, it’s not like flash on cellphones is equivalent to an infinitely fast, zero-latency network connection.  And even if it was, in a naive app you’d still try and load all of your HTML, CSS, and JavaScript at the same time because the HTML file would reference them all.  And that adds up.

It adds up in the form of event loop activity and competition with other threads and processes.  With the exception of Promises which get their own micro-task queue fast-lane, the web execution model is the same as all other UI event loops; events get scheduled and then executed in the same order they are scheduled.  Loading data from an asynchronous API like IndexedDB means that your read result gets in line behind everything else that’s scheduled.  And in the case of the bulk of shipped Firefox OS devices, we only have a single processor core so the thread and process contention do come into play.

So we try not to be a naive.

Seeming Fast at Startup: The HTML Cache

If we’re going to optimize startup, it’s good to start with what the user sees.  Once an account exists for the email app, at startup we display the default account’s inbox folder.

What is the least amount of work that we can do to show that?  Cache a screenshot of the Inbox.  The problem with that, of course, is that a static screenshot is indistinguishable from an unresponsive application.

So we did the next best thing, (which is) we cache the actual HTML we display.  At startup we load a minimal HTML file, our concatenated CSS, and just enough Javascript to figure out if we should use the HTML cache and then actually use it if appropriate.  It’s not always appropriate, like if our application is being triggered to display a compose UI or from a new mail notification that wants to show a specific message or a different folder.  But this is a decision we can make synchronously so it doesn’t slow us down.

Local Storage: Okay in small doses

We implement this by storing the HTML in localStorage.

Important Disclaimer!  LocalStorage is a bad API.  It’s a bad API because it’s synchronous.  You can read any value stored in it at any time, without waiting for a callback.  Which means if the data is not in memory the browser needs to block its event loop or spin a nested event loop until the data has been read from disk.  Browsers avoid this now by trying to preload the Entire contents of local storage for your origin into memory as soon as they know your page is being loaded.  And then they keep that information, ALL of it, in memory until your page is gone.

So if you store a megabyte of data in local storage, that’s a megabyte of data that needs to be loaded in its entirety before you can use any of it, and that hangs around in scarce phone memory.

To really make the point: do not use local storage, at least not directly.  Use a library like localForage that will use IndexedDB when available, and then fails over to WebSQLDatabase and local storage in that order.

Now, having sufficiently warned you of the terrible evils of local storage, I can say with a sorta-clear conscience… there are upsides in this very specific case.

The synchronous nature of the API means that once we get our turn in the event loop we can act immediately.  There’s no waiting around for an IndexedDB read result to gets its turn on the event loop.

This matters because although the concept of loading is simple from a User Experience perspective, there’s no standard to back it up right now.  Firefox OS’s UX desires are very straightforward.  When you tap on an app, we zoom it in.  Until the app is loaded we display the app’s icon in the center of the screen.  Unfortunately the standards are still assuming that the content is right there in the HTML.  This works well for document-based web pages or server-powered web apps where the contents of the page are baked in.  They work less well for client-only web apps where the content lives in a database and has to be dynamically retrieved.

The two events that exist are:

DOMContentLoaded” fires when the document has been fully parsed and all scripts not tagged as “async” have run.  If there were stylesheets referenced prior to the script tags, the script tags will wait for the stylesheet loads.

load” fires when the document has been fully loaded; stylesheets, images, everything.

But none of these have anything to do with the content in the page saying it’s actually done.  This matters because these standards also say nothing about IndexedDB reads or the like.  We tried to create a standards consensus around this, but it’s not there yet.  So Firefox OS just uses the “load” event to decide an app or page has finished loading and it can stop showing your app icon.  This largely avoids the dreaded “flash of unstyled content” problem, but it also means that your webpage or app needs to deal with this period of time by displaying a loading UI or just accepting a potentially awkward transient UI state.

(Trivial HTML slide)

<link rel=”stylesheet” ...> <script ...></script> DOMContentLoaded!

This is the important summary of our index.html.

We reference our stylesheet first.  It includes all of our styles.  We never dynamically load stylesheets because that compels a style recalculation for all nodes and potentially a reflow.  We would have to have an awful lot of style declarations before considering that.

Then we have our single script file.  Because the stylesheet precedes the script, our script will not execute until the stylesheet has been loaded.  Then our script runs and we synchronously insert our HTML from local storage.  Then DOMContentLoaded can fire.  At this point the layout engine has enough information to perform a style recalculation and determine what CSS-referenced image resources need to be loaded for buttons and icons, then those load, and then we’re good to be displayed as the “load” event can fire.

After that, we’re displaying an interactive-ish HTML document.  You can scroll, you can press on buttons and the :active state will apply.  So things seem real.

Being Fast: Lazy Loading and Optimized Layers

But now we need to try and get some logic in place as quickly as possible that will actually cash the checks that real-looking HTML UI is writing.  And the key to that is only loading what you need when you need it, and trying to get it to load as quickly as possible.

There are many module loading and build optimizing tools out there, and most frameworks have a preferred or required way of handling this.  We used the RequireJS family of Asynchronous Module Definition loaders, specifically the alameda loader and the r-dot-js optimizer.

One of the niceties of the loader plugin model is that we are able to express resource dependencies as well as code dependencies.

RequireJS Loader Plugins

var fooModule = require('./foo'); var htmlString = require('text!./foo.html'); var localizedDomNode = require('tmpl!./foo.html');

The standard Common JS loader semantics used by node.js and io.js are the first one you see here.  Load the module, return its exports.

But RequireJS loader plugins also allow us to do things like the second line where the exclamation point indicates that the load should occur using a loader plugin, which is itself a module that conforms to the loader plugin contract.  In this case it’s saying load the file foo.html as raw text and return it as a string.

But, wait, there’s more!  loader plugins can do more than that.  The third example uses a loader that loads the HTML file using the ‘text’ plugin under the hood, creates an HTML document fragment, and pre-localizes it using our localization library.  And this works un-optimized in a browser, no compilation step needed, but it can also be optimized.

So when our optimizer runs, it bundles up the core modules we use, plus, the modules for our “message list” card that displays the inbox.  And the message list card loads its HTML snippets using the template loader plugin.  The r-dot-js optimizer then locates these dependencies and the loader plugins also have optimizer logic that results in the HTML strings being inlined in the resulting optimized file.  So there’s just one single javascript file to load with no extra HTML file dependencies or other loads.

We then also run the optimizer against our other important cards like the “compose” card and the “message reader” card.  We don’t do this for all cards because it can be hard to carve up the module dependency graph for optimization without starting to run into cases of overlap where many optimized files redundantly include files loaded by other optimized files.

Plus, we have another trick up our sleeve:

Seeming Fast: Preloading

Preloading.  Our cards optionally know the other cards they can load.  So once we display a card, we can kick off a preload of the cards that might potentially be displayed.  For example, the message list card can trigger the compose card and the message reader card, so we can trigger a preload of both of those.

But we don’t go overboard with preloading in the frontend because we still haven’t actually loaded the back-end that actually does all the emaily email stuff.  The back-end is also chopped up into optimized layers along account type lines and online/offline needs, but the main optimized JS file still weighs in at something like 17 thousand lines of code with newlines retained.

So once our UI logic is loaded, it’s time to kick-off loading the back-end.  And in order to avoid impacting the responsiveness of the UI both while it loads and when we’re doing steady-state processing, we run it in a DOM Worker.

Being Responsive: Workers and SharedWorkers

DOM Workers are background JS threads that lack access to the page’s DOM, communicating with their owning page via message passing with postMessage.  Normal workers are owned by a single page.  SharedWorkers can be accessed via multiple pages from the same document origin.

By doing this, we stay out of the way of the main thread.  This is getting less important as browser engines support Asynchronous Panning & Zooming or “APZ” with hardware-accelerated composition, tile-based rendering, and all that good stuff.  (Some might even call it magic.)

When Firefox OS started, we didn’t have APZ, so any main-thread logic had the serious potential to result in janky scrolling and the impossibility of rendering at 60 frames per second.  It’s a lot easier to get 60 frames-per-second now, but even asynchronous pan and zoom potentially has to wait on dispatching an event to the main thread to figure out if the user’s tap is going to be consumed by app logic and preventDefault called on it.  APZ does this because it needs to know whether it should start scrolling or not.

And speaking of 60 frames-per-second…

Being Fast: Virtual List Widgets

…the heart of a mail application is the message list.  The expected UX is to be able to fling your way through the entire list of what the email app knows about and see the messages there, just like you would on a native app.

This is admittedly one of the areas where native apps have it easier.  There are usually list widgets that explicitly have a contract that says they request data on an as-needed basis.  They potentially even include data bindings so you can just point them at a data-store.

But HTML doesn’t yet have a concept of instantiate-on-demand for the DOM, although it’s being discussed by Firefox layout engine developers.  For app purposes, the DOM is a scene graph.  An extremely capable scene graph that can handle huge documents, but there are footguns and it’s arguably better to err on the side of fewer DOM nodes.

So what the email app does is we create a scroll-region div and explicitly size it based on the number of messages in the mail folder we’re displaying.  We create and render enough message summary nodes to cover the current screen, 3 screens worth of messages in the direction we’re scrolling, and then we also retain up to 3 screens worth in the direction we scrolled from.  We also pre-fetch 2 more screens worth of messages from the database.  These constants were arrived at experimentally on prototype devices.

We listen to “scroll” events and issue database requests and move DOM nodes around and update them as the user scrolls.  For any potentially jarring or expensive transitions such as coordinate space changes from new messages being added above the current scroll position, we wait for scrolling to stop.

Nodes are absolutely positioned within the scroll area using their ‘top’ style but translation transforms also work.  We remove nodes from the DOM, then update their position and their state before re-appending them.  We do this because the browser APZ logic tries to be clever and figure out how to create an efficient series of layers so that it can pre-paint as much of the DOM as possible in graphic buffers, AKA layers, that can be efficiently composited by the GPU.  Its goal is that when the user is scrolling, or something is being animated, that it can just move the layers around the screen or adjust their opacity or other transforms without having to ask the layout engine to re-render portions of the DOM.

When our message elements are added to the DOM with an already-initialized absolute position, the APZ logic lumps them together as something it can paint in a single layer along with the other elements in the scrolling region.  But if we start moving them around while they’re still in the DOM, the layerization logic decides that they might want to independently move around more in the future and so each message item ends up in its own layer.  This slows things down.  But by removing them and re-adding them it sees them as new with static positions and decides that it can lump them all together in a single layer.  Really, we could just create new DOM nodes, but we produce slightly less garbage this way and in the event there’s a bug, it’s nicer to mess up with 30 DOM nodes displayed incorrectly rather than 3 million.

But as neat as the layerization stuff is to know about on its own, I really mention it to underscore 2 suggestions:

1, Use a library when possible.  Getting on and staying on APZ fast-paths is not trivial, especially across browser engines.  So it’s a very good idea to use a library rather than rolling your own.

2, Use developer tools.  APZ is tricky to reason about and even the developers who write the Async pan & zoom logic can be surprised by what happens in complex real-world situations.  And there ARE developer tools available that help you avoid needing to reason about this.  Firefox OS has easy on-device developer tools that can help diagnose what’s going on or at least help tell you whether you’re making things faster or slower:

– it’s got a frames-per-second overlay; you do need to scroll like mad to get the system to want to render 60 frames-per-second, but it makes it clear what the net result is

– it has paint flashing that overlays random colors every time it paints the DOM into a layer.  If the screen is flashing like a discotheque or has a lot of smeared rainbows, you know something’s wrong because the APZ logic is not able to to just reuse its layers.

– devtools can enable drawing cool colored borders around the layers APZ has created so you can see if layerization is doing something crazy

There’s also fancier and more complicated tools in Firefox and other browsers like Google Chrome to let you see what got painted, what the layer tree looks like, et cetera.

And that’s my spiel.

Links

The source code to Gaia can be found at https://github.com/mozilla-b2g/gaia

The email app in particular can be found at https://github.com/mozilla-b2g/gaia/tree/master/apps/email

(I also asked for questions here.)

Categorieën: Mozilla-nl planet

Joshua Cranmer: Breaking news

Thunderbird - wo, 01/04/2015 - 09:00
It was brought to my attention recently by reputable sources that the recent announcement of increased usage in recent years produced an internal firestorm within Mozilla. Key figures raised alarm that some of the tech press had interpreted the blog post as a sign that Thunderbird was not, in fact, dead. As a result, they asked Thunderbird community members to make corrections to emphasize that Mozilla was trying to kill Thunderbird.

The primary fear, it seems, is that knowledge that the largest open-source email client was still receiving regular updates would impel its userbase to agitate for increased funding and maintenance of the client to help forestall potential threats to the open nature of email as well as to innovate in the space of providing usable and private communication channels. Such funding, however, would be an unaffordable luxury and would only distract Mozilla from its central goal of building developer productivity tooling. Persistent rumors that Mozilla would be willing to fund Thunderbird were it renamed Firefox Email were finally addressed with the comment, "such a renaming would violate our current policy that all projects be named Persona."

Categorieën: Mozilla-nl planet

Joshua Cranmer: Why email is hard, part 8: why email security failed

Thunderbird - di, 13/01/2015 - 05:38
This post is part 8 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. Part 6 discusses how email security works in practice. Part 7 discusses the problem of trust. This part discusses why email security has largely failed.

At the end of the last part in this series, I posed the question, "Which email security protocol is most popular?" The answer to the question is actually neither S/MIME nor PGP, but a third protocol, DKIM. I haven't brought up DKIM until now because DKIM doesn't try to secure email in the same vein as S/MIME or PGP, but I still consider it relevant to discussing email security.

Unquestionably, DKIM is the only security protocol for email that can be considered successful. There are perhaps 4 billion active email addresses [1]. Of these, about 1-2 billion use DKIM. In contrast, S/MIME can count a few million users, and PGP at best a few hundred thousand. No other security protocols have really caught on past these three. Why did DKIM succeed where the others fail?

DKIM's success stems from its relatively narrow focus. It is nothing more than a cryptographic signature of the message body and a smattering of headers, and is itself stuck in the DKIM-Signature header. It is meant to be applied to messages only on outgoing servers and read and processed at the recipient mail server—it completely bypasses clients. That it bypasses clients allows it to solve the problem of key discovery and key management very easily (public keys are stored in DNS, which is already a key part of mail delivery), and its role in spam filtering is strong motivation to get it implemented quickly (it is 7 years old as of this writing). It's also simple: this one paragraph description is basically all you need to know [2].

The failure of S/MIME and PGP to see large deployment is certainly a large topic of discussion on myriads of cryptography enthusiast mailing lists, which often like to partake in propositions of new end-to-end encryption of email paradigms, such as the recent DIME proposal. Quite frankly, all of these solutions suffer broadly from at least the same 5 fundamental weaknesses, and I see it unlikely that a protocol will come about that can fix these weaknesses well enough to become successful.

The first weakness, and one I've harped about many times already, is UI. Most email security UI is abysmal and generally at best usable only by enthusiasts. At least some of this is endemic to security: while it mean seem obvious how to convey what an email signature or an encrypted email signifies, how do you convey the distinctions between sign-and-encrypt, encrypt-and-sign, or an S/MIME triple wrap? The Web of Trust model used by PGP (and many other proposals) is even worse, in that inherently requires users to do other actions out-of-band of email to work properly.

Trust is the second weakness. Consider that, for all intents and purposes, the email address is the unique identifier on the Internet. By extension, that implies that a lot of services are ultimately predicated on the notion that the ability to receive and respond to an email is a sufficient means to identify an individual. However, the entire purpose of secure email, or at least of end-to-end encryption, is subtly based on the fact that other people in fact have access to your mailbox, thus destroying the most natural ways to build trust models on the Internet. The quest for anonymity or privacy also renders untenable many other plausible ways to establish trust (e.g., phone verification or government-issued ID cards).

Key discovery is another weakness, although it's arguably the easiest one to solve. If you try to keep discovery independent of trust, the problem of key discovery is merely picking a protocol to publish and another one to find keys. Some of these already exist: PGP key servers, for example, or using DANE to publish S/MIME or PGP keys.

Key management, on the other hand, is a more troubling weakness. S/MIME, for example, basically works without issue if you have a certificate, but managing to get an S/MIME certificate is a daunting task (necessitated, in part, by its trust model—see how these issues all intertwine?). This is also where it's easy to say that webmail is an unsolvable problem, but on further reflection, I'm not sure I agree with that statement anymore. One solution is just storing the private key with the webmail provider (you're trusting them as an email client, after all), but it's also not impossible to imagine using phones or flash drives as keystores. Other key management factors are more difficult to solve: people who lose their private keys or key rollover create thorny issues. There is also the difficulty of managing user expectations: if I forget my password to most sites (even my email provider), I can usually get it reset somehow, but when a private key is lost, the user is totally and completely out of luck.

Of course, there is one glaring and almost completely insurmountable problem. Encrypted email fundamentally precludes certain features that we have come to take for granted. The lesser known is server-side search and filtration. While there exist some mechanisms to do search on encrypted text, those mechanisms rely on the fact that you can manipulate the text to change the message, destroying the integrity feature of secure email. They also tend to be fairly expensive. It's easy to just say "who needs server-side stuff?", but the contingent of people who do email on smartphones would not be happy to have to pay the transfer rates to download all the messages in their folder just to find one little email, nor the energy costs of doing it on the phone. And those who have really large folders—Fastmail has a design point of 1,000,000 in a single folder—would still prefer to not have to transfer all their mail even on desktops.

The more well-known feature that would disappear is spam filtration. Consider that 90% of all email is spam, and if you think your spam folder is too slim for that to be true, it's because your spam folder only contains messages that your email provider wasn't sure were spam. The loss of server-side spam filtering would dramatically increase the cost of spam (a 10% reduction in efficiency would double the amount of server storage, per my calculations), and client-side spam filtering is quite literally too slow [3] and too costly (remember smartphones? Imagine having your email take 10 times as much energy and bandwidth) to be a tenable option. And privacy or anonymity tends to be an invitation to abuse (cf. Tor and Wikipedia). Proposed solutions to the spam problem are so common that there is a checklist containing most of the objections.

When you consider all of those weaknesses, it is easy to be pessimistic about the possibility of wide deployment of powerful email security solutions. The strongest future—all email is encrypted, including metadata—is probably impossible or at least woefully impractical. That said, if you weaken some of the assumptions (say, don't desire all or most traffic to be encrypted), then solutions seem possible if difficult.

This concludes my discussion of email security, at least until things change for the better. I don't have a topic for the next part in this series picked out (this part actually concludes the set I knew I wanted to discuss when I started), although OAuth and DMARC are two topics that have been bugging me enough recently to consider writing about. They also have the unfortunate side effect of being things likely to see changes in the near future, unlike most of the topics I've discussed so far. But rest assured that I will find more difficulties in the email infrastructure to write about before long!

[1] All of these numbers are crude estimates and are accurate to only an order of magnitude. To justify my choices: I assume 1 email address per Internet user (this overestimates the developing world and underestimates the developed world). The largest webmail providers have given numbers that claim to be 1 billion active accounts between them, and all of them use DKIM. S/MIME is guessed by assuming that any smartcard deployment supports S/MIME, and noting that the US Department of Defense and Estonia's digital ID project are both heavy users of such smartcards. PGP is estimated from the size of the strong set and old numbers on the reachable set from the core Web of Trust.
[2] Ever since last April, it's become impossible to mention DKIM without referring to DMARC, as a result of Yahoo's controversial DMARC policy. A proper discussion of DMARC (and why what Yahoo did was controversial) requires explaining the mail transmission architecture and spam, however, so I'll defer that to a later post. It's also possible that changes in this space could happen within the next year.
[3] According to a former GMail spam employee, if it takes you as long as three minutes to calculate reputation, the spammer wins.

Categorieën: Mozilla-nl planet

Joshua Cranmer: A unified history for comm-central

Thunderbird - za, 10/01/2015 - 18:55
Several years back, Ehsan and Jeff Muizelaar attempted to build a unified history of mozilla-central across the Mercurial era and the CVS era. Their result is now used in the gecko-dev repository. While being distracted on yet another side project, I thought that I might want to do the same for comm-central. It turns out that building a unified history for comm-central makes mozilla-central look easy: mozilla-central merely had one import from CVS. In contrast, comm-central imported twice from CVS (the calendar code came later), four times from mozilla-central (once with converted history), and imported twice from Instantbird's repository (once with converted history). Three of those conversions also involved moving paths. But I've worked through all of those issues to provide a nice snapshot of the repository [1]. And since I've been frustrated by failing to find good documentation on how this sort of process went for mozilla-central, I'll provide details on the process for comm-central.

The first step and probably the hardest is getting the CVS history in DVCS form (I use hg because I'm more comfortable it, but there's effectively no difference between hg, git, or bzr here). There is a git version of mozilla's CVS tree available, but I've noticed after doing research that its last revision is about a month before the revision I need for Calendar's import. The documentation for how that repo was built is no longer on the web, although we eventually found a copy after I wrote this post on git.mozilla.org. I tried doing another conversion using hg convert to get CVS tags, but that rudely blew up in my face. For now, I've filed a bug on getting an official, branchy-and-tag-filled version of this repository, while using the current lack of history as a base. Calendar people will have to suffer missing a month of history.

CVS is famously hard to convert to more modern repositories, and, as I've done my research, Mozilla's CVS looks like it uses those features which make it difficult. In particular, both the calendar CVS import and the comm-central initial CVS import used a CVS tag HG_COMM_INITIAL_IMPORT. That tagging was done, on only a small portion of the tree, twice, about two months apart. Fortunately, mailnews code was never touched on CVS trunk after the import (there appears to be one commit on calendar after the tagging), so it is probably possible to salvage a repository-wide consistent tag.

The start of my script for conversion looks like this:

#!/bin/bash set -e WORKDIR=/tmp HGCVS=$WORKDIR/mozilla-cvs-history MC=/src/trunk/mozilla-central CC=/src/trunk/comm-central OUTPUT=$WORKDIR/full-c-c # Bug 445146: m-c/editor/ui -> c-c/editor/ui MC_EDITOR_IMPORT=d8064eff0a17372c50014ee305271af8e577a204 # Bug 669040: m-c/db/mork -> c-c/db/mork MC_MORK_IMPORT=f2a50910befcf29eaa1a29dc088a8a33e64a609a # Bug 1027241, bug 611752 m-c/security/manager/ssl/** -> c-c/mailnews/mime/src/* MC_SMIME_IMPORT=e74c19c18f01a5340e00ecfbc44c774c9a71d11d # Step 0: Grab the mozilla CVS history. if [ ! -e $HGCVS ]; then hg clone git+https://github.com/jrmuizel/mozilla-cvs-history.git $HGCVS fi

Since I don't want to include the changesets useless to comm-central history, I trimmed the history by using hg convert to eliminate changesets that don't change the necessary files. Most of the files are simple directory-wide changes, but S/MIME only moved a few files over, so it requires a more complex way to grab the file list. In addition, I also replaced the % in the usernames with @ that they are used to appearing in hg. The relevant code is here:

# Step 1: Trim mozilla CVS history to include only the files we are ultimately # interested in. cat >$WORKDIR/convert-filemap.txt <<EOF # Revision e4f4569d451a include directory/xpcom include mail include mailnews include other-licenses/branding/thunderbird include suite # Revision 7c0bfdcda673 include calendar include other-licenses/branding/sunbird # Revision ee719a0502491fc663bda942dcfc52c0825938d3 include editor/ui # Revision 52efa9789800829c6f0ee6a005f83ed45a250396 include db/mork/ include db/mdb/ EOF # Add the S/MIME import files hg -R $MC log -r "children($MC_SMIME_IMPORT)" \ --template "{file_dels % 'include {file}\n'}" >>$WORKDIR/convert-filemap.txt if [ ! -e $WORKDIR/convert-authormap.txt ]; then hg -R $HGCVS log --template "{email(author)}={sub('%', '@', email(author))}\n" \ | sort -u > $WORKDIR/convert-authormap.txt fi cd $WORKDIR hg convert $HGCVS $OUTPUT --filemap convert-filemap.txt -A convert-authormap.txt

That last command provides us the subset of the CVS history that we need for unified history. Strictly speaking, I should be pulling a specific revision, but I happen to know that there's no need to (we're cloning the only head) in this case. At this point, we now need to pull in the mozilla-central changes before we pull in comm-central. Order is key; hg convert will only apply the graft points when converting the child changeset (which it does but once), and it needs the parents to exist before it can do that. We also need to ensure that the mozilla-central graft point is included before continuing, so we do that, and then pull mozilla-central:

CC_CVS_BASE=$(hg log -R $HGCVS -r 'tip' --template '{node}') CC_CVS_BASE=$(grep $CC_CVS_BASE $OUTPUT/.hg/shamap | cut -d' ' -f2) MC_CVS_BASE=$(hg log -R $HGCVS -r 'gitnode(215f52d06f4260fdcca797eebd78266524ea3d2c)' --template '{node}') MC_CVS_BASE=$(grep $MC_CVS_BASE $OUTPUT/.hg/shamap | cut -d' ' -f2) # Okay, now we need to build the map of revisions. cat >$WORKDIR/convert-revmap.txt <<EOF e4f4569d451a5e0d12a6aa33ebd916f979dd8faa $CC_CVS_BASE # Thunderbird / Suite 7c0bfdcda6731e77303f3c47b01736aaa93d5534 d4b728dc9da418f8d5601ed6735e9a00ac963c4e, $CC_CVS_BASE # Calendar 9b2a99adc05e53cd4010de512f50118594756650 $MC_CVS_BASE # Mozilla graft point ee719a0502491fc663bda942dcfc52c0825938d3 78b3d6c649f71eff41fe3f486c6cc4f4b899fd35, $MC_EDITOR_IMPORT # Editor 8cdfed92867f885fda98664395236b7829947a1d 4b5da7e5d0680c6617ec743109e6efc88ca413da, e4e612fcae9d0e5181a5543ed17f705a83a3de71 # Chat EOF # Next, import mozilla-central revisions for rev in $MC_MORK_IMPORT $MC_EDITOR_IMPORT $MC_SMIME_IMPORT; do hg convert $MC $OUTPUT -r $rev --splicemap $WORKDIR/convert-revmap.txt \ --filemap $WORKDIR/convert-filemap.txt done

Some notes about all of the revision ids in the script. The splicemap requires the full 40-character SHA ids; anything less and the thing complains. I also need to specify the parents of the revisions that deleted the code for the mozilla-central import, so if you go hunting for those revisions and are surprised that they don't remove the code in question, that's why.

I mentioned complications about the merges earlier. The Mork and S/MIME import codes here moved files, so that what was db/mdb in mozilla-central became db/mork. There's no support for causing the generated splice to record these as a move, so I have to manually construct those renamings:

# We need to execute a few hg move commands due to renamings. pushd $OUTPUT hg update -r $(grep $MC_MORK_IMPORT .hg/shamap | cut -d' ' -f2) (hg -R $MC log -r "children($MC_MORK_IMPORT)" \ --template "{file_dels % 'hg mv {file} {sub(\"db/mdb\", \"db/mork\", file)}\n'}") | bash hg commit -m 'Pseudo-changeset to move Mork files' -d '2011-08-06 17:25:21 +0200' MC_MORK_IMPORT=$(hg log -r tip --template '{node}') hg update -r $(grep $MC_SMIME_IMPORT .hg/shamap | cut -d' ' -f2) (hg -R $MC log -r "children($MC_SMIME_IMPORT)" \ --template "{file_dels % 'hg mv {file} {sub(\"security/manager/ssl\", \"mailnews/mime\", file)}\n'}") | bash hg commit -m 'Pseudo-changeset to move S/MIME files' -d '2014-06-15 20:51:51 -0700' MC_SMIME_IMPORT=$(hg log -r tip --template '{node}') popd # Echo the new move commands to the changeset conversion map. cat >>$WORKDIR/convert-revmap.txt <<EOF 52efa9789800829c6f0ee6a005f83ed45a250396 abfd23d7c5042bc87502506c9f34c965fb9a09d1, $MC_MORK_IMPORT # Mork 50f5b5fc3f53c680dba4f237856e530e2097adfd 97253b3cca68f1c287eb5729647ba6f9a5dab08a, $MC_SMIME_IMPORT # S/MIME EOF

Now that we have all of the graft points defined, and all of the external code ready, we can pull comm-central and do the conversion. That's not quite it, though—when we graft the S/MIME history to the original mozilla-central history, we have a small segment of abandoned converted history. A call to hg strip removes that.

# Now, import comm-central revisions that we need hg convert $CC $OUTPUT --splicemap $WORKDIR/convert-revmap.txt hg strip 2f69e0a3a05a

[1] I left out one of the graft points because I just didn't want to deal with it. I'll leave it as an exercise to the reader to figure out which one it was. Hint: it's the only one I didn't know about before I searched for the archive points [2].
[2] Since I wasn't sure I knew all of the graft points, I decided to try to comb through all of the changesets to figure out who imported code. It turns out that hg log -r 'adds("**")' narrows it down nicely (1667 changesets to look at instead of 17547), and using the {file_adds} template helps winnow it down more easily.

Categorieën: Mozilla-nl planet

Philipp Kewisch: Monitor all http(s) network requests using the Mozilla Platform

Thunderbird - do, 02/10/2014 - 16:38

In an xpcshell test, I recently needed a way to monitor all network requests and access both request and response data so I can save them for later use. This required a little bit of digging in Mozilla’s devtools code so I thought I’d write a short blog post about it.

This code will be used in a testcase that ensures that calendar providers in Lightning function properly. In the case of the CalDAV provider, we would need to access a real server for testing. We can’t just set up a few servers and use them for testing, it would end in an unreasonable amount of server maintenance. Given non-local connections are not allowed when running the tests on the Mozilla build infrastructure, it wouldn’t work anyway. The solution is to create a fakeserver, that is able to replay the requests in the same way. Instead of manually making the requests and figuring out how the server replies, we can use this code to quickly collect all the requests we need.

Without further delay, here is the code you have been waiting for:

/* This Source Code Form is subject to the terms of the Mozilla Public * License, v. 2.0. If a copy of the MPL was not distributed with this * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ var allRequests = []; /** * Add the following function as a request observer: * Services.obs.addObserver(httpObserver, "http-on-examine-response", false); * * When done listening on requests: * dump(allRequests.join("\n===\n")); // print them * dump(JSON.stringify(allRequests, null, " ")) // jsonify them */ function httpObserver(aSubject, aTopic, aData) { if (aSubject instanceof Components.interfaces.nsITraceableChannel) { let request = new TracedRequest(aSubject); request._next = aSubject.setNewListener(request); allRequests.push(request); } } /** * This is the object that represents a request/response and also collects the data for it * * @param aSubject The channel from the response observer. */ function TracedRequest(aSubject) { let httpchannel = aSubject.QueryInterface(Components.interfaces.nsIHttpChannel); let self = this; this.requestHeaders = Object.create(null); httpchannel.visitRequestHeaders({ visitHeader: function(k, v) { self.requestHeaders[k] = v; } }); this.responseHeaders = Object.create(null); httpchannel.visitResponseHeaders({ visitHeader: function(k, v) { self.responseHeaders[k] = v; } }); this.uri = aSubject.URI.spec; this.method = httpchannel.requestMethod; this.requestBody = readRequestBody(aSubject); this.responseStatus = httpchannel.responseStatus; this.responseStatusText = httpchannel.responseStatusText; this._chunks = []; } TracedRequest.prototype = { uri: null, method: null, requestBody: null, requestHeaders: null, responseStatus: null, responseStatusText: null, responseHeaders: null, responseBody: null, toJSON: function() { let j = Object.create(null); for (let m of Object.keys(this)) { if (typeof this[m] != "function" && m[0] != "_") { j[m] = this[m]; } } return j; }, onStartRequest: function(aRequest, aContext) this._next.onStartRequest(aRequest, aContext), onStopRequest: function(aRequest, aContext, aStatusCode) { this.responseBody = this._chunks.join(""); this._chunks = null; this._next.onStopRequest(aRequest, aContext, aStatusCode); this._next = null; }, onDataAvailable: function(aRequest, aContext, aStream, aOffset, aCount) { let binaryInputStream = Components.classes["@mozilla.org/binaryinputstream;1"] .createInstance(Components.interfaces.nsIBinaryInputStream); let storageStream = Components.classes["@mozilla.org/storagestream;1"] .createInstance(Components.interfaces.nsIStorageStream); let outStream = Components.classes["@mozilla.org/binaryoutputstream;1"] .createInstance(Components.interfaces.nsIBinaryOutputStream); binaryInputStream.setInputStream(aStream); storageStream.init(8192, aCount, null); outStream.setOutputStream(storageStream.getOutputStream(0)); let data = binaryInputStream.readBytes(aCount); this._chunks.push(data); outStream.writeBytes(data, aCount); this._next.onDataAvailable(aRequest, aContext, storageStream.newInputStream(0), aOffset, aCount); }, toString: function() { let str = this.method + " " + this.uri; for (let hdr of Object.keys(this.requestHeaders)) { str += hdr + ": " + this.requestHeaders[hdr] + "\n"; } if (this.requestBody) { str += "\r\n" + this.requestBody + "\n"; } str += "\n" + this.responseStatus + " " + this.responseStatusText if (this.responseBody) { str += "\r\n" + this.responseBody + "\n"; } return str; } }; // Taken from: // http://hg.mozilla.org/mozilla-central/file/2399d1ae89e9/toolkit/devtools/webconsole/network-helper.js#l120 function readRequestBody(aRequest, aCharset="UTF-8") { let text = null; if (aRequest instanceof Ci.nsIUploadChannel) { let iStream = aRequest.uploadStream; let isSeekableStream = false; if (iStream instanceof Ci.nsISeekableStream) { isSeekableStream = true; } let prevOffset; if (isSeekableStream) { prevOffset = iStream.tell(); iStream.seek(Ci.nsISeekableStream.NS_SEEK_SET, 0); } // Read data from the stream. try { let rawtext = NetUtil.readInputStreamToString(iStream, iStream.available()) let conv = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"] .createInstance(Components.interfaces.nsIScriptableUnicodeConverter); conv.charset = aCharset; text = conv.ConvertToUnicode(rawtext); } catch (err) { } // Seek locks the file, so seek to the beginning only if necko hasn't // read it yet, since necko doesn't eek to 0 before reading (at lest // not till 459384 is fixed). if (isSeekableStream && prevOffset == 0) { iStream.seek(Components.interfaces.nsISeekableStream.NS_SEEK_SET, 0); } } return text; }

view raw
TracedRequest.js
hosted with ❤ by GitHub

Categorieën: Mozilla-nl planet

Joshua Cranmer: Why email is hard, part 7: email security and trust

Thunderbird - wo, 06/08/2014 - 05:39
This post is part 7 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. Part 6 discusses how email security works in practice. This part discusses the problem of trust.

At a technical level, S/MIME and PGP (or at least PGP/MIME) use cryptography essentially identically. Yet the two are treated as radically different models of email security because they diverge on the most important question of public key cryptography: how do you trust the identity of a public key? Trust is critical, as it is the only way to stop an active, man-in-the-middle (MITM) attack. MITM attacks are actually easier to pull off in email, since all email messages effectively have to pass through both the sender's and the recipients' email servers [1], allowing attackers to be able to pull off permanent, long-lasting MITM attacks [2].

S/MIME uses the same trust model that SSL uses, based on X.509 certificates and certificate authorities. X.509 certificates effectively work by providing a certificate that says who you are which is signed by another authority. In the original concept (as you might guess from the name "X.509"), the trusted authority was your telecom provider, and the certificates were furthermore intended to be a part of the global X.500 directory—a natural extension of the OSI internet model. The OSI model of the internet never gained traction, and the trusted telecom providers were replaced with trusted root CAs.

PGP, by contrast, uses a trust model that's generally known as the Web of Trust. Every user has a PGP key (containing their identity and their public key), and users can sign others' public keys. Trust generally flows from these signatures: if you trust a user, you know the keys that they sign are correct. The name "Web of Trust" comes from the vision that trust flows along the paths of signatures, building a tight web of trust.

And now for the controversial part of the post, the comparisons and critiques of these trust models. A disclaimer: I am not a security expert, although I am a programmer who revels in dreaming up arcane edge cases. I also don't use PGP at all, and use S/MIME to a very limited extent for some Mozilla work [3], although I did try a few abortive attempts to dogfood it in the past. I've attempted to replace personal experience with comprehensive research [4], but most existing critiques and comparisons of these two trust models are about 10-15 years old and predate several changes to CA certificate practices.

A basic tenet of development that I have found is that the average user is fairly ignorant. At the same time, a lot of the defense of trust models, both CAs and Web of Trust, tends to hinge on configurability. How many people, for example, know how to add or remove a CA root from Firefox, Windows, or Android? Even among the subgroup of Mozilla developers, I suspect the number of people who know how to do so are rather few. Or in the case of PGP, how many people know how to change the maximum path length? Or even understand the security implications of doing so?

Seen in the light of ignorant users, the Web of Trust is a UX disaster. Its entire security model is predicated on having users precisely specify how much they trust other people to trust others (ultimate, full, marginal, none, unknown) and also on having them continually do out-of-band verification procedures and publicly reporting those steps. In 1998, a seminal paper on the usability of a GUI for PGP encryption came to the conclusion that the UI was effectively unusable for users, to the point that only a third of the users were able to send an encrypted email (and even then, only with significant help from the test administrators), and a quarter managed to publicly announce their private keys at some point, which is pretty much the worst thing you can do. They also noted that the complex trust UI was never used by participants, although the failure of many users to get that far makes generalization dangerous [5]. While newer versions of security UI have undoubtedly fixed many of the original issues found (in no small part due to the paper, one of the first to argue that usability is integral, not orthogonal, to security), I have yet to find an actual study on the usability of the trust model itself.

The Web of Trust has other faults. The notion of "marginal" trust it turns out is rather broken: if you marginally trust a user who has two keys who both signed another person's key, that's the same as fully trusting a user with one key who signed that key. There are several proposals for different trust formulas [6], but none of them have caught on in practice to my knowledge.

A hidden fault is associated with its manner of presentation: in sharp contrast to CAs, the Web of Trust appears to not delegate trust, but any practical widespread deployment needs to solve the problem of contacting people who have had no prior contact. Combined with the need to bootstrap new users, this implies that there needs to be some keys that have signed a lot of other keys that are essentially default-trusted—in other words, a CA, a fact sometimes lost on advocates of the Web of Trust.

That said, a valid point in favor of the Web of Trust is that it more easily allows people to distrust CAs if they wish to. While I'm skeptical of its utility to a broader audience, the ability to do so for is crucial for a not-insignificant portion of the population, and it's important enough to be explicitly called out.

X.509 certificates are most commonly discussed in the context of SSL/TLS connections, so I'll discuss them in that context as well, as the implications for S/MIME are mostly the same. Almost all criticism of this trust model essentially boils down to a single complaint: certificate authorities aren't trustworthy. A historical criticism is that the addition of CAs to the main root trust stores was ad-hoc. Since then, however, the main oligopoly of these root stores (Microsoft, Apple, Google, and Mozilla) have made their policies public and clear [7]. The introduction of the CA/Browser Forum in 2005, with a collection of major CAs and the major browsers as members, and several [8] helps in articulating common policies. These policies, simplified immensely, boil down to:

  1. You must verify information (depending on certificate type). This information must be relatively recent.
  2. You must not use weak algorithms in your certificates (e.g., no MD5).
  3. You must not make certificates that are valid for too long.
  4. You must maintain revocation checking services.
  5. You must have fairly stringent physical and digital security practices and intrusion detection mechanisms.
  6. You must be [externally] audited every year that you follow the above rules.
  7. If you screw up, we can kick you out.

I'm not going to claim that this is necessarily the best policy or even that any policy can feasibly stop intrusions from happening. But it's a policy, so CAs must abide by some set of rules.

Another CA criticism is the fear that they may be suborned by national government spy agencies. I find this claim underwhelming, considering that the number of certificates acquired by intrusions that were used in the wild is larger than the number of certificates acquired by national governments that were used in the wild: 1 and 0, respectively. Yet no one complains about the untrustworthiness of CAs due to their ability to be hacked by outsiders. Another attack is that CAs are controlled by profit-seeking corporations, which misses the point because the business of CAs is not selling certificates but selling their access to the root databases. As we will see shortly, jeopardizing that access is a great way for a CA to go out of business.

To understand issues involving CAs in greater detail, there are two CAs that are particularly useful to look at. The first is CACert. CACert is favored by many by its attempt to handle X.509 certificates in a Web of Trust model, so invariably every public discussion about CACert ends up devolving into an attack on other CAs for their perceived capture by national governments or corporate interests. Yet what many of the proponents for inclusion of CACert miss (or dismiss) is the fact that CACert actually failed the required audit, and it is unlikely to ever pass an audit. This shows a central failure of both CAs and Web of Trust: different people have different definitions of "trust," and in the case of CACert, some people are favoring a subjective definition (I trust their owners because they're not evil) when an objective definition fails (in this case, that the root signing key is securely kept).

The other CA of note here is DigiNotar. In July 2011, some hackers managed to acquire a few fraudulent certificates by hacking into DigiNotar's systems. By late August, people had become aware of these certificates being used in practice [9] to intercept communications, mostly in Iran. The use appears to have been caught after Chromium updates failed due to invalid certificate fingerprints. After it became clear that the fraudulent certificates were not limited to a single fake Google certificate, and that DigiNotar had failed to notify potentially affected companies of its breach, DigiNotar was swiftly removed from all of the trust databases. It ended up declaring bankruptcy within two weeks.

DigiNotar indicates several things. One, SSL MITM attacks are not theoretical (I have seen at least two or three security experts advising pre-DigiNotar that SSL MITM attacks are "theoretical" and therefore the wrong target for security mechanisms). Two, keeping the trust of browsers is necessary for commercial operation of CAs. Three, the notion that a CA is "too big to fail" is false: DigiNotar played an important role in the Dutch community as a major CA and the operator of Staat der Nederlanden. Yet when DigiNotar screwed up and lost its trust, it was swiftly kicked out despite this role. I suspect that even Verisign could be kicked out if it manages to screw up badly enough.

This isn't to say that the CA model isn't problematic. But the source of its problems is that delegating trust isn't a feasible model in the first place, a problem that it shares with the Web of Trust as well. Different notions of what "trust" actually means and the uncertainty that gets introduced as chains of trust get longer both make delegating trust weak to both social engineering and technical engineering attacks. There appears to be an increasing consensus that the best way forward is some variant of key pinning, much akin to how SSH works: once you know someone's public key, you complain if that public key appears to change, even if it appears to be "trusted." This does leave people open to attacks on first use, and the question of what to do when you need to legitimately re-key is not easy to solve.

In short, both CAs and the Web of Trust have issues. Whether or not you should prefer S/MIME or PGP ultimately comes down to the very conscious question of how you want to deal with trust—a question without a clear, obvious answer. If I appear to be painting CAs and S/MIME in a positive light and the Web of Trust and PGP in a negative one in this post, it is more because I am trying to focus on the positions less commonly taken to balance perspective on the internet. In my next post, I'll round out the discussion on email security by explaining why email security has seen poor uptake and answering the question as to which email security protocol is most popular. The answer may surprise you!

[1] Strictly speaking, you can bypass the sender's SMTP server. In practice, this is considered a hole in the SMTP system that email providers are trying to plug.
[2] I've had 13 different connections to the internet in the same time as I've had my main email address, not counting all the public wifis that I have used. Whereas an attacker would find it extraordinarily difficult to intercept all of my SSH sessions for a MITM attack, intercepting all of my email sessions is clearly far easier if the attacker were my email provider.
[3] Before you read too much into this personal choice of S/MIME over PGP, it's entirely motivated by a simple concern: S/MIME is built into Thunderbird; PGP is not. As someone who does a lot of Thunderbird development work that could easily break the Enigmail extension locally, needing to use an extension would be disruptive to workflow.
[4] This is not to say that I don't heavily research many of my other posts, but I did go so far for this one as to actually start going through a lot of published journals in an attempt to find information.
[5] It's questionable how well the usability of a trust model UI can be measured in a lab setting, since the observer effect is particularly strong for all metrics of trust.
[6] The web of trust makes a nice graph, and graphs invite lots of interesting mathematical metrics. I've always been partial to eigenvectors of the graph, myself.
[7] Mozilla's policy for addition to NSS is basically the standard policy adopted by all open-source Linux or BSD distributions, seeing as OpenSSL never attempted to produce a root database.
[8] It looks to me that it's the browsers who are more in charge in this forum than the CAs.
[9] To my knowledge, this is the first—and so far only—attempt to actively MITM an SSL connection.

Categorieën: Mozilla-nl planet

Joshua Cranmer: Why email is hard, part 6: today's email security

Thunderbird - di, 27/05/2014 - 02:32
This post is part 6 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. This part discusses how email security works in practice.

Email security is a rather wide-ranging topic, and one that I've wanted to cover for some time, well before several recent events that have made it come up in the wider public knowledge. There is no way I can hope to cover it in a single post (I think it would outpace even the length of my internationalization discussion), and there are definitely parts for which I am underqualified, as I am by no means an expert in cryptography. Instead, I will be discussing this over the course of several posts of which this is but the first; to ease up on the amount of background explanation, I will assume passing familiarity with cryptographic concepts like public keys, hash functions, as well as knowing what SSL and SSH are (though not necessarily how they work). If you don't have that knowledge, ask Wikipedia.

Before discussing how email security works, it is first necessary to ask what email security actually means. Unfortunately, the layman's interpretation is likely going to differ from the actual precise definition. Security is often treated by laymen as a boolean interpretation: something is either secure or insecure. The most prevalent model of security to people is SSL connections: these allow the establishment of a communication channel whose contents are secret to outside observers while also guaranteeing to the client the authenticity of the server. The server often then gets authenticity of the client via a more normal authentication scheme (i.e., the client sends a username and password). Thus there is, at the end, a channel that has both secrecy and authenticity [1]: channels with both of these are considered secure and channels without these are considered insecure [2].

In email, the situation becomes more difficult. Whereas an SSL connection is between a client and a server, the architecture of email is such that email providers must be considered as distinct entities from end users. In addition, messages can be sent from one person to multiple parties. Thus secure email is a more complex undertaking than just porting relevant details of SSL. There are two major cryptographic implementations of secure email [3]: S/MIME and PGP. In terms of implementation, they are basically the same [4], although PGP has an extra mode which wraps general ASCII (known as "ASCII-armor"), which I have been led to believe is less recommended these days. Since I know the S/MIME specifications better, I'll refer specifically to how S/MIME works.

S/MIME defines two main MIME types: multipart/signed, which contains the message text as a subpart followed by data indicating the cryptographic signature, and application/pkcs7-mime, which contains an encrypted MIME part. The important things to note about this delineation are that only the body data is encrypted [5], that it's theoretically possible to encrypt only part of a message's body, and that the signing and encryption constitute different steps. These factors combine to make for a potentially infuriating UI setup.

How does S/MIME tackle the challenges of encrypting email? First, rather than encrypting using recipients' public keys, the message is encrypted with a symmetric key. This symmetric key is then encrypted with each of the recipients' keys and then attached to the message. Second, by only signing or encrypting the body of the message, the transit headers are kept intact for the mail system to retain its ability to route, process, and deliver the message. The body is supposed to be prepared in the "safest" form before transit to avoid intermediate routers munging the contents. Finally, to actually ascertain what the recipients' public keys are, clients typically passively pull the information from signed emails. LDAP, unsurprisingly, contains an entry for a user's public key certificate, which could be useful in large enterprise deployments. There is also work ongoing right now to publish keys via DNS and DANE.

I mentioned before that S/MIME's use can present some interesting UI design decisions. I ended up actually testing some common email clients on how they handled S/MIME messages: Thunderbird, Apple Mail, Outlook [6], and Evolution. In my attempts to create a surreptitious signed part to confuse the UI, Outlook decided that the message had no body at all, and Thunderbird decided to ignore all indication of the existence of said part. Apple Mail managed to claim the message was signed in one of these scenarios, and Evolution took the cake by always agreeing that the message was signed [7]. It didn't even bother questioning the signature if the certificate's identity disagreed with the easily-spoofable From address. I was actually surprised by how well people did in my tests—I expected far more confusion among clients, particularly since the will to maintain S/MIME has clearly been relatively low, judging by poor support for "new" features such as triple-wrapping or header protection.

Another fault of S/MIME's design is that it makes the mistaken belief that composing a signing step and an encryption step is equivalent in strength to a simultaneous sign-and-encrypt. Another page describes this in far better detail than I have room to; note that this flaw is fixed via triple-wrapping (which has relatively poor support). This creates yet more UI burden into how to adequately describe in UI all the various minutiae in differing security guarantees. Considering that users already have a hard time even understanding that just because a message says it's from example@isp.invalid doesn't actually mean it's from example@isp.invalid, trying to develop UI that both adequately expresses the security issues and is understandable to end-users is an extreme challenge.

What we have in S/MIME (and PGP) is a system that allows for strong guarantees, if certain conditions are met, yet is also vulnerable to breaches of security if the message handling subsystems are poorly designed. Hopefully this is a sufficient guide to the technical impacts of secure email in the email world. My next post will discuss the most critical component of secure email: the trust model. After that, I will discuss why secure email has seen poor uptake and other relevant concerns on the future of email security.

[1] This is a bit of a lie: a channel that does secrecy and authentication at different times isn't as secure as one that does them at the same time.
[2] It is worth noting that authenticity is, in many respects, necessary to achieve secrecy.
[3] This, too, is a bit of a lie. More on this in a subsequent post.
[4] I'm very aware that S/MIME and PGP use radically different trust models. Trust models will be covered later.
[5] S/MIME 3.0 did add a provision stating that if the signed/encrypted part is a message/rfc822 part, the headers of that part should override the outer message's headers. However, I am not aware of a major email client that actually handles these kind of messages gracefully.
[6] Actually, I tested Windows Live Mail instead of Outlook, but given the presence of an official MIME-to-Microsoft's-internal-message-format document which seems to agree with what Windows Live Mail was doing, I figure their output would be identical.
[7] On a more careful examination after the fact, it appears that Evolution may have tried to indicate signedness on a part-by-part basis, but the UI was sufficiently confusing that ordinary users are going to be easily confused.

Categorieën: Mozilla-nl planet

Andrew Sutherland: webpd: a Polymer-based web UI for the beets music library manager

Thunderbird - zo, 06/04/2014 - 18:56

beets webpd filtered artists list

beets is the extensible music database tool every programmer with a music collection has dreamed of writing.  At its simplest it’s a clever tagger that can normalize your music against the MusicBrainz database and then store the results in a searchable SQLite database.  But with plugins it can fetch album art, use the Discogs music database for tagging too, calculate ReplayGain values for all your music, integrate meta-data from The Echo Nest, etc.  It even has a Music Player Daemon server-mode (bpd) and a simple HTML interface (web) that lets you search for tracks and play them in your browse using the HTML5 audio tag.

I’ve tried a lot of music players through the years (alphabetically: amarok, banshee, exaile, quodlibetrhythmbox).  They all are great music players and (at least!) satisfy the traditional Artist/Album/Track hierarchy use-case, but when you exceed 20,000 tracks and you have a lot of compilation cd’s, that frequently ends up not being enough. Extending them usually turned out to be too hard / not fun enough, although sometimes it was just a question of time and seeking greener pastures.

But enough context; if you’re reading my blog you probably are on board with the web platform being the greatest platform ever.  The notable bits of the implementation are:

  • Server-wise, it’s a mash-up of beets’ MPD-alike plugin bpd and its web plugin.  Rather than needing to speak the MPD protocol over TCP to get your server to play music, you can just hit it with an HTTP POST and it will enqueue and play the song.  Server-sent events/EventSource are used to let the web UI hypothetically update as things happen on the server.  Right now the client can indeed tell the server to play a song and hear an update via the EventSource channel, but there’s almost certainly a resource leak on the server-side and there’s a lot more web/bpd interlinking required to get it reliable.  (Python’s Flask is neat, but I’m not yet clear on how to properly manage the life-cycle of a long-lived request that only dies when the connection dies since I’m seeing the appcontext get torn down even before the generator starts running.)
  • The client is implemented in Polymer on top of some simple backbone.js collections that build on the existing logic from the beets web plugin.
    • The artist list uses the polymer-virtual-list element which is important if you’re going to be scrolling through a ton of artists.  The implementation is page-based; you tell it how many pages you want and how many items are on each page.  As you scroll it fires events that compel you to generate the appropriate page.  It’s an interesting implementation:
      • Pages are allowed to be variable height and therefore their contents are too, although a fixedHeight mode is also supported.
      • In variable-height mode, scroll offsets are translated to page positions by guessing the page based on the height of the first page and then walking up/down from there based on cached page-sizes until the right page size is found.  If there is missing information because the user managed to trigger a huge jump, extrapolation is performed based on the average item size from the first page.
      • Any changes to the contents of the list regrettably require discarding all existing pages/bindings.  At this time there is no way to indicate a splice at a certain point that should simply result in a displacement of the existing items.
    • Albums are loaded in batches from the server and artists dynamically derived from them.  Although this would allow for the UI to update as things are retrieved, the virtual-list invalidation issue concerned me enough to have the artist-list defer initialization until all albums are loaded.  On my machine a couple thousand albums load pretty quickly, so this isn’t a huge deal.
    • There’s filtering by artist name and number of albums in the database by that artist built on backbone-filtered-collection.  The latter one is important to me because scrolling through hundreds of artists where I might only own one cd or not even one whole cd is annoying.  (Although the latter is somewhat addressed currently by only using the albumartist for the artist generation so various artists compilations don’t clutter things up.)
    • If you click on an artist it plays the first track (numerically) from the first album (alphabetically) associated with the artist.  This does limit the songs you can listen to somewhat…
    • visualizations are done using d3.js; one svg per visualization

beets webpd madonna and morrissey

“What’s with all those tastefully chosen colors?” is what you are probably asking yourself.  The answer?  Two things!

  1. A visualization of albums/releases in the database by time, heat-map style.
    • We bin all of the albums that beets knows about by year.  In this case we assume that 1980 is the first interesting year and so put 1979 and everything before it (including albums without a year) in the very first bin on the left.  The current year is the rightmost bucket.
    • We vertically divide the albums into “albums” (red), “singles” (green), and “compilations” (blue).  This is accomplished by taking the MusicBrainz Release Group / Types and mapping them down to our smaller space.
    • The more albums in a bin, the stronger the color.
  2. A scatter-plot using the echo nest‘s acoustic attributes for the tracks where:
    • the x-axis is “danceability”.  Things to the left are less danceable.  Things to the right are more danceable.
    • the y-axis is “valence” which they define as “the musical positiveness conveyed by a track”.  Things near the top are sadder, things near the bottom are happier.
    • the colors are based on the type of album the track is from.  The idea was that singles tend to have remixes on them, so it’s interesting if we always see a big cluster of green remixes to the right.
    • tracks without the relevant data all end up in the upper-left corner.  There are a lot of these.  The echo nest is extremely generous in allowing non-commercial use of their API, but they limit you to 20 requests per minute and at this point the beets echonest plugin needs to upload (transcoded) versions of all my tracks since my music collection is apparently more esoteric than what the servers already have fingerprints for.

Together these visualizations let us infer:

  • Madonna is more dancey than Morrissey!  Shocking, right?
  • I bought the Morrissey singles box sets. And I got ripped off because there’s a distinct lack of green dots over on the right side.

Code is currently in the webpd branch of my beets fork although I should probably try and split it out into a separate repo.  You need to enable the webpd plugin like you would any other plugin for it to work.  There’s still a lot lot lot more work to be done for it to be usable, but I think it’s neat already.  It definitely works in Firefox and Chrome.

Categorieën: Mozilla-nl planet

Joshua Cranmer: Announcing jsmime 0.2

Thunderbird - za, 05/04/2014 - 19:18
Previously, I've been developing JSMime as a subdirectory within comm-central. However, after discussions with other developers, I have moved the official repository of record for JSMime to its own repository, now found on GitHub. The repository has been cleaned up and the feature set for version 0.2 has been selected, so that the current tip on JSMime (also the initial version) is version 0.2. This contains the feature set I imported into Thunderbird's source code last night, which is to say support for parsing MIME messages into the MIME tree, as well as support for parsing and encoding email address headers.

Thunderbird doesn't actually use the new code quite yet (as my current tree is stuck on a mozilla-central build error, so I haven't had time to run those patches through a last minute sanity check before requesting review), but the intent is to replace the current C++ implementations of nsIMsgHeaderParser and nsIMimeConverter with JSMime instead. Once those are done, I will be moving forward with my structured header plans which more or less ought to make those interfaces obsolete.

Within JSMime itself, the pieces which I will be working on next will be rounding out the implementation of header parsing and encoding support (I have prototypes for Date headers and the infernal RFC 2231 encoding that Content-Disposition needs), as well as support for building MIME messages from their constituent parts (a feature which would be greatly appreciated in the depths of compose and import in Thunderbird). I also want to implement full IDN and EAI support, but that's hampered by the lack of a JS implementation I can use for IDN (yes, there's punycode.js, but that doesn't do StringPrep). The important task of converting the MIME tree to a list of body parts and attachments is something I do want to work on as well, but I've vacillated on the implementation here several times and I'm not sure I've found one I like yet.

JSMime, as its name implies, tries to work in as pure JS as possible, augmented with several web APIs as necessary (such as TextDecoder for charset decoding). I'm using ES6 as the base here, because it gives me several features I consider invaluable for implementing JavaScript: Promises, Map, generators, let. This means it can run on an unprivileged web page—I test JSMime using Firefox nightlies and the Firefox debugger where necessary. Unfortunately, it only really works in Firefox at the moment because V8 doesn't support many ES6 features yet (such as destructuring, which is annoying but simple enough to work around, or Map iteration, which is completely necessary for the code). I'm not opposed to changing it to make it work on Node.js or Chrome, but I don't realistically have the time to spend doing it myself; if someone else has the time, please feel free to contact me or send patches.

Categorieën: Mozilla-nl planet

Joshua Cranmer: If you want fast code, don't use assembly

Thunderbird - do, 03/04/2014 - 18:52
…unless you're an expert at assembly, that is. The title of this post was obviously meant to be an attention-grabber, but it is much truer than you might think: poorly-written assembly code will probably be slower than an optimizing compiler on well-written code (note that you may need to help the compiler along for things like vectorization). Now why is this?

Modern microarchitectures are incredibly complex. A modern x86 processor will be superscalar and use some form of compilation to microcode to do that. Desktop processors will undoubtedly have multiple instruction issues per cycle, forms of register renaming, branch predictors, etc. Minor changes—a misaligned instruction stream, a poor order of instructions, a bad instruction choice—could kill the ability to take advantages of these features. There are very few people who could accurately predict the performance of a given assembly stream (I myself wouldn't attempt it if the architecture can take advantage of ILP), and these people are disproportionately likely to be working on compiler optimizations. So unless you're knowledgeable enough about assembly to help work on a compiler, you probably shouldn't be hand-coding assembly to make code faster.

To give an example to elucidate this point (and the motivation for this blog post in the first place), I was given a link to an implementation of the N-queens problem in assembly. For various reasons, I decided to use this to start building a fine-grained performance measurement system. This system uses a high-resolution monotonic clock on Linux and runs the function 1000 times to warm up caches and counters and then runs the function 1000 more times, measuring each run independently and reporting the average runtime at the end. This is a single execution of the system; 20 executions of the system were used as the baseline for a t-test to determine statistical significance as well as visual estimation of normality of data. Since the runs observed about a constant 1-2 μs of noise, I ran all of my numbers on the 10-queens problem to better separate the data (total runtimes ended up being in the range of 200-300μs at this level). When I say that some versions are faster, the p-values for individual measurements are on the order of 10-20—meaning that there is a 1-in-100,000,000,000,000,000,000 chance that the observed speedups could be produced if the programs take the same amount of time to run.

The initial assembly version of the program took about 288μs to run. The first C++ version I coded, originating from the same genesis algorithm that the author of the assembly version used, ran in 275μs. A recursive program beat out a hand-written assembly block of code... and when I manually converted the recursive program into a single loop, the runtime improved to 249μs. It wasn't until I got rid of all of the assembly in the original code that I could get the program to beat the derecursified code (at 244μs)—so it's not the vectorization that's causing the code to be slow. Intrigued, I started to analyze why the original assembly was so slow.

It turns out that there are three main things that I think cause the slow speed of the original code. The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself. The other two causes are directly measured from observing runtime changes as I slowly replaced the assembly with code. When I replaced the use of push and pop instructions with a global static array, the runtime improved dramatically. This suggests that the alignment of the stack could be to blame (although the stack is still 8-byte aligned when I checked via gdb), which just goes to show you how much alignments really do matter in code.

The final, and by far most dramatic, effect I saw involves the use of three assembly instructions: bsf (find the index of the lowest bit that is set), btc (clear a specific bit index), and shl (left shift). When I replaced the use of these instructions with a more complicated expression int bit = x & -x and x = x - bit, the program's speed improved dramatically. And the rationale for why the speed improved won't be found in latency tables, although those will tell you that bsf is not a 1-cycle operation. Rather, it's in minutiae that's not immediately obvious.

The original program used the fact that bsf sets the zero flag if the input register is 0 as the condition to do the backtracking; the converted code just checked if the value was 0 (using a simple test instruction). The compare and the jump instructions are basically converted into a single instruction in the processor. In contrast, the bsf does not get to do this; combined with the lower latency of the instruction intrinsically, it means that empty loops take a lot longer to do nothing. The use of an 8-bit shift value is also interesting, as there is a rather severe penalty for using 8-bit registers in Intel processors as far as I can see.

Now, this isn't to say that the compiler will always produce the best code by itself. My final code wasn't above using x86 intrinsics for the vector instructions. Replacing the _mm_andnot_si128 intrinsic with an actual and-not on vectors caused gcc to use other, slower instructions instead of the vmovq to move the result out of the SSE registers for reasons I don't particularly want to track down. The use of the _mm_blend_epi16 and _mm_srli_si128 intrinsics can probably be replaced with __builtin_shuffle instead for more portability, but I was under the misapprehension that this was a clang-only intrinsic when I first played with the code so I never bothered to try that, and this code has passed out of my memory long enough that I don't want to try to mess with it now.

In short, compilers know things about optimizing for modern architectures that many general programmers don't. Compilers may have issues with autovectorization, but the existence of vector intrinsics allow you to force compilers to use vectorization while still giving them leeway to make decisions about instruction scheduling or code alignment which are easy to screw up in hand-written assembly. Also, compilers are liable to get better in the future, whereas hand-written assembly code is unlikely to get faster in the future. So only write assembly code if you really know what you're doing and you know you're better than the compiler.

Categorieën: Mozilla-nl planet

Andrew Sutherland: monitoring gaia travis build status using webmail LED notifiers

Thunderbird - do, 03/04/2014 - 15:58

usb LED webmail notifiers showing build status

For Firefox OS the Gaia UI currently uses Travis CI to run a series of test jobs in parallel for each pull request.  While Travis has a neat ember.js-based live-updating web UI, I usually find myself either staring at my build watching it go nowhere or forgetting about it entirely.  The latter is usually what ends up happening since we have a finite number of builders available, we have tons of developers, each build takes 5 jobs, and some of those jobs can take up to 35 minutes to run when they finally get a turn to run.

I recently noticed ThinkGeek had a bunch of Dream Cheeky USB LED notifiers on sale.  They’re each a USB-controlled tri-color LED in a plastic case that acts as a nice diffuser.  Linux’s “usbled” driver exposes separate red/green/blue files via sysfs that you can echo numbers into to control them.  While the driver and USB protocol inherently support a range of 0-255, it seems like 0-63 or 0-64 is all they give.  The color gamut isn’t amazing but is quite respectable and they are bright enough that they are useful in daylight.  I made a node.js library at https://github.com/asutherland/gaudy-leds that can do some basic tricks and is available on npm as “gaudy-leds”.  You can tell it to do things by doing “gaudy-leds set red green blue purple”, etc.  I added a bunch of commander sub-commands, so “gaudy-leds –help” should give a lot more details than the currently spartan readme.

I couldn’t find any existing tools/libraries to easily watch a Travis CI build and invoke commands like that (though I feel like they must exist) so I wrote https://github.com/asutherland/travis-build-watcher.  While the eventual goal is to not have to manually activate it at all, right now I can point it at a Travis build or a github pull request and it will poll appropriately so it ends up at the latest build and updates the state of the LEDs each time it polls.

Relevant notes / context:

  • There is a storied history of people hooking build/tree status up to LED lights and real traffic lights and stuff like that.  I think if you use Jenkins you’re particularly in luck.  This isn’t anything particularly new or novel, but the webmail notifiers are a great off-the-shelf solution.  The last time I did something like this I used a phidgets LED64 in a rice paper lamp and the soldering was much more annoying than dealing with a mess of USB cables.  Also, it could really only display one status at a time.
  • There are obviously USB port scalability issues, but you can get a 24-port USB hub for ~$40 from Amazon/monoprice/etc.  (They all seem to be made by the same manufacturer.)  I coincidentally bought 24 of the notifiers after my initial success with 6, so I am really prepared for an explosion in test jobs!
  • While I’m currently trying to keep things UNIXy with a bunch of small command-line tools operating together, I think I would like to have some kind of simple message-bus mechanism so that:
    • mozilla-central mach builds can report status as they go
    • webhooks / other async mechanisms can be used to improve efficiency and require less manual triggering/interaction on my end.  So if I re-spin a build from the web UI I won’t need to re-trigger the script locally and such.  Please let me know if you’re aware of existing solutions in this space, I didn’t find much and am planning to just use redis as glue for a bunch of small/independent pieces plus a more daemonish node process for polling / interacting with the web/AMQP.
  • There are efforts underway to overhaul the continuous integration mechanism used for Gaia.  This should address delays in starting tests by being able to throw more resources at them as well as allow notification by whatever Mozilla Pulse’s successor is.
Categorieën: Mozilla-nl planet

Joshua Cranmer: Understanding email charsets

Thunderbird - vr, 14/03/2014 - 05:17
Several years ago, I embarked on a project to collect the headers of all the messages I could reach on NNTP, with the original intent of studying the progression of the most common news clients. More recently, I used this dataset to attempt to discover the prevalence of charsets in email messages. In doing so, I identified a critical problem with the dataset: since it only contains headers, there is very little scope for actually understanding the full, sad story of charsets. So I've decided to rectify this problem.

This time, I modified my data-collection scripts to make it much easier to mass-download NNTP messages. The first script effectively lists all the newsgroups, and then all the message IDs in those newsgroups, stuffing the results in a set to remove duplicates (cross-posts). The second script uses Python's nntplib package to attempt to download all of those messages. Of the 32,598,261 messages identified by the first set, I succeeded in obtaining 1,025,586 messages in full or in part. Some messages failed to download due to crashing nntplib (which appears to be unable to handle messages of unbounded length), and I suspect my newsserver connections may have just timed out in the middle of the download at times. Others failed due to expiring before I could download them. All in all, 19,288 messages were not downloaded.

Analysis of the contents of messages were hampered due to a strong desire to find techniques that could mangle messages as little as possible. Prior experience with Python's message-parsing libraries lend me to believe that they are rather poor at handling some of the crap that comes into existence, and the errors in nntplib suggest they haven't fixed them yet. The only message parsing framework I truly trust to give me the level of finess is the JSMime that I'm writing, but that happens to be in the wrong language for this project. After reading some blog posts of Jeffrey Stedfast, though, I decided I would give GMime a try instead of trying to rewrite ad-hoc MIME parser #N.

Ultimately, I wrote a program to investigate the following questions on how messages operate in practice:

  • What charsets are used in practice? How are these charsets named?
  • For undeclared charsets, what are the correct charsets?
  • For charsets unknown to a decoder, how often would ASCII suffice?
  • What charsets are used in RFC 2047 encoded words?
  • How prevalent are malformed RFC 2047 encoded words?
  • When HTML and MIME are mixed, who wins?
  • What is the state of 8-bit headers?

While those were the questions I seeked the answers to originally, I did come up with others as I worked on my tool, some in part due to what information I was basically already collecting. The tool I wrote primarily uses GMime to convert the body parts to 8-bit text (no charset conversion), as well as parse the Content-Type headers, which are really annoying to do without writing a full parser. I used ICU to handle charset conversion and detection. RFC 2047 decoding is done largely by hand since I needed very specific information that I couldn't convince GMime to give me. All code that I used is available upon request; the exact dataset is harder to transport, given that it is some 5.6GiB of data.

Other than GMime being built on GObject and exposing a C API, I can't complain much, although I didn't try to use it to do magic. Then again, in my experience (and as this post will probably convince you as well), you really want your MIME library to do charset magic for you, so in doing well for my needs, it's actually not doing well for a larger audience. ICU's C API similarly makes me want to complain. However, I'm now very suspect of the quality of its charset detection code, which is the main reason I used it. Trying to figure out how to get it to handle the charset decoding errors also proved far more annoying than it really should.

Some final background regards the biases I expect to crop up in the dataset. As the approximately 1 million messages were drawn from the python set iterator, I suspect that there's no systematic bias towards or away from specific groups, excepting that the ~11K messages found in the eternal-september.* hierarchy are completely represented. The newsserver I used, Eternal September, has a respectably large set of newsgroups, although it is likely to be biased towards European languages and under-representing East Asians. The less well-connected South America, Africa, or central Asia are going to be almost completely unrepresented. The download process will be biased away towards particularly heinous messages (such as exceedingly long lines), since nntplib itself is failing.

This being news messages, I also expect that use of 8-bit will be far more common than would be the case in regular mail messages. On a related note, the use of 8-bit in headers would be commensurately elevated compared to normal email. What would be far less common is HTML. I also expect that undeclared charsets may be slightly higher.

Charsets

Charset data is mostly collected on the basis of individual body parts within body messages; some messages have more than one. Interestingly enough, the 1,025,587 messages yielded 1,016,765 body parts with some text data, which indicates that either the messages on the server had only headers in the first place or the download process somehow managed to only grab the headers. There were also 393 messages that I identified having parts with different charsets, which only further illustrates how annoying charsets are in messages.

The aliases in charsets are mostly uninteresting in variance, except for the various labels used for US-ASCII (us - ascii, 646, and ANSI_X3.4-1968 are the less-well-known aliases), as well as the list of charsets whose names ICU was incapable of recognizing, given below. Unknown charsets are treated as equivalent to undeclared charsets in further processing, as there were too few to merit separate handling (45 in all).

  • x-windows-949
  • isolatin
  • ISO-IR-111
  • Default
  • ISO-8859-1 format=flowed
  • X-UNKNOWN
  • x-user-defined
  • windows-874
  • 3D"us-ascii"
  • x-koi8-ru
  • windows-1252 (fuer gute Newsreader)
  • LATIN-1#2 iso-8859-1

For the next step, I used ICU to attempt to detect the actual charset of the body parts. ICU's charset detector doesn't support the full gamut of charsets, though, so charset names not claimed to be detected were instead processed by checking if they decoded without error. Before using this detection, I detect if the text is pure ASCII (excluding control characters, to enable charsets like ISO-2022-JP, and +, if the charset we're trying to check is UTF-7). ICU has a mode which ignores all text in things that look like HTML tags, and this mode is set for all HTML body parts.

I don't quite believe ICU's charset detection results, so I've collapsed the results into a simpler table to capture the most salient feature. The correct column indicates the cases where the detected result was the declared charset. The ASCII column captures the fraction which were pure ASCII. The UTF-8 column indicates if ICU reported that the text was UTF-8 (it always seems to try this first). The Wrong C1 column refers to an ISO-8859-1 text being detected as windows-1252 or vice versa, which is set by ICU if it sees or doesn't see an octet in the appropriate range. The other column refers to all other cases, including invalid cases for charsets not supported by ICU.

DeclaredCorrectASCIIUTF-8 Wrong C1OtherTotal ISO-8859-1230,526225,6678838,1191,035466,230 Undeclared148,0541,11637,626186,796 UTF-875,67437,6001,551114,825 US-ASCII98,238030498,542 ISO-8859-1567,52918,527086,056 windows-125221,4144,3701543,31913029,387 ISO-8859-218,6472,13870712,31923,245 KOI8-R4,61642421,1126,154 GB23121,3075901121,478 Big562260801741,404 windows-125634310045398 IBM437842570341 ISO-8859-1331160317 windows-125113197161290 windows-12506969014101253 ISO-8859-7262600131183 ISO-8859-9127110017155 ISO-2022-JP766903148 macintosh67570124 ISO-8859-16015101116 UTF-7514055 x-mac-croatian0132538 KOI8-U282030 windows-125501800624 ISO-8859-4230023 EUC-KR0301619 ISO-8859-14144018 GB180301430017 ISO-8859-800001616 TIS-620150015 Shift_JIS840113 ISO-8859-391111 ISO-8859-10100010 KSC_56013609 GBK4206 windows-1253030025 ISO-8859-510034 IBM8500404 windows-12570303 ISO-2022-JP-22002 ISO-8859-601001 Total421,751536,3732,22611,52344,8921,016,765

The most obvious thing shown in this table is that the most common charsets remain ISO-8859-1, Windows-1252, US-ASCII, UTF-8, and ISO-8859-15, which is to be expected, given an expected prior bias to European languages in newsgroups. The low prevalence of ISO-2022-JP is surprising to me: it means a lower incidence of Japanese than I would have expected. Either that, or Japanese have switched to UTF-8 en masse, which I consider very unlikely given that Japanese have tended to resist the trend towards UTF-8 the most.

Beyond that, this dataset has caused me to lose trust in the ICU charset detectors. KOI8-R is recorded as being 18% malformed text, with most of that ICU believing to be ISO-8859-1 instead. Judging from the results, it appears that ICU has a bias towards guessing ISO-8859-1, which means I don't believe the numbers in the Other column to be accurate at all. For some reason, I don't appear to have decoders for ISO-8859-16 or x-mac-croatian on my local machine, but running some tests by hand appear to indicate that they are valid and not incorrect.

Somewhere between 0.1% and 1.0% of all messages are subject to mojibake, depending on how much you trust the charset detector. The cases of UTF-8 being misdetected as non-UTF-8 could potentially be explained by having very few non-ASCII sequences (ICU requires four valid sequences before it confidently declares text UTF-8); someone who writes a post in English but has a non-ASCII signature (such as myself) could easily fall into this category. Despite this, however, it does suggest that there is enough mojibake around that users need to be able to override charset decisions.

The undeclared charsets are described, in descending order of popularity, by ISO-8859-1, Windows-1252, KOI8-R, ISO-8859-2, and UTF-8, describing 99% of all non-ASCII undeclared data. ISO-8859-1 and Windows-1252 are probably over-counted here, but the interesting tidbit is that KOI8-R is used half as much undeclared as it is declared, and I suspect it may be undercounted. The practice of using locale-default fallbacks that Thunderbird has been using appears to be the best way forward for now, although UTF-8 is growing enough in popularity that using a specialized detector that decodes as UTF-8 if possible may be worth investigating (3% of all non-ASCII, undeclared messages are UTF-8).

HTML

Unsuprisingly (considering I'm polling newsgroups), very few messages contained any HTML parts at all: there were only 1,032 parts in the total sample size, of which only 552 had non-ASCII characters and were therefore useful for the rest of this analysis. This means that I'm skeptical of generalizing the results of this to email in general, but I'll still summarize the findings.

HTML, unlike plain text, contains a mechanism to explicitly identify the charset of a message. The official algorithm for determining the charset of an HTML file can be described simply as "look for a <meta> tag in the first 1024 bytes. If it can be found, attempt to extract a charset using one of several different techniques depending on what's present or not." Since doing this fully properly is complicated in library-less C++ code, I opted to look first for a <meta[ \t\r\n\f] production, guess the extent of the tag, and try to find a charset= string somewhere in that tag. This appears to be an approach which is more reflective of how this parsing is actually done in email clients than the proper HTML algorithm. One difference is that my regular expressions also support the newer <meta charset="UTF-8"/> construct, although I don't appear to see any use of this.

I found only 332 parts where the HTML declared a charset. Only 22 parts had a case where both a MIME charset and an HTML charset and the two disagreed with each other. I neglected to count how many messages had HTML charsets but no MIME charsets, but random sampling appeared to indicate that this is very rare on the data set (the same order of magnitude or less as those where they disagreed).

As for the question of who wins: of the 552 non-ASCII HTML parts, only 71 messages did not have the MIME type be the valid charset. Then again, 71 messages did not have the HTML type be valid either, which strongly suggests that ICU was detecting the incorrect charset. Judging from manual inspection of such messages, it appears that the MIME charset ought to be preferred if it exists. There are also a large number of HTML charset specifications saying unicode, which ICU treats as UTF-16, which is most certainly wrong.

Headers

In the data set, 1,025,856 header blocks were processed for the following statistics. This is slightly more than the number of messages since the headers of contained message/rfc822 parts were also processed. The good news is that 97% (996,103) headers were completely ASCII. Of the remaining 29,753 headers, 3.6% (1,058) were UTF-8 and 43.6% (12,965) matched the declared charset of the first body part. This leaves 52.9% (15,730) that did not match that charset, however.

Now, NNTP messages can generally be expected to have a higher 8-bit header ratio, so this is probably exaggerating the setup in most email messages. That said, the high incidence is definitely an indicator that even non-EAI-aware clients and servers cannot blindly presume that headers are 7-bit, nor can EAI-aware clients and servers presume that 8-bit headers are UTF-8. The high incidence of mismatching the declared charset suggests that fallback-charset decoding of headers is a necessary step.

RFC 2047 encoded-words is also an interesting statistic to mine. I found 135,951 encoded-words in the data set, which is rather low, considering that messages can be reasonably expected to carry more than one encoded-word. This is likely an artifact of NNTP's tendency towards 8-bit instead of 7-bit communication and understates their presence in regular email.

Counting encoded-words can be difficult, since there is a mechanism to let them continue in multiple pieces. For the purposes of this count, a sequence of such words count as a single word, and I indicate the number of them that had more than one element in a sequence in the Continued column. The 2047 Violation column counts the number of sequences where decoding words individually does not yield the same result as decoding them as a whole, in violation of RFC 2047. The Only ASCII column counts those words containing nothing but ASCII symbols and where the encoding was thus (mostly) pointless. The Invalid column counts the number of sequences that had a decoder error.

CharsetCountContinued2047 ViolationOnly ASCIIInvalid ISO-8859-156,35515,6104990 UTF-836,56314,2163,3112,7049,765 ISO-8859-1520,6995,695400 ISO-8859-211,2472,66990 windows-12525,1743,075260 KOI8-R3,5231,203120 windows-125676556800 Big551146280171 ISO-8859-71652603 windows-12511573020 GB2312126356051 ISO-2022-JP10285049 ISO-8859-13784500 ISO-8859-9762100 ISO-8859-471200 windows-1250682100 ISO-8859-5662000 US-ASCII3810380 TIS-620363400 KOI8-U251100 ISO-8859-16221022 UTF-7172183 EUC-KR174409 x-mac-croatian103010 Shift_JIS80003 Unknown7207 ISO-2022-KR70000 GB1803061001 windows-12554000 ISO-8859-143000 ISO-8859-32100 GBK20002 ISO-8859-61100 Total135,95143,3603,3613,33810,096

This table somewhat mirrors the distribution of regular charsets, with one major class of differences: charsets that represent non-Latin scripts (particularly Asian scripts) appear to be overdistributed compared to their corresponding use in body parts. The exception to this rule is GB2312 which is far lower than relative rankings would presume—I attribute this to people using GB2312 being more likely to use 8-bit headers instead of RFC 2047 encoding, although I don't have direct evidence.

Clearly continuations are common, which is to be relatively expected. The sad part is how few people bother to try to adhere to the specification here: out of 14,312 continuations in languages that could violate the specification, 23.5% of them violated the specification. The mode-shifting versions (ISO-2022-JP and EUC-KR) are basically all violated, which suggests that no one bothered to check if their encoder "returns to ASCII" at the end of the word (I know Thunderbird's does, but the other ones I checked don't appear to).

The number of invalid UTF-8 decoded words, 26.7%, seems impossibly high to me. A brief check of my code indicates that this is working incorrectly in the face of invalid continuations, which certainly exaggerates the effect but still leaves a value too high for my tastes. Of more note are the elevated counts for the East Asian charsets: Big5, GB2312, and ISO-2022-JP. I am not an expert in charsets, but I belive that Big5 and GB2312 in particular are a family of almost-but-not-quite-identical charsets and it may be that ICU is choosing the wrong candidate of each family for these instances.

There is a surprisingly large number of encoded words that encode only ASCII. When searching specifically for the ones that use the US-ASCII charset, I found that these can be divided into three categories. One set comes from a few people who apparently have an unsanitized whitespace (space and LF were the two I recall seeing) in the display name, producing encoded words like =?us-ascii?Q?=09Edward_Rosten?=. Blame 40tude Dialog here. Another set encodes some basic characters (most commonly = and ?, although a few other interpreted characters popped up). The final set of errors were double-encoded words, such as =?us-ascii?Q?=3D=3FUTF-8=3FQ=3Ff=3DC3=3DBCr=3F=3D?=, which appear to be all generated by an Emacs-based newsreader.

One interesting thing when sifting the results is finding the crap that people produce in their tools. By far the worst single instance of an RFC 2047 encoded-word that I found is this one: Subject: Re: [Kitchen Nightmares] Meow! Gordon Ramsay Is =?ISO-8859-1?B?UEgR lqZ VuIEhlYWQgVH rbGeOIFNob BJc RP2JzZXNzZW?= With My =?ISO-8859-1?B?SHVzYmFuZ JzX0JhbGxzL JfU2F5c19BbXiScw==?= Baking Company Owner (complete with embedded spaces), discovered by crashing my ad-hoc base64 decoder (due to the spaces). The interesting thing is that even after investigating the output encoding, it doesn't look like the text is actually correct ISO-8859-1... or any obvious charset for that matter.

I looked at the unknown charsets by hand. Most of them were actually empty charsets (looked like =??B?Sy4gSC4gdm9uIFLDvGRlbg==?=), and all but one of the outright empty ones were generated by KNode and really UTF-8. The other one was a Windows-1252 generated by a minor newsreader.

Another important aspect of headers is how to handle 8-bit headers. RFC 5322 blindly hopes that headers are pure ASCII, while RFC 6532 dictates that they are UTF-8. Indeed, 97% of headers are ASCII, leaving just 29,753 headers that are not. Of these, only 1,058 (3.6%) are UTF-8 per RFC 6532. Deducing which charset they are is difficult because the large amount of English text for header names and the important control values will greatly skew any charset detector, and there is too little text to give a charset detector confidence. The only metric I could easily apply was testing Thunderbird's heuristic as "the header blocks are the same charset as the message contents"—which only worked 45.2% of the time.

Encodings

While developing an earlier version of my scanning program, I was intrigued to know how often various content transfer encodings were used. I found 1,028,971 parts in all (1,027,474 of which are text parts). The transfer encoding of binary did manage to sneak in, with 57 such parts. Using 8-bit text was very popular, at 381,223 samples, second only to 7-bit at 496,114 samples. Quoted-printable had 144,932 samples and base64 only 6,640 samples. Extremely interesting are the presence of 4 illegal transfer encodings in 5 messages, two of them obvious typos and the others appearing to be a client mangling header continuations into the transfer-encoding.

Conclusions

So, drawing from the body of this data, I would like to make the following conclusions as to using charsets in mail messages:

  1. Have a fallback charset. Undeclared charsets are extremely common, and I'm skeptical that charset detectors are going to get this stuff right, particularly since email can more naturally combine multiple languages than other bodies of text (think signatures). Thunderbird currently uses a locale-dependent fallback charset, which roughly mirrors what Firefox and I think most web browsers do.
  2. Let users override charsets when reading. On a similar token, mojibake text, while not particularly common, is common enough to make declared charsets sometimes unreliable. It's also possible that the fallback charset is wrong, so users may need to override the chosen charset.
  3. Testing is mandatory. In this set of messages, I found base64 encoded words with spaces in them, encoded words without charsets (even UNKNOWN-8BIT), and clearly invalid Content-Transfer-Encodings. Real email messages that are flagrantly in violation of basic spec requirements exist, so you should make sure that your email parser and client can handle the weirdest edge cases.
  4. Non-UTF-8, non-ASCII headers exist. EAI not withstanding, 8-bit headers are a reality. Combined with a predilection for saying ASCII when text is really ASCII, this means that there is often no good in-band information to tell you what charset is correct for headers, so you have to go back to a fallback charset.
  5. US-ASCII really means ASCII. Email clients appear to do a very good job of only emitting US-ASCII as a charset label if it's US-ASCII. The sample size is too small for me to grasp what charset 8-bit characters should imply in US-ASCII.
  6. Know your decoders. ISO-8859-1 actually means Windows-1252 in practice. Big5 and GB1232 are actually small families of charsets with slightly different meanings. ICU notably disagrees with some of these realities, so be sure to include in your tests various charset edge cases so you know that the decoders are correct.
  7. UTF-7 is still relevant. Of the charsets I found not mentioned in the WHATWG encoding spec, IBM437 and x-mac-croatian are in use only due to specific circumstances that limit their generalizable presence. IBM850 is too rare. UTF-7 is common enough that you need to actually worry about it, as abominable and evil a charset it is.
  8. HTML charsets may matter—but MIME matters more. I don't have enough data to say if charsets declared in HTML are needed to do proper decoding. I do have enough to say fairly conclusively that the MIME charset declaration is authoritative if HTML disagrees.
  9. Charsets are not languages. The entire reason x-mac-croatian is used at all can be traced to Thunderbird displaying the charset as "Croatian," despite it being pretty clearly not a preferred charset. Similarly most charsets are often enough ASCII that, say, an instance of GB2312 is a poor indicator of whether or not the message is in English. Anyone trying to filter based on charsets is doing a really, really stupid thing.
  10. RFCs reflect an ideal world, not reality. This is most notable in RFC 2047: the specification may state that encoded words are supposed to be independently decodable, but the evidence is pretty clear that more clients break this rule than uphold it.
  11. Limit the charsets you support. Just because your library lets you emit a hundred charsets doesn't mean that you should let someone try to do it. You should emit US-ASCII or UTF-8 unless you have a really compelling reason not to, and those compelling reasons don't require obscure charsets. Some particularly annoying charsets should never be written: EBCDIC is already basically dead on the web, and I'd like to see UTF-7 die as well.

When I have time, I'm planning on taking some of the more egregious or interesting messages in my dataset and packaging them into a database of emails to help create testsuites on handling messages properly.

Categorieën: Mozilla-nl planet

Joshua Cranmer: Why email is hard, part 5: mail headers

Thunderbird - za, 01/02/2014 - 04:57
This post is part 5 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. This post discusses the more general problem of email headers.

Back in my first post, Ludovic kindly posted, in a comment, a link to a talk of someone else's email rant. And the best place to start this post is with a quote from that talk: "If you want to see an email programmer's face turn red, ask him about CFWS." CFWS is an acronym that stands for "comments and folded whitespace," and I can attest that the mere mention of CFWS is enough for me to start ranting. Comments in email headers are spans of text wrapped in parentheses, and the folding of whitespace refers to the ability to continue headers on multiple lines by inserting a newline before (but not in lieu of) a space.

I'll start by pointing out that there is little advantage to adding in free-form data to headers which are not going to be manually read in the vast majority of cases. In practice, I have seen comments used for only three headers on a reliable basis. One of these is the Date header, where a human-readable name of the timezone is sometimes included. The other two are the Received and Authentication-Results headers, where some debugging aids are thrown in. There would be no great loss in omitting any of this information; if information is really important, appending an X- header with that information is still a viable option (that's where most spam filtration notes get added, for example).

For this feature of questionable utility in the first place, the impact it has on parsing message headers is enormous. RFC 822 is specified in a manner that is familiar to anyone who reads language specifications: there is a low-level lexical scanning phase which feeds tokens into a secondary parsing phase. Like programming languages, comments and white space are semantically meaningless [1]. Unlike programming languages, however, comments can be nested—and therefore lexing an email header is not regular [2]. The problems of folding (a necessary evil thanks to the line length limit I keep complaining about) pale in comparison to comments, but it's extra complexity that makes machine-readability more difficult.

Fortunately, RFC 2822 made a drastic change to the specification that greatly limited where CFWS could be inserted into headers. For example, in the Date header, comments are allowed only following the timezone offset (and whitespace in a few specific places); in addressing headers, CFWS is not allowed within the email address itself [3]. One unanticipated downside is that it makes reading the other RFCs that specify mail headers more difficult: any version that predates RFC 2822 uses the syntax assumptions of RFC 822 (in particular, CFWS may occur between any listed tokens), whereas RFC 2822 and its descendants all explicitly enumerate where CFWS may occur.

Beyond the issues with CFWS, though, syntax is still problematic. The separation of distinct lexing and parsing phases means that you almost see what may be a hint of uniformity which turns out to be an ephemeral illusion. For example, the header parameters define in RFC 2045 for Content-Type and Content-Disposition set a tradition of ;-separated param=value attributes, which has been picked up by, say, the DKIM-Signature or Authentication-Results headers. Except a close look indicates that Authenticatin-Results allows two param=value pairs between semicolons. Another side effect was pointed out in my second post: you can't turn a generic 8-bit header into a 7-bit compatible header, since you can't tell without knowing the syntax of the header which parts can be specified as 2047 encoded-words and which ones can't.

There's more to headers than their syntax, though. Email headers are structured as a somewhat-unordered list of headers; this genericity gives rise to a very large number of headers, and that's just the list of official headers. There are unofficial headers whose use is generally agreed upon, such as X-Face, X-No-Archive, or X-Priority; other unofficial headers are used for internal tracking such as Mailman's X-BeenThere or Mozilla's X-Mozilla-Status headers. Choosing how to semantically interpret these headers (or even which headers to interpret!) can therefore be extremely daunting.

Some of the headers are specified in ways that would seem surprising to most users. For example, the venerable From header can represent anywhere between 0 mailboxes [4] to an arbitrarily large number—but most clients assume that only one exists. It's also worth noting that the Sender header is (if present) a better indication of message origin as far as tracing is concerned [5], but its relative rarity likely results in filtering applications not taking it into account. The suite of Resent-* headers also experiences similar issues.

Another impact of email headers is the degree to which they can be trusted. RFC 5322 gives some nice-sounding platitudes to how headers are supposed to be defined, but many of those interpretations turn out to be difficult to verify in practice. For example, Message-IDs are supposed to be globally unique, but they turn out to be extremely lousy UUIDs for emails on a local system, even if you allow for minor differences like adding trace headers [6].

More serious are the spam, phishing, etc. messages that lie as much as possible so as to be seen by end-users. Assuming that a message is hostile, the only header that can be actually guaranteed to be correct is the first Received header, which is added by the final user's mailserver [7]. Every other header, including the Date and From headers most notably, can be a complete and total lie. There's no real way to authenticate the headers or hide them from snoopers—this has critical consequences for both spam detection and email security.

There's more I could say on this topic (especially CFWS), but I don't think it's worth dwelling on. This is more of a preparatory post for the next entry in the series than a full compilation of complaints. Speaking of my next post, I don't think I'll be able to keep up my entirely-unintentional rate of posting one entry this series a month. I've exhausted the topics in email that I am intimately familiar with and thus have to move on to the ones I'm only familiar with.

[1] Some people attempt to be to zealous in following RFCs and ignore the distinction between syntax and semantics, as I complained about in part 4 when discussing the syntax of email addresses.
[2] I mean this in the theoretical sense of the definition. The proof that balanced parentheses is not a regular language is a standard exercise in use of the pumping lemma.
[3] Unless domain literals are involved. But domain literals are their own special category.
[4] Strictly speaking, the 0 value is intended to be used only when the email has been downgraded and the email address cannot be downgraded. Whether or not these will actually occur in practice is an unresolved question.
[5] Semantically speaking, Sender is the person who typed the message up and actually sent it out. From is the person who dictated the message. If the two headers would be the same, then Sender is omitted.
[6] Take a message that's cross-posted to two mailing lists. Each mailing list will generate copies of the message which end up being submitted back into the mail system and will typically avoid touching the Message-ID.
[7] Well, this assumes you trust your email provider. However, your email provider can do far worse to your messages than lie about the Received header…

Categorieën: Mozilla-nl planet

Joshua Cranmer: Charsets and NNTP

Thunderbird - vr, 24/01/2014 - 01:53
Recently, the question of charsets came up within the context of necessary decoder support for Thunderbird. After much hemming and hawing about how to find this out (which included a plea to the IMAP-protocol list for data), I remembered that I actually had this data. Long-time readers of this blog may recall that I did a study several years ago on the usage share of newsreaders. After that, I was motivated to take my data collection to the most extreme way possible. Instead of considering only the "official" Big-8 newsgroups, I looked at all of them on the news server I use (effectively, all but alt.binaries). Instead of relying on pulling the data from the server for the headers I needed, I grabbed all of them—the script literally runs HEAD and saves the results in a database. And instead of a month of results, I grabbed the results for the entire year of 2011. And then I sat on the data.

After recalling Henri Svinonen's pesterings about data, I decided to see the suitability of my dataset for this task. For data management reasons, I only grabbed the data from the second half of the year (about 10 million messages). I know from memory that the quality of Python's message parser (which was used to extract data in the first place) is surprisingly poor, which introduces bias of unknown consequence to my data. Since I only extracted headers, I can't identify charsets for anything which was sent as, say, multipart/alternative (which is more common than you'd think), which introduces further systematic bias. The end result is approximately 9.6M messages that I could extract charsets from and thence do further research.

Discussions revealed one particularly surprising tidbit of information. The most popular charset not accounted for by the Encoding specification was IBM437. Henri Sivonen speculated that the cause was some crufty old NNTP client on Windows using that encoding, so I endeavored to build a correlation database to check that assumption. Using the wonderful magic of d3, I produced a heatmap comparing distributions of charsets among various user agents. Details about the visualization may be found on that page, but it does refute Henri's claim when you dig into the data (it appears to be caused by specific BBS-to-news gateways, and is mostly localized in particular BBS newsgroups).

Also found on that page are some fun discoveries of just what kind of crap people try to pass off as valid headers. Some of those User-Agents are clearly spoofs (Outlook Express and family used the X-Newsreader header, not the User-Agent header). There also appears to be a fair amount of mojibake in headers (one of them appeared to be venerable double mojibake). The charsets also have some interesting labels to them: the "big5\n" and the "(null)" illustrate that some people don't double check their code very well, and not shown are the 5 examples of people who think charset names have spaces in them. A few people appear to have mixed up POSIX locales with charsets as well.

Categorieën: Mozilla-nl planet

Joshua Cranmer: Why email is hard, part 4: Email addresses

Thunderbird - do, 05/12/2013 - 00:24
This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.

You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.

 

 

 

Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.

The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].

Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.

Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.

RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[127.0.0.1] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].

The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.

At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">

Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.

Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).

Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).

Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.

Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.

RFC 822 addr-spec
Impossible. Don't even try.
RFC 5322 non-obsolete addr-spec production
([^\x00-\x20()\[\]:;@\\,.]+(\.[^\x00-\x20()\[\]:;@\\,.]+)*|"(\\.|[^\\"])*")@([^\x00-\x20()\[\]:;@\\,.]+(.[^\x00-\x20()\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
RFC 5322, unquoted email address
.*@([^\x00-\x20()\[\]:;@\\,.]+(\.[^\x00-\x20()\[\]:;@\\,.]+)*|\[(\\.|[^\\\]])*\])
HTML 5's interpretation
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*
Effective EAI-aware version
[^\x00-\x20\x80-\x9f]()\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.

[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ example.com would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.

Categorieën: Mozilla-nl planet

Pagina's