mozilla

Mozilla Nederland LogoDe Nederlandse
Mozilla-gemeenschap

This Week In Rust: This Week in Rust 342

Mozilla planet - wo, 10/06/2020 - 06:00

Hello and welcome to another issue of This Week in Rust! Rust is a systems language pursuing the trifecta: safety, concurrency, and speed. This is a weekly summary of its progress and community. Want something mentioned? Tweet us at @ThisWeekInRust or send us a pull request. Want to get involved? We love contributions.

This Week in Rust is openly developed on GitHub. If you find any errors in this week's issue, please submit a PR.

Check out this week's This Week in Rust Podcast

Updates from Rust Community News & Blog Posts Crate of the Week

This week's crate is cargo-spellcheck, a cargo subcommand to spell-check your docs.

Thanks to Bernhard Schuster for the suggestion!

Submit your suggestions and votes for next week!

Call for Participation

Always wanted to contribute to open-source projects but didn't know where to start? Every week we highlight some tasks from the Rust community for you to pick and get started!

Some of these tasks may also have mentors available, visit the task page for more information.

If you are a Rust project owner and are looking for contributors, please submit tasks here.

Updates from Rust Core

350 pull requests were merged in the last week

Rust Compiler Performance Triage Approved RFCs

Changes to Rust follow the Rust RFC (request for comments) process. These are the RFCs that were approved for implementation this week:

No RFCs were approved this week.

Final Comment Period

Every week the team announces the 'final comment period' for RFCs and key PRs which are reaching a decision. Express your opinions now.

RFCs

No RFCs are currently in the final comment period.

Tracking Issues & PRs New RFCs

No new RFCs were proposed this week.

Upcoming Events Online North America

If you are running a Rust event please add it to the calendar to get it mentioned here. Please remember to add a link to the event too. Email the Rust Community Team for access.

Rust Jobs

Tweet us at @ThisWeekInRust to get your job offers listed here!

Quote of the Week

You don't declare lifetimes. Lifetimes come from the shape of your code, so to change what the lifetimes are, you must change the shape of the code.

Alice Ryhl on rust-users

Thanks to RustyYato for the suggestions!

Please submit quotes and vote for next week!

This Week in Rust is edited by: nellshamrell, llogiq, and cdmistman.

Discuss on r/rust

Categorieën: Mozilla-nl planet

The Rust Programming Language Blog: 2020 Event Lineup - Update

Mozilla planet - wo, 10/06/2020 - 02:00

In 2020 the way we can do events suddenly changed. In the past we had in-person events all around the world, with some major conferences throughout the year. With everything changed due to a global pandemic this won't be possible anymore. Nonetheless the Rust community found ways to continue with events in some form or another. With more and more events moving online they are getting more accessible to people no matter where they are.

Below you find updated information about Rust events in 2020.

Do you plan to run a Rust online event? Send an email to the Rust Community team and the team will be able to get your event on the calendar and might be able to offer further help.

Rust LATAM

Unfortunately the Latin American event Rust LATAM had to be canceled this year. The team hopes to be able to resume the event in the future.

Oxidize
July 17th-20th, 2020

The Oxidize conference was relabeled to become Oxidize Global. From July 17-20 you will be able to learn about embedded systems and IoT in Rust. Over the course of 4 days you will be able to attend online workshops (July 17th), listen to talks (July 18th) and take part in the Impl Days, where you can collaborate with other Embedded Rust contributors in active programming sessions.

Tickets are on sale and the speakers & talks will be announced soon.

RustConf
August 20th, 2020

The official RustConf will be taking place fully online. Listen to talks and meet other Rust enthusiasts online in digital meetups & breakout rooms. See the list of speakers, register already and follow Twitter for updates as the event date approaches!

Rusty Days
July 27th - August 2nd, 2020

Rusty Days is a new conference and was planned to happen in Wroclaw, Poland. It now turned into a virtual Rust conference stretched over five days. You'll be able to see five speakers with five talks -- and everything is free of charge, streamed online and available to watch later.

The Call for Papers is open. Follow Twitter for updates.

RustLab
October 16th-17th, 2020

RustLab 2020 is also turning into an online event. The details are not yet settled, but they are aiming for the original dates. Keep an eye on their Twitter stream for further details.

RustFest Netherlands Global
November 7th-8th, 2020

RustFest Netherlands was supposed to happen this June. The team decided to postpone the event and is now happening as an online conference in Q4 of this year. More information will be available soon on the RustFest blog and also on Twitter.

Update 2020-06-18: RustFest has announced its dates: November 7th & 8th, running as an online community conference. See the announcement blog post for details.

Conferences are not the only thing happening. More and more local meetups get turned into online events. We try to highlight these in the community calendar as well as in the This Week in Rust newsletter. Some Rust developers are streaming their work on the language & their Rust projects. You can get more information in a curated list of Rust streams.

Do you plan to run a Rust online event? Send an email to the Rust Community team and the team will be able to get your event on the calendar and might be able to offer further help.

Categorieën: Mozilla-nl planet

Mozilla Future Releases Blog: Next steps in testing our Firefox Private Network browser extension beta

Mozilla planet - di, 09/06/2020 - 18:25

Last fall, we launched the Firefox Private Network browser extension beta as a part of our Test Pilot experiments program. The extension offers safe, no-hassle network protection in the Firefox browser. Since our initial launch, we’ve released a number of versions offering different capabilities. We’ve also launched a Virtual Private Network (VPN) for users interested in full device protection.

Today we are pleased to announce the next step in our Firefox Private Network browser extension Beta. Starting soon, we will be transitioning from a free beta to a paid subscription beta for the Firefox Private Network browser extension. This version will be offered for a limited time for $2.99/mo and will provide unlimited access while using the Firefox Private Network extension. Like our existing extension, this version will be available in the U.S. first, but we hope to expand to other markets soon. Unlike our previous beta, this version will also allow users to connect up to three Firefox browsers at once using the same account. This will only be available for desktop users. For this release, we will also be updating our product icon to differentiate more clearly from the VPN. More information about our VPN as a stand-alone product offering will be shared in the coming weeks.

What did we learn

Last fall, when we first launched the Firefox Private Network browser extension, we saw a lot of early excitement around the product followed by a wave of users signing up. From September through December, we offered early adopters a chance to sign up for the extension with unlimited access, free of charge. In December, when the subscription VPN first launched, we updated our experimental offering to understand if giving participants a certain number of hours a month for browsing in coffee shops or at airports (remember those?) would be appealing. What we learned very quickly was that the appeal of the proxy came most of all from the simplicity of the unlimited offering. Users of the unlimited version appreciated having set and forget privacy, while users of the limited version often didn’t remember to turn on the extension at opportune moments.

These initial findings were borne out in subsequent research. Users in the unlimited cohort engaged at a high level, while users in the limited cohort often stopped using the proxy after only a few hours. When we spoke to proxy users, we found that for many the appeal of the product was in the set-it-and-forget-it protection it offered.

We also knew from the outset that we could not offer this product for free forever. While there are some free proxy products available in the market, there is always a cost associated with the network infrastructure required to run a secure proxy service.  We believe the simplest and most transparent way to account for these costs is by providing this service at a modest subscription fee. After conducting a number of surveys, we believe that the appropriate introductory price for the Firefox Private Network browser extension is $2.99 a month.

What will be testing?

So the next thing we want to understand is basically this: will people pay for a browser-based privacy tool? It’s a simple question really, and one we think is best answered by the market. Over the summer we will be conducting a series of small marketing tests to determine interest in the Firefox Private Network browser extension as both a standalone subscription product and as well as part of a larger privacy and security bundle for Firefox.

In conjunction, we will also continue to explore the relationship between the Firefox Private Network extension and the VPN. Does it make sense to bundle them? Do VPN subscribers want access to the browser extension? How can we best communicate the different values and attributes of each?

What you can expect next

Starting in a few weeks, new users and users in the limited experiment will be offered the opportunity to subscribe to the unlimited beta for $2.99 a month. Shortly thereafter we will be asking our unlimited users to migrate as well.

 

The post Next steps in testing our Firefox Private Network browser extension beta appeared first on Future Releases.

Categorieën: Mozilla-nl planet

Next steps in testing our Firefox Private Network browser extension beta

Mozilla Futurereleases - di, 09/06/2020 - 18:25

Last fall, we launched the Firefox Private Network browser extension beta as a part of our Test Pilot experiments program. The extension offers safe, no-hassle network protection in the Firefox browser. Since our initial launch, we’ve released a number of versions offering different capabilities. We’ve also launched a Virtual Private Network (VPN) for users interested in full device protection.

Today we are pleased to announce the next step in our Firefox Private Network browser extension Beta. Starting soon, we will be transitioning from a free beta to a paid subscription beta for the Firefox Private Network browser extension. This version will be offered for a limited time for $2.99/mo and will provide unlimited access while using the Firefox Private Network extension. Like our existing extension, this version will be available in the U.S. first, but we hope to expand to other markets soon. Unlike our previous beta, this version will also allow users to connect up to three Firefox browsers at once using the same account. This will only be available for desktop users. For this release, we will also be updating our product icon to differentiate more clearly from the VPN. More information about our VPN as a stand-alone product offering will be shared in the coming weeks.

What did we learn

Last fall, when we first launched the Firefox Private Network browser extension, we saw a lot of early excitement around the product followed by a wave of users signing up. From September through December, we offered early adopters a chance to sign up for the extension with unlimited access, free of charge. In December, when the subscription VPN first launched, we updated our experimental offering to understand if giving participants a certain number of hours a month for browsing in coffee shops or at airports (remember those?) would be appealing. What we learned very quickly was that the appeal of the proxy came most of all from the simplicity of the unlimited offering. Users of the unlimited version appreciated having set and forget privacy, while users of the limited version often didn’t remember to turn on the extension at opportune moments.

These initial findings were borne out in subsequent research. Users in the unlimited cohort engaged at a high level, while users in the limited cohort often stopped using the proxy after only a few hours. When we spoke to proxy users, we found that for many the appeal of the product was in the set-it-and-forget-it protection it offered.

We also knew from the outset that we could not offer this product for free forever. While there are some free proxy products available in the market, there is always a cost associated with the network infrastructure required to run a secure proxy service.  We believe the simplest and most transparent way to account for these costs is by providing this service at a modest subscription fee. After conducting a number of surveys, we believe that the appropriate introductory price for the Firefox Private Network browser extension is $2.99 a month.

What will be testing?

So the next thing we want to understand is basically this: will people pay for a browser-based privacy tool? It’s a simple question really, and one we think is best answered by the market. Over the summer we will be conducting a series of small marketing tests to determine interest in the Firefox Private Network browser extension as both a standalone subscription product and as well as part of a larger privacy and security bundle for Firefox.

In conjunction, we will also continue to explore the relationship between the Firefox Private Network extension and the VPN. Does it make sense to bundle them? Do VPN subscribers want access to the browser extension? How can we best communicate the different values and attributes of each?

What you can expect next

Starting in a few weeks, new users and users in the limited experiment will be offered the opportunity to subscribe to the unlimited beta for $2.99 a month. Shortly thereafter we will be asking our unlimited users to migrate as well.

 

The post Next steps in testing our Firefox Private Network browser extension beta appeared first on Future Releases.

Categorieën: Mozilla-nl planet

The Mozilla Blog: Mozilla Announces Second Three COVID-19 Solutions Fund Recipients

Mozilla planet - ma, 08/06/2020 - 15:45

Innovations spanning food supplies, medical records and PPE manufacture were today included in the final three awards made by Mozilla from its COVID-19 Solutions Fund. The Fund was established at the end of March by the Mozilla Open Source Support Program (MOSS), to offer up to $50,000 each to open source technology projects responding to the COVID-19 pandemic. In just two months, the Fund received 163 applicants from 30 countries and is now closed to new applications.

OpenMRS is a robust, scalable, user-driven, open source electronic medical record system platform currently used to manage more than 12.6 million patients at over 5,500 health facilities in 64 countries. Using Kenya as a primary use case, their COVID-19 Response project will coordinate work on OpenMRS COVID-19 solutions emerging from their community, particularly “pop-up” hospitals, into a COVID-19 package for immediate use.

This package will be built for eventual re-use as a foundation for a suite of tools that will become the OpenMRS Public Health Response distribution. Science-based data collection tools, reports, and data exchange interfaces with other key systems in the public health sector will provide critical information needed to contain disease outbreaks. The committee approved an award of $49,754.

Open Food Network offers an open source platform enabling new, ethical supply chains. Food producers can sell online directly to consumers and wholesalers can manage buying groups and supply produce through networks of food hubs and shops. Communities can bring together producers to create a virtual farmers’ market, building a resilient local food economy.

At a time when supply chains are being disrupted around the world — resulting in both food waste and shortages — they’re helping to get food to people in need. Globally, the Open Food Network is currently deployed in India, Brazil, Italy, South Africa, Australia, the UK, the US and five other countries. They plan to use their award to extend to ten other countries, build tools to allow vendors to better control inventory, and scale up their support infrastructure as they continue international expansion. The Committee approved a $45,210 award.

Careables Casa Criatura Olinda in northeast Brazil is producing face shields for local hospitals based on an open source design. With their award, they plan to increase their production of face shields as well as to start producing aerosol boxes using an open source design, developed in partnership with local healthcare professionals.

Outside of North American ICUs, many hospitals cannot maintain only one patient per room, protected by physical walls and doors. In such cases, aerosol boxes are critical to prevent the spread of the virus from patient to patient and patient to physician. Yet even the Brazilian city of Recife (population: 1.56 million), has only three aerosol boxes. The Committee has approved a $25,000 award and authorized up to an additional $5,000 to help the organization spread the word about their aerosol box design.

“Healthcare has for too long been assumed to be too high risk for open source development. These awards highlight how critical open source technologies are to helping communities around the world to cope with the pandemic,” said Jochai Ben-Avie, Head of International Public Policy and Administrator of the Program at Mozilla. “We are indebted to the talented global community of open source developers who have found such vital ways to put our support to good use.”

Information on the first three recipients from the Fund can be found here.

The post Mozilla Announces Second Three COVID-19 Solutions Fund Recipients appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Henri Sivonen: chardetng: A More Compact Character Encoding Detector for the Legacy Web

Mozilla planet - ma, 08/06/2020 - 12:48

chardetng is a new small-binary-footprint character encoding detector for Firefox written in Rust. Its purpose is user retention by addressing an ancient—and for non-Latin scripts page-destroying—Web compat problem that Chrome already addressed. It puts an end to the situation that some pages showed up as unreadable garbage in Firefox but opening the page in Chrome remedied the situation more easily than fiddling with Firefox menus. The deployment of chardetng in Firefox 73 marks the first time in the history of Firefox (and in the entire Netscape lineage of browsers since the time support for non-ISO-8859-1 text was introduced) that the mapping of HTML bytes to the DOM is not a function of the browser user interface language. This is accomplished at a notably lower binary size increase than what would have resulted from adopting Chrome’s detector. Also, the result is more explainable and modifiable as well as more suitable for standardization, should there be interest, than Chrome’s detector

chardetng targets the long tail of unlabeled legacy pages in legacy encodings. Web developers should continue to use UTF-8 for newly-authored pages and to label HTML accordingly. This is not a new feature for Web developers to make use of!

Although chardetng first landed in Firefox 73, this write-up discusses chardetng as of Firefox 78.

TL;DR

There is a long tail of legacy Web pages that fail to label their encoding. Historically, browsers have used a default associated with the user interface language of the browser for unlabeled pages and provided a menu for the user to choose something different.

In order to get rid of the character encoding menu, Chrome adopted an encoding detector (apparently) developed for Google Search and Gmail. This was done without discussing the change ahead of time in standard-setting organizations. Firefox had gone in the other direction of avoiding content-based guessing, so this created a Web compat problem that, when it occurred for non-Latin-script pages, was as bad as a Web compat problem can be: Encoding-unlabeled legacy pages could appear completely unreadable in Firefox (unless a well-hidden menu action was taken) but appear OK in Chrome. Recent feedback from Japan as well as world-wide telemetry suggested that this problem still actually existed in practice. While Safari detects less, if a user encounters the problem on iOS, there is no other engine to go to, so Safari can’t be used to determine the abandonment risk for Firefox on platforms where a Chromium-based browser is a couple of clicks or taps away. Edge’s switch to Chromium signaled an end to any hope of taking the Web Platform in the direction of detecting less.

ICU4C’s detector wasn’t accurate (or complete) enough. Firefox’s old and mostly already removed detector wasn’t complete enough and completing it would have involved the same level of work as writing a completely new one. Since Chrome’s detector, ced, wasn’t developed for browser use cases, it has a larger footprint than is necessary. It is also (in practice) unmodifiable over-the-wall Open Source, so adopting it would have meant adopting a bunch of C++ that would have had known-unnecessary bloat while also being difficult to clean up.

Developing an encoding detector is easier and takes less effort than it looks once one has made the observations that one makes when developing a character encoding conversion library. chardetng’s foundational binary size-reducing idea is to make use of the legacy Chinese, Japanese, and Korean (CJK) decoders (that a browser has to have anyway) for the purpose of detecting CJK legacy encodings. chardetng is also strictly scoped to Web-relevant use cases.

On x86_64, the binary size contribution of chardetng (and the earlier companion Japanese-specific detector) to libxul is 28% of what ced would have contributed to libxul. If we had adopted ced and later wanted to make a comparable binary size saving, hunting for savings around the code base would have been more work than writing chardetng from scratch.

The focus on binary size makes chardetng take 42% longer to process the same data compared to ced. However, this tradeoff makes sense for code that runs for legacy pages and doesn’t run at all for modern pages. The accuracy of chardetng is competitive with ced, and chardetng is integrated into Firefox in a way that gives it a better opportunity to give the right answer compared to the way ced is integrated into Chrome.

chardetng has been developed in such a way that it would be standardizable, should there be interest to standardize it at the WHATWG. The data tables are under CC0, and non-CC0 part of chardetng consists of fewer than 3000 lines of explainable Rust code that could be reversed into spec English.

Why

Before getting to how chardetng works, let’s look in more detail at why it exists at all.

Background

Back when the Web was created, operating systems had locale-specific character encodings. For example, a system localized for Greek had a different character encoding from a system localized for Japanese. The Web was created in Switzerland, so bytes were assumed to be interpreted according to ISO-8859-1, which was the Western European encoding for Unix-ish systems and also compatible with the Western European encoding for Windows. (The single-byte encodings on Mac OS Classic were so different from Unix and Windows that browsers on Mac had to permute the bytes and couldn’t assume content from the Web to match the system encoding.)

As browsers were adapted to more languages, the default treatment of bytes followed the system-level design of the textual interpretation of bytes being locale-specific. Even though the concept of actually indicating what character encoding content was in followed relatively quickly, the damage was already done. There were already unlabeled pages out there, so—for compatibility—browsers retained locale-specific fallback defaults. This made it possible for authors to create more unlabeled content that appeared to work fine when viewed with an in-locale browser but that broke if viewed with an out-of-locale browser. That’s not so great for a system that has “world-wide” in its name.

For a long time, browsers provided a menu that allowed the user to override the character encoding when the fallback character encoding of the page author’s browser and the fallback character encoding of the page reader’s browser didn’t match. Browsers have also traditionally provided a detector for the Japanese locale for deciding between Shift_JIS and EUC-JP (also possibly ISO-2022-JP) since there was an encoding split between Windows and Unix. (Central Europe also had such an encoding split between Windows and Unix, but, without a detector, Web developers needed to get their act together instead and declare the encodings…) Firefox also provided an on-by-default detector for the Russian and Ukrainian locales (but not other Cyrillic locales!) for detecting among multiple Cyrillic encodings. Later on, Firefox also tried to alleviate the problem by deciding from the top-level domain instead of the UI localization when the top-level domain was locale-affiliated. However, this didn’t solve the problem for .com/.net/.org, for local files, or for locales with multiple legacy encodings even if not on the level of prevalence as in Japan.

What Problem is Being Solved Now?

As part of an effort to get rid of the UI for manually overriding the character encoding of an HTML page, Chrome adopted compact_enc_det (ced), which appears to be Google’s character encoding detector from Google Search and Gmail. This was done as a surprise to us without discussing the change in standard-setting organizations ahead of time. This led to a situation where Firefox’s top-level domain or UI locale heurististics could fail but Chrome’s content-based detection could succeed. It is likely more discoverable and easier to launch a Chromium-based browser instead of seeking to remedy the situation within Firefox by navigating rather well-hidden submenus.

When the problem occurred with non-Latin-script pages, it was about as bad as a Web compat problem can be: The text was completely unreadable in Firefox and worked just fine in Chrome.

But Did the Problem Actually Occur Often?

So we had a Web compat problem that was older than JavaScript and almost as old as the Web itself and that had a very bad (text completely unreadable) failure mode. But did it actually happen often enough to care? If you’re in the Americas, Western Europe, Australia, New Zealand, many parts of Africa to the South of Sahara, etc., you might be thinking: “Did this really happen prior to Firefox 73? It didn’t happen when I browsed the Web.”

Indeed, this issue has not been a practical problem for quite a while for users who are in windows-1252 locales and read content from windows-1252 locales, where a “windows-1252 locale” is defined as a locale where the legacy “ANSI” code page for the Windows localization for that locale is windows-1252. There are two reasons why this is the case. First, the Web was created in this locale cohort (windows-1252 is a superset of ISO-8859-1), so the defaults have always been favorable. Second, when the problem does occur, the effect is relatively tolerable for the languages of these locales. These locales use the Latin script with relatively infrequent non-ASCII characters. Even if those non-ASCII characters are replaced with garbage, it’s still easy to figure out what the text as a whole is saying.

With non-Latin scripts, the problem is much more severe, because pretty much all characters (except ASCII spaces and ASCII punctuation) get replaced with garbage.

Clearly, when the user invokes the Text Encoding menu in Firefox, this user action is indicative of the user having encountered this problem. All over the world, the use of the menu could be fairly characterized as very rare. However, how rare varied by a rather large factor. If we take the level of usage in Germany and Spain, which are large non-English windows-1252 locales, as the baseline of a rate of menu usage that we could dismiss as not worth addressing, the rate of usage in Macao was 85 times that (measured not in terms of accesses to the menu directly but in terms Firefox subsessions in which the menu had been used at least once; i.e. a number of consecutive uses in one subsession counted just once). Ignorable times 85 does not necessarily equal requiring action, but this indicates that an assessment of the severity of the problem from a windows-1252 experience is unlikely to be representative.

Also, the general rarity of the menu usage is not the whole story. It measures only the cases where the user knew how and bothered to address the problem within Firefox. It doesn’t measure abandonment: It doesn’t measure the user addressing the problem by opening a Chromium-based browser instead. For every user who found the menu option and used it in Firefox, several others likely abandoned Firefox.

The feedback kept coming from Japan that the problem still occurred from time to time. Telemetry indicated that it occurred in locales where the primary writing system is Traditional Chinese at even a higher rate than in Japan. In mainland China, which uses Simplified Chinese, the rate was a bit lower than in Japan but on a similar level. My hypothesis is that despite Traditional Chinese having a single legacy encoding in the Web Platform and Simplified Chinese also having a single legacy encoding in the Web Platform (for decoding purposes), which would predict these locales as having success from a locale-based guess, users read content across the Traditional/Simplified split on generic domains (where a guess from the top-level domain didn’t apply) but don’t treat out-of-locale failures as reportable. In the case of Japan and the Japanese language, the failures are in-locale and apparently treated as more reportable.

In general, as one would expect, the menu usage is higher in locales where the primary writing system is not the Latin script than in locales where the primary writing system is the Latin script. Notably, Bulgaria was pretty high on the list. We enabled our previous Cyrillic detector (which probably wasn’t trained with Bulgarian) by default for the Russian and Ukrainian localizations but not for the Bulgarian localization. Also, we lacked a TLD mapping for .cy and, sure enough, the rate of menu usage was higher in Cyprus than in Greece.

Again, as one would expect, the menu usage was higher in non-windows-1252 Latin-script locales than in windows-1252 Latin-script locales. Notably, the usage in Azerbaijan was unusually high compared to other Latin-script locales. I had suspected we had the wrong TLD mapping for .az, because at the time the TLD mappings were introduced to Firefox, we didn’t have an Azerbaijani localization to learn from.

In fact, I suspected that our TLD mapping for .az was wrong at the very moment of making the TLD mappings, but I didn’t have data to prove it, so I erred on the side of inaction. Now, years later, we have the data. It seems to me that it was better to trust the feedback from Japan and fix this problem than to gather more data to prove that the Web compat problem deserved fixing.

Why Now?

So why not fix this earlier or at the latest when Chrome introduced their present detector? Generally, it seems a better idea to avoid unobvious behaviors in the Web Platform if they are avoidable, and at the time Microsoft was going in the opposite direction compared to Chrome with neither a menu nor a detector in EdgeHTML-based Edge. This made it seem that there might be a chance that the Web Platform could end up being simpler, not more complex, on this point.

While it would be silly to suggest that this particular issue factored into Microsoft’s decision to base the new Edge on Chromium, the switch removed EdgeHTML as a data point suggesting that things could be less complex than they already were in Gecko. At this point, it didn’t make sense to pretend that if we didn’t start detecting more, Chrome and the new Edge would start detecting less.

Safari has less detection than Firefox had previously. As far as I can tell, Safari doesn’t have TLD-based guessing, and the fallback comes directly from the UI language with the detail that if the UI language is Japanese, there’s content-based guessing between Japanese legacy encodings. However, to the extent Safari/WebKit is engineered to the iOS market conditions, we can’t take this as an indication that this level of guessing would be enough for Gecko relative to Chromium. If users encounter this problem on iOS, there’s no other engine they can go to, and the problem is rare enough that users aren’t going to give up on iOS as a whole due to this problem. However, on every platform that Gecko runs on, a Chromium-based browser a couple of clicks or taps away.

Why Not Use an Existing Library?

After deciding to fix this, there’s the question of how to fix it. Why write new code instead of reusing existing code?

Why Not ICU4C?

ICU4C has a detector, but Chrome had already rejected it as not accurate enough. Indeed, in my own testing with title-length input, ICU4C was considerably less accurate than the alternatives discussed below.

Why Not Resurrect Mozilla’s Old Detector?

At this point you may be thinking “Wait, didn’t Firefox have a ‘Universal’ detector and weren’t you the one who removed it?” Yes and yes.

Firefox used to have two detectors: the “Universal” detector, also known as chardet, and a separate Cyrillic detector. chardet had the following possible configurations: Japanese, Traditional Chinese, Simplified Chinese, Chinese, Korean, and Universal, where the last mode enabled all the detection capabilities and the other modes enabled subsets only.

This list alone suggests a problem: If you look at the Web Platform as it exists today, Traditional Chinese has a single legacy encoding, Simplified Chinese has a single legacy decode mode (both GBK and GB18030 decode the same way and differ on the encoding side), and Korean has a single legacy encoding. The detector was written at a time when Gecko was treating character encodings as Pokémon and tried to catch them all. Since then we’ve learned that EUC-TW (an encoding for Traditional Chinese), ISO-2022-CN (an encoding for both Traditional Chinese and Simplified Chinese), HZ (an encoding for Simplified Chinese), ISO-2022-KR (an encoding for Korean), and Johab (an encoding for Korean) never took off on the Web, and we were able to remove them from the platform. With a single real legacy encoding per each of the Traditional Chinese, Simplified Chinese, and Korean, the only other detectable was UTF-8, but detecting it is problematic (more on this later).

When I filed the removal bug for the non-Japanese parts of chardet in early 2013, chardet was enabled in its Japanese-only mode by default for the Japanese localization, the separate Cyrillic detector was enabled for the Russian and Ukrainian localizations, and chardet was enabled in the Universal mode for the Traditional Chinese localization. At the time, the character encoding defaults for the Traditional Chinese localization differed from how they were supposed to be set in another way, too: It had UTF-8 instead of Big5 as the fallback encoding, which is why I didn’t take the detector setting very seriously, either. In the light of later telemetry, the combined Chinese mode, which detected between both Traditional and Simplified Chinese, would have been useful to turn on by default both for the Traditional Chinese and Simplified Chinese localizations.

From time to time, there were requests to turn the “Universal” detector on by default for everyone. That would not have worked, because the “Universal” detector wasn’t actually universal! It was incomplete and had known bugs, but people who hadn’t examined it more closely took the name at face value. That is, over a decade after the detector paper was presented (on September 12th 2001), the detector was still incomplete, off by default except in the Japanese subset mode and in a case that looked like an obvious misconfiguration, and the name was confusing.

The situation was bad as it was. There were really two options: removal or actually investing in fixing it. At the time, Chrome hadn’t gone all-in with detection. Given what other browsers were doing and not wishing to make the Web Platform more complex than it appeared to need to be, I pushed for removal. I still think that the call made sense considering the circumstances and information available.

As for whether it would have made sense to resurrect the old detector in 2019, digging up the code from version control history would still have resulted in an incomplete detector that would not have been suitable for turning on by default generally. Meanwhile, a non-Mozilla fork had added some things and another effort had resulted in a Rust port. Apart from having to sort out licensing (the code forked before the MPL 2.0 upgrade, the Rust port retained only the LGPL3 part of the license options, and Mozilla takes compliance with the LPGL relinking clause seriously making the use of LGPLed code in Gecko less practical than the use of other Open Source code) even the forks would have required more work to complete despite at least the C++ fork being more complete than the original Mozilla code.

Moreover, despite validating legacy CJK encodings being one of the foundational ideas of the old detector (called “Coding Scheme Method” in the paper), the old detector didn’t make use of the browser’s decoders for these encodings but implemented the validation on its own. This is great if you want a C++ library that has no dependencies, but it’s not great if you are considering the binary size cost in a browser that already has validating decoders for these encodings and ships on Android where binary size still matters.

A detector has two big areas: How to handle single-byte encodings and how to handle legacy CJK. For the former, if you need to develop tooling to add support for some more single-byte encodings or language coverage for single-byte encoding already detected for some languages, you will have built tooling that allows you to redo all of them. Therefore, if the prospect is adding support for more single-byte cases and to reuse for CJK detection the decoders that the browser has anyway, doing the work within the frame of the old decoder becomes a hindrance and it makes sense to frame it as newly-written code only instead of trying to formulate the changes as patches to what already existed.

Why Not Use Chrome’s Detector?

Using Chrome’s detector (ced) looks attractive on the surface. If Firefox used the exact same detector as Chrome, Firefox could never do worse than Chrome, since both would do the same thing. There are problems with this, though.

As a matter of a health-of-the-Web principle, it would be bad if an area of the Web platform became defined as having to run a particular implementation. Also, the way ced integrates with Chrome violates the design principles of Firefox’s HTML parser. What gets fed to ced in Chrome depends on buffer boundaries as delivered by the networking subsystem to the HTML parser. Prior to Firefox 4, HTML parsing in Firefox depended on buffer boundaries in an even worse way. Since Firefox 4, I’ve considered it a goal not to make Firefox’s mapping of a byte stream into a DOM dependent on the network buffer boundaries (or the wall clock). Therefore, if I had integrated ced into Firefox, the manner of integration would have been an opportunity for different results, but that would have been negligible. That is, these principled issues don’t justify putting effort into a new implementation.

The decisive reasons are these: ced is over-the-wall Open Source to the point that even Chrome developers don’t try to change it beyond post-processing its output and making it compile with newer compilers. License-wise the code is Open Source / Free Software, but it’s impractical to exercise the freedom to make modifications to its substance, which makes it close to an invariant section in practice (again, not as a matter of license). The code doesn’t come with the tools needed to regenerate its generated parts. And even if it did, the input to those tools would probably involve Google-specific data sources. There isn’t any design documentation (that I could find) beyond code comments. Adopting ced would have meant adopting a bunch of C++ code that we wouldn’t be able to meaningfully change.

But why would one even want to make changes if the goal isn’t to ever improve detection and the goal is just to ensure our detection is never worse than Chrome’s? First of all, as in the case of chardet, ced is self-contained and doesn’t make use of the algorithms and data that a browser already has to have in order to decode legacy CJK encodings once detected. But it’s worse than that. “Post-processing” in the previous paragraph means that ced has more potential outcomes than a browser engine has use for, since ced evidently was not developed for the browser use case. (As noted earlier, ced has the appearance of having been developed for Google Search and Gmail.)

For example, ced appears to make distinctions between various flavors of Shift_JIS in terms of their carrier-legacy emoji mappings (which may have made sense for Gmail at some point in the past). It’s unclear how much table duplication results form this, but it doesn’t appear to be full duplication for Shift_JIS. Still, there are other cases that clearly do lead to table duplication even though the encodings have been unified in the Web Platform. For example, what the Web Platform unifies as a single legacy Traditional Chinese encoding occurs as three distinct ones in ced and what the Web Platform unifies as a single legacy Simplified Chinese encoding for decoding purposes appears as three distinct generations in ced. Also, ced keeps around data and code for several encodings that have been removed from the Web Platform (to a large extent because Chrome demonstrated the feasibility of not supporting them!), for KOI8-CS (why?), and for a number of IE4/Netscape 4-era / pre-iOS/Android-era deliberately misencoded fonts deployed in India. (We’ll come back to those later.)

My thinking around internationalization in relation to binary size is still influenced by the time period when, after shipping on desktop, Mozilla didn’t ship the ECMAScript i18n API on Android for a long time due to binary size concerns. If we had adopted ced and later decided that we wanted to run an effort (BinShrink?) to make the binary size smaller for Android, it would probably have taken more time and effort to save as many bytes from somewhere else (as noted, changing ced itself is something even the Chrome developers don’t do) as chardetng saves relative to just adopting ced than it took me to write chardetng. Or maybe, if done carefully, it would have been possible to remove the unnecessary parts of ced without access to the original tools and to ensure that the result still kept working, but writing the auxiliary code to validate the results of such an effort would have been on the same order of magnitude of effort as writing the tooling for training and testing chardetng.

It’s Not Rocket Surgery

That most Open Source encoding detectors are ports of the old Mozilla code, that ICU’s effort to write their own resulted in something less accurate, and the assumption that Google’s detector draws from some massive Web-scale analysis make it seem like writing a detector from scratch is a huge deal. It isn’t that big a deal, really.

After one has worked on implementing character encodings for a while, some patterns emerge. These aren’t anything new. The sort of things that one notices are the kinds of things the chardet paper attributes to Frank Tang noticing while at Netscape. Also, I already knew about language-specific issues related to e.g. Albanian, Estonian, Romanian, Mongolian, Azerbaijani, and Vietnamese that are discussed below. That is, many of the issues that may look like discoveries in the development process are things that I already knew before writing any code for chardetng.

Furthermore, these days, you don’t need to be Google to have access to a corpus of language-labeled human-authored text: Wikipedia publishes database dumps. Synthesizing legacy-encoded data from these has the benefit that there’s no need to go locate actual legacy-encoded data on the Web and check that it’s correctly labeled.

Why Rust?

An encoding detector is a component with identifiable boundaries and a very narrow API. As such, it’s perfectly suited to be exposed via Foreign Function Interface. Furthermore, the plan was to leverage the existing CJK decoders from encoding_rs—the encoding conversion library used in Firefox—and that’s already Rust code. In this situation, it would have been wrong not to take the productivity and safety benefits of Rust. As a bonus, Rust offers the possibility of parallelizing the code using Rayon, which may or may not help, which isn’t known before measuring, but the cost of trying is very low whereas the cost of planning for parallelism in C++ is very high. (We’ll see later that the measurement was a disappointment.)

How

With the “why” out of the way, let’s look at how chardetng actually works.

Standardizability

One principled concern related to just using ced was that it’s bad for interoperability of the Web Platform depending on a single implementation that everyone would have to ship. Since chardetng is something that I just made up without a spec, is it any better in this regard?

Before writing a single line of code I arranged the data tables used by chardetng to be under CC0 so as enable their inclusion in a WHATWG spec for encoding detection, should there be interest for one. Also, I have tried to keep everything that chardetng does explainable. The non-CC0 (Apache-2.0 OR MIT) part of chardetng is under 3000 lines of Rust that could realistically be reversed into spec English in case someone wanted to write an interoperable second implementation.

The training tool for creating the data tables given Wikipedia dumps would be possible to explain as well. It runs in two phases. The first one computes statistics from Wikipedia dumps, and the second one generates Rust files from the statistics. The intermediate statistics are available, so as long as you don’t change the character classes or the set of languages, which would invalidate the format of the statistics, you can make changes to the later parts of the code generation and re-run it from the same statistics that I used.

Foundational Ideas
  • The most foundational idea of chardetng is the observation that legacy CJK encodings have enough structure to them that if you take bytes and a decoder for a legacy CJK encoding and try to decode the bytes with the decoder, if the bytes aren’t intended to be in that encoding, the decoder will report an error sooner or later, with the exception that EUC-family encodings may decode without error as other EUC-family encodings. And that a browser (or any app for that matter) that wants to make useful use of the detection result has to have those decoders anyway.

  • The EUC family (EUC-JP, EUC-KR, and GBK—the GB naming is more commonly used than the EUC-CN name) can be distinguished by observing that Japanese text has kana, which distinguishes EUC-JP, and Hanja is very rare in Korean, so if enough EUC byte pairs fall outside the KS X 1001 Hangul range, chances are the text is Chinese in GBK.

  • Some single-byte encodings have unassigned bytes. If an unassigned byte occurs, the encoding can be removed from consideration. C1 controls can be treated as if unassigned.

  • Bicameral scripts have a certain regularity to capital letter use. Even though e.g. brand names can violate these regularities, the bulk of text should consist of words that are lower-case, start with an upper-case letter, or are in all-caps.

  • Non-Latin letters generally don’t occur right before or right after Latin letters. Hence, pairing an ASCII letter with a non-ASCII letter is indicative of the Latin script and should be penalized in non-Latin-script encodings.

  • Single-byte encodings can be distinguished well enough from the relative probabilities of byte pairs excluding ASCII pairs. These pairings don’t need to and should not be analyzed on exact byte values but the bytes should be classified more coarsely such that ASCII punctuation and parantheses, etc., are classified as space-equivalent and at least upper and lower case of a given letter are unified.

  • Pairs of ASCII bytes should be neutral in terms of the detection outcome apart from ISO-2022-JP detection and cases where the first ASCII byte of a pair is interpreted as a trail byte of a legacy CJK sequence. This way, the detector ignores various computer-readable syntaxes such as HTML (for deployment) and MediaWiki syntax (for training) without any syntax-specific state machine(s).

  • Visual Hebrew can be distinguished by observing the placement of ASCII punctuation relative to non-ASCII words.

  • Avoid detecting encodings that have never worked without declaration in any localization of a major browser.

There are two major observations to make of the above ideas:

  1. The first point fundamentally trades off accuracy for short inputs for legacy CJK encodings in order to minimize binary size.
  2. The rules are all super-simple to implement except that finding out the relative probabilities of character class pairs for single-byte encodings requires some effort.
Included and Excluded Encodings

The last point on the foundational idea list suggests that chardetng does not detect all encodings in the Encoding Standard. This indeed is the case. After all, the purpose of the detector is not to catch them all but to deal with the legacy that has arisen from locale-specific defaults and, to some extent, previously-deployed detectors. Most encodings in the Encoding Standard correspond to the “ANSI” code page of some Windows localization and, therefore, the default fallback in some localization of Internet Explorer. Additionally, ISO-8859-2 and ISO-8859-7 have appeared as the default fallback in non-Microsoft browsers. Some encodings require a bit more justification for why they are included or excluded.

ISO-2022-JP and EUC-JP are included, because multiple browsers have shipped on-by-default detectors over most of the existence of the Web with these as possible outcomes. Likewise, ISO-8859-5, IBM866, and KOI8-U (the Encoding Standard name for the encoding officially known as KOI8-RU) are detected, because they were detected by Gecko and are detected by IE and Chrome. The data published as part of ced also indicates that these encodings have actually been used on the Web.

ISO-8859-4 and ISO-8859-6 are included, because both IE and Chrome detect them, they have been in the menu in IE and Firefox practically forever, and the data published as part of ced indicates that they have actually been used on the Web.

ISO-8859-13 is included, because it has the same letter assignments as windows-1257, so browser support for windows-1257 has allowed ISO-8859-13 to work readably. However, disqualifying an encoding based on one encoding error could break such compatibility. To avoid breakage due to eager disqualification of windows-1257, chardetng supports ISO-8859-13 explicitly despite it not qualifying for inclusion on the usual browser legacy behavior grounds. (The non-UTF-8 glibc locales for Lithuanian and Latvian used ISO-8859-13. Solaris 2.6 used ISO-8859-4 instead.)

ISO-8859-8 is supported, because it has been in the menu in IE and Firefox practically forever, and if it wasn’t explicitly supported, the failure mode would be detection as windows-1255 with the direction of the text swapped.

KOI8-R, ISO-8859-8-I, and GB18030 are detected as KOI8-U, windows-1255, and GBK instead. KOI8-U differs from KOI8-R by assigning a couple of box drawing bytes to letters instead, so at worst the failure mode of this unification is some box drawing segments (which aren’t really used on the Web anyway) showing up as letters. windows-1255 is a superset of ISO-8859-8-I except for swapping the currency symbol. GBK and GB18030 have the same decoder in the Encoding Standard. However, chardetng makes no attempt to detect the use of GB18030/GBK for content other than Simplified Chinese despite the ability to also represent other content being a key design goal of GB18030. As far as I am aware, there is no legacy mechanism that would have allowed Web authors to rely on non-Chinese usage of GB18030 to work without declaring the encoding. (I haven’t evaluated how well Traditional Chinese encoded as GBK gets detected.)

x-user-defined as defined in the Encoding Standard (as opposed to how an encoding of the same name is defined in IE’s mlang.dll) is relevant to XMLHttpRequest but not really to HTML, so it is not a possible detection outcome.

The legacy encodings that the Encoding Standard maps to the replacement encoding are not detected. Hence, the replacement encoding is not a possible detection outcome.

UTF-16BE and UTF-16LE are not detected by chardetng. They are detected from the BOM outside chardetng. (Additionally for compatibility with IE’s U+0000 ignoring behavior as of 2009, Firefox has a hack to detect Latin1-only BOMless UTF-16BE and UTF-16LE.)

The macintosh encoding (better known as MacRoman) is not detected, because it has not been the fallback for any major browser. The usage data published as part of ced suggests that the macintosh encoding exists on the Web, but the data looks a lot like the data is descriptive of ced’s own detection results and is recording misdetection.

x-mac-cyrillic is not detected, because it isn’t detected by IE and Chrome. It was previously detected by Firefox, though.

ISO-8859-3 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. The IE4 character encoding documentation published on the W3C’s site remarks of ISO-8859-3: “not used in the real world”. The languages for which this encoding is supposed to be relevant are Maltese and Esperanto. Despite glibc having an ISO-8859-3 locale for Maltese, the data published as part of ced doesn’t show ISO-8859-3 usage under the TLD of Malta. The ced data shows usage for Catalan, but this is likely a matter of recording misdetection, since Catalan is clearly an ISO-8859-1/windows-1252 language.

ISO-8859-10 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. Likewise, the anachronistic (post-dating UTF-8) late additions to the series, ISO-8859-14, ISO-8859-15, and ISO-8859-16, are not detected for the same reason. Of these, ISO-8859-15 differs from windows-1252 so little and for such rare characters that detection isn’t even particularly practical, and having ISO-8859-15 detected as windows-1252 is quite tolerable.

Pairwise Probabilities for Single-Byte Encodings

I assigned the single-byte encodings to groups such that multiple encodings that address roughly the same character repertoire are together. Some groups only have one member. E.g. windows-1254 is alone in a group. However, the Cyrillic group has windows-1251, KOI8-U, ISO-8859-5, and IBM866.

I assigned character classes according to the character repertoire of each group. Unifying upper and lower case can be done algorithmically. Then the groupings of characters can be listed somewhere and a code generator can generate lookup tables that are indexed by byte value yielding another 8-bit number. I reserved the most significant bit of the number read from the lookup tables to indicate case for bicameral scripts. I also split the classification lookup table in the middle in order to reuse the ASCII half across the cases that share the ASCII half (non-Latin vs. non-windows-1254 Latin vs. windows-1254). windows-1254 requires a different ASCII classification in order not to treat ‘i’ and ‘I’ as a case pair. Other than that, for Latin-script encodings ASCII letters as individual classes matter and for non-Latin-script, they don’t.

I looked at the list of Wikipedias. I excluded languages with fewer than 10000-article Wikipedias as well as languages that don’t have a legacy of having been written in single-byte encodings in the above-mentioned groupings and languages whose orthography is all-ASCII or nearly all-ASCII. Then I assigned the remaining languages to the encoding groups.

I wrote a tool that ingested the Wikipedia database dumps for those languages and for each language normalized the input to Unicode Normalization Form C (just in case; I didn’t bother examining how well Wikipedia was already in NFC) and then classified each Unicode scalar value according to the classification built above such that characters not part of the mapping were considered equivalent to spaces, because ampersand and semicolon were treated as being in the same equivalence class as space, and unmappable characters would be represented as numeric character references in single-byte encodings, so the adjacencies would be with ampersand and semicolon.

The program counted the pairs, ignoring ASCII pairs, divided the count for each pair by the total (non-ASCII) pair count and divided also by class size (with a specially-picked divisor for the space-equivalent class), where class size didn’t consider case (i.e. upper and lower case didn’t count as two). Within an encoding group, the languages were merged together by taking the maximum value across languages for each character class pair. If the original count was actually zero, i.e. no occurrence of the pair in relevant Wikipedias, the output lookup table got an implausibility marker (number 255). Otherwise, the floating point results were scaled to the 0 to 254 range to turn them into relative scores that fit into a byte. The scaling was done such that the highest spikes got clipped in a way that retained reasonable value range otherwise instead of mapping the highest spike to 254. I also added manually-picked multipliers for some encoding groups. This made it possible to e.g. boost Greek a bit relative to Cyrillic, which made accuracy better for short inputs, and for long inputs ends up correcting for misdetecting Cyrillic as Greek anyway because Greek has unmapped bytes such that sufficiently long windows-1251 input ends up disqualifying the Greek encodings according to the rule that a single unmapped byte disqualifies an encoding from consideration.

Thanks to Rust compiling to very efficient code and Rayon making it easy to process as many Wikipedias in parallel as there are hardware threads, this is a quicker processing task than it may seem.

The pairs logically form a two-dimension grid of scores (and case bits for bicameral scripts) where the rows and column are the character classes participating in the pair. Since ASCII pairs don’t contribute to the score, the part of the grid that would correspond to ASCII-ASCII pairs is not actually stored.

Synthetizing Legacy from Present-Day Data

Wikipedia uses present-day Unicode-enabled orthographies. For some languages, this is not the same as the legacy 8-bit orthography. Fortunately, with the exception of Vietnamese, going from the current orthography to the legacy orthography is a simple matter of replacing specific Unicode scalar values with other Unicode scalar values one-to-one. I made the following substitutions for training (also in upper case):

  • For Azerbaijani, I replaced ə with ä to synthetize the windows-1254-compatible 1991 orthography.
  • For Mongolian, I replaced ү with ї and ө with є to apply a convention that uses Ukrainian characters to allow the use of windows-1251.
  • For Romanian, I replaced ș with ş and ț with ţ. Unicode disunified the comma-below characters from the cedilla versions at the request of the Romanian authorities, but the 8-bit legacy encodings had them unified.

These were the languages that pretty obviously required such replacements. I did not investigate the orthography of languages whose orthography I didn’t already expect to require measures like this, so there is a small chance that some other language in the training set would have required similar substitution. I’m fairly confident that that isn’t the case, though.

For Vietnamese the legacy synthesis is a bit more complicated. windows-1258 cannot represent Vietnamese in Unicode Normalization Form C. There is a need to decompose the characters. I wrote a tiny crate that performs the decomposition and ran the training with both plausible decompositions:

  • The minimal decomposition: This could plausibly arise when converting IME-originating NFC data to windows-1258. In this case, if a base is simple enough that with a tone it becomes a combination that exists as precomposed in windows-1258 due to it appearing as precomposed in windows-1252, it’s not decomposed.
  • The orthographic decomposition: This the decomposition that arises naturally when using the standard Vietnamese keyboard layout (as opposed to IME) without normalization.

I assume that the languages themselves haven’t shifted so much from the early days of the Web that the pairwise frequencies observed from Wikipedia would not work. Also, I’m assuming that encyclopedic writing doesn’t disturb the pairwise frequencies too much relative to other writing styles and topics. Intuitively this should be true for alphabetic writing, but I have no proof. (Notably, this isn’t true for Japanese. If one takes a look at the most frequent kanji in Wikipedia titles and the most frequent kanji in Wikipedia text generally, the titles are biased towards science-related kanji.)

Misgrouped Languages

The Latin script does not fit in an 8-bit code space (and certainly not in the ISO-style code space that wastes 32 code points for the C1 controls). Even the Latin script as used in Europe does not fit in an 8-bit code space when assuming precomposed diacritic combinations. For this reason, there are multiple Latin-script legacy encodings.

Especially after the 1990s evolution of the ISO encodings into the Windows encodings, two languages, Albanian and Estonian, are left in a weird place in terms of encoding detection. Both Albanian and Estonian can be written using windows-1252 but their default “ANSI” encoding in Windows is something different: windows-1250 (Central European) and windows-1257 (Baltic), respectively.

With Albanian, the case is pretty clear. Orthographically, Albanian is a windows-1252-compatible language, and the data (from 2007?) that Google published as part of ced shows the Albanian language and the TLD for Albania strongly associated with windows-1252 and not at all associated with either windows-1250 or ISO-8859-2. It made sense to put Albanian in the windows-1252 training set for chardetng.

Estonian is a trickier case. Estonian words that aren’t recent (from the last 100 years or so?) loans can be written using the ISO-8859-1 repertoire (the non-ASCII letters being õ, ä, ö, and ü; et_EE without further suffix in glibc is an ISO-8859-1 locale, and et was an ISO-8859-1 locale in Solaris 2.6, too). Clearly, Estonian has better detection synergy with Finnish, German, and Portuguese than with Lithuanian and Latvian. For this reason, chardetng treats Estonian as a windows-1252 language.

Although the official Estonian character reportoire is fully part of windows-1252 in the final form of windows-1252 reached in Windows 98, it wasn’t the case when Windows 95 introduced an Estonian localization of Windows and the windows-1257 Baltic encoding. At that time, windows-1252 didn’t yet have ž, which was added in Windows 98—presumably in order to match all the letter additions that ISO-8859-15 got relative to ISO-8859-1. (ISO-8859-15 got ž and š as a result of lobbying by the Finnish language regulator which insists that these letters be used for certain loans in Finnish. windows-1252 already had š in Windows 95.) While the general design principle of the window-125x series appears to be that if a character occurs in windows-1252, it is in the same position in the other windows-125x encodings that it occurs in, this principle does not apply to the placement of š and ž in windows-1257. ISO-8859-13 has the same placement of š and ž as windows-1257. ISO-8859-4 has yet different placement. (As does ISO-8859-15, which isn’t a possible detection outcome of chardetng.) The Estonian-native vowels are in the same positions in all these encodings.

The Estonian language regulator designates š and ž as part of the Estonian orthography, but these characters are rare in Estonian, since they are only used in recent loans and in transliteration. It’s completely normal to have Web pages where the entire page has neither or the page has only one occurrence of either of them. Still, they are common enough that you can find them on a major newspaper site with a few clicks. This, obviously, is problematic for encoding detection. Estonian gets detected as windows-1252. If the encoding actually was windows-1257, ISO-8859-13, ISO-8859-4, or ISO-8859-15, the (likely) lone instance of š or ž gets garbled.

It would be possible to add Estonian-specific post-processing logic to map the windows-1252 result to windows-1257, ISO-8859-13, or ISO-8859-4 (or even ISO-8859-15) if the content looks Estonian based on the Estonian non-ASCII vowels being the four most frequent non-ASCII letters and then checking which encoding is the best fit for š and ž. However, I haven’t taken the time to do this, and it would cause reloads of Web pages just to fix maybe one character per page.

Refinements

The foundational ideas described above weren’t quite enough. Some refinements were needed. I wrote a test harness that synthetized input from Wikipedia titles and checked how well titles that encoded to non-ASCII were detected in such a way that they roundtripped. (In some cases, there are multiple encodings that roundtrip a string. Detecting any one of them was considered a success.) I looked at the cases that chardetng failed but ced succeeded at.

When looking at the results, it was pretty easy to figure out why a particular case failed, and it was usually pretty quick and easy to come up with and implement a new rule to address the failure mode. This iteration could be continued endlessly, but I stopped early when the result seemed competitive enough compared to ced. However, the last two items on the list were made in response to a bug report. Here are the adjustments that I made.

  • Some non-ASCII letters in Latin-script encodings got high pair-wise scores leading to misdetecting non-Latin as Latin. I remedied this by penalizing sequences of three non-ASCII letters in Latin encodings a little and penalizing sequences of four or more a lot. (Polish and Turkish do have some legitimate sequences of four non-ASCII letters, though.)

  • Treating non-ASCII punctuation and symbols as space-equivalent didn’t work, because pairing a letter with a space-like byte tends to score high but some encodings assign letters where others have symbols or punctuation. Therefore, when bytes were intended to be two letters but could be interpreted as a symbol and letter in another encoding, the latter interpretation scored higher due to the letter and space-like combination scoring higher. Such symbols and punctuation needed new non-space-equivalent character classes. I adjusted things such that no byte above 0xA0 in any single-byte encoding can get grouped together in the same character class as the ASCII space. No-break space when assigned to 0xA0 as well as Windows curly quotes and dashes when assigned below 0xA0 remain in the same character class as the ASCII space, though.

  • Symbols in the 0xA1 to 0xBF byte range were split further into classes that have different implausibility characteristics before or after letters (or both) or that are implausible next to characters from their own class. (Things like ® typically appear after letters but © appears before letters if next to a letter at all, and ®® and ©© are very unlikely.)

  • Some other characters had existence proof of occurring in pairs in Wikipedia that are in practice extremely unlikely and benefited from manually-forced implausibility. Examples include forcing implausibility of a letter after Greek final sigma, forcing implausibility of left-to-right mark and right-to-left mark next to a letter (they are supposed to be used between punctuation and space), and marking Vietnamese tones as implausible after letters other than the ones that are legitimate base characters in Vietnamese orthography. (Note that an implausibility penalty isn’t absolutely disqualifying, so long input can tolerate isolated implausibilities.)

  • Giving non-ASCII bicameral-script words that start with a capital letter a slight boost helped distinguish between Greek and the various non-Windows Cyrillic encodings.

  • Splitting the windows-1252 model into two: Icelandic and Faroese on one hand and the rest on the other. Since characters that (within the windows-1252 language set) are specific to Icelandic and Faroese are the first ones that that get replaced with something else in other Latin-script Windows and ISO encodings, not merging their use with scores for other windows-1252 languages helps keep the models for windows-1257 and windows-1254 relatively more distinctive.

  • I made use of the fact that present-day Korean uses ASCII spaces between words while Chinese and Japanese don’t.

  • Giving the same score to any Han character turned out to be a bad idea in terms of being able to distinguish legacy CJK encodings from non-Latin single-byte encodings. Fortunately, the legacy CJK encodings have a coarse frequency classification built in, and the most frequent class has nice properties relative to single-byte encodings.

    JIS X 0208, GB2312, and the original Big5 have their kanji/hanzi organized into two levels roughly according to respective locales’ education systems’ classification at the time of initial standard creation. That is, Level 1 corresponds to the most frequent kanji/hanzi that the education systems prioritize. KS X 1001 instead splits into common Hangul and to hanja (very rare). This means that just by looking at the bytes, it’s possible to classify legacy CJK characters into three frequency classes: Level 1 (or Hangul), Level 2 (or hanja in the case of EUC-KR), and other (rare Hangul in the case of EUC-KR). These three can, and now are, given different scores.

    Moreover, these have the fortuitous byte mapping that the Level 1 or common Hangul section in each encoding uses lower byte values for the lead byte while non-Thai, non-Arabic Windows and ISO encodings use high byte values for the common non-ASCII characters: for lower case in bicameral scripts or, in the case of Hebrew, for the consonants. This yields naturally distinctive scoring except for Thai and, to lesser extent, Arabic.

  • Some lead bytes in CJK encodings that overlap with windows-125x non-ASCII punctuation are problematic, because they pair with ASCII trail bytes in ways that can occur in Latin-script text without other adjacent letters that would trigger an Latin adjacency penalty. For example, without intervention, “Rock ’n Roll” could get interpreted as “Rock 地 Roll”. I made it so that the score for CJK characters with a problematic lead byte is committed only if the next character is a CJK character, too.

  • The differences between EUC-JP (presence of kana), EUC-KR (mainly just Hangul and with spaces between words), and GBK don’t necessarily show up in short titles. This is to be expected given the fundamental bet made in the design. Still, this made chardetng look bad relative to ced. I deviated from the plan of not having CJK frequency tables by including tables of the most frequent JIS X 0208 Level 1 Kanji, the most frequent GB2312 Level 1 Hanzi, and the most frequent KS X 1001 Hangul. I set the cutoff for “most frequent” to 128, so the resulting tables ended up being very small but still effective. (Big5 is structurally distinctive even with short inputs, so after trying including a table of the most frequent Big5 Level 1 Hanzi, I removed the table as unnecessary.)

  • Thai needed byte range-specific multipliers to tune it relative to GBK.

  • It was impractical to give score to windows-1252 ordinal indicators in the pairwise model without breaking Romanian detection. For this reason, there’s a state machine for giving score to ordinal indicator usage based on a little more context than just byte pairs taken into account. This boosted detecting accuracy especially for Italian but also for Portuguese, Castilian, Catalan, and Galician.

  • The byte 0xA0 is no-break space in most encodings. To avoid misdetecting an odd number of no-break spaces as IBM866 and to avoid misdetecting an even number as Chinese or Korean, 0xA0 is treated as a problematic lead for CJK purposes, and there’s a special case not to apply the score to IBM866 from certain combinations involving 0xA0.

  • To avoid detecting windows-1252 English as windows-1254, Latin candidates don’t count a score for a pair that involves an ASCII byte and a space-like non-ASCII byte. Otherwise, the score for Turkish dotless ı in word-final would be applied to English I’ (as in “I’ve” or “I’m”). While this would decode to the right characters and look right, it would cause an unnecessary reload in Firefox.

While the single-byte letter pair scores arose from data and the Level 1 Kanji/Hanzi was then calibrated relative to Arabic scoring (and the common Hangul score is one higher than that with a penalty for implausibly long words to distinguish from Chinese given enough input), in the interest of expediency, I assigned the rest of the scoring, including penalty values, as educated guesses rather than trying to build any kind of training framework to try to calibrate them optimally.

TLD-Awareness

The general accuracy characterization relates to generic domains, such as .com. In the case of ccTLDs that actually appear to be in local use (such as .fi) as opposed to generic use (such as .tv), chardetng penalizes encodings that are less plausible for the ccTLD but are known to be confusable with the encoding(s) plausible for the ccTLD. Therefore, typical legacy content on ccTLDs is even more likely to get the right guess than legacy content on generic domains. On the flip side, the guess may be worse for atypical legacy content on ccTLDs.

Integration with Firefox

Firefox asks chardetng to provide a guess twice during the loading of a page. If there is no HTTP-level character encoding declaration and no BOM, Firefox buffers up to 1024 bytes for the <meta charset> pre-scan. Therefore, if the pre-scan fails and content-based detection is necessary, there always already is a buffer of the first 1024 bytes. chardetng makes it first guess from that buffer and the top-level domain. This doesn’t involve reloading anything, because the first 1024 byte haven’t been decoded yet.

Then, when the end of the stream is reached, chardetng guesses again. If the guess differs from the earlier guess, the page is reloaded using the new guess. By making the second guess at the end of the stream, chardetng has the maximal information available, and there is no need to estimate what portion of the stream would have been sufficient to make the guess. Also, unlike Chrome’s approach of examining the first chunk of data that the networking subsystem passes to the HTML parser, this approach does not depend on how the stream is split into buffers. If the title of the page fit into the first 1024 bytes, chances are very good that the first guess was already correct. (This is why evaluating accuracy for “title-length input” was interesting.) Encoding-related reloads are a pre-existing behavior of Gecko. A <meta charset> that does not fit in the first 1024 bytes also triggers a reload.

In addition to the .in and .lk TLDs being exempt from chardetng (explanation why comes further down), the .jp TLD uses a Japanese-specific detector that makes its decision as soon as logically possible to decide between ISO-2022-JP, Shift_JIS and EUC-JP rather than waiting until the end of the stream.

Evaluation

So is it any good?

Accuracy

Is it more or less accurate that ced? This question does not have a simple answer. First of all, accuracy depends on the input length, and chardetng and ced scale down differently. They also scale up differently. After all, as noted, one of the fundamental bets of chardetng was that it was OK to let legacy CJK detection accuracy scale down less well in order to have binary size savings. In contrast, ced appears to have been designed to scale down well. However, ced in Chrome gets to guess once per page but chardetng in Firefox gets to revise its guess. Second, for some languages ced is more accurate and for others chardetng is more accurate. There is no right way to weigh the languages against each other to come up with a single number.

Moreover, since so many languages are windows-1252 languages, just looking at the number of languages is misleading. A detector that always guesses windows-1252 would be more accurate for a large number of languages, especially with short inputs that don’t exercise the whole character repertoire of a given language. In fact, guessing windows-1252 for pretty much anything Latin-script makes the old chardet (tested as the Rust port) look really good for windows-1252. (Other than that, both chardetng and ced are clearly so much better than ICU4C and old chardet that I will limit further discussion to chardetng vs. ced.)

There are three measures of accuracy that I think are relevant:

Title-length accuracy

Given page titles that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?

Since Firefox first asks chardetng to guess from the first 1024 bytes, chances are that the only bit of non-ASCII content that participates in the guess is the page title.

Document-length accuracy

Given documents that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?

Since Firefox asks chardetng to guess again at the end of the stream, it is relevant how well the detector does with full-document input.

Document-length-equivalent number of non-ASCII bytes

Given the guess that is made from a full document, what prefix length, counted as number of non-ASCII bytes in the prefix, is sufficient to look at for getting the same result?

The set of languages I tested was the set of languages that have Web Platform-relevant legacy encodings, have at least 10000 articles in the language’s Wikipedia, and are not too close to being ASCII-only: an, ar, az, be, bg, br, bs, ca, ce, cs, da, de, el, es, et, eu, fa, fi, fo, fr, ga, gd, gl, he, hr, ht, hu, is, it, ja, ko, ku, lb, li, lt, lv, mk, mn, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sq, sr, sv, th, tr, uk, ur, vi (orthographically decomposed), vi (minimally decomposed), wa, yi, zh-hans, zh-hant. (The last two were algorithmically derived from the zh Wikipedia using MediaWiki’s own capabilities.)

It’s worth noting that it’s somewhat questionable to use the same data set for training and for assessing accuracy. The accuracy results would be stronger if the detector was shown accurate on a data set independent from the training data. In this sense, the results for ced are stonger than the results for chardetng. Unfortunately, it’s not simple to obtain a data set alternative to Wikipedia, which is why the same data set is used for both purposes.

Title-Length Accuracy

If we look at Wikipedia article titles (after rejecting titles that encode to all ASCII, kana-containing titles in Chinese Wikipedia, and some really short mnemonic titles for Wikipedia meta pages), we can pick some accuracy threshold and see how many languages ced and chardetng leave below the threshold.

No matter what accuracy threshold is chosen, ced leaves more combinations of language and encoding below the threshold, but among the least accurate are Vietnamese, which is simply unsupported by ced, and ISO-8859-4 and ISO-8859-6, which are not as relevant as Windows encodings. Still, I think it’s fair to say that chardetng is overall more accurate on this title-length threshold metric, although it’s still possible to argue about the relative importance. For example, one might argue that considering the number of users it should matter more which one does better on Simplified Chinese (ced) than on Breton or Walloon (chardetng). This metric is questionable, because there are so many windows-1252 languages (some of which make it past the 10000 Wikipedia article threshold by having lots of stub articles) that a detector that always guessed windows-1252 would get a large number of languages right (old chardet is close to this characterization).

If we put the threshold at 80%, the only languages that chardetng leaves below the threshold are Latvian (61%) and Lithuanian (48%). Fixing the title-length accuracy for Latvian and Lithuanian is unlikely to be possible without a binary size penalty. ced spends 8 KB on a trigram table that improves Latvian and Lithuanian accuracy somewhat but still leaves them less accurate than most other Latin-script languages. An alternative possibility would be to have distinct models for Lithuanian and Latvian to be able to boost them individually such that the boosted models wouldn’t compete with languages that match the combination of Lithuanian and Latvian but don’t match either individually. The non-ASCII letter sets of Lithuanian and Latvian are disjoint except for č, š and ž. Anyway, as noted earlier, the detector is primarily for the benefit of non-Latin scripts, since the failure mode for Latin scripts is relatively benign. For this reason, I have not made the effort to split Lithuanian and Latvian into separate models.

The bet on trading away the capability to scale down for legacy CJK shows up the most clearly for GBK: chardetng is only 88% accurate on GBK-encoded Simplified Chinese Wikipedia titles while ced is 95% accurate. This is due to the GBK accuracy of chardetng being bad with fewer than 6 hanzi. Five syllables is still a plausible Korean word length, so the penalty for implausibly long Korean words doesn’t take effect to decide that the input is Chinese. Overall, I think failing initially for cases where the input is shorter than 6 hanzi is a reasonable price to pay for the binary size savings. After all, once the detector has seen the whole page, it can for sure correct itself and figure out the distinction between GBK and EUC-KR. (Likewise, if the 80% threshold from the previous paragraph seems bad—one in five failure rate, it’s good to remember that it just means that one in five cases needs to correct the guess after having examined the whole page.)

In the table language/encoding combinations for which chardetng is worse than ced by more than one percentage point are highlighed with bold font and tomato background. The combinations for which chardetng is worse than ced by one percentage point are highlighted with italic font and thistle background.

LanguageEncodingchardetngcedchardetICU4C anwindows-125298%97%100%92% arISO-8859-689%49%0%64% arwindows-125688%98%1%41% azwindows-125492%73%44%49% beISO-8859-599%88%96%45% beKOI8-U99%78%23%5% bewindows-125199%99%72%34% bgISO-8859-598%92%98%51% bgKOI8-U97%94%97%40% bgwindows-125194%98%77%44% brwindows-125297%60%100%82% bsISO-8859-289%68%3%20% bswindows-125089%87%39%50% cawindows-125292%89%100%92% ceIBM86699%96%93%0% ceISO-8859-599%93%98%44% ceKOI8-U99%97%98%39% cewindows-125198%99%69%42% csISO-8859-286%80%39%55% cswindows-125086%93%53%65% dawindows-125295%92%100%88% dewindows-125299%97%100%94% elISO-8859-796%92%91%63% elwindows-125397%90%91%61% eswindows-125299%98%100%96% etwindows-125298%95%99%83% euwindows-125296%94%100%87% fawindows-125688%97%0%19% fiwindows-125299%98%99%88% fowindows-125294%86%98%79% frwindows-125294%91%100%96% gawindows-1252100%98%100%89% gdwindows-125289%66%100%83% glwindows-125299%98%100%94% hewindows-125597%95%93%59% hrISO-8859-295%70%8%24% hrwindows-125095%87%41%52% htwindows-125296%71%100%83% huISO-8859-293%96%76%74% huwindows-125093%96%77%75% iswindows-125292%85%95%75% itwindows-125295%90%100%90% jaEUC-JP86%55%56%17% jaShift_JIS95%99%37%17% koEUC-KR95%98%68%7% kuwindows-125481%42%80%54% lbwindows-125296%86%100%95% liwindows-125294%74%99%87% ltISO-8859-471%47%2%3% ltwindows-125748%88%2%2% lvISO-8859-470%45%3%4% lvwindows-125761%74%3%3% mkISO-8859-598%91%97%48% mkKOI8-U96%97%97%36% mkwindows-125195%98%73%41% mnKOI8-U97%71%71%11% mnwindows-125194%97%64%12% nnwindows-125294%94%100%84% nowindows-125295%95%100%91% ocwindows-125291%80%100%85% plISO-8859-292%97%23%64% plwindows-125090%96%26%65% ptwindows-125297%98%100%96% roISO-8859-290%55%30%53% rowindows-125090%56%31%55% ruIBM86699%96%91%0% ruISO-8859-599%93%98%46% ruKOI8-U98%97%98%41% ruwindows-125197%99%73%44% shISO-8859-291%78%44%50% shwindows-125093%93%79%83% skISO-8859-290%81%54%57% skwindows-125087%93%72%68% slISO-8859-292%73%10%26% slwindows-125091%93%51%59% sqwindows-125298%53%100%89% srISO-8859-599%96%99%41% srKOI8-U99%98%99%27% srwindows-125199%99%87%34% svwindows-125296%94%100%92% thwindows-87493%96%86%0% trwindows-125484%87%41%52% ukKOI8-U98%81%34%10% ukwindows-125198%98%69%34% urwindows-125686%87%0%13% viwindows-1258 (orthographic)93%10%11%10% viwindows-1258 (minimally decomposed)91%21%22%19% wawindows-125298%71%100%84% yiwindows-125593%86%86%30% zh-hansGBK88%95%28%5% zh-hantBig595%94%25%5% Document-length Accuracy

If we look at Wikipedia articles themselves and filter out ones whose wikitext UTF-8 byte length is 6000 or less (arbitrary threshold to try to filter out stub articles), chardetng looks even better compared to ced in terms of how many language are left below a given accuracy threshold.

If the accuracy is rounded to full percents, ced leaves 29 language/encoding combinations at worse than 98% (i.e. 97% or lower). chardetng leaves 8. Moreover, ced leaves 22 combinations below the 89% threshold. chardetng leaves 1: Lithuanian as ISO-8859-4. That’s a pretty good result!

LanguageEncodingchardetngcedchardetICU4C anwindows-125299%99%100%100% arISO-8859-6100%100%0%94% arwindows-1256100%100%0%93% azwindows-125499%46%1%88% beISO-8859-5100%100%100%66% beKOI8-U100%100%0%0% bewindows-1251100%100%81%66% bgISO-8859-5100%100%100%89% bgKOI8-U100%94%93%83% bgwindows-1251100%100%100%89% brwindows-1252100%23%100%99% bsISO-8859-2100%8%0%23% bswindows-1250100%99%0%24% cawindows-1252100%99%100%100% ceIBM866100%100%100%0% ceISO-8859-5100%100%100%51% ceKOI8-U100%96%95%37% cewindows-1251100%100%98%51% csISO-8859-2100%6%0%84% cswindows-1250100%100%0%85% dawindows-1252100%100%100%100% dewindows-1252100%98%100%100% elISO-8859-797%31%57%95% elwindows-1253100%100%17%64% eswindows-1252100%100%100%100% etBetter of windows-1252 and windows-1257100%98%98%98% euwindows-125298%98%100%100% fawindows-1256100%100%0%12% fiwindows-1252100%77%100%99% fowindows-125295%98%100%99% frwindows-1252100%100%100%100% gawindows-125299%100%100%100% gdwindows-125299%75%100%99% glwindows-1252100%100%100%100% hewindows-1255100%100%100%84% hrISO-8859-298%17%2%65% hrwindows-125098%99%4%68% htwindows-125299%73%100%100% huISO-8859-289%85%1%85% huwindows-125089%98%1%82% iswindows-125299%99%100%99% itwindows-125297%94%100%100% jaEUC-JP100%100%99%100% jaShift_JIS100%100%92%100% koEUC-KR100%100%94%100% kuwindows-125496%6%8%44% lbwindows-1252100%91%100%100% liwindows-1252100%32%100%100% ltISO-8859-454%87%0%0% ltwindows-125794%99%0%0% lvISO-8859-498%99%0%0% lvwindows-125799%100%0%0% mkISO-8859-5100%100%100%83% mkKOI8-U100%98%97%82% mkwindows-1251100%100%99%83% mnKOI8-U100%99%1%0% mnwindows-1251100%99%98%1% nnwindows-1252100%100%100%100% nowindows-125299%99%100%100% ocwindows-1252100%98%100%98% plISO-8859-299%98%0%84% plwindows-125099%100%0%85% ptwindows-125299%100%100%100% roISO-8859-299%66%0%82% rowindows-125099%71%1%78% ruIBM866100%100%100%0% ruISO-8859-5100%100%100%93% ruKOI8-U100%96%93%86% ruwindows-1251100%100%97%93% shISO-8859-299%11%0%31% shwindows-125099%98%4%36% skISO-8859-299%41%0%64% skwindows-125099%100%13%65% slISO-8859-299%33%0%41% slwindows-125098%98%2%46% sqwindows-1252100%16%100%100% srISO-8859-5100%100%100%22% srKOI8-U100%100%100%22% srwindows-1251100%100%99%22% svwindows-1252100%100%100%100% thwindows-874100%91%99%0% trwindows-125499%97%0%80% ukKOI8-U100%100%0%0% ukwindows-1251100%100%99%80% urwindows-125699%98%1%5% viwindows-1258 (orthographic)100%0%0%0% viwindows-1258 (minimally decomposed)99%0%0%0% wawindows-1252100%79%100%99% yiwindows-1255100%100%99%30% zh-hansGBK100%100%100%100% zh-hantBig5100%100%99%100% Document-length-equivalent number of non-ASCII bytes

I examined how truncated input (starting from 10 non-ASCII bytes and continuing by 10-byte intervals (until 100 and then my coarser intervals) but truncating by one more if the truncation would otherwise render CJK input invalid) compared to document-length input.

For legacy CJK encodings, chardetng achieves document-length-equivalent accuracy with about 10 non-ASCII bytes. For most windows-1252 and windows-1251 languages, chardetng achieves document-length-equivalent accuracy with about 20 non-ASCII bytes. Obviously, this means shorter overall input for windows-1251 than for windows-1252. At 50 non-ASCII bytes, there are very few language/encoding combinations that haven’t completely converged. Some oscillate back a little afterwards, and almost everything has settled at 90 non-ASCII bytes.

Hungarian and ISO-8859-2 Romanian are special cases that haven’t completely converged even at 1000 non-ASCII bytes.

While the title-length case showed that ced scaled down better in some cases, the advantage is lost even at 10 non-ASCII bytes. While ced has document-lengh-equivalent accuracy for the legacy CJK encodings at 10 non-ASCII bytes, the rest take significantly longer to converge than they do with chardetng.

ced had a number of windows-1252 adn windows-1251 cases that converged at 20 non-ASCII bytes as with chardetng. However, it has more cases, including windows-1252 and windows-1251 cases, whose convergenge went into hundrends of non-ASCII bytes. Notably, KOI8-U as an encoding was particularly bad at converging to document-length-equilavence and for most languages (for which it is relevant) had not converged even at 1000 non-ASCII bytes.

Overall, I think it is fair to say that ced may scale down better in some case where there are fewer than 10 non-ASCII bytes, but chardetng generally scales up better from 10 non-ASCII bytes onwards. (The threshold may be a bit under 10, but because the computation of these tests is quite slow, I did not spend time searching for the exact threshold.)

ISO-8895-7 Greek exhibited strange enough behavior with both chardetng and ced in this test that it made me suspect the testing method has some ISO-8859-7-specific problem, but I did not have time to investigate the ISO-8895-7 Greek issue.

Binary Size

Is it more compact than ced? As of Firefox 78, chardetng and the special-purpose Japanese encoding detector shift_or_euc contribute 62 KB to x86_64 Android libxul size, when the crates that these depend on are treated as sunk cost (i.e. Firefox already has the dependencies anyway, so they don’t count towards the added bytes). When built as part of libxul, ced contributes 226 KB to x86_64 Android libxul size. The binary size contribution on chardetng and shift_or_euc together is 28% of what the binary size contribution of ced would be. The x86 situation is similar.

chardetng + shift_or_eucced .text24.6 KB34.3 KB .rodata30.8 KB120 KB .data.rel.ro2.52 KB59.9 KB

On x86_64, the goal of creating something smaller than ced worked out very well. On ARMv7 and aarch64, chardetng and shift_or_euc together result in smaller code than ced but by a less impressive factor. PGO effects ended up changing other code in unfortunate ways so much that it doesn’t make sense to give exact numbers.

Speed

chardetng is slower than ced. In single-threaded mode, chardetng takes 42% longer than ced to process the same input on Haswell.

This is not surprising, since I intentionally resolved most tradeoffs between binary size and speed in favor of smaller binary size at the expense of speed. When I resolved tradeoffs in favor of speed instead of binary size, I didn’t do so primarily for speed but for code readability. Furthermore, Firefox feeds chardetng more data than Chrome feeds to ced, so it’s pretty clear that overall Firefox spends more time in encoding detection than Chrome does.

I think optimizing for binary size rather than speed is the right tradeoff for code that only runs on legacy pages and doesn’t run at all for modern pages. (Also, microbenchmarks don’t show the cache effects on the performance of other code that likely result from ced having a larger working set of data tables than chardetng does.)

Rayon

The construction of encoding detectors is that there are a number of probes that process the same data logically independently of each other. On the surface, this structure looks perfect for parallelization using one of Rust’s superpowers: being able to easily convert an iteration to use multiple worker threads using Rayon.

Unfortunately, the result in the case of chardetng is rather underwhelming even with document-length passed in as 128 KB chunks (the best case with Firefox’s networking stack when the network is fast). While Rayon makes chardetng faster in terms of wall-clock time, the result is very far from scaling linearly with the number of hardware threads. With 8 hyperthreads available on a Haswell desktop i7, the wall-clock result using Rayon is still slower than ced running on a single thread. The synchronization overhead is significant, and the overall sum of compute time across threads is inefficient compared to the single-threaded scenario. If there is a reason to expect parallelism from higher-level task division, it doesn’t make sense to enable the Rayon mode in chardetng. As of Firefox 78, the Rayon mode isn’t used in Firefox, and I don’t expect to enable the Rayon mode for Firefox.

Risks

There are some imaginable risks that testing with data synthetized from Wikipedia cannot reveal.

Big5 and EUC-KR Private-Use Characters

The approach that a single encoding error disqualifies an encoding could disqualify Big5 or EUC-KR if there’s a single private-use character that is not acknowledged by the Encoding Standard.

The Encoding Standard definition of EUC-KR does not acknowledge the Private Use Area mappings in Windows code page 949. Windows maps byte pairs with lead byte 0xC9 or 0xFE and trail byte 0xA1 through 0xFE (inclusive) to the Private Use Area (and byte 0x80 to U+0080). The use cases, especially on the Web, for this area are mostly theoretical. However, if a page somehow managed to have a PUA character like this, the detector would reject EUC-KR as a possible detection outcome.

In the case of Big5, the issue is slightly less theoretical. Big5 as defined in the Encoding Standard fills the areas that were originally for private use but that were taken by Hong Kong Supplementary Character Set with actual non-PUA mappings for the HKSCS characters. However, this still leaves a range below HKSCS, byte pairs whose lead byte is 0x81 through 0x86 (inclusive), unmapped and, therefore, treated as errors by the Encoding Standard. Big5 has had more extension activity than EUC-KR, including mutually-incompatible extensions, and the Han script gives more of a reason (than Hangul) to use the End-User-Defined Characters feature of Windows. Therefore, it is more plausible that a private-use character could find its way into a Big5-encoded Web page than into an EUC-KR-encoded Web page.

For GBK, the Encoding Standard supports the PUA mappings that Windows has. For Shift_JIS, the Encoding Standard supports two-byte PUA mappings that Windows has. (Windows also maps a few single bytes to PUA code points.) Therefore, the concern raised in this section is moot for GBK and Shift_JIS.

I am slightly uneasy that Big5 and EUC-KR in the Encoding Standard are, on the topic of (non-HKSCS) private use characters, inconsistent with Windows and with the way the Encoding Standard handles GBK and Shift_JIS. However, others have been rather opposed to adding the PUA mappings and I haven’t seen actual instances of problems in the real world, so I haven’t made a real effort to get these mappings added.

(Primarily Tamil) Font Hacks

Some scripts had (single-byte) legacy encodings that didn’t make it to IE4 and, therefore, the Web Platform. These were supported by having the page decode as windows-1252 (possible via an x-user-defined declaration that meant windows-1252 decoding in IE4; the Encoding Standard x-user-defined is Mozilla’s different thing relevant to legacy XHR) and having the user install an intentionally misencoded font that assigned non-Latin glyphs to windows-1252 code points.

In some cases, there was some kind of cross-font agreement on how these were arranged. For example, for Armenian there was ARMSCII-8. (Gecko at one point implemented it as a real character encoding, but doing so was useless, because Web sites that used it didn’t declare it as such.) In other cases, these arrangements were font-specific and the relevant sites were simply popular enough (e.g. sites of newspapers in India) to be able to demand that the user install a particular font.

ced knows about a couple of font-specific encodings for Devanagari and multiple encodings, both font-specific and Tamil Nadu state standard, for Tamil. The Tamil script’s visual features make it possible to treat it as more Thai-like than as Devenagari-like, which means that Tamil is more suited for font hacks than the other Brahmic scripts of India. Unicode adopted the approach standardized by the federal government of India in 1988 to treat Tamil as Devanagari-like (logical order) whereas in 1999 the state government of Tamil Nadu sought to promote treating Tamil the way Thai is treated in Unicode (visual order).

All indications are that ced being able to detect these font hacks has nothing to do with Chrome’s needs as of 2017. It is more likely that this capability was put there for the benefit of the Google search engine more than a decade earlier. However, Chrome post-processes the detection of these encodings to windows-1252, so if sites that use these font hacks still exist, they’d appear to work in Chrome. (ced doesn’t know about Armenian, Georgian, Tajik, or Kazakh 8-bit encodings despite these having had glibc locales.)

Do such sites still exist? I don’t know. Looking at the old bugs in Bugzilla, the reported sites appear to have migrated to Unicode. This makes sense. Despite deferring the migration for years and years after Unicode was usable, chances are that the rise of mobile devices has forced migration. It’s considerably less practical to tell users of mobile operating systems to install fonts that it is to tell users of desktop operating systems to install fonts.

So chances are that such sites no longer exist (in quantity that matters), but it’s hard to tell, and if they did, they’d work in Chrome (if using an encoding that ced knows about) but wouldn’t work with chardetng in Firefox. Instead of adding support for detecting such encodings as windows-1252, I made the .in and .lk top-level domains not run chardetng and simply fall back to windows-1252. (My understanding is that the font hacks were more about the Tamil language in India specifically than about the Tamil language generally, but I turned off chardetng for .lk just in case.) This kept the previous behavior of Firefox for these two TLDs. If the problem exists on .com/.net/.org, the problem is not solved there. Also, this has the slight downside of not detecting windows-1256 to the extent it is used under .in.

UTF-8

chardetng detects UTF-8 by checking if the input is valid as UTF-8. However, like Chrome, Firefox only honors this detection result for file: URLs. As in Chrome, for non-file: URLs, UTF-8 is never a possible detection outcome. If it was, Web developers could start relying on it, which would make the Web Platform more brittle. (The assumption is that at this point, Web developers want to use UTF-8 for new content, so being able to rely on legacy encodings getting detected is less harmful at this point in time.) That is, the user-facing problem of unlabeled UTF-8 is deliberately left unaddressed in order to avoid more instances of problematic content getting created.

As with the Quirks mode being the default and everyone having to opt into the Standards mode, and on mobile a desktop-like view port being the default and everyone having to opt into a mobile-friendly view port, for encodings legacy is the default and everyone has to opt into UTF-8. In all these cases, the legacy content isn’t going to be changed to opt out.

The full implications of “what if” UTF-8 was detected for non-file: URLs require a whole article on their own. The reason why file: URLs are different is that the entire content can be assumed to be present up front. The problems with the detecting UTF-8 on non-file: URLs relate to supporting incremental parsing and display of HTML as it arrives over the network.

When UTF-8 is detected on non-file: URLs, chardetng reports the encoding affiliated with the top-level domain instead. Various test cases, both test cases that intentionally test this and test cases that accidentally end up testing this, require windows-1252 to be reported for generic top-level domains when the content is valid UTF-8. Reporting the TLD-affiliated encoding as opposed to always reporting windows-1252 avoids needless reloads on TLDs that are affiliated with an encoding other than windows-1252.

Categorieën: Mozilla-nl planet

Mozilla Privacy Blog: Mozilla releases recommendations on EU Data Strategy

Mozilla planet - vr, 05/06/2020 - 13:24

Mozilla recently submitted our response to the European Commission’s public consultation on its European Strategy for Data.  The Commission’s data strategy is one of the pillars of its tech strategy, which was published in early 2020 (more on that here). To European policymakers, promoting proper use and management of data can play a key role in a modern industrial policy, particularly as it can provide a general basis for insights and innovations that advance the public interest.

Our recommendations provide insights on how to manage data in a way that protects the rights of individuals, maintains trust, and allows for innovation. In addition to highlighting some of Mozilla’s practices and policies which underscore our commitment to ethical data and working in the open – such as our Lean Data Practices Toolkit, the Data Stewardship Program, and the Firefox Public Data Report – our key recommendations for the European Commission are the following:

  • Address collective harms: In order to foster the development of data ecosystems where data can be leveraged to serve collective benefits, legal and policy frameworks must also reflect an understanding of potential collective harms arising from abusive data practices and how to mitigate them.
  • Empower users: While enhancing data literacy is a laudable objective, data literacy is not a silver bullet in mitigating the risks and harms that would emerge in an unbridled data economy. Data literacy – i.e. the ability to understand, assess, and ultimately choose between certain data-driven market offerings – is effective only if there is actually meaningful choice of privacy-respecting goods and services for consumers. Creating the conditions for privacy-respecting goods and services to thrive should be a key objective of the strategy.
  • Explore data stewardship models (with caution): We welcome the Commission’s exploration of novel means of data governance and management. We believe data trusts and other models and structures of data governance may hold promise. However, there are a range of challenges and complexities associated with the concept that will require careful navigation in order for new data governance structures to meaningfully improve the state of data management and to serve as the foundation for a truly ethical and trustworthy data ecosystem.

We’ll continue to build out our thinking on these recommendations, and will work with the European Commission and other stakeholders to make them a reality in the EU data strategy. For now, you can find our full submission here.

 

The post Mozilla releases recommendations on EU Data Strategy appeared first on Open Policy & Advocacy.

Categorieën: Mozilla-nl planet

The Talospace Project: Firefox 77 on POWER

Mozilla planet - do, 04/06/2020 - 19:38
Firefox 77 is released. I really couldn't care less about Pocket recommendations, and I don't know who was clamouring for that exactly because everybody be tripping recommendations, but better accessibility options are always welcome and the debugging and developer tools improvements sound really nice. This post is being typed in it.

There are no OpenPOWER-specific changes in Fx77, though a few compilation issues were fixed expeditiously through Dan Horák's testing just in time for the Fx78 beta. Daniel Kolesa reported an issue with system NSS 3.52 and WebRTC, but I have not heard if this is still a problem (at least on the v2 ABI), and I always build using in-tree NSS myself which seems to be fine. This morning Daniel Pocock sent me a basic query of 64-bit Power ISA bugs yet to be fixed in Firefox; I suspect some are dupes (I closed one just this morning which I know I fixed myself already), and many are endian-specific, but we should try whittling down that list (and, as usual, LTO and PGO still need to be investigated further). I'm still using the same .mozconfigs from Firefox 67.

In a minor moment of self-promotion, I'm also shamelessly reminding readers that Fx77 comes out parallel with TenFourFox Feature Parity Release 23, relevant to Talospace readers because I made some fixes to its Content Security Policy support to properly support the web-based OpenBMC with System Package 2.00. Although the serial console-LAN redirector has some stuttery keystrokes, I think this is a timing problem rather than a feature deficiency, and everything else generally works. Connecting over ssh or serial port is naturally always an option, but I have to agree the web OpenBMC is a lot nicer and some tasks are certainly easier that way. If you're a long-term PowerPC dweeb like me and you want to use your beloved Power Mac to manage your brand-spanking-new Talos II or Blackbird, now you can.

Categorieën: Mozilla-nl planet

Hacks.Mozilla.Org: A New RegExp Engine in SpiderMonkey

Mozilla planet - do, 04/06/2020 - 16:21
Background: RegExps in SpiderMonkey

Regular expressions – commonly known as RegExps – are a powerful tool in JavaScript for manipulating strings. They provide a rich syntax to describe and capture character information. They’re also heavily used, so it’s important for SpiderMonkey (the JavaScript engine in Firefox) to optimize them well.

Over the years, we’ve had several approaches to RegExps. Conveniently, there’s a fairly clear dividing line between the RegExp engine and the rest of SpiderMonkey. It’s still not easy to replace the RegExp engine, but it can be done without too much impact on the rest of SpiderMonkey.

In 2014, we took advantage of this flexibility to replace YARR (our previous RegExp engine) with a forked copy of Irregexp, the engine used in V8. This raised a tricky question: how do you make code designed for one engine work inside another? Irregexp uses a number of V8 APIs, including core concepts like the representation of strings, the object model, and the garbage collector.

At the time, we chose to heavily rewrite Irregexp to use our own internal APIs. This made it easier for us to work with, but much harder to import new changes from upstream. RegExps were changing relatively infrequently, so this seemed like a good trade-off. At first, it worked out well for us. When new features like the ‘\u’ flag were introduced, we added them to Irregexp. Over time, though, we began to fall behind. ES2018 added four new RegExp features: the dotAll flag, named capture groups, Unicode property escapes, and look-behind assertions. The V8 team added Irregexp support for those features, but the SpiderMonkey copy of Irregexp had diverged enough to make it difficult to apply the same changes.

We began to rethink our approach. Was there a way for us to support modern RegExp features, with less of an ongoing maintenance burden? What would our RegExp engine look like if we prioritized keeping it up to date? How close could we stay to upstream Irregexp?

Solution: Building a shim layer for Irregexp

The answer, it turns out, is very close indeed. As of the writing of this post, SpiderMonkey is using the very latest version of Irregexp, imported from the V8 repository, with no changes other than mechanically rewritten #include statements. Refreshing the import requires minimal work beyond running an update script. We are actively contributing bug reports and patches upstream.

How did we get to this point? Our approach was to build a shim layer between SpiderMonkey and Irregexp. This shim provides Irregexp with access to all the functionality that it normally gets from V8: everything from memory allocation, to code generation, to a variety of utility functions and data structures.

A diagram showing the architecture of Irregexp inside SpiderMonkey. SpiderMonkey calls through the shim layer into Irregexp, providing a RegExp pattern. The Irregexp parser converts the pattern into an internal representation. The Irregexp compiler uses the MacroAssembler API to call either the SpiderMonkey macro-assembler, or the Irregexp bytecode generator. The SpiderMonkey macro-assembler produces native code which can be executed directly. The bytecode generator produces bytecode, which is interpreted by the Irregexp interpreter. In both cases, this produces a match result, which is returned to SpiderMonkey.

This took some work. A lot of it was a straightforward matter of hooking things together. For example, the Irregexp parser and compiler use V8’s Zone, an arena-style memory allocator, to allocate temporary objects and discard them efficiently. SpiderMonkey’s equivalent is called a LifoAlloc, but it has a very similar interface. Our shim was able to implement calls to Zone methods by forwarding them directly to their LifoAlloc equivalents.

Other areas had more interesting solutions. A few examples:

Code Generation

Irregexp has two strategies for executing RegExps: a bytecode interpreter, and a just-in-time compiler. The former generates denser code (using less memory), and can be used on systems where native code generation is not available. The latter generates code that runs faster, which is important for RegExps that are executed repeatedly. Both SpiderMonkey and V8 interpret RegExps on first use, then tier up to compiling them later.

Tools for generating native code are very engine-specific. Fortunately, Irregexp has a well-designed API for code generation, called RegExpMacroAssembler. After parsing and optimizing the RegExp, the RegExpCompiler will make a series of calls to a RegExpMacroAssembler to generate code. For example, to determine whether the next character in the string matches a particular character, the compiler will call CheckCharacter. To backtrack if a back-reference fails to match, the compiler will call CheckNotBackReference.

Overall, there are roughly 40 available operations. Together, these operations can represent any JavaScript RegExp. The macro-assembler is responsible for converting these abstract operations into a final executable form. V8 contains no less than nine separate implementations of RegExpMacroAssembler: one for each of the eight architectures it supports, and a final implementation that generates bytecode for the interpreter. SpiderMonkey can reuse the bytecode generator and the interpreter, but we needed our own macro-assembler. Fortunately, a couple of things were working in our favour.

First, SpiderMonkey’s native code generation tools work at a higher level than V8’s. Instead of having to implement a macro-assembler for each architecture, we only needed one, which could target any supported machine. Second, much of the work to implement RegExpMacroAssembler using SpiderMonkey’s code generator had already been done for our first import of Irregexp. We had to make quite a few changes to support new features (especially look-behind references), but the existing code gave us an excellent starting point.

Garbage Collection

Memory in JavaScript is automatically managed. When memory runs short, the garbage collector (GC) walks through the program and cleans up any memory that is no longer in use. If you’re writing JavaScript, this happens behind the scenes. If you’re implementing JavaScript, though, it means you have to be careful. When you’re working with something that might be garbage-collected – a string, say, that you’re matching against a RegExp – you need to inform the GC. Otherwise, if you call a function that triggers a garbage collection, the GC might move your string somewhere else (or even get rid of it entirely, if you were the only remaining reference). For obvious reasons, this is a bad thing. The process of telling the GC about the objects you’re using is called rooting. One of the most interesting challenges for our shim implementation was the difference between the way SpiderMonkey and V8 root things.

SpiderMonkey creates its roots right on the C++ stack. For example, if you want to root a string, you create a Rooted<JSString*> that lives in your local stack frame. When your function returns, the root disappears and the GC is free to collect your JSString. In V8, you create a Handle. Under the hood, V8 creates a root and stores it in a parallel stack. The lifetime of roots in V8 is controlled by HandleScope objects, which mark a point on the root stack when they are created, and clear out every root newer than the marked point when they are destroyed.

To make our shim work, we implemented our own miniature version of V8’s HandleScopes. As an extra complication, some types of objects are garbage-collected in V8, but are regular non-GC objects in SpiderMonkey. To handle those objects (no pun intended), we added a parallel stack of “PseudoHandles”, which look like normal Handles to Irregexp, but are backed by (non-GC) unique pointers.

Collaboration

None of this would have been possible without the support and advice of the V8 team. In particular, Jakob Gruber has been exceptionally helpful. It turns out that this project aligns nicely with a pre-existing desire on the V8 team to make Irregexp more independent of V8. While we tried to make our shim as complete as possible, there were some circumstances where upstream changes were the best solution. Many of those changes were quite minor. Some were more interesting.

Some code at the interface between V8 and Irregexp turned out to be too hard to use in SpiderMonkey. For example, to execute a compiled RegExp, Irregexp calls NativeRegExpMacroAssembler::Match. That function was tightly entangled with V8’s string representation. The string implementations in the two engines are surprisingly close, but not so close that we could share the code. Our solution was to move that code out of Irregexp entirely, and to hide other unusable code behind an embedder-specific #ifdef. These changes are not particularly interesting from a technical perspective, but from a software engineering perspective they give us a clearer sense of where the API boundary might be drawn in a future project to separate Irregexp from V8.

As our prototype implementation neared completion, we realized that one of the remaining failures in SpiderMonkey’s test suite was also failing in V8. Upon investigation, we determined that there was a subtle mismatch between Irregexp and the JavaScript specification when it came to case-insensitive, non-unicode RegExps. We contributed a patch upstream to rewrite Irregexp’s handling of characters with non-standard case-folding behaviour (like ‘ß’, LATIN SMALL LETTER SHARP S, which gives “SS” when upper-cased).

Our opportunities to help improve Irregexp didn’t stop there. Shortly after we landed the new version of Irregexp in Firefox Nightly, our intrepid fuzzing team discovered a convoluted RegExp that crashed in debug builds of both SpiderMonkey and V8. Fortunately, upon further investigation, it turned out to be an overly strict assertion. It did, however, inspire some additional code quality improvements in the RegExp interpreter.

Conclusion: Up to date and ready to go

 

What did we get for all this work, aside from some improved subscores on the JetStream2 benchmark?

Most importantly, we got full support for all the new RegExp features. Unicode property escapes and look-behind references only affect RegExp matching, so they worked as soon as the shim was complete. The dotAll flag only required a small amount of additional work to support. Named captures involved slightly more support from the rest of SpiderMonkey, but a couple of weeks after the new engine was enabled, named captures landed too. (While testing them, we turned up one last bug in the equivalent V8 code.) This brings Firefox fully up to date with the latest ECMAScript standards for JavaScript.

We also have a stronger foundation for future RegExp support. More collaboration on Irregexp is mutually beneficial. SpiderMonkey can add new RegExp syntax much more quickly. V8 gains an extra set of eyes and hands to find and fix bugs. Hypothetical future embedders of Irregexp have a proven starting point.

The new engine is available in Firefox 78, which is currently in our Developer Edition browser release. Hopefully, this work will be the basis for RegExps in Firefox for years to come.

 

The post A New RegExp Engine in SpiderMonkey appeared first on Mozilla Hacks - the Web developer blog.

Categorieën: Mozilla-nl planet

Marco Zehe: My Journey To Ghost

Mozilla planet - do, 04/06/2020 - 13:30

As I wrote in my last post, this blog has moved from WordPress to Ghost recently. Ghost is a modern publishing platform that focuses on the essentials. Unlike WordPress, it doesn't try to be the one-stop solution for every possible use case. Instead, it is a CMS geared towards bloggers, writers, and publishers of free and premium content. In other words, people like me. :-)

After a lot of research, some pros and cons soul searching, and some experimentation, last week I decided to go through with the migration. This blog is hosted with the Ghost Foundation's Ghost(Pro) offering. So not only do I get excellent hosting, but my monthly fee will also be a donation to the foundation and help future development. They also take care of updates for me and that everything runs smoothly. And through a worldwide CDN, the site is now super fast no matter where my visitors come from.

The following should, however, also work the same on a self-hosted Ghost installation. I am not consciously aware of anything particular that would only work on the hosted Ghost(Pro) instances. So no matter how you have your Ghost site running, the following all assumes that you have, but the details are up to you.

Publishing from iPad

One of the main reasons to choose Ghost also was the ability to publish from my iPad without any hassle. My favorite writing app, Ulysses, has had the ability to publish to Ghost since June 2019. Similar to its years long capabilities to publish to WordPress and Medium, it now also does the same with Ghost through their open APIs. The Markdown I write, images, tags, and other bits of information is automatically translated to concepts Ghost understands. For a complete walk-through, read the post on the Ulysses blog about this integration.

Migrating from WordPress

My journey began by following the Ghost tutorial on migrating from WordPress. In a nutshell, this consists of:

  • Installing an official exporter plugin into your WordPress site.
  • Exporting your content using that plugin.
  • Importing the export into your Ghost site.
  • Check that everything works.

Sounds easy, eh? Well, it was, except for some pitfalls. With some trial and error, and deleting and importing my content from and into my Ghost site a total of three times, I got it working, though. Here's what I learned.

Match the author

Before exporting your content from WordPress, make sure that the author's profile e-mail address matches that of the author account in Ghost. Otherwise, a new author will be created, and the posts won't be attributed to you. That is, of course, assuming that you are doing this import for yourself, not for a team mate. The match this is done by is the e-mail address of the actual author profile, not the general admin e-mail from the WordPress general settings.

Check your image paths

This is another bit that differs between Ghost and WordPress. WordPress puts images into a wp-content/uploads/year/mo/<filename> folder. The Ghost exporter tries to mimic that, puts the images in content/wordpress/year/mo/<filenames>. But the links in the actually exported JSON file are not adjusted, you have to do that manually in your favorite text editor with a find and replace operation. And don't forget to zip up the changed JSON file back into the export you want to upload to the Ghost importer.

Permalink redirects

This was actually the hardest part for me, and with which I struggled for a few hours before I got it working. In default WordPress installations, the permalink structure looks something like mysite.com/yyyy/mm/dd/post-slug/. Some may omit the day part, but this is how things usually stand with WordPress. Ghost's permalink structure, which you can also change, by the way, is different. Its default permalinks look something like mysite.com/post-slug/. Since this was all new, I wanted to stick with the defaults and not reproduce the WordPress URL structure with custom routing.

The solution, of course, is one that, if someone brings up a link to my previous posts from another site, or a not yet updated index from Google searches, they will still get my post displayed, not a 404 Page Not Found error. And the proper way to do that is by permanent 301 redirects. Those are actually quite powerful, because they support regular expressions, or RegEx.

Regular expressions are powerful search phrases. They can, for example, do things like „Look for a string that starts somewhere with a forward slash, followed by 4 digits, followed by another slash, followed by 2 digits, another slash, another 2 digits, another slash, and some arbitrary string of characters until you reach the end of that string“. And if you've found that, return me only that arbitrary string so I can process it further. You guessed it, that is, in plain English, the search we need to do to get WordPress permalinks processed. We then only have to use the extracted slug so we can actually redirect to an URL that only contains that slug part.

The tricky part was how to get that right. I have been notoriously bad with Regex syntax. Its machine readable form is something not everyone can understand, much less compose, easily. And I thought: Someone must have run into this problem before, so let's ask Aunt Google.

What I found was, not surprisingly, something that pertained to changing the permalink structure in WordPress from the default to something that was exactly what Ghost is using. And the people who offer such a conversion tool are the makers of the YOAST SEO plugin. It is called YOAST permalink helper and is an awesome web tool that outputs the redirect entries for both Apache and NGINX configuration files.

Equipped with that, I started by looking at another web tool called Regex101. This is another awesome, although not fully accessible, tool that can take Regex of four flavors, you also give it a search string, and it tells you not only what the Regex does, but also if it works on the string you gave it. So I tried it out and could even generate a JavaScript snippet that then translated my Regex into the flavor that JavaScript uses. Because, you know, Regex isn't complicated enough as it is, it also needs flavors for many languages and systems. And they sometimes bite each other, like flavors in food can also do.

The Ghost team has a great tutorial on permanent redirects. but as I found out, the Ghost implementation has a few catches that took me a while to figure out. For example, to search for a forward slash, you usually escape that with a backslash character. However, in Ghost, the very first forward slash in the „from“ value must not be escaped. All others, yes please. But if you actually try the JavaScript flavor out on the Regex 101 page the tutorial recommends, it shows all forward slashes as to be escaped. Also, you better not conclude with a slash, but let the regex end in whatever character comes before that last forward slash Regex101 recommends.

The „To:“ value then also starts with a forward slash, and can then take one of the groups, in my case the 4th group, denoted by the $4 notation. I banged my head against these subtleties for a few hours, even went out on a completely different tangent there for a while only to discover that my initial approach was still the best bet I was getting.

Compared to the above, redirecting the RSS /feed/ to the Ghost style /rss/ was, after that previous ordeal, a piece of cake. Some RSS readers may struggle with this, so if yours doesn't pick up new posts any more, please change your feed URL setting.

My final redirect JSON file looks like this. If you plan to migrate to Ghost from WordPress, and have a similar permalink structure, feel free to use it.

[{ "from": "/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/(.*)", "to": "/$4", "permanent": true }, { "from": "/feed/", "to": "/rss/", "permanent": true }]Tags and categories

There are some more things that only partially translate between WordPress and Ghost. For example, while tags carry over, categories don't. The only way to do that is via a plugin that converts categories to tags. It is mentioned in Ghost‘s tutorial, but as I was looking at it, I saw that it had been updated last 6 years ago, and the last tested version was equally old. And while I don't think this part of WordPress has changed much, at least from the looks of it, I didn't trust such an old plugin to mess with my data. Yes, I had a backup anyway, but still.

And then there was CodeMirror

So here I was, having imported all my stuff, and opening one of the posts for editing in the Ghost admin. And to my surprise, I could not edit it! I found that the post was put into an instance of the CodeMirror editor.

CodeMirror has, at least in all versions before the upcoming version 6, a huge accessibility issue which makes it impossible for screen readers to read any text. It uses a hidden textarea for input, and this is actually where the focus is at all times. But as you type, the text gets grabbed and removed, and put into a contentEditable mirror that never gets focus. It also does some more fancy stuff like code syntax highlighting and line numbers. Version 6 will address accessibility, but it is not production-ready yet.

But wait, Ghost was said to work with Markdown? And I had actually tested the regular editor with Markdown. That was a contentEditable that I could read fine with my screen reader. So why was this different?

The answer is simple: To make things as seamless as possible, the Ghost WordPress Exporter exports the full HTML of the post, and imports it in Ghost as something called an HTML card. Cards are some special blocks that allow for code or HTML formatting. They are inserted as blocks into the regular content. And no, this is not actually a Gutenberg clone, it is more like some special areas of the post. Only that with these imported posts, the whole post was this special area.

Fortunately, if you need to work on such an older post after the import, for most simple formatting, there is a way to do it. You edit the HTML card, and when focused in the text area, press CTRL+A to select all, CTRL+X to cut the whole contents, then escape out of that card once. Back in the regular contentEditable, paste the clipboard contents. For not too complicated formatting, this will simply put the HTML into your contentEditable, and you get headings, lists, links, etc. The one thing I found that doesn't translate, are tables. This is probably because Markdown has such different and differing implementations in its flavors for tables.

If you need to insert HTML, write it in your favorite code editor first. Then, insert an HTML card, and paste the HTML there. I did so while updating my guide on how to use NVDA and Firefox to test your web pages for accessibility. Worked flawlessly. Also, the JSON code snippet above was input the same way.

But believe me, that moment where I opened an older post and could actually not edit it, was a scary moment that almost made me give up on the Ghost effort. Thankfully, there was help. So here we are.

A special thank you

I would like to extend a special thank you to Dave, the Ghost Foundation's developer advocate, who took it upon himself early on to help me with the migration. He answered quite a number of very different questions I was having, sent me helpful links, and also was a great assistant in understanding some of the quirks of the Ghost publishing screen. Some of which has led to some pull requests I sent in to fix these quirks. You know, I can't help it, I'm just that kind of accessibility guy. ;-)

But John O'Nolan, Ghost's founder, and others from the team have been very helpful and welcoming, merging my pull requests literally from day 1.5 of me using Ghost, answering more questions and offering to help.

In conclusion

This was a pleasurable experience through and through. And even the two hiccups I encountered were dealt with eventually, or are, in the case of the inaccessible CodeMirror bits, things I can somehow work around.

My blog has been running smoothly since May 29, and I hope to have some of the kinks with the theme smoothed out next, especially the color contrast bit, and the bit about the fonts some people have given me feedback on. I will work with the maintainer of the Attila theme to work through these.

Again, welcome to this new blogging chapter!

Categorieën: Mozilla-nl planet

The Rust Programming Language Blog: Announcing Rust 1.44.0

Mozilla planet - do, 04/06/2020 - 02:00

The Rust team has published a new version of Rust, 1.44.0. Rust is a programming language that is empowering everyone to build reliable and efficient software.

This is a shorter blog post than usual: in acknowledgement that taking a stand against the police brutality currently happening in the US and the world at large is more important than sharing tech knowledge, we decided to significantly scale back the amount of promotion we're doing for this release.

The Rust Core Team believes that tech is and always will be political, and we encourage everyone take the time today to learn about racial inequality and support the Black Lives Matter movement.

What's in 1.44.0 stable

Rust 1.44 is a small release, with cargo tree integrated in Cargo itself and support for async/await in no_std contexts as its highlights. You can learn more about all the changes in this release by reading the release notes.

Contributors to 1.44.0

Many people came together to create Rust 1.44.0. We couldn't have done it without all of you. Thanks!

Categorieën: Mozilla-nl planet

Data@Mozilla: This Week in Glean: The Glean SDK and iOS Application Extensions, or A Tale of Two Sandboxes

Mozilla planet - wo, 03/06/2020 - 14:05

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

Recently, I had the pleasure of working with our wonderful iOS developers here at Mozilla in instrumenting Lockwise, one of our iOS applications, with the Glean SDK.  At this point, I’ve already helped integrate it with several other applications, all of which went pretty smoothly, and Lockwise for iOS held true to that.  It wasn’t until later, when unexpected things started happening, that I realized something was amiss…

Integrating the Glean SDK with a new product is a fairly straightforward process.  On iOS, it amounts to adding the dependency via Carthage, and adding a couple of build-steps to get it to do its thing.  After this is done, we generally smoke test the data using the built in debugging tools.  If everything looks good, we submit a request for data review for collecting the new metrics.  Once a data steward has signed off on our request to collect new data, we can then release a new version of the application with its Glean SDK powered telemetry.  Finally, we collect a few weeks of data to validate that everything looks good, such as user counts, distribution of locales, and we look for anything that might indicate that the data isn’t getting collected like we expected, such as holes in sequence numbers or missing fields.  In Lockwise for iOS’s case, all of this went just as expected.

One part of the Glean SDK integration that I haven’t mentioned yet is enabling the application in our data ingestion pipeline via the probe-scraper so that we can accept data from it.  On iOS, the Glean SDK makes use of the application bundle identifier to uniquely identify the app to our pipeline, so enabling the app means letting the pipeline know about this id so that it won’t turn away the data.  This identifier also determines the table that the data ultimately ends up in, so it’s a key identifier in the process.

So, here’s where I learned something new about iOS architecture, especially as it relates to embedded application extensions.  Application extensions are a cool and handy way of adding additional features and functionality to your application in the Apple ecosystem.  In the case of Lockwise, they are using a form of extension that provides credentials to other applications.  This allows the credentials stored in Lockwise to be used to authenticate in websites and other apps installed on the device.  I knew about extensions but hadn’t really worked with them much until now, so it was pretty interesting to see how it all worked in Lockwise.

Here’s where a brick smacks into the story.  Remember that bundle identifier that I said was used to uniquely identify the app?  Well, it turns out that application extensions in iOS modify this a bit by adding to it to uniquely identify themselves!  We realized this when we started to see our pipeline reject this new identifier, because it wasn’t an exact match for the identifier that we expected and had allowed through.  The id we expected was org-mozilla-ios-lockbox, but the extension was reporting org-mozilla-ios-Lockbox-CredentialProvider.  Using a different bundle identifier totally makes sense, since they run as a separate process within their own application sandbox container.  The OS needs to see them differently because an extension can run even if the base application isn’t running.  Unfortunately, the Glean SDK is purposefully built to not care about, or even know about different processes so we had a bit of a blind spot in the application extension.  Not only that, but remember I mentioned that the extension’s storage container is a separate sandbox from the base application?  Well, since the extension runs in a different process from the base application, and it has a separate storage, the Glean SDK running in the extension acted just like the extension was a completely separate application.  With separate storage, it happily generates a different unique identifier for the client, which does not match the id generated for the base application.  So there was no way to attribute the information in the extension to the base application that contained it because the ingestion pipeline saw these as separate applications with no way to associate the client ids between the two.  These were two sandboxes that just couldn’t interact with each other.  To be fair, Apple does provide a way to share data between extensions and applications, but it requires creating a completely separate shared sandbox, and this doesn’t solve the problem that the same Glean SDK instance just shouldn’t be used directly by multiple processes at the same time.

Well, that wasn’t ideal, to say the least, so we began an investigation to determine what course of action we should (or could) take.  We went back and forth over the details but ultimately we determined that the Glean SDK shouldn’t know about processes and that there wasn’t much we could do aside from blocking it from running in the extensions and documenting the fact that it was up to the Glean SDK-using application to ensure that metrics were only collected by the main process application.  I was a bit sad that there wasn’t much we could do to make the user-experience better for Glean SDK consumers, but sometimes you just can’t predict the challenges you will face when implementing a truly cross-platform thing.  I still hold out hope that a way will open up to make this easier, but the lesson I learned from all of this is that sometimes you can’t win but it’s important to stick to the design and do the best you can.

Categorieën: Mozilla-nl planet

Allen Wirfs-Brock: The Rise and Fall of Commercial Smalltalk

Mozilla planet - wo, 03/06/2020 - 00:27
<figcaption>Relics</figcaption>

Gilad Bracha recently posted an article “Bits of History, Words of Advice” that talks about the incredible influence of Smalltalk but bemoans the fact that:

…today Smalltalk is relegated to a small niche of true believers.  Whenever two or more Smalltalkers gather over drinks, the question is debated: Why? 

(All block quotes are from Gilad’s article unless otherwise noted.)

Smalltalk actually had a surge of commercial popularity in the first half of the 1990s but that interest evaporated almost instantaneously in 1996. Most of the Gilad’s article consists of his speculations on why that happened. I agree with many of Gilad’s takes on this subject, but his involvement and perspective with Smalltalk started relatively late in the commercial Smalltalk lifecycle. I was there pretty much at the beginning so it seems appropriate to add some additional history and my own personal perspective. I’ll be directly responding to some of Gilad’s points, so you should probably read his post before continuing.

Let’s Start with Some History

Gilad worked on the Strongtalk implementation of Smalltalk that was developed in the mid-1990s. His post describes the world of Smalltalk as he saw it during that period. But to get a more complete understanding we will start with an overview of what had occurred over the previous twenty years.

1970-1979: Creation

Starting in the early 1970s the early Smalltalkers and other Xerox PARC researchers invented the concepts and mechanisms of Personal Computing, most of which are still dominant today. Guided by Alan Kay’s vision, Dan Ingalls along with Ted Kaehler, Adele Goldberg, Larry Tesler, and other members of the PARC Learning Research Group created Smalltalk as the software for the Dynabook, their aspirational model of a personal computer. Smalltalk wasn’t just a programming language—it was, using today’s terminology, a complete software application platform and development environment running on the bare metal of dedicated personal supercomputers. During this time, the Smalltalk language and system evolved through at least five major versions.

1980-1984: Dissemination and Frustration

In the 1970s, the outside world only saw hints and fleeting glimpses of was what was going on within the Learning Research Group. Their work influenced the design of the Xerox Star family of office machines and after Steve Jobs got a Smalltalk demo he hired away Larry Tesler to work on the Apple Lisa. But the LRG team (later called SCG for Software Concepts Group) wanted to directly expose their work (Smalltalk) to the world at large. They developed a version of their software, Smalltalk-80, that would be suitable for distribution outside of Xerox and wrote about it in a series of books and a special issue of the widely read Byte magazine. They also made a memory snapshot of their Smalltalk-80 implementation available to several companies that they thought had the hardware expertise to build the microcoded custom processors that they thought were needed to run Smalltalk. The hope was that the companies would design and sell Smalltalk-based computers similar to the one they were using at Xerox PARC.

But none of the collaborators did this. Instead of building microcoded Smalltalk processors they used economical mini-computers or commodity microprocessor-based systems and coded what were essentially simple software emulations of the Xerox super computer Smalltalk machines. The initial results were very disappointing. They could nominally run the Xerox provided Smalltalk memory image but at speeds 10–100 times slower than the machines being used at PARC. This was extremely frustrating as their implementations were too slow to do meaningful work with the Smalltalk platform. Most of the original external collaborators eventually gave up on Smalltalk.

But a few groups, such as me and my colleagues at Tektronix, L Peter Deutsch and Alan Schiffman at Xerox, and Dave Ungar at UC Berkley developed new techniques leading to drastically better Smalltalk-80 performance using commodity microprocessors with 32-bit architectures.

At the same time Jim Anderson, George Bosworth, and their colleagues, working exclusive from the Byte articles independently developed a new Smalltalk that could usefully run on the original IBM PC with its Intel 8088 processor.

1985-1989: Productization

By 1985, it was possible to actually use Smalltalk on affordable hardware and various groups went about releasing Smalltalk products. In January 1985 my group at Tektronix shipped the first of a series of “AI Workstations” that were specifically designed to support Smalltalk-80. At about the same time Digitalk, Anderson’s and Bosworth’s company, shipped Methods—their first IBM PC Smalltalk product. This was followed by their Smalltalk/V products, whose name suggested a linkage to Alan Kay’s Vivarium project. Servio Logic introduced GemStone, an object-oriented data base system based on Smalltalk. In 1987 most of the remaining Xerox PARC Smalltalkers, led by Adele Goldberg, spun-off into a startup company, ParcPlace Systems. Their initial focus was selling Smalltalk-80 systems for workstation class machines.

Dave Thomas’ Object Technology International (OTI) worked with Digitalk to create a Macintosh version of Smalltalk/V and then started to develop the Smalltalk technology that would eventually be used in IBM’s VisualAge products. Tektronix, working with OTI, started developing two new families of oscilloscopes that used embedded Smalltalk to run its user interface. Both OTI and Instantiations, a Tektronix spin-off, started working on tools to support team-scale development projects using Smalltalk.

1990-1995: Enterprise Smalltalk

As many software tools vendors discovered over that period, it takes too many $99 sales to support anything larger than a “lifestyle” business. In 1990, if you wanted to build a substantial software tools business you had to find the customers who spent substantial sums on such tools. Most tools vendors who stayed in business discovered that most of those customers were enterprise IT departments. Those organizations, used to being IBM mainframe customers, often had hundreds of in-house developers and were willing to spend several thousand dollars per initial developer seat plus substantial yearly support fees for an effective application development platform.

At that time, many enterprise IT organizations were trying to transition from mainframe driven “green-screen” applications to “fat clients” with modern Mac/Windows type GUI interfaces. The major Smalltalk vendors positioned their products as solutions to that transition problem. This was reflected in their product branding, ParcPlace Systems’ ObjectWorks was renamed as VisualWorks. Digitalk’s Smalltalk/V became VisualSmalltalk Enterprise, IBM used OTI’s embedded Smalltalk technologies as the basis of VisualAge. Smalltalk was even being positioned in the technical press and books as the potential successor to COBOL:

<figcaption>Smalltalk and Object Orientation: An Introduction John Hunt, 1998, p. 44</figcaption>

Dave Thomas’ 1995 article “Travels with Smalltalk” is an excellent survey of the then current and previous ten years of commercial Smalltalk activity. Around the time it was published, IBM consolidated its Smalltalk technology base by acquiring Dave’s OTI, and Digitalk merged with ParcPlace Systems. That merger was motivated by their mutual fear of IBM Smalltalk’s competitive power in the enterprise market.

1996-Present: Retreating to the Fringe

Over six months in 1996, Smalltalk’s place in the market changed from the enterprise darling COBOL replacement to yet another disappointing tool with a long tail of deployed applications that needed to be maintained.

The World Wide Web diverted enterprise attention away from fat clients and suddenly web thin clients were the hot new technology. At the same time Sun Microsystems’ massive marketing campaign for Java convinced enterprises that Java was the future for both web and standalone client applications. Even though Java never delivered on its client-side promises, the commercial Smalltalk vendors were unable to counter the Java hype cycle and development of new Smalltalk-based enterprise applications stopped. The merged ParcPlace-Digitalk rapidly declined and only survived for a couple more years. IBM quickly pivoted its development attention to Java and used the development environment expertise it had acquired from OTI to create Eclipse.

Niche companies assumed responsibility for support of the legacy enterprise Smalltalk customers with deployed applications. Cincom picked up support for VisualWorks and Visual Smalltalk Enterprise customers, while Instantiations took over supporting IBM’s VisualAge Smalltalk. Today, twenty five years later, there are still hundreds of deployed legacy enterprise Smalltalk applications.

Dan Ingalls at Apple in 1996 used an original Smalltalk-80 virtual image to bootstrap Squeak Smalltalk. Dan, Alan Kay, and their colleagues at Apple, Disney, and some universities used Squeak as a research platform, much as they had used the original Smalltalk at Xerox PARC. Squeak was released as open source software and has a community of users and contributors throughout the world. In 2008, Pharo was forked from Squeak, with the intent of being a streamlined, more industrialized, and less research focused version of Smalltalk. In addition, several other Smalltalks have been created over that last 25 years. Today Smalltalk still has a small, enthusiastic, and somewhat fragmented base of users

Why Didn’t Smalltalk Win?

Gilad in his article, lists four major areas of failure of execution by the Smalltalk community that led to its lack of broad long term adoption. Let’s look at each of them and see where I agree and disagree.

Lack of a Standard

But a standard for what? What is this “Smalltalk” thing that wasn’t standardized? As Gilad notes, there was a reasonable degree of de facto agreement on the core language syntax, semantics and even the kernel class protocols going back to at least 1988. But beyond that kernel, things were very different.

Each vendor had a slightly different version – not so much a different language, as a different platform.

And these platforms weren’t just slightly different. They were very different in terms of everything that made them a platform. The differences between the Smalltalk platforms were not like those that existed at the time between compilers like GCC and Microsoft’s C/C++. A more apt analog would be the differences between platforms like Windows, Solaris, various Linux distributions, and Mac OSX. A C program using the standard C library can be easily ported among all these platform but porting a highly interactive GUI C-based application usually required a major rewrite.

The competing Smalltalk platform products were just as different. It was easy enough to “file out” basic Smalltalk language code that used the common kernel classes and move it to another Smalltalk product. But just like the C-implemented platforms, porting a highly interactive GUI application among Smalltalk platforms required a major rewrite The core problem wasn’t the reflective manner in which Smalltalk programs were defined (although I strongly agree with Gilad that this is undesirable). The problem was that all the platform services that built upon the core language and kernel classes—the things that make a platform a platform—were different. By 1995, there were at three major and several niche Smalltalk platforms competing in the client application platform space. But by then there was already a client platform winner—it was Microsoft Windows.

Business Model

Smalltalk vendors had the quaint belief in the notion of “build a better mousetrap and the world will beat a path to your door”. Since they had built a vastly better mousetrap, they thought they might charge money for it. 

…Indeed, some vendors charged not per-developer-seat, but per deployed instance of the software. Greedy algorithms are often suboptimal, and this approach was greedier and less optimal than most.

This may be an accurate description of the business model of ParcPlace Systems, but it isn’t a fair characterization of the other Smalltalk vendors. In particular, Digitalk started with a $99 Smalltalk product for IBM PCs and between 1985 and 1990 built a reasonable small business around sub $500 Smalltalk products for PCs and Macs. Then, with VC funding, it transitioned to a fairly standard enterprise focused $2000+ per seat plus professional services business model. Digitalk never had deployment fees. I don’t believe that IBM did either.

The enterprise business model worked well for the major Smalltalk vendors—but it became a trap. When you acquire such customers you have to respond to their needs. And their immediate needs were building those fat-client applications. I remember with dismay the day at Digitalk when I realized that our customers didn’t really care about Smalltalk, or object-based programming and design, or live programming, or any of the unique technologies Smalltalk brought to the table. They just wanted to replace those green-screens with something that looked like a Windows application. They bought our products because our (quite successful) enterprise sales group convinced them that Smalltalk was necessary to build such applications.

Supporting those customer expectations diverted engineering resources to visual designers and box & stick visual programming tools rather than important down-stream issues such as team collaborative development, versioning, and deployment. Yes, we knew about these issues and had plans to address them, but most resources went to the immediate customer GUI issues. So much so that ParcPlace and Digitalk were blindsided and totally unprepared to compete when their leading enterprise customers pivoted to web-based thin-clients.

Performance

Smalltalk execution performance had been a major issue in the 1980s. But by the mid 1990s every major commercial Smalltalk had a JIT-based virtual machine and a multi-generation garbage collector. When “Strongtalk applied Self’s technology to Smalltalk” it was already practical.

While those of us working on Smalltalk VMs loved to chase C++ performance our actual competition was PowerBuilder, Visual Basic, and occasionally Delphi. All the major Smalltalk VMs had much better execution performance than any of those. Microsoft once even made a bid to acquire Digitalk, even though they had no interest in Smalltalk. They just wanted to repurpose the Smalltalk/V VM technology to make Visual Basic faster.

But as Gilad points out, raw speed is seldom an issue. Particularly for the fat client UIs that were the focus of most commercial Smalltalk customers. Smalltalk VMs also had much better performance than the other dynamic languages that emerged and gained some popularity during the 1990s. Perl, Python, Ruby, PHP all had, and as far as I know still have, much poorer execution performance than 1995 Smalltalks running on comparable hardware.

Memory usage was a bigger issues. It was expensive for customers to have to double the memory in PCs to effectively run commercial Smalltalk. But Moore’s law quickly overcame that issue.

It’s also worth dwelling on the fact that raw speed is often much less relevant than people think. Java was introduced as a client technology (anyone remember applets?). The vision was programs running in web pages. Alas, Java was a terrible client technology. In contrast, even a Squeak interpreter, let alone Strongtalk, had much better start up times than Java, and better interactive response as well. It also had much smaller footprint. It was a much better basis for performant client software than Java. The implications are staggering.

Would Squeak (or any mid-1990s version of Smalltalk) have really fared better than Java in the browser? You can try it yourself by running a 1998 version of Squeak right now in your browser: https://squeak.js.org/demo/simple.html. Is this what web developers needed at that time?

Java’s problem as a web client was that it wanted to be it’s own platform. Java wasn’t well integrated into the HTML-based architecture of web browsers. Instead, Java treated the browser as simply another processor to host the Sun-controlled “write once, run [the same] everywhere ” Java application platform. It’s goal wasn’t to enhance native browser technology—it’s goal was to replace them.

Interaction with the Outside World

Because of their platform heritage, commercial Smalltalk products had similar problems to those Java had. There was an “impedance mismatch” between the Smalltalk way of doing things and the dominant client platforms such as Microsoft Windows. The non-Smalltalk-80 derived platforms (Digitalk and OTI/IBM) did a better job than ParcPlace Systems at smoothing that mismatch.

Windowing. Smalltalk was the birthplace of windowing. Ironically, Smalltalks continued to run on top of their own idiosyncratic window systems, locked inside a single OS window.
Strongtalk addressed this too; occasionally, so did others, but the main efforts remained focused on their own isolated world, graphically as in every other way.

This describes ParcPlace’s VisualWorks, but not Digitalk’s Visual Smalltalk or IBM’s VisualAge Smalltalk. Applications written for those Smalltalks used native platform windows and widgets, just like other languages. Each Smalltalk vendor supplied their own higher-level GUI frameworks. Those frameworks were different from each other and from frameworks used with other languages. Digitalk’s early Smalltalk products for MSDOS supplied their own windowing systems, but Digitalk’s Windows and OS/2 products always used the native windows and graphics.

Source control. The lack of a conventional syntax meant that Smalltalk code could not be managed with conventional source control systems. Instead, there were custom tools. Some were great – but they were very expensive.

Both IBM and Digitalk bundled source control systems into their enterprise Smalltalk products which were competitively priced with other enterprise development platforms. OTI/IBM’s Envy source control system successfully built upon Smalltalk’s traditional reflective syntax and stored code in a multiuser object database. OTI also sold Envy for ParcPlace’s VisualWorks but lacked the pricing advantage of bundling. Digitalk’s Team/V unobtrusively introduced a non-reflective syntax and versioned source code using RCS. Team/V could forward and backwards migrate versions of Smalltalk “modules” within an running virtual image.

Deployment. Smalltalk made it very difficult to deploy an application separate from the programming environment.

As Gilad describes, extracting an application from its development environment could be tricky. But both Team/V and Envy included tools to help developers do this. Team/V support the deployment of applications as Digitalk Smalltalk Link Libraries (SLLs) which were separable virtual image segments that could be dynamically loaded. Envy, which originally was designed to generate ROM-based embedded applications, had rich tools for tree-shaking a deployable application from a development image.

Gilad also mentioned that “unprotected IP” was a deployment concern that hindered Smalltalk acceptance. This is presumably because even if source code wasn’t deployed with an application it was quite easy to decompile a Smalltalk bytecoded method back into human readable source code. Indeed, potential enterprise customers would occasionally raise that as a concern. But, it was usually as a “what about” issue. Out of hundreds of enterprise customers we dealt with at Digitalk, I can’t recall a single instance where unprotected IP killed a deal. If it had been, we would have done something about this as we knew of various techniques that could be used to obfuscate Smalltalk code.

“So why didn’t Smalltalk take over the world?”

Smalltalk did something more important than take over the world—it define the shape of the world! Alan Kay’s Dynabook vision of personal computing didn’t come with a detailed design document described all of its features and how to create it. To make the vision concrete, real artifacts had to be invented.

<figcaption>Alan Kay Dynabook sketch, 1972</figcaption>

Smalltalk was the software foundation for “inventing the future” Dynabook. The Smalltalk programming language went through at least five major revisions at Xerox PARC and evolved into the software platform of the prototype Dynabook. Using the Smalltalk platform, the fundamental concepts of a personal graphic user interface were first explored and refined. Those initial concepts were widely copied and further refined in the market resulting in the systems we still use today. Smalltalk was also the vector by which the concepts of object-oriented programming and design were introduced to a world-wide programming community. Those ideas dominated for at least 25 years. From the perspective of its original purpose, Smalltalk was a phenomenal success.

But, I think Gilad’s question really is: Why didn’t the Smalltalk programming language become one of the most widely used languages? Why isn’t it in the top 10 of the Tiobe Index?

Gilad’s entire article is essentially about answering that question, and he summarizes his conclusions as:

With 20/20 hindsight, we can see that from the pointy-headed boss perspective, the Smalltalk value proposition was:

 Pay a lot of money to be locked in to slow software that exposes your IP, looks weird on screen and cannot interact well with anything else; it is much easier to maintain and develop though!

On top of that, a fair amount of bad luck.

There are elements of truth in all of those observations, but I think they miss what really happened with the rise and fall of commercial Smalltalk:

  1. Smalltalk wasn’t just a language; it was a complete personal computing platform. But by the time affordable personal computing hardware could run the Smalltalk platform, other less demanding platforms had already dominated the personal computing ecosystem.
  2. Xerox Smalltalk was built for experimentation. It was never hardened for wide use or reliable application deployment.
  3. Adapting Smalltalk for production use required large engineering teams and in the late 1980s no deep-pocketed large companies were willing to make that investment. Other than IBM and HP, the large computing companies that could have helped ready Smalltalk for production use took other paths.
  4. Small companies that wanted to harden Smalltalk needed to find a business model that would support that large engineering effort. The obvious one was the enterprise application development market.
  5. To break into that market required providing a solution to an important customer problem. The problem the companies latched on to was enterprise migration from green-screens to more modern looking fat-client applications.
  6. Several companies had considerable success in taking Smalltalk to the enterprise market. But these were demanding and fickle customers and the Smalltalk companies became hyper focused on fulfilling their fat-client commitments.
  7. With that focus, the Smalltalk companies were completely unprepared in 1996 to deal with the sudden emergence of the web browser platform and its use within enterprises. At the same time, Sun, a much larger company than any of the Smalltalk-technology driven companies spent vast sums to promote Java as the solution for both web and desktop client applications.
  8. Enterprises diverted their development tool purchases from Smalltalk companies to new businesses who promised web and/or Java based solutions.
  9. Smalltalk revenues rapidly declined and the Smalltalk focused companies failed.
  10. Technologies that have failed in the marketplace seldom get revived.

Smalltalk still has its niche uses and devotees. Sometimes interesting new technologies emerge from that community. Gilad’s Newspeak is a quite interesting rethinking of the language layer of traditional Smalltalk.

Gilad mentioned Smalltalk’s bad luck. But I think it had better than average luck compared to most other attempts to industrialize innovative programming technologies. It takes incredible luck and timing to become of one of the world’s most widely used programming languages. It’s such a rare event that no language designer should start out with that expectation. You really should have some other motivation to justify your effort.

But my question for Newspeak and any similar efforts is: What is your goal? What drives the design? What impact do you want to have?

Smalltalk wasn’t created to rule the software world, it was created to enable the invention of a new software world. I want to hear about compelling visions for the next software world and what we will need to build it. Could Newspeak have such a role?

Categorieën: Mozilla-nl planet

Daniel Stenberg: curl ootw: –ftp-skip-pasv-ip

Mozilla planet - di, 02/06/2020 - 17:32

(Other command line options of the week.)

--ftp-skip-pasv-ip has no short option and it was added to curl in 7.14.2.

Crash course in FTP

Remember how FTP is this special protocol for which we create two connections? One for the “control” where we send commands and read responses and then a second one for the actual data transfer.

When setting up the second connection, there are two ways to do it: the active way and the passive way. The wording there is basically in the eyes of the FTP server: should the server be active or passive in the creation and that’s the key. The traditional underlying FTP commands to do this is either PORT or PASV.

Due to the prevalence of firewalls and other network “complications” these days, the passive style is dominant for FTP. That’s when the client asks the server to listen on a new port (by issuing the PASV command) and then the client connects to the server with a second connection.

The PASV response

When a server responds to a PASV command that the client sends to it, it sends back an IPv4 address and a port number for the client to connect to – in a rather arcane way that looks like this:

227 Entering Passive Mode (192,168,0,1,156,64)

This says the server listens to the IPv4 address 192.168.0.1 on port 40000 (== 156 x 256 + 64).

However, sometimes the server itself isn’t perfectly aware of what IP address it actually is accessible as “from the outside”. Maybe there’s a NAT involved somewhere, maybe there are even more than one NAT between the client and the server.

We know better

For the cases when the server responds with a crazy address, curl can be told to ignore the address in the response and instead assume that the IP address used for the control connection will in fact work for the data connection as well – this is generally true and has actually become even more certain over time as FTP servers these days typically never return a different IP address for PASV.

Enter the “we know better than you” option --ftp-skip-pasv-ip.

What about IPv6 you might ask

The PASV command, as explained above, is explicitly only working with IPv4 as it talks about numerical IPv4 addresses. FTP was actually first described in the early 1970s, quite a lot time before IPv6 was born.

When FTP got support for IPv6, another command was introduced as a PASV replacement.: the EPSV command. If you run curl with -v (verbose mode) when doing FTP transfers, you will see that curl does indeed first try to use EPSV before it eventually falls back and tries PASV if the previous command doesn’t work.

The response to the EPSV command doesn’t even include an IP address but then it always assumes the same address as the control connection and it only returns back a TCP port number.

Example

Download a file from that server giving you a crazy PASV response:

curl --ftp-skip-pasv-ip ftp://example.com/file.txt Related options

Change to active FTP mode with --ftp-port, switch off EPSV attempts with --disable-epsv.

Categorieën: Mozilla-nl planet

Hacks.Mozilla.Org: New in Firefox 77: DevTool improvements and web platform updates

Mozilla planet - di, 02/06/2020 - 16:31

Note: This post is also available in: 简体中文 (Chinese (Simplified)), 繁體中文 (Chinese (Traditional)), and Español (Spanish).

A new stable Firefox version is rolling out. Version 77 comes with a few new features for web developers.

This blog post provides merely a set of highlights; for all the details, check out the following:

Developer tools improvements

Let’s start by reviewing the most interesting Developer Tools improvements and additions for 77. If you like to see more of the work in progress to give feedback, get Firefox DevEdition for early access.

Faster, leaner JavaScript debugging

Large web apps can provide a challenge for DevTools as bundling, live reloading, and dependencies need to be handled fast and correctly. With 77, Firefox’s Debugger learned a few more tricks, so you can focus on debugging.

After we improved debugging performance over many releases, we did run out of actionable, high-impact bugs. So to find the last remaining bottlenecks, we have been actively reaching out to our community. Thanks to many detailed reports we received, we were able to land performance improvements that not only speed up pausing and stepping but also cut down on memory usage over time.

JavaScript & CSS Source Maps that just work

Source maps were part of this outreach and saw their own share of performance boosts. Some cases of inline source maps improved 10x in load time. More importantly though, we improved reliability for many more source map configurations. We were able to tweak the fallbacks for parsing and mapping, thanks to your reports about specific cases of slightly-incorrect generated source maps. Overall, you should now see projects that just work, that previously failed to load your original CSS and JavaScript/TypeScript/etc code.

Step JavaScript in the selected stack frame

Stepping is a big part of debugging but not intuitive. You can easily lose your way and overstep when moving in and out of functions, and between libraries and your own code.

The debugger will now respect the currently selected stack when stepping. This is useful when you’ve stepped into a function call or paused in a library method further down in the stack. Just select the right function in the Call Stack to jump to its currently paused line and continue stepping from there.

Navigating the call stack and continuing stepping further in that function

We hope that this makes stepping through code execution more intuitive and less likely for you to miss an important line.

Overflow settings for Network and Debugger

To make for a leaner toolbar, Network and Debugger follow Console’s example in combining existing and new checkboxes into a new settings menu. This puts powerful options like “Disable JavaScript” right at your fingertips and gives room for more powerful options in the future.

Overflow settings menus in both Network and Debugger toolbar.

Pause on property read & write

Understanding state changes is a problem that is often investigated by console logging or debugging. Watchpoints, which landed in Firefox 72, can pause execution while a script reads a property or writes it. Right-click a property in the Scopes panel when paused to attach them.

Right-click on object properties in Debugger's Scopes to break on get/set

Contributor Janelle deMent made watchpoints easier to use with a new option that combines get/set, so any script reference will trigger a pause.

Improved Network data preview

Step by step over each release, the Network details panels have been rearchitected. The old interface had event handling bugs that made selecting and copying text too flaky. While we were at it, we also improved performance for larger data entries.

This is part of a larger interface cleanup in the Network panel, which we have been surveying our community about via @FirefoxDevTools Twitter and Mozilla’s Matrix community. Join us there to have your voice heard. More parts of the Network-panel sidebar redesign are also available in Firefox DevEdition for early access.

Web platform updates

Firefox 77 supports a couple of new web platform features.

String#replaceAll

Firefox 67 introduced String#matchAll, a more convenient way to iterate over regex result matches. In Firefox 77 we’re adding more comfort: String#replaceAll helps with replacing all occurrences of a string – an operation that’s probably one of those things you have searched for a thousand times in the past already (thanks StackOverflow for being so helpful!).

Previously, when trying to replace all cats with dogs, you had to use a global regular expression

.replace(/cats/g, 'dogs');

Or, you could use split and join:

.split('cats').join('dogs');

Now, thanks to String#replaceAll, this becomes much more readable:

.replaceAll('cats', 'dogs'); IndexedDB cursor requests

Firefox 77 exposes the request that an IDBCursor originated from as an attribute on that cursor. This is a nice improvement that makes it easier to write things like wrapper functions that “upgrade” database features. Previously, to do such an upgrade on a cursor you’d have to pass in the cursor object and the request object that it originated from, as the former is reliant on the latter. With this change, you now only need to pass in the cursor object, as the request is available on the cursor.

Extensions in Firefox 77: Fewer permission requests and more

Since Firefox 57, users see the permissions an extension wants to access during installation or when any new permissions are added during an update. The frequency of these prompts can be overwhelming, and failure to accept a new permission request during an extension’s update can leave users stranded on an old version. We’re making it easier for extension developers to avoid triggering as many prompts by making more permissions available as optional permissions. Optional permissions don’t trigger a permission request upon installation or when they are added to an extension update, and can also be requested at runtime so users see what permissions are being requested in context.

Visit the Add-ons Blog to see more updates for extensions in Firefox 77!

Summary

These are the highligts of Firefox 77! Check out the new features and have fun playing! As always, feel free to give feedback and ask questions in the comments.

The post New in Firefox 77: DevTool improvements and web platform updates appeared first on Mozilla Hacks - the Web developer blog.

Categorieën: Mozilla-nl planet

The Mozilla Blog: Pocket provides fascinating reads from trusted sources in the UK with newest Firefox

Mozilla planet - di, 02/06/2020 - 15:00

It’s a stressful and strange time. Reading the news today can feel overwhelming, repetitive, and draining. We all feel it. We crave new inputs and healthy diversions—stories that can fuel our minds, spark fresh ideas, and leave us feeling recharged, informed, and inspired.

Connecting people with such stories is what we do at Pocket. We surface and recommend exceptional stories from across the web to nearly 40 million Firefox users in the U.S., Canada, and Germany each month. More than 4 million subscribers to our Pocket Hits newsletters (available in English and in German) see our curated recommendations each day in their inboxes.

Today we’re pleased to announce the launch of Pocket’s article recommendations for Firefox users in the United Kingdom. The expansion into the UK was made seamless thanks to our successes with English-language recommendations in the U.S. and Canada.

What does this mean for Firefox users in the UK? Open a new tab every day and see a curated selection of recommended stories from Pocket. People will see thought-provoking essays, hidden gems, and fascinating deep-dives from UK-based publishers both large and small — and other trusted global sources from across the web.

Open a new tab to see a curated selection of recommended stories

Where do these recommendations come from? Pocket readers. Pocket has a diverse, well-read community of users who help us surface some of the best stories on the web. Using our flagship Pocket app and save button (built right into Firefox), our users save millions of articles each day. The data from our most saved, opened, and read articles is aggregated; our curators then sift through and recommend the very best of these stories with the wider Firefox and Pocket communities.

The result is a unique alternative to the vast array of content feeds out there today. Instead of breaking news, users will see stories that dig deep into a subject, offer a new perspective, and come from outlets that might be outside their normal reading channels. They’ll find engrossing science features, moving first-person narratives, and entertaining cooking and career how-tos. They’ll discover deeply reported business features, informative DIY guides, and eye-opening history pieces. Most of all, they’ll find stories worthy of their time and attention, curated specifically for Firefox users in the United Kingdom. Publishers, too, will benefit from a new stream of readers to their high-quality content.

Pocket delivers these recommendations with the same dedication to privacy that people have come to expect from Firefox and Mozilla. Recommendations are drawn from aggregate data and neither Mozilla nor Pocket receives Firefox browsing history or data, or is able to view the saved items of an individual Pocket account. A Firefox user’s browsing data never leaves their own computer or device. .

We welcome new Pocket readers in the UK — alongside our readers in the U.S., Canada, and Germany — and hope you find your new tab is a breath of fresh air and a stimulating place to refuel and recharge at a time when you may be needing it most.

Download Firefox to get thought-provoking stories from around the web with every new tab. Be sure to enable the recommendations to begin reading.

The post Pocket provides fascinating reads from trusted sources in the UK with newest Firefox appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

This Week In Rust: This Week in Rust 341

Mozilla planet - di, 02/06/2020 - 06:00

Hello and welcome to another issue of This Week in Rust! Rust is a systems language pursuing the trifecta: safety, concurrency, and speed. This is a weekly summary of its progress and community. Want something mentioned? Tweet us at @ThisWeekInRust or send us a pull request. Want to get involved? We love contributions.

This Week in Rust is openly developed on GitHub. If you find any errors in this week's issue, please submit a PR.

There is no This Week in Rust podcast this week, next week's episode will cover both this week and next week.

We Stand With You

Since our previous issue, there has been a lot of news about the civil rights discourse in the United States, spawned by the murder of George Floyd by a member of the Minneapolis Police Department. We stand with Black Lives Matter and our Black siblings now and always.

We believe this is not a matter of taking a political stance, but a matter of supporting basic human rights and equality.

We believe that Rustaceans have a duty to our community and to the rest of the world to ensure that people feel comfortable and welcome wherever they may be. In our own community, the Rust Code of Conduct explicitly states that we intend to make everybody feel safe, but this does not just apply to us.

Just as we support Rustaceans, we also support humanity as a whole. It is time for social progress to be made. We support those risking their own well-being to show support for George Floyd, Breonna Taylor, Ahmaud Aubery, and everyone else who has faced injustice at the hands of members of the police. We stand with the protesters hoping to make the world better.

If you want to show your support, here is a website of curated resources. We encourage you to speak out, as one more voice is one step closer to a better world.

Updates from Rust Community News & Blog Posts Crate of the Week

This week's crate is jql, a JSON Query Language CLI tool.

Thanks to Davy Duperron for the suggestion!

Submit your suggestions and votes for next week!

Call for Participation

Always wanted to contribute to open-source projects but didn't know where to start? Every week we highlight some tasks from the Rust community for you to pick and get started!

Some of these tasks may also have mentors available, visit the task page for more information.

If you are a Rust project owner and are looking for contributors, please submit tasks here.

Updates from Rust Core

442 pull requests were merged in the last week

Rust Compiler Performance Triage

This is a new section containing the results of a weekly check on how rustc's perf has changed.

Approved RFCs

Changes to Rust follow the Rust RFC (request for comments) process. These are the RFCs that were approved for implementation this week:

Final Comment Period

Every week the team announces the 'final comment period' for RFCs and key PRs which are reaching a decision. Express your opinions now.

RFCs

No RFCs are currently in the final comment period.

Tracking Issues & PRs New RFCs Upcoming Events Online North America

If you are running a Rust event please add it to the calendar to get it mentioned here. Please remember to add a link to the event too. Email the Rust Community Team for access.

Rust Jobs

Tweet us at @ThisWeekInRust to get your job offers listed here!

Quote of the Week

Rust enables belligerent refactoring – making dramatic changes and then working with the compiler to bring the project back to a working state.

Pankaj Chaudhary on Knoldus Blog

Thanks to Maxim Vorobjov for the suggestions!

Please submit quotes and vote for next week!

This Week in Rust is edited by: nellshamrell, llogiq, and cdmistman.

Discuss on r/rust

Categorieën: Mozilla-nl planet

The Mozilla Blog: We’ve Got Work to Do

Mozilla planet - ma, 01/06/2020 - 21:14

The promise of America is “liberty and justice for all.” We must do more to live up to this promise. The events of last week once again shine a spotlight on how much systemic change is still required. These events  — the deaths at the hands of police and civilians, the accusations that are outright lies — are not new, and are not isolated. African Americans continue to pay an obscene and unacceptable price for our nation’s failure to rectify our history of racial discrimination and violence. As a result, our communities and our nation are harmed and diminished.

Change is required. That change involves all of us. It’s not immediately clear all the actions an organization like Mozilla should take, but it’s clear action is required. As a starting point, we will use our products to highlight black and other under-represented voices in this unfolding dialog. And we’re looking hard at other actions, across the range of our activities and assets.

Closer to home we’ve reiterated our support for our black colleagues. We recognize the disproportionate impact of these events, as well as the disproportionate effect of COVID-19 on communities of color. We recognize that continued diligence could lead others to think it is “business as usual.” We know that it is not.

And this has left many of us once again, questioning how to meaningfully make our world better. As our starting point, Mozilla is committed to continuing to support our black employees, expanding our own diversity, and using our products to build a better world.

The post We’ve Got Work to Do appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

About:Community: Firefox 77 new contributors

Mozilla planet - ma, 01/06/2020 - 16:07

With the release of Firefox 77, we are pleased to welcome the 38 developers who contributed their first code change to Firefox in this release, 36 of whom were brand new volunteers! Please join us in thanking each of these diligent and enthusiastic individuals, and take a look at their contributions:

Categorieën: Mozilla-nl planet

Daniel Stenberg: on-demand buffer alloc in libcurl

Mozilla planet - za, 30/05/2020 - 23:22

Okay, so I’ll delve a bit deeper into the libcurl internals than usual here. Beware of low-level talk!

There’s a never-ending stream of things to polish and improve in a software project and curl is no exception. Let me tell you what I fell over and worked on the other day.

Smaller than what holds Linux

We have users who are running curl on tiny devices, often put under the label of Internet of Things, IoT. These small systems typically have maybe a megabyte or two of ram and flash and are often too small to even run Linux. They typically run one of the many different RTOS flavors instead.

It is with these users in mind I’ve worked on the tiny-curl effort. To make curl a viable alternative even there. And believe me, the world of RTOSes and IoT is literally filled with really low quality and half-baked HTTP client implementations. Often certainly very small but equally as often with really horrible shortcuts or protocol misunderstandings in them.

Going with curl in your IoT device means going with decades of experience and reliability. But for libcurl to be an option for many IoT devices, a libcurl build has to be able to get really small. Both the footprint on storage but also in the required amount of dynamic memory used while executing.

Being feature-packed and attractive for the high-end users and yet at the same time being able to get really small for the low-end is a challenge. And who doesn’t like a good challenge?

Reduce reduce reduce

I’ve set myself on a quest to make it possible to build libcurl smaller than before and to use less dynamic memory. The first tiny-curl releases were only the beginning and I already then aimed for a libcurl + TLS library within 100K storage size. I believe that goal was met, but I also think there’s more to gain.

I will make tiny-curl smaller and use less memory by making sure that when we disable parts of the library or disable specific features and protocols at build-time, they should no longer affect storage or dynamic memory sizes – as far as possible. Tiny-curl is a good step in this direction but the job isn’t done yet – there’s more “dead meat” to carve off.

One example is my current work (PR #5466) on making sure there’s much less proxy remainders left when libcurl is built without support for such. This makes it smaller on disk but also makes it use less dynamic memory.

To decrease the maximum amount of allocated memory for a typical transfer, and in fact for all kinds of transfers, we’ve just switched to a model with on-demand download buffer allocations (PR #5472). Previously, the download buffer for a transfer was allocated at the same time as the handle (in the curl_easy_init call) and kept allocated until the handle was cleaned up again (with curl_easy_cleanup). Now, we instead lazy-allocate it first when the transfer starts, and we free it again immediately when the transfer is over.

It has several benefits. For starters, the previous initial allocation would always first allocate the buffer using the default size, and the user could then set a smaller size that would realloc a new smaller buffer. That double allocation was of course unfortunate, especially on systems that really do want to avoid mallocs and want a minimum buffer size.

The “price” of handling many handles drastically went down, as only transfers that are actively in progress will actually have a receive buffer allocated.

A positive side-effect of this refactor, is that we could now also make sure the internal “closure handle” actually doesn’t use any buffer allocation at all now. That’s the “spare” handle we create internally to be able to associate certain connections with, when there’s no user-provided handles left but we need to for example close down an FTP connection as there’s a command/response procedure involved.

Downsides? It means a slight increase in number of allocations and frees of dynamic memory for doing new transfers. We do however deem this a sensible trade-off.

Numbers

I always hesitate to bring up numbers since it will vary so much depending on your particular setup, build, platform and more. But okay, with that said, let’s take a look at the numbers I could generate on my dev machine. A now rather dated x86-64 machine running Linux.

For measurement, I perform a standard single transfer getting a 8GB file from http://localhost, written to stderr:

curl -s http://localhost/8GB -o /dev/null

With all the memory calls instrumented, my script counts the number of memory alloc/realloc/free/etc calls made as well as the maximum total memory allocation used.

The curl tool itself sets the download buffer size to a “whopping” 100K buffer (as it actually makes a difference to users doing for example transfers from localhost or other really high bandwidth setups or when doing SFTP over high-latency links). libcurl is more conservative and defaults it to 16K.

This command line of course creates a single easy handle and makes a single HTTP transfer without any redirects.

Before the lazy-alloc change, this operation would peak at 168978 bytes allocated. As you can see, the 100K receive buffer is a significant share of the memory used.

After the alloc work, the exact same transfer instead ended up using 136188 bytes.

102,400 bytes is for the receive buffer, meaning we reduced the amount of “extra” allocated data from 66578 to 33807. By 49%

Even tinier tiny-curl: in a feature-stripped tiny-curl build that does HTTPS GET only with a mere 1K receive buffer, the total maximum amount of dynamically allocated memory is now below 25K.

Caveats

The numbers mentioned above only count allocations done by curl code. It does not include memory used by system calls or, when used, third party libraries.

Landed

The changes mentioned in this blog post have landed in the master branch and will ship in the next release: curl 7.71.0.

Categorieën: Mozilla-nl planet

Pagina's