mozilla

Mozilla Nederland LogoDe Nederlandse
Mozilla-gemeenschap

Abonneren op feed Mozilla planet
Planet Mozilla - https://planet.mozilla.org/
Bijgewerkt: 3 dagen 12 uur geleden

Mozilla Addons Blog: Friend of Add-ons: Juraj Mäsiar

ma, 15/06/2020 - 17:32

Our newest Friend of Add-ons is Juraj Mäsiar! Juraj is the developer of several extensions for Firefox, including Scroll Anywhere, which is part of our Recommended Extensions program. He is also a frequent contributor on our community forums, where he offers friendly advice and input for extension developers looking for help.

Juraj first started building extensions for Firefox in 2016 during a quiet weekend trip to his hometown. The transition to the WebExtensions API was less than a year away, and developers were starting to discuss their migration plans. After discovering many of his favorite extensions weren’t going to port to the new API, Juraj decided to try the migration process himself to give a few extensions a second life.  “I was surprised to see it’s just normal JavaScript, HTML and CSS — things I already knew,” he says. “I put some code together and just a few moments later I had a working prototype of my ScrollAnywhere add-on. It was amazing!”

Juraj immersed himself in exploring the WebExtensions API and developing extensions for Firefox. It wasn’t always a smooth process, and he’s eager to share some tips and tricks to make the development experience easier and more efficient. “Split your code to ES6 modules. Share common code between your add-ons — you can use `git submodule` for that. Automate whatever can be automated. If you don’t know how, spend the time learning how to automate it instead of doing it manually,” he advises. Developers can also save energy by not reinventing the wheel. “If you need a build script, use webpack. Don’t build your own DOM handling library. If you need complex UI, use existing libraries like Vue.js.”

Juraj recommends staying active, saying. “Doing enough sport every day will keep your mind fresh and ready for new challenges.” He stays active by playing VR games and rollerblading.

Currently, Juraj is experimenting with the CryptoAPI and testing it with a new extension that will encrypt user notes and synchronize them with Firefox Sync. The goal is to create a secure extension that can be used to store sensitive material, like a server configuration or a home wifi password.

On behalf of the Add-ons Team, thank you for all of your wonderful contributions to our community, Juraj!

If you are interested in getting involved with the add-ons community, please take a look at our current contribution opportunities.

The post Friend of Add-ons: Juraj Mäsiar appeared first on Mozilla Add-ons Blog.

Categorieën: Mozilla-nl planet

Christopher Arnold: Money, friction and momentum on the web

zo, 14/06/2020 - 05:21
Back when I first moved to Silicon Valley in 1998, I tried to understand how capital markets made the valley such a unique place for inventors and entrepreneurs.  Corporate stocks, real estate, international currency and commodities markets were concepts I was well familiar with from my time working at a financial news service in the nation's capital in the mid 1990's.  However, crowdfunding and angel investing were new concepts to me 20 years ago.  Crowdfunding platforms seemed to be more to the advantage of the funding recipient than the balanced two-sided exchanges of the commercial financial system.  I often wondered what motivated generosity-driven models that was different from reward-driven sponsorships.

When trying to grasp the way angel investors think about entrepreneurship, my friend Willy, a serial entrepreneur and investor, said: “If you want to see something succeed, throw money at it!”  The idea behind the "angel" is that they are the riskiest of risk-capital.  Angel investors join the startup funding before banks and venture capital firms.  They seldom get payback in kind from the companies they sponsor and invest in.  Angels are the lenders of first-resort for founders because they tend to be more generous, more flexible and more forgiving than lenders.  They value the potential success of the venture far more than they value the money they put forth.  And the contributions of an angel investor can have an outsized benefit in the early stage of an initiative by sustaining the founder/creator at their most vulnerable stage.  But what is this essence they get out of it that is worth more than money to them?

Over the course of the last couple of decades I've become a part of the crowdfunding throng of inventors and sponsors.  I have contributed to small business projects on Kiva in over 30 countries, and backed many small-scale projects across Kickstarter, Indiegogo and Appbackr.  I've also been on the receiving side, having the chance to pitch my company for funding on Sand Hill Road, the strip of financial lending firms that populate Palo Alto's hillsides.  As a funder, it has been very enlightening to know that I can be part of someone else's project by chipping in time, sharing insights and capital to get improbable projects off the ground.  And the excitement of following the path of the entrepreneurs has been the greatest reward.  As a founder, I remember framing the potential of a future that, if funded, would yield significant returns to the lenders and shareholders.  Of course, the majority of new ventures do not come to market in the form of their initial invention.  Some of the projects I participated in have launched commercially and I've been able to benefit.  (By getting shares in a growing venture or by getting nifty gadgets and software as part of the pre-release test audience.)  But those things aren't the reward I was seeking when I signed up.  It was the energy of participating in the innovation process and the excitement about making a difference.  After many years of working in the corporate world, I became hooked on the idea of working with engineers and developers who are bringing about the next generation of our expressive/experiential platforms of the web.

During the Augmented World Expo in May, I attended a conference session called "Web Monetization and Social Signaling," hosted by Anselm Hook, a researcher at the web development non-profit Mozilla, where I also work.  He made an interesting assertion during his presentation, "Money appears to be a form of communication."  His study was observing platform-integrated social signals (such as up-voting, re-tweeting and applauding with hand-clapping emojis) to draw attention to content users had discovered on the web, in this case within the content recommendation platform of the Firefox Reality VR web browser.  There are multiple motivations and benefits for this kind of social signaling.  It serves as a bookmarking method for the user, it increases the content's visibility to friends who might also like the content, it signals affinity with the content as part of one's own identity and it gives reinforcement to the content/comment provider.  Anselm found in his research that participants actually reacted more strongly when they believed their action contributed financial benefit directly to the other participant.  Meaning, we don't just want to use emojis to make each other feel good about their web artistry.  In some cases, we want to cause profit for the artist/developer directly.  Perhaps a gesture of a smiley-face or a thumb is adequate to assuage our desire to give big-ups to an artist, and we can feel like our karmic balance book is settled.  But what if we want to do more than foist colored pixels on each other?  Could the web do more to allow us to financially sustain the artist wizards behind the curtain?  Can we "tip" the way we do our favorite street musicians?  Not conveniently, because the systems we have now mostly rely on the credit card.  But in the offline context, do we interrupt a street busker to ask for their Venmo or Paypal account?  We typically use cash, which has only rough analogues as of yet in our digital lives.

When I lived in Washington DC, I had the privilege to see the great Qawwali master Nusrat Fateh Ali Khan in concert.  Qawwali is a style of inspired Sufi mystical chant combined with call-and-response singing with a backup ensemble.  Listening for hours as his incantations built from quiet mutterings accompanied by harmonium and slow paced drums to a crescendo of shouts and wails of devotion at the culmination of his songs was very transporting in spite of my dissimilar cultural upbringing and language.  What surprised me, beyond the amazing performance of course, was that as the concert progressed people in the audience would get up, dance and then hurl money at the stage.  "This is supposed to be a devotional setting isn't it?  Hurling cash at the musicians seems so profane," I thought.  But apparently this is something that one does at concerts in Pakistan.  The relinquishing of cash is devotional, like Thai Buddhists offering gold leaf by pressing it into the statues of their teachers and monks.  Money is a form of communication of the ineffable appreciation we feel toward those of greatness in the moment of connection or the moment of realization of our indebtedness.  Buying is a different form of expression that is personal but not expressive.  When we buy, it is disconnected from artistry of the moment.  No lesser appreciation for sure.  It's different because it isn't social signaling, it's coveting.  When in concerts or in real-time scenarios we transmit our bounty upon another, it is an act of making sacrifice and conferring benefit.  The underlying meaning of it may be akin to "I hope you succeed!" or, "I relinquish my having so that you might have."  I'm glossing over the cultural complexity of the gesture surely.  Japanese verbs have subtle ways to distinguish the transfer/receipt of benefit according to seniority, societal position and degree of humility: Giving upward "ageru/agemasu", giving downward "kudasai/kudasaru", giving laterally "kureru/morau"  The psychological subtlety of the transfer of boons between individuals is scripted deeply within us, all the more accentuating how a plastic card or a piece of paper barely captures the breadth of expression we caring animals have.

The web of yesteryear has done a really good job of covering the coveting use case.  Well done web! Now, what do we build for an encore?  How can we emulate the other expressions of human intent that coveting and credit cards don't cover?

In the panic surrounding the current Covid pandemic, I felt a sense of being disconnected from the community I usually am rooted in.  I sought information about those affected internationally in the countries I've visited and lived in, where my friends and favorite artists live.  I sought out charitable organizations positioned there and gave them money, as it was the least I felt I could do to reach those impacted by the crisis remote from me.  Locally, my network banded together to find ways that we could mobilize to help those affected in our community.  We found that using the metaphor of "gift cards" (a paper coupon) could be used to foist cash quickly into the coffers of local businesses so they could meet short-term spending needs to keep their employees paid and their businesses operational even while their shops were forced into closure in the interest of posterity.   I found the process very slow and cumbersome as I had to write checks, give out credit cards (to places I never would typically share sensitive financial data) find email addresses for people to transmit PayPal to, and in come cases I had to resort to paper cash for those whom the web could not reach.

This experience made me keenly aware that the systems we have on the web don't replicate the ways we think and the ways that we express our generosity in the modern world.  As web developers, we need to enable a new kind of gesture akin to the act of tipping with cash in offline society.  Discussing this with my friend Aneil, he asserted that both anonymous donor platforms like Patreon and other block-chain currencies can fit the bill for addressing the donor need, if the recipient is set up to receive them. He cautioned that online transactions are held to a different standard than cash in US society because of “Know Your Customer” regulation which was put in place to stem the risk of money laundering through anonymous transactions. As we discussed the idea of peer-to-peer transactions in virtual environments, he pointed out, ”The way game companies get around that is to have consumers purchase in game credits that cannot be converted back into money.” The government is fine with people putting money into something. It’s the extraction from the flow of exchange in monetary sense that needs to be subject to the regulations designed for taxation and investment controls.

Patreon, like PayPal, is a cash-value paired system while virtual currencies such as Bitcoin, BAT and Etherium can be variable in exchange value for their coin. Blockchain ledger transactions trace exactly who gave what to whom. So, they are in theory able to comply with KYC restrictions even in situations where the exchange is relatively anonymous. Yet they are wildly different in terms of how the currency holders perceive their value. Aneil pointed out that Bitcoin is bad for online transactions because its scarcity model incentivizes people to hold onto it. It’s like gold, a slow currency. A valuable crypto currency therefore would slow down rather than facilitating donation and tipping. You need a currency that people are comfortable to hold for only short periods of time like the funds in a Kiva or Patreon wallet. If people are always withdrawing from the currency for fear of its losing value, then the currency itself isn’t stable enough to be the basis of a robust transaction system. For instance, when I was in Zimbabwe, where inflation is incredibly high for their paper currency, people wanted to get rid of it quickly for some other asset that lost value slower than the paper notes. Similarly, Aneil pointed out, any coin that you use to transact virtually could suffer the incentive to cash out quickly, which would drive the value of the asset in a fluid marketplace lower. Cash proxies don’t have an inherent value unless they are underpinned by an artificial or perceived scarcity mechanism.  The US government has an agency, the Federal Reserve, whose mission it is to ensure that money depreciates slowly enough that the underlying credit of the government stays stable and encourages growth of its economy.  Any other currency system would need the same.  Bitcoin can't be it because of its exceedingly high scarcity which leads to hoarding.  Until web developers solve this friction problem, web transactions and therefore web authorship will be stifled of support it needs to grow.

Understanding this underlying problem of financial sustainability, my colleague Anselm is working with crypto-currency enabler Coil to try to apply cyrpto-currency sponsorship to peer and creator/recipient exchanges on the web.  He envisions a future where users could casually exchange funds in a virtual, web-based or real-world "augmented reality" transaction without needing to exchange credit card or personal account data.  This may sound mundane, because the use-case and need is obvious, as we're used to doing it with cash every day.  The question it begs is, why can't the web do this?#!  Why do I need to exchange credit cards (sensitive data) or email (not sensitive but not public) if I just want to send credits or tips to somebody?  There was an early success in this kind of micropayments model when Anshe Chung became the world's first self-made millionaire by selling virtual goods to Second Life enthusiasts.  The LindenLabs virtual platform had the ability for users to pay money to other peer users inside the virtual environment.  With a bit more development collaboration, this kind of model may be beneficial to others outside of specific game environments.

Anselm's speech at AWE was to introduce the concept of a "tip-jar," something we're familiar with from colloquial offline life, for the nascent developer ecosystem of virtual and augmented reality web developers.  For most people who are used to high-end software being sold as apps in a marketplace like iTunes or Android Play Store, the idea that we would pay web pages may seem peculiar.  But it's not too far a leap from how we generally spend our money in society.  Leaving tips with cash is common practice for Americans.  Even when service fees are not required, Americans tend to tip generously.  Lonely Planet dedicates sections of its guidebooks to concepts of money and I've typically seen that Americans have a looser idea of tip amount than other countries.

Anselm and the team managing the "Grant for the Web" hope to bring this kind of peer-to-peer mechanism to the broader web around us by utilizing Coil's grant of $100 Million in crypto-currency toward achieving this vision. 

If you're interested in learning more about web-monetization initiative from Coil and Mozilla please visit: https://www.grantfortheweb.org/








Categorieën: Mozilla-nl planet

Daniel Stenberg: curl meets gold level best practices

za, 13/06/2020 - 22:43

About four years ago I announced that curl was 100% compliant with the CII Best Practices criteria. curl was one of the first projects on that train to reach a 100% – primarily of course because we were early joiners and participants of the Best Practices project.

The point of that was just to highlight and underscore that we do everything we can in the curl project to act as a responsible open source project and citizen of the larger ecosystem. You should be able to trust curl, in every aspect.

Going above and beyond basic

Subsequently, the best practices project added higher levels of compliance. Basically adding a bunch of requirements so if you want to grade yourself at silver or even gold level there are a whole series of additional requirements to meet. At the time those were added, I felt they were asking for quite a lot of specifics that we didn’t provide in the curl project and with a bit of a sigh I felt I had to accept the fact that we would remain on “just” 100% compliance and only reaching a part of the way toward Silver and Gold. A little disheartened of course because I always want curl to be in the top.

So maybe Silver?

I had left the awareness of that entry listing in a dusty corner of my brain and hadn’t considered it much lately, when I noticed the other day that it was announced that the Linux kernel project reached gold level best practice.

That’s a project with around 50 times more developers and commits than curl for an average release (and even a greater multiplier for amount of code) so I’m not suggesting the two projects are comparable in any sense. But it made me remember our entry on CII Best Practices web site.

I came back, updated a few fields that seemed to not be entirely correct anymore and all of a sudden curl quite unexpectedly had a 100% compliance at Silver level!

Further?

If Silver was achievable, what’s actually left for gold?

Sure enough, soon there were only a few remaining criteria left “unmet” and after some focused efforts in the project, we had created the final set of documents with information that were previously missing. When we now finally could fill in links to those docs in the final few entries, project curl found itself also scoring a 100% at gold level.

Best Practices: Gold Level

What does it mean for us? What does it mean for you, our users?

For us, it is a double-check and verification that we’re doing the right things and that we are providing the right information in the project and we haven’t forgotten anything major. We already knew that we were doing everything open source in a pretty good way, but getting a bunch of criteria that insisted on a number of things also made us go the extra way and really provide information for everything in written form. Some of what previously really only was implied, discussed in IRC or read between lines in various pull requests.

I’m proud to lead the curl project and I’m proud of all our maintainers and contributors.

For users, having curl reach gold level makes it visible that we’re that kind of open source project. We’re part of this top clique of projects. We care about every little open source detail and this should instill trust and confidence in our users. You can trust curl. We’re a golden open source project. We’re with you all the way.

The final criteria we checked off

Which was the last criteria of them all for curl to fulfill to reach gold?

The project MUST document its code review requirements, including how code review is conducted, what must be checked, and what is required to be acceptable (link)

This criteria is now fulfilled by the brand new document CODE_REVIEW.md. What’s next?

We’re working on the next release. We always do. Stop the slacking now and get back to work!

Credits

Gold image by Erik Stein from Pixabay

Categorieën: Mozilla-nl planet

Patrick Cloke: Raspberry Pi File Server

vr, 12/06/2020 - 22:55

This is just some quick notes (for myself) of how I recently setup my Raspberry Pi as a file server. The goal was to have a shared folder so that a Sonos could play music from it. The data would be backed via a microSD card plugged into USB.

  1. Update …
Categorieën: Mozilla-nl planet

Data@Mozilla: This Week in Glean: Project FOG Update, end of H12020

vr, 12/06/2020 - 22:28

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

It’s been a while since last I wrote on Project FOG, so I figure I should update all of you on the progress we’ve made.

A reminder: Project FOG (Firefox on Glean) is the year-long effort to bring the Glean SDK to Firefox. This means answering such varied questions as “Where are the docs going to live?” (here) “How do we update the SDK when we need to?” (this way) “How are tests gonna work?” (with difficulty) and so forth. In a project this long you can expect updates from time-to-time. So where are we?

First, we’ve added the Glean SDK to Firefox Desktop and include it in Firefox Nightly. This is only a partial integration, though, so the only builtin ping it sends is the “deletion-request” ping when the user opts out of data collection in the Preferences. We don’t actually collect any data, so the ping doesn’t do anything, but we’re sending it and soon we’ll have a test ensuring that we keep sending it. So that’s nice.

Second, we’ve written a lot of Design Proposals. The Glean Team and all the other teams our work impacts are widely distributed across a non-trivial fragment of the globe. To work together and not step on each others’ toes we have a culture of putting most things larger than a bugfix into Proposal Documents which we then pass around asynchronously for ideation, feedback, review, and signoff. For something the size and scope of adding a data collection library to Firefox Desktop, we’ve needed more than one. These design proposals are Google Docs for now, but will evolve to in-tree documentation (like this) as the proposals become code. This way the docs live with the code and hopefully remain up-to-date for our users (product developers, data engineers, data scientists, and other data consumers), and are made open to anyone in the community who’s interested in learning how it all works.

Third, we have a Glean SDK Rust API! Sorta. To limit scope creep we haven’t added the Rust API to mozilla/glean and are testing its suitability in FOG itself. This allows us to move a little faster by mixing our IPC implementation directly into the API, at the expense of needing to extract the common foundation later. But when we do extract it, it will be fully-formed and ready for consumers since it’ll already have been serving the demanding needs of FOG.

Fourth, we have tests. This was a bit of a struggle as the build order of Firefox means that any Rust code we write that touches Firefox internals can’t be tested in Rust tests (they must be tested by higher-level integration tests instead). By damming off the Firefox-adjacent pieces of the code we’ve been able to write and run Rust tests of the metrics API after all. Our code coverage is still a little low, but it’s better than it was.

Fifth, we are using Firefox’s own network stack to send pings. In a stroke of good fortune the application-services team (responsible for fan-favourite Firefox features “Sync”, “Send Tab”, and “Firefox Accounts”) was bringing a straightforward Rust networking API called Viaduct to Firefox Desktop almost exactly when we found ourselves in need of one. Plugging into Viaduct was a breeze, and now our “deletion-request” pings can correctly work their way through all the various proxies and protocols to get to Mozilla’s servers.

Sixth, we have firm designs on how to implement both the C++ and JS APIs in Firefox. They won’t be fully-fledged language bindings the way that Kotlin, Python, and Swift are (( they’ll be built atop the Rust language binding so they’re really more like shims )), but they need to have every metric type and every metric instance that a full language binding would have, so it’s no small amount of work.

But where does that leave our data consumers? For now, sadly, there’s little to report on both the input and output sides: We have no way for product engineers to collect data in Firefox Desktop (and no pings to send the data on), and we have no support in the pipeline for receiving data, not that we have any to analyse. These will be coming soon, and when they do we’ll start cautiously reaching out to potential first customers to see whether their needs can be satisfied by the pieces we’ve built so far.

And after that? Well, we need to do some validation work to ensure we’re doing things properly. We need to implement the designs we proposed. We need to establish how tasks accomplished in Telemetry can now be accomplished in the Glean SDK. We need to start building and shipping FOG and the Glean SDK beyond Nightly to Beta and Release. We need to implement the builtin Glean SDK pings. We need to document the designs so others can understand them, best practices so our users can follow them, APIs so engineers can use them, test guarantees so QA can validate them, and grand processes for migration from Telemetry to Glean so that organizations can start roadmapping their conversions.

In short: plenty has been done, and there’s still plenty to do.

I guess we’d better be about it, then.

:chutten

(( this is a syndicated copy of the original post ))

Categorieën: Mozilla-nl planet

Chris H-C: This Week in Glean: Project FOG Update, end of H12020

vr, 12/06/2020 - 22:24

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

It’s been a while since last I wrote on Project FOG, so I figure I should update all of you on the progress we’ve made.

A reminder: Project FOG (Firefox on Glean) is the year-long effort to bring the Glean SDK to Firefox. This means answering such varied questions as “Where are the docs going to live?” (here) “How do we update the SDK when we need to?” (this way) “How are tests gonna work?” (with difficulty) and so forth. In a project this long you can expect updates from time-to-time. So where are we?

First, we’ve added the Glean SDK to Firefox Desktop and include it in Firefox Nightly. This is only a partial integration, though, so the only builtin ping it sends is the “deletion-request” ping when the user opts out of data collection in the Preferences. We don’t actually collect any data, so the ping doesn’t do anything, but we’re sending it and soon we’ll have a test ensuring that we keep sending it. So that’s nice.

Second, we’ve written a lot of Design Proposals. The Glean Team and all the other teams our work impacts are widely distributed across a non-trivial fragment of the globe. To work together and not step on each others’ toes we have a culture of putting most things larger than a bugfix into Proposal Documents which we then pass around asynchronously for ideation, feedback, review, and signoff. For something the size and scope of adding a data collection library to Firefox Desktop, we’ve needed more than one. These design proposals are Google Docs for now, but will evolve to in-tree documentation (like this) as the proposals become code. This way the docs live with the code and hopefully remain up-to-date for our users (product developers, data engineers, data scientists, and other data consumers), and are made open to anyone in the community who’s interested in learning how it all works.

Third, we have a Glean SDK Rust API! Sorta. To limit scope creep we haven’t added the Rust API to mozilla/glean and are testing its suitability in FOG itself. This allows us to move a little faster by mixing our IPC implementation directly into the API, at the expense of needing to extract the common foundation later. But when we do extract it, it will be fully-formed and ready for consumers since it’ll already have been serving the demanding needs of FOG.

Fourth, we have tests. This was a bit of a struggle as the build order of Firefox means that any Rust code we write that touches Firefox internals can’t be tested in Rust tests (they must be tested by higher-level integration tests instead). By damming off the Firefox-adjacent pieces of the code we’ve been able to write and run Rust tests of the metrics API after all. Our code coverage is still a little low, but it’s better than it was.

Fifth, we are using Firefox’s own network stack to send pings. In a stroke of good fortune the application-services team (responsible for fan-favourite Firefox features “Sync”, “Send Tab”, and “Firefox Accounts”) was bringing a straightforward Rust networking API called Viaduct to Firefox Desktop almost exactly when we found ourselves in need of one. Plugging into Viaduct was a breeze, and now our “deletion-request” pings can correctly work their way through all the various proxies and protocols to get to Mozilla’s servers.

Sixth, we have firm designs on how to implement both the C++ and JS APIs in Firefox. They won’t be fully-fledged language bindings the way that Kotlin, Python, and Swift are (( they’ll be built atop the Rust language binding so they’re really more like shims )), but they need to have every metric type and every metric instance that a full language binding would have, so it’s no small amount of work.

But where does that leave our data consumers? For now, sadly, there’s little to report on both the input and output sides: We have no way for product engineers to collect data in Firefox Desktop (and no pings to send the data on), and we have no support in the pipeline for receiving data, not that we have any to analyse. These will be coming soon, and when they do we’ll start cautiously reaching out to potential first customers to see whether their needs can be satisfied by the pieces we’ve built so far.

And after that? Well, we need to do some validation work to ensure we’re doing things properly. We need to implement the designs we proposed. We need to establish how tasks accomplished in Telemetry can now be accomplished in the Glean SDK. We need to start building and shipping FOG and the Glean SDK beyond Nightly to Beta and Release. We need to implement the builtin Glean SDK pings. We need to document the designs so others can understand them, best practices so our users can follow them, APIs so engineers can use them, test guarantees so QA can validate them, and grand processes for migration from Telemetry to Glean so that organizations can start roadmapping their conversions.

In short: plenty has been done, and there’s still plenty to do. 

I guess we’d better be about it, then.

:chutten

Categorieën: Mozilla-nl planet

Daniel Stenberg: 800 authors and counting

vr, 12/06/2020 - 17:24

Today marks the day when we merged the commit authored by the 800th person in the curl project.

We turned 22 years ago this spring but it really wasn’t until 2010 when we switched to git when we started to properly keep track of every single author in the project. Since then we’ve seen a lot of new authors and a lot of new code.

The “explosion” is clearly visible in this graph generated with fresh data just this morning (while we were still just 799 authors). See how we’ve grown maybe 250 authors since 1 Jan 2018.

Author number 800 is named Nicolas Sterchele and he submitted an update of the TODO document. Appreciated!

As the graph above also shows, a majority of all authors only ever authored a single commit. If you did 10 commits in the curl project, you reach position #61 among all the committers while 100 commits takes you all the way up to position #13.

Become one!

If you too want to become one of the cool authors of curl, I fine starting point for that journey could be the Help Us document. If that’s not enough, you’re also welcome to contact me privately or maybe join the IRC channel for some socializing and “group mentoring”.

If we keep this up, we could reach a 1,000 authors in 2022…

Categorieën: Mozilla-nl planet

Cameron Kaiser: TenFourFox FPR23 for Intel available

vr, 12/06/2020 - 06:20
Ken Cunningham figured out the build issues he was having with the Intel version and has updated TenFourFox for Intel systems to FPR23, now up to date with the Power Mac version. As always, there is no support for any Intel build of TenFourFox; do not report issues to Tenderapp. You can get it from SourceForge.

Ken's patches have also been incorporated into the tree along with a workaround submitted by Raphaël Guay to deal with Twitch overflowing our JIT stack. This is probably due to something we don't support causing infinite function call recursion since with the JIT disabled it correctly just runs out of stack and stops. There is no way to increase stack further since we are strictly 32-bit builds and the stack already consumes 1GB of our 2.2-ish GB available, so we need to a) figure out why the stack overflow happens without being detected and b) temporarily disable that script until we do. It's part B that is implemented as a second blacklist which is on unless disabled, since other sites may do this, until we find a better solution to part A. This will be in FPR24 along with probably some work on MP3 compliance issues since TenFourFox gets used as a simple little Internet radio a lot more than I realized, and a few other odds and ends.

In case you missed it, I am now posting content I used to post here as "And now for something completely different" over on a new separate blog christened Old Vintage Computing Research, or my Old VCR (previous posts will remain here indefinitely). Although it will necessarily have Power Mac content, it will also cover some of my other beloved older systems all in one place. Check out properly putting your old Mac to Terminal sleep (and waking it back up again), along with screenshots of the unscreenshotable, including grabs off the biggest computer Apple ever made, the Apple Network Server. REWIND a bit and PLAY.

Categorieën: Mozilla-nl planet

Mozilla Addons Blog: Recommended extensions — recent additions

do, 11/06/2020 - 22:10

When the Recommended Extensions program debuted last year, it listed about 60 extensions. Today the program has grown to just over a hundred as we continue to evaluate new nominations and carefully grow the list. The curated collection grows slowly because one of the program’s goals is to cultivate a fairly fixed list of content so users can feel confident the Recommended extensions they install will be monitored for safety and security for the foreseeable future.

Here are some of the more exciting recent additions to the program…

DuckDuckGo Privacy Essentials provides a slew of great privacy features, like advanced ad tracker and search protection, encryption enforcement, and more.

Read Aloud: Text to Speech converts any web page text (even PDF’s) to audio. This can be a very useful extension for everyone from folks with eyesight or reading issues to someone who just wants their web content narrated to them while their eyes roam elsewhere.

SponsorBlock addresses the nuisance of this newer, more intrusive type of video advertising.

SponsorBlock for YouTube is one of the more original content blockers we’ve seen in a while. Leveraging crowdsourced data, the extension skips those interruptive sponsored content segments of YouTube clips.

Metastream Remote has been extremely valuable to many of us during pandemic related home confinement. It allows you to host streaming video watch parties with friends. Metastream will work with any video streaming platform, so long as the video has a URL (in the case of paid platforms like Netflix, Hulu, or Disney+, they too will work provided all watch party participants have their own accounts).

Cookie AutoDelete summarizes its utility right in the title. This simple but powerful extension will automatically delete your cookies from closed tabs. Customization features include whitelist support and informative visibility into the number of cookies used on any given site.

AdGuard AdBlocker is a popular and highly respected content blocker that works to block all ads—banner, video, pop-ups, text ads—all of it. You may also notice the nice side benefit of faster page loads, since AdGuard prohibits so much content you didn’t want anyway.

If you’re the creator of an extension you feel would make a strong candidate for the Recommended program, or even if you’re just a huge fan of an extension you think merits consideration, please submit nominations to amo-featured [at] mozilla [dot] org. Due to the high volume of submissions we receive, please understand we’re unable to respond to every inquiry.

The post Recommended extensions — recent additions appeared first on Mozilla Add-ons Blog.

Categorieën: Mozilla-nl planet

Hacks.Mozilla.Org: Introducing the MDN Web Docs Front-end developer learning pathway

do, 11/06/2020 - 18:01

The MDN Web Docs Learning Area (LA) was first launched in 2015, with the aim of providing a useful counterpart to the regular MDN reference and guide material. MDN had traditionally been aimed at web professionals, but we were getting regular feedback that a lot of our audience found MDN too difficult to understand, and that it lacked coverage of basic topics.

Fast forward 5 years, and the Learning Area material is well-received. It boasts around 3.5–4 million page views per month; a little under 10% of MDN Web Docs’ monthly web traffic.

At this point, the Learning Area does its job pretty well. A lot of people use it to study client-side web technologies, and its loosely-structured, unopinionated, modular nature makes it easy to pick and choose subjects at your own pace. Teachers like it because it is easy to include in their own courses.

However, at the beginning of the year, this area had two shortcomings that we wanted to improve upon:

  1. We’d gotten significant feedback that our users wanted a more opinionated, structured approach to learning web development.
  2. We didn’t include any information on client-side tooling, such as JavaScript frameworks, transformation tools, and deployment tools widely used in the web developer’s workplace.

To remedy these issues, we created the Front-end developer learning pathway (FED learning pathway).

Structured learning

Take a look at the Front-end developer pathway linked above  — you’ll see that it provides a clear structure for learning front-end web development. This is our opinion on how you should get started if you want to become a front-end developer. For example, you should really learn vanilla HTML, CSS, and JavaScript before jumping into frameworks and other such tooling. Accessibility should be front and center in all you do. (All Learning Area sections try to follow accessibility best practices as much as possible).

While the included content isn’t completely exhaustive, it delivers the essentials you need, along with the confidence to look up other information on your own.

The pathway starts by clearly stating the subjects taught, prerequisite knowledge, and where to get help. After that, we provide some useful background reading on how to set up a minimal coding environment. This will allow you to work through all the examples you’ll encounter. We explain what web standards are and how web technologies work together, as well as how to learn and get help effectively.

The bulk of the pathway is dedicated to detailed guides covering:

  • HTML
  • CSS
  • JavaScript
  • Web forms
  • Testing and accessibility
  • Modern client-side tooling (which includes client-side JavaScript frameworks)

Throughout the pathway we aim to provide clear direction — where you are now, what you are learning next, and why. We offer enough assessments to provide you with a challenge, and an acknowledgement that you are ready to go on to the next section.

Tooling

MDN’s aim is to document native web technologies — those supported in browsers. We don’t tend to document tooling built on top of native web technologies because:

  • The creators of that tooling tend to produce their own documentation resources.  To repeat such content would be a waste of effort, and confusing for the community.
  • Libraries and frameworks tend to change much more often than native web technologies. Keeping the documentation up to date would require a lot of effort. Alas, we don’t have the bandwidth to perform regular large-scale testing and updates.
  • MDN is seen as a neutral documentation provider. Documenting tooling is seen by many as a departure from neutrality, especially for tooling created by major players such as Facebook or Google.

Therefore, it came as a surprise to some that we were looking to document such tooling. So why did we do it? Well, the word here is pragmatism. We want to provide the information people need to build sites and apps on the web. Client-side frameworks and other tools are an unmistakable part of that. It would look foolish to leave out that entire part of the ecosystem. So we opted to provide coverage of a subset of tooling “essentials” — enough information to understand the tools, and use them at a basic level. We aim to provide the confidence to look up more advanced information on your own.

New Tools and testing modules

In the Tools and testing Learning Area topic, we’ve provided the following new modules:

  1. Understanding client-side web development tools: An introduction to the different types of client-side tools that are available, how to use the command line to install and use tools. This section delivers a crash course in package managers. It includes a walkthrough of how to set up and use a typical toolchain, from enhancing your code writing experience to deploying your app.
  2. Understanding client-side JavaScript frameworks: A useful grounding in client-side frameworks, in which we aim to answer questions such as “why use a framework?”, “what problems do they solve?”, and “how do they relate to vanilla JavaScript?” We give the reader a basic tutorial series in some of the most popular frameworks. At the time of writing, this includes React, Ember, and Vue.
  3. Git and GitHub: Using links to Github’s guides, we’ve assembled a quickfire guide to Git and GitHub basics, with the intention of writing our own set of guides sometime later on.
Further work

The intention is not just to stop here and call the FED learning pathway done. We are always interested in improving our material to keep it up to date and make it as useful as possible to aspiring developers. And we are interested in expanding our coverage, if that is what our audience wants. For example, our frameworks tutorials are fairly generic to begin with, to allow us to use them as a test bed, while providing some immediate value to readers.

 

We don’t want to just copy the material provided by tooling vendors, for reasons given above. Instead we want to listen, to find out what the biggest pain points are in learning front-end web development. We’d like to see where you need more coverage, and expand our material to suit. We would like to cover more client-side JavaScript frameworks (we have already got a Svelte tutorial on the way), provide deeper coverage of other tool types (such as transformation tools, testing frameworks, and static site generators), and other things besides.

Your feedback please!

To enable us to make more intelligent choices, we would love your help. If you’ve got a strong idea abou tools or web technologies we should cover on MDN Web Docs, or you think some existing learning material needs improvement, please let us know the details! The best ways to do this are:

  1. Leave a comment on this article.
  2. Fill in our questionnaire (it should only take 5–10 minutes).

So that draws us to a close. Thank you for reading, and for any feedback you choose to share.

We will use it to help improve our education resources, helping the next generation of web devs learn the skills they need to create a better web of tomorrow.

The post Introducing the MDN Web Docs Front-end developer learning pathway appeared first on Mozilla Hacks - the Web developer blog.

Categorieën: Mozilla-nl planet

Mozilla Addons Blog: Improvements to Statistics Processing on AMO

wo, 10/06/2020 - 21:22

We’re revamping the statistics we make available to add-on developers on addons.mozilla.org (AMO).

These stats are aggregated from add-on update logs and don’t include any personally identifiable user data. They give developers information about user adoption, general demographics, and other insights that might help them make changes and improvements.

The current system is costly to run, and glitches in the data have been a long-standing recurring issue. We are addressing these issues by changing the data source, which will improve reliability and reduce processing costs.

Usage Statistics

Until now, add-on usage statistics have been based on add-on updates. Firefox checks AMO daily for updates for add-ons that are hosted there (self-distributed add-ons generally check for updates on a server specified by the developer). The server logs for these update requests are aggregated and used to calculate the user counts shown on add-on pages on AMO. They also power a statistics dashboard for developers that breaks down the usage data by language, platform, application, etc.

Stats dashboard example

Stats dashboard showing new version adoption for uBlock Origin

In a few weeks, we will stop using the daily pings as the data source for usage statistics. The new statistics will be based on Firefox telemetry data. As with the current stats, all data is aggregated and no personally identifiable user data is shared with developers.

The data shown on AMO and shared with developers will be essentially the same, but the move to telemetry means that the numbers will change a little. Firefox users can opt out of sending telemetry data, and the way they are counted is different. Our current stats system counts distinct users by IP address, while telemetry uses a per-profile ID. For most add-ons you should expect usage totals to be lower, but usage trends and fluctuations should be nearly identical.

Telemetry data will enable us to show data for add-on versions that are not listed on AMO, so all developers will now be able to analyze their add-on usage stats, regardless of how the add-on is distributed. This also means some add-ons will have higher usage numbers, since the average will be calculated including both AMO-hosted and self-hosted versions.

Other changes that will happen due to this update:

  • The dashboards will only show data for enabled installs. There won’t be a breakdown of usage by add-on status anymore.
  • A breakdown of usage by country will be added.
  • Usage data for our current Firefox for Android browser (also known as Fennec) isn’t included. We’re working on adding data for our next mobile browser (Fenix), currently in development.
  • It won’t be possible to make your statistics dashboard publicly available anymore. Dashboards will only be accessible to add-on developers and admins, starting on June 11. If you are a member of a team that maintains an add-on and you need to access its stats dashboard, please ask your team to add you as an author in the Manage Authors & License page on AMO. The Listed property can be checked off so you don’t show up in the add-on’s public listing page.

We will begin gradually rolling out the new dashboard on June 11. During the rollout, a fraction of add-on dashboards will default to show the new data, but they will also have a link to access the old data. We expect to complete the rollout and discontinue the old dashboards on July 9. If you want to export any of your old stats, make sure you do it before then.

Download Statistics

We plan to make a similar overhaul to download statistics in the coming months. For now they will remain the same. You should expect an announcement around August, when we are closer to switching over to the new download data.

The post Improvements to Statistics Processing on AMO appeared first on Mozilla Add-ons Blog.

Categorieën: Mozilla-nl planet

This Week In Rust: This Week in Rust 342

wo, 10/06/2020 - 06:00

Hello and welcome to another issue of This Week in Rust! Rust is a systems language pursuing the trifecta: safety, concurrency, and speed. This is a weekly summary of its progress and community. Want something mentioned? Tweet us at @ThisWeekInRust or send us a pull request. Want to get involved? We love contributions.

This Week in Rust is openly developed on GitHub. If you find any errors in this week's issue, please submit a PR.

Check out this week's This Week in Rust Podcast

Updates from Rust Community News & Blog Posts Crate of the Week

This week's crate is cargo-spellcheck, a cargo subcommand to spell-check your docs.

Thanks to Bernhard Schuster for the suggestion!

Submit your suggestions and votes for next week!

Call for Participation

Always wanted to contribute to open-source projects but didn't know where to start? Every week we highlight some tasks from the Rust community for you to pick and get started!

Some of these tasks may also have mentors available, visit the task page for more information.

If you are a Rust project owner and are looking for contributors, please submit tasks here.

Updates from Rust Core

350 pull requests were merged in the last week

Rust Compiler Performance Triage Approved RFCs

Changes to Rust follow the Rust RFC (request for comments) process. These are the RFCs that were approved for implementation this week:

No RFCs were approved this week.

Final Comment Period

Every week the team announces the 'final comment period' for RFCs and key PRs which are reaching a decision. Express your opinions now.

RFCs

No RFCs are currently in the final comment period.

Tracking Issues & PRs New RFCs

No new RFCs were proposed this week.

Upcoming Events Online North America

If you are running a Rust event please add it to the calendar to get it mentioned here. Please remember to add a link to the event too. Email the Rust Community Team for access.

Rust Jobs

Tweet us at @ThisWeekInRust to get your job offers listed here!

Quote of the Week

You don't declare lifetimes. Lifetimes come from the shape of your code, so to change what the lifetimes are, you must change the shape of the code.

Alice Ryhl on rust-users

Thanks to RustyYato for the suggestions!

Please submit quotes and vote for next week!

This Week in Rust is edited by: nellshamrell, llogiq, and cdmistman.

Discuss on r/rust

Categorieën: Mozilla-nl planet

The Rust Programming Language Blog: 2020 Event Lineup - Update

wo, 10/06/2020 - 02:00

In 2020 the way we can do events suddenly changed. In the past we had in-person events all around the world, with some major conferences throughout the year. With everything changed due to a global pandemic this won't be possible anymore. Nonetheless the Rust community found ways to continue with events in some form or another. With more and more events moving online they are getting more accessible to people no matter where they are.

Below you find updated information about Rust events in 2020.

Do you plan to run a Rust online event? Send an email to the Rust Community team and the team will be able to get your event on the calendar and might be able to offer further help.

Rust LATAM

Unfortunately the Latin American event Rust LATAM had to be canceled this year. The team hopes to be able to resume the event in the future.

Oxidize
July 17th-20th, 2020

The Oxidize conference was relabeled to become Oxidize Global. From July 17-20 you will be able to learn about embedded systems and IoT in Rust. Over the course of 4 days you will be able to attend online workshops (July 17th), listen to talks (July 18th) and take part in the Impl Days, where you can collaborate with other Embedded Rust contributors in active programming sessions.

Tickets are on sale and the speakers & talks will be announced soon.

RustConf
August 20th, 2020

The official RustConf will be taking place fully online. Listen to talks and meet other Rust enthusiasts online in digital meetups & breakout rooms. See the list of speakers, register already and follow Twitter for updates as the event date approaches!

Rusty Days
July 27th - August 2nd, 2020

Rusty Days is a new conference and was planned to happen in Wroclaw, Poland. It now turned into a virtual Rust conference stretched over five days. You'll be able to see five speakers with five talks -- and everything is free of charge, streamed online and available to watch later.

The Call for Papers is open. Follow Twitter for updates.

RustLab
October 16th-17th, 2020

RustLab 2020 is also turning into an online event. The details are not yet settled, but they are aiming for the original dates. Keep an eye on their Twitter stream for further details.

RustFest Netherlands Global
November 7th-8th, 2020

RustFest Netherlands was supposed to happen this June. The team decided to postpone the event and is now happening as an online conference in Q4 of this year. More information will be available soon on the RustFest blog and also on Twitter.

Update 2020-06-18: RustFest has announced its dates: November 7th & 8th, running as an online community conference. See the announcement blog post for details.

Conferences are not the only thing happening. More and more local meetups get turned into online events. We try to highlight these in the community calendar as well as in the This Week in Rust newsletter. Some Rust developers are streaming their work on the language & their Rust projects. You can get more information in a curated list of Rust streams.

Do you plan to run a Rust online event? Send an email to the Rust Community team and the team will be able to get your event on the calendar and might be able to offer further help.

Categorieën: Mozilla-nl planet

Mozilla Future Releases Blog: Next steps in testing our Firefox Private Network browser extension beta

di, 09/06/2020 - 18:25

Last fall, we launched the Firefox Private Network browser extension beta as a part of our Test Pilot experiments program. The extension offers safe, no-hassle network protection in the Firefox browser. Since our initial launch, we’ve released a number of versions offering different capabilities. We’ve also launched a Virtual Private Network (VPN) for users interested in full device protection.

Today we are pleased to announce the next step in our Firefox Private Network browser extension Beta. Starting soon, we will be transitioning from a free beta to a paid subscription beta for the Firefox Private Network browser extension. This version will be offered for a limited time for $2.99/mo and will provide unlimited access while using the Firefox Private Network extension. Like our existing extension, this version will be available in the U.S. first, but we hope to expand to other markets soon. Unlike our previous beta, this version will also allow users to connect up to three Firefox browsers at once using the same account. This will only be available for desktop users. For this release, we will also be updating our product icon to differentiate more clearly from the VPN. More information about our VPN as a stand-alone product offering will be shared in the coming weeks.

What did we learn

Last fall, when we first launched the Firefox Private Network browser extension, we saw a lot of early excitement around the product followed by a wave of users signing up. From September through December, we offered early adopters a chance to sign up for the extension with unlimited access, free of charge. In December, when the subscription VPN first launched, we updated our experimental offering to understand if giving participants a certain number of hours a month for browsing in coffee shops or at airports (remember those?) would be appealing. What we learned very quickly was that the appeal of the proxy came most of all from the simplicity of the unlimited offering. Users of the unlimited version appreciated having set and forget privacy, while users of the limited version often didn’t remember to turn on the extension at opportune moments.

These initial findings were borne out in subsequent research. Users in the unlimited cohort engaged at a high level, while users in the limited cohort often stopped using the proxy after only a few hours. When we spoke to proxy users, we found that for many the appeal of the product was in the set-it-and-forget-it protection it offered.

We also knew from the outset that we could not offer this product for free forever. While there are some free proxy products available in the market, there is always a cost associated with the network infrastructure required to run a secure proxy service.  We believe the simplest and most transparent way to account for these costs is by providing this service at a modest subscription fee. After conducting a number of surveys, we believe that the appropriate introductory price for the Firefox Private Network browser extension is $2.99 a month.

What will be testing?

So the next thing we want to understand is basically this: will people pay for a browser-based privacy tool? It’s a simple question really, and one we think is best answered by the market. Over the summer we will be conducting a series of small marketing tests to determine interest in the Firefox Private Network browser extension as both a standalone subscription product and as well as part of a larger privacy and security bundle for Firefox.

In conjunction, we will also continue to explore the relationship between the Firefox Private Network extension and the VPN. Does it make sense to bundle them? Do VPN subscribers want access to the browser extension? How can we best communicate the different values and attributes of each?

What you can expect next

Starting in a few weeks, new users and users in the limited experiment will be offered the opportunity to subscribe to the unlimited beta for $2.99 a month. Shortly thereafter we will be asking our unlimited users to migrate as well.

 

The post Next steps in testing our Firefox Private Network browser extension beta appeared first on Future Releases.

Categorieën: Mozilla-nl planet

The Mozilla Blog: Mozilla Announces Second Three COVID-19 Solutions Fund Recipients

ma, 08/06/2020 - 15:45

Innovations spanning food supplies, medical records and PPE manufacture were today included in the final three awards made by Mozilla from its COVID-19 Solutions Fund. The Fund was established at the end of March by the Mozilla Open Source Support Program (MOSS), to offer up to $50,000 each to open source technology projects responding to the COVID-19 pandemic. In just two months, the Fund received 163 applicants from 30 countries and is now closed to new applications.

OpenMRS is a robust, scalable, user-driven, open source electronic medical record system platform currently used to manage more than 12.6 million patients at over 5,500 health facilities in 64 countries. Using Kenya as a primary use case, their COVID-19 Response project will coordinate work on OpenMRS COVID-19 solutions emerging from their community, particularly “pop-up” hospitals, into a COVID-19 package for immediate use.

This package will be built for eventual re-use as a foundation for a suite of tools that will become the OpenMRS Public Health Response distribution. Science-based data collection tools, reports, and data exchange interfaces with other key systems in the public health sector will provide critical information needed to contain disease outbreaks. The committee approved an award of $49,754.

Open Food Network offers an open source platform enabling new, ethical supply chains. Food producers can sell online directly to consumers and wholesalers can manage buying groups and supply produce through networks of food hubs and shops. Communities can bring together producers to create a virtual farmers’ market, building a resilient local food economy.

At a time when supply chains are being disrupted around the world — resulting in both food waste and shortages — they’re helping to get food to people in need. Globally, the Open Food Network is currently deployed in India, Brazil, Italy, South Africa, Australia, the UK, the US and five other countries. They plan to use their award to extend to ten other countries, build tools to allow vendors to better control inventory, and scale up their support infrastructure as they continue international expansion. The Committee approved a $45,210 award.

Careables Casa Criatura Olinda in northeast Brazil is producing face shields for local hospitals based on an open source design. With their award, they plan to increase their production of face shields as well as to start producing aerosol boxes using an open source design, developed in partnership with local healthcare professionals.

Outside of North American ICUs, many hospitals cannot maintain only one patient per room, protected by physical walls and doors. In such cases, aerosol boxes are critical to prevent the spread of the virus from patient to patient and patient to physician. Yet even the Brazilian city of Recife (population: 1.56 million), has only three aerosol boxes. The Committee has approved a $25,000 award and authorized up to an additional $5,000 to help the organization spread the word about their aerosol box design.

“Healthcare has for too long been assumed to be too high risk for open source development. These awards highlight how critical open source technologies are to helping communities around the world to cope with the pandemic,” said Jochai Ben-Avie, Head of International Public Policy and Administrator of the Program at Mozilla. “We are indebted to the talented global community of open source developers who have found such vital ways to put our support to good use.”

Information on the first three recipients from the Fund can be found here.

The post Mozilla Announces Second Three COVID-19 Solutions Fund Recipients appeared first on The Mozilla Blog.

Categorieën: Mozilla-nl planet

Henri Sivonen: chardetng: A More Compact Character Encoding Detector for the Legacy Web

ma, 08/06/2020 - 12:48

chardetng is a new small-binary-footprint character encoding detector for Firefox written in Rust. Its purpose is user retention by addressing an ancient—and for non-Latin scripts page-destroying—Web compat problem that Chrome already addressed. It puts an end to the situation that some pages showed up as unreadable garbage in Firefox but opening the page in Chrome remedied the situation more easily than fiddling with Firefox menus. The deployment of chardetng in Firefox 73 marks the first time in the history of Firefox (and in the entire Netscape lineage of browsers since the time support for non-ISO-8859-1 text was introduced) that the mapping of HTML bytes to the DOM is not a function of the browser user interface language. This is accomplished at a notably lower binary size increase than what would have resulted from adopting Chrome’s detector. Also, the result is more explainable and modifiable as well as more suitable for standardization, should there be interest, than Chrome’s detector

chardetng targets the long tail of unlabeled legacy pages in legacy encodings. Web developers should continue to use UTF-8 for newly-authored pages and to label HTML accordingly. This is not a new feature for Web developers to make use of!

Although chardetng first landed in Firefox 73, this write-up discusses chardetng as of Firefox 78.

TL;DR

There is a long tail of legacy Web pages that fail to label their encoding. Historically, browsers have used a default associated with the user interface language of the browser for unlabeled pages and provided a menu for the user to choose something different.

In order to get rid of the character encoding menu, Chrome adopted an encoding detector (apparently) developed for Google Search and Gmail. This was done without discussing the change ahead of time in standard-setting organizations. Firefox had gone in the other direction of avoiding content-based guessing, so this created a Web compat problem that, when it occurred for non-Latin-script pages, was as bad as a Web compat problem can be: Encoding-unlabeled legacy pages could appear completely unreadable in Firefox (unless a well-hidden menu action was taken) but appear OK in Chrome. Recent feedback from Japan as well as world-wide telemetry suggested that this problem still actually existed in practice. While Safari detects less, if a user encounters the problem on iOS, there is no other engine to go to, so Safari can’t be used to determine the abandonment risk for Firefox on platforms where a Chromium-based browser is a couple of clicks or taps away. Edge’s switch to Chromium signaled an end to any hope of taking the Web Platform in the direction of detecting less.

ICU4C’s detector wasn’t accurate (or complete) enough. Firefox’s old and mostly already removed detector wasn’t complete enough and completing it would have involved the same level of work as writing a completely new one. Since Chrome’s detector, ced, wasn’t developed for browser use cases, it has a larger footprint than is necessary. It is also (in practice) unmodifiable over-the-wall Open Source, so adopting it would have meant adopting a bunch of C++ that would have had known-unnecessary bloat while also being difficult to clean up.

Developing an encoding detector is easier and takes less effort than it looks once one has made the observations that one makes when developing a character encoding conversion library. chardetng’s foundational binary size-reducing idea is to make use of the legacy Chinese, Japanese, and Korean (CJK) decoders (that a browser has to have anyway) for the purpose of detecting CJK legacy encodings. chardetng is also strictly scoped to Web-relevant use cases.

On x86_64, the binary size contribution of chardetng (and the earlier companion Japanese-specific detector) to libxul is 28% of what ced would have contributed to libxul. If we had adopted ced and later wanted to make a comparable binary size saving, hunting for savings around the code base would have been more work than writing chardetng from scratch.

The focus on binary size makes chardetng take 42% longer to process the same data compared to ced. However, this tradeoff makes sense for code that runs for legacy pages and doesn’t run at all for modern pages. The accuracy of chardetng is competitive with ced, and chardetng is integrated into Firefox in a way that gives it a better opportunity to give the right answer compared to the way ced is integrated into Chrome.

chardetng has been developed in such a way that it would be standardizable, should there be interest to standardize it at the WHATWG. The data tables are under CC0, and non-CC0 part of chardetng consists of fewer than 3000 lines of explainable Rust code that could be reversed into spec English.

Why

Before getting to how chardetng works, let’s look in more detail at why it exists at all.

Background

Back when the Web was created, operating systems had locale-specific character encodings. For example, a system localized for Greek had a different character encoding from a system localized for Japanese. The Web was created in Switzerland, so bytes were assumed to be interpreted according to ISO-8859-1, which was the Western European encoding for Unix-ish systems and also compatible with the Western European encoding for Windows. (The single-byte encodings on Mac OS Classic were so different from Unix and Windows that browsers on Mac had to permute the bytes and couldn’t assume content from the Web to match the system encoding.)

As browsers were adapted to more languages, the default treatment of bytes followed the system-level design of the textual interpretation of bytes being locale-specific. Even though the concept of actually indicating what character encoding content was in followed relatively quickly, the damage was already done. There were already unlabeled pages out there, so—for compatibility—browsers retained locale-specific fallback defaults. This made it possible for authors to create more unlabeled content that appeared to work fine when viewed with an in-locale browser but that broke if viewed with an out-of-locale browser. That’s not so great for a system that has “world-wide” in its name.

For a long time, browsers provided a menu that allowed the user to override the character encoding when the fallback character encoding of the page author’s browser and the fallback character encoding of the page reader’s browser didn’t match. Browsers have also traditionally provided a detector for the Japanese locale for deciding between Shift_JIS and EUC-JP (also possibly ISO-2022-JP) since there was an encoding split between Windows and Unix. (Central Europe also had such an encoding split between Windows and Unix, but, without a detector, Web developers needed to get their act together instead and declare the encodings…) Firefox also provided an on-by-default detector for the Russian and Ukrainian locales (but not other Cyrillic locales!) for detecting among multiple Cyrillic encodings. Later on, Firefox also tried to alleviate the problem by deciding from the top-level domain instead of the UI localization when the top-level domain was locale-affiliated. However, this didn’t solve the problem for .com/.net/.org, for local files, or for locales with multiple legacy encodings even if not on the level of prevalence as in Japan.

What Problem is Being Solved Now?

As part of an effort to get rid of the UI for manually overriding the character encoding of an HTML page, Chrome adopted compact_enc_det (ced), which appears to be Google’s character encoding detector from Google Search and Gmail. This was done as a surprise to us without discussing the change in standard-setting organizations ahead of time. This led to a situation where Firefox’s top-level domain or UI locale heurististics could fail but Chrome’s content-based detection could succeed. It is likely more discoverable and easier to launch a Chromium-based browser instead of seeking to remedy the situation within Firefox by navigating rather well-hidden submenus.

When the problem occurred with non-Latin-script pages, it was about as bad as a Web compat problem can be: The text was completely unreadable in Firefox and worked just fine in Chrome.

But Did the Problem Actually Occur Often?

So we had a Web compat problem that was older than JavaScript and almost as old as the Web itself and that had a very bad (text completely unreadable) failure mode. But did it actually happen often enough to care? If you’re in the Americas, Western Europe, Australia, New Zealand, many parts of Africa to the South of Sahara, etc., you might be thinking: “Did this really happen prior to Firefox 73? It didn’t happen when I browsed the Web.”

Indeed, this issue has not been a practical problem for quite a while for users who are in windows-1252 locales and read content from windows-1252 locales, where a “windows-1252 locale” is defined as a locale where the legacy “ANSI” code page for the Windows localization for that locale is windows-1252. There are two reasons why this is the case. First, the Web was created in this locale cohort (windows-1252 is a superset of ISO-8859-1), so the defaults have always been favorable. Second, when the problem does occur, the effect is relatively tolerable for the languages of these locales. These locales use the Latin script with relatively infrequent non-ASCII characters. Even if those non-ASCII characters are replaced with garbage, it’s still easy to figure out what the text as a whole is saying.

With non-Latin scripts, the problem is much more severe, because pretty much all characters (except ASCII spaces and ASCII punctuation) get replaced with garbage.

Clearly, when the user invokes the Text Encoding menu in Firefox, this user action is indicative of the user having encountered this problem. All over the world, the use of the menu could be fairly characterized as very rare. However, how rare varied by a rather large factor. If we take the level of usage in Germany and Spain, which are large non-English windows-1252 locales, as the baseline of a rate of menu usage that we could dismiss as not worth addressing, the rate of usage in Macao was 85 times that (measured not in terms of accesses to the menu directly but in terms Firefox subsessions in which the menu had been used at least once; i.e. a number of consecutive uses in one subsession counted just once). Ignorable times 85 does not necessarily equal requiring action, but this indicates that an assessment of the severity of the problem from a windows-1252 experience is unlikely to be representative.

Also, the general rarity of the menu usage is not the whole story. It measures only the cases where the user knew how and bothered to address the problem within Firefox. It doesn’t measure abandonment: It doesn’t measure the user addressing the problem by opening a Chromium-based browser instead. For every user who found the menu option and used it in Firefox, several others likely abandoned Firefox.

The feedback kept coming from Japan that the problem still occurred from time to time. Telemetry indicated that it occurred in locales where the primary writing system is Traditional Chinese at even a higher rate than in Japan. In mainland China, which uses Simplified Chinese, the rate was a bit lower than in Japan but on a similar level. My hypothesis is that despite Traditional Chinese having a single legacy encoding in the Web Platform and Simplified Chinese also having a single legacy encoding in the Web Platform (for decoding purposes), which would predict these locales as having success from a locale-based guess, users read content across the Traditional/Simplified split on generic domains (where a guess from the top-level domain didn’t apply) but don’t treat out-of-locale failures as reportable. In the case of Japan and the Japanese language, the failures are in-locale and apparently treated as more reportable.

In general, as one would expect, the menu usage is higher in locales where the primary writing system is not the Latin script than in locales where the primary writing system is the Latin script. Notably, Bulgaria was pretty high on the list. We enabled our previous Cyrillic detector (which probably wasn’t trained with Bulgarian) by default for the Russian and Ukrainian localizations but not for the Bulgarian localization. Also, we lacked a TLD mapping for .cy and, sure enough, the rate of menu usage was higher in Cyprus than in Greece.

Again, as one would expect, the menu usage was higher in non-windows-1252 Latin-script locales than in windows-1252 Latin-script locales. Notably, the usage in Azerbaijan was unusually high compared to other Latin-script locales. I had suspected we had the wrong TLD mapping for .az, because at the time the TLD mappings were introduced to Firefox, we didn’t have an Azerbaijani localization to learn from.

In fact, I suspected that our TLD mapping for .az was wrong at the very moment of making the TLD mappings, but I didn’t have data to prove it, so I erred on the side of inaction. Now, years later, we have the data. It seems to me that it was better to trust the feedback from Japan and fix this problem than to gather more data to prove that the Web compat problem deserved fixing.

Why Now?

So why not fix this earlier or at the latest when Chrome introduced their present detector? Generally, it seems a better idea to avoid unobvious behaviors in the Web Platform if they are avoidable, and at the time Microsoft was going in the opposite direction compared to Chrome with neither a menu nor a detector in EdgeHTML-based Edge. This made it seem that there might be a chance that the Web Platform could end up being simpler, not more complex, on this point.

While it would be silly to suggest that this particular issue factored into Microsoft’s decision to base the new Edge on Chromium, the switch removed EdgeHTML as a data point suggesting that things could be less complex than they already were in Gecko. At this point, it didn’t make sense to pretend that if we didn’t start detecting more, Chrome and the new Edge would start detecting less.

Safari has less detection than Firefox had previously. As far as I can tell, Safari doesn’t have TLD-based guessing, and the fallback comes directly from the UI language with the detail that if the UI language is Japanese, there’s content-based guessing between Japanese legacy encodings. However, to the extent Safari/WebKit is engineered to the iOS market conditions, we can’t take this as an indication that this level of guessing would be enough for Gecko relative to Chromium. If users encounter this problem on iOS, there’s no other engine they can go to, and the problem is rare enough that users aren’t going to give up on iOS as a whole due to this problem. However, on every platform that Gecko runs on, a Chromium-based browser a couple of clicks or taps away.

Why Not Use an Existing Library?

After deciding to fix this, there’s the question of how to fix it. Why write new code instead of reusing existing code?

Why Not ICU4C?

ICU4C has a detector, but Chrome had already rejected it as not accurate enough. Indeed, in my own testing with title-length input, ICU4C was considerably less accurate than the alternatives discussed below.

Why Not Resurrect Mozilla’s Old Detector?

At this point you may be thinking “Wait, didn’t Firefox have a ‘Universal’ detector and weren’t you the one who removed it?” Yes and yes.

Firefox used to have two detectors: the “Universal” detector, also known as chardet, and a separate Cyrillic detector. chardet had the following possible configurations: Japanese, Traditional Chinese, Simplified Chinese, Chinese, Korean, and Universal, where the last mode enabled all the detection capabilities and the other modes enabled subsets only.

This list alone suggests a problem: If you look at the Web Platform as it exists today, Traditional Chinese has a single legacy encoding, Simplified Chinese has a single legacy decode mode (both GBK and GB18030 decode the same way and differ on the encoding side), and Korean has a single legacy encoding. The detector was written at a time when Gecko was treating character encodings as Pokémon and tried to catch them all. Since then we’ve learned that EUC-TW (an encoding for Traditional Chinese), ISO-2022-CN (an encoding for both Traditional Chinese and Simplified Chinese), HZ (an encoding for Simplified Chinese), ISO-2022-KR (an encoding for Korean), and Johab (an encoding for Korean) never took off on the Web, and we were able to remove them from the platform. With a single real legacy encoding per each of the Traditional Chinese, Simplified Chinese, and Korean, the only other detectable was UTF-8, but detecting it is problematic (more on this later).

When I filed the removal bug for the non-Japanese parts of chardet in early 2013, chardet was enabled in its Japanese-only mode by default for the Japanese localization, the separate Cyrillic detector was enabled for the Russian and Ukrainian localizations, and chardet was enabled in the Universal mode for the Traditional Chinese localization. At the time, the character encoding defaults for the Traditional Chinese localization differed from how they were supposed to be set in another way, too: It had UTF-8 instead of Big5 as the fallback encoding, which is why I didn’t take the detector setting very seriously, either. In the light of later telemetry, the combined Chinese mode, which detected between both Traditional and Simplified Chinese, would have been useful to turn on by default both for the Traditional Chinese and Simplified Chinese localizations.

From time to time, there were requests to turn the “Universal” detector on by default for everyone. That would not have worked, because the “Universal” detector wasn’t actually universal! It was incomplete and had known bugs, but people who hadn’t examined it more closely took the name at face value. That is, over a decade after the detector paper was presented (on September 12th 2001), the detector was still incomplete, off by default except in the Japanese subset mode and in a case that looked like an obvious misconfiguration, and the name was confusing.

The situation was bad as it was. There were really two options: removal or actually investing in fixing it. At the time, Chrome hadn’t gone all-in with detection. Given what other browsers were doing and not wishing to make the Web Platform more complex than it appeared to need to be, I pushed for removal. I still think that the call made sense considering the circumstances and information available.

As for whether it would have made sense to resurrect the old detector in 2019, digging up the code from version control history would still have resulted in an incomplete detector that would not have been suitable for turning on by default generally. Meanwhile, a non-Mozilla fork had added some things and another effort had resulted in a Rust port. Apart from having to sort out licensing (the code forked before the MPL 2.0 upgrade, the Rust port retained only the LGPL3 part of the license options, and Mozilla takes compliance with the LPGL relinking clause seriously making the use of LGPLed code in Gecko less practical than the use of other Open Source code) even the forks would have required more work to complete despite at least the C++ fork being more complete than the original Mozilla code.

Moreover, despite validating legacy CJK encodings being one of the foundational ideas of the old detector (called “Coding Scheme Method” in the paper), the old detector didn’t make use of the browser’s decoders for these encodings but implemented the validation on its own. This is great if you want a C++ library that has no dependencies, but it’s not great if you are considering the binary size cost in a browser that already has validating decoders for these encodings and ships on Android where binary size still matters.

A detector has two big areas: How to handle single-byte encodings and how to handle legacy CJK. For the former, if you need to develop tooling to add support for some more single-byte encodings or language coverage for single-byte encoding already detected for some languages, you will have built tooling that allows you to redo all of them. Therefore, if the prospect is adding support for more single-byte cases and to reuse for CJK detection the decoders that the browser has anyway, doing the work within the frame of the old decoder becomes a hindrance and it makes sense to frame it as newly-written code only instead of trying to formulate the changes as patches to what already existed.

Why Not Use Chrome’s Detector?

Using Chrome’s detector (ced) looks attractive on the surface. If Firefox used the exact same detector as Chrome, Firefox could never do worse than Chrome, since both would do the same thing. There are problems with this, though.

As a matter of a health-of-the-Web principle, it would be bad if an area of the Web platform became defined as having to run a particular implementation. Also, the way ced integrates with Chrome violates the design principles of Firefox’s HTML parser. What gets fed to ced in Chrome depends on buffer boundaries as delivered by the networking subsystem to the HTML parser. Prior to Firefox 4, HTML parsing in Firefox depended on buffer boundaries in an even worse way. Since Firefox 4, I’ve considered it a goal not to make Firefox’s mapping of a byte stream into a DOM dependent on the network buffer boundaries (or the wall clock). Therefore, if I had integrated ced into Firefox, the manner of integration would have been an opportunity for different results, but that would have been negligible. That is, these principled issues don’t justify putting effort into a new implementation.

The decisive reasons are these: ced is over-the-wall Open Source to the point that even Chrome developers don’t try to change it beyond post-processing its output and making it compile with newer compilers. License-wise the code is Open Source / Free Software, but it’s impractical to exercise the freedom to make modifications to its substance, which makes it close to an invariant section in practice (again, not as a matter of license). The code doesn’t come with the tools needed to regenerate its generated parts. And even if it did, the input to those tools would probably involve Google-specific data sources. There isn’t any design documentation (that I could find) beyond code comments. Adopting ced would have meant adopting a bunch of C++ code that we wouldn’t be able to meaningfully change.

But why would one even want to make changes if the goal isn’t to ever improve detection and the goal is just to ensure our detection is never worse than Chrome’s? First of all, as in the case of chardet, ced is self-contained and doesn’t make use of the algorithms and data that a browser already has to have in order to decode legacy CJK encodings once detected. But it’s worse than that. “Post-processing” in the previous paragraph means that ced has more potential outcomes than a browser engine has use for, since ced evidently was not developed for the browser use case. (As noted earlier, ced has the appearance of having been developed for Google Search and Gmail.)

For example, ced appears to make distinctions between various flavors of Shift_JIS in terms of their carrier-legacy emoji mappings (which may have made sense for Gmail at some point in the past). It’s unclear how much table duplication results form this, but it doesn’t appear to be full duplication for Shift_JIS. Still, there are other cases that clearly do lead to table duplication even though the encodings have been unified in the Web Platform. For example, what the Web Platform unifies as a single legacy Traditional Chinese encoding occurs as three distinct ones in ced and what the Web Platform unifies as a single legacy Simplified Chinese encoding for decoding purposes appears as three distinct generations in ced. Also, ced keeps around data and code for several encodings that have been removed from the Web Platform (to a large extent because Chrome demonstrated the feasibility of not supporting them!), for KOI8-CS (why?), and for a number of IE4/Netscape 4-era / pre-iOS/Android-era deliberately misencoded fonts deployed in India. (We’ll come back to those later.)

My thinking around internationalization in relation to binary size is still influenced by the time period when, after shipping on desktop, Mozilla didn’t ship the ECMAScript i18n API on Android for a long time due to binary size concerns. If we had adopted ced and later decided that we wanted to run an effort (BinShrink?) to make the binary size smaller for Android, it would probably have taken more time and effort to save as many bytes from somewhere else (as noted, changing ced itself is something even the Chrome developers don’t do) as chardetng saves relative to just adopting ced than it took me to write chardetng. Or maybe, if done carefully, it would have been possible to remove the unnecessary parts of ced without access to the original tools and to ensure that the result still kept working, but writing the auxiliary code to validate the results of such an effort would have been on the same order of magnitude of effort as writing the tooling for training and testing chardetng.

It’s Not Rocket Surgery

That most Open Source encoding detectors are ports of the old Mozilla code, that ICU’s effort to write their own resulted in something less accurate, and the assumption that Google’s detector draws from some massive Web-scale analysis make it seem like writing a detector from scratch is a huge deal. It isn’t that big a deal, really.

After one has worked on implementing character encodings for a while, some patterns emerge. These aren’t anything new. The sort of things that one notices are the kinds of things the chardet paper attributes to Frank Tang noticing while at Netscape. Also, I already knew about language-specific issues related to e.g. Albanian, Estonian, Romanian, Mongolian, Azerbaijani, and Vietnamese that are discussed below. That is, many of the issues that may look like discoveries in the development process are things that I already knew before writing any code for chardetng.

Furthermore, these days, you don’t need to be Google to have access to a corpus of language-labeled human-authored text: Wikipedia publishes database dumps. Synthesizing legacy-encoded data from these has the benefit that there’s no need to go locate actual legacy-encoded data on the Web and check that it’s correctly labeled.

Why Rust?

An encoding detector is a component with identifiable boundaries and a very narrow API. As such, it’s perfectly suited to be exposed via Foreign Function Interface. Furthermore, the plan was to leverage the existing CJK decoders from encoding_rs—the encoding conversion library used in Firefox—and that’s already Rust code. In this situation, it would have been wrong not to take the productivity and safety benefits of Rust. As a bonus, Rust offers the possibility of parallelizing the code using Rayon, which may or may not help, which isn’t known before measuring, but the cost of trying is very low whereas the cost of planning for parallelism in C++ is very high. (We’ll see later that the measurement was a disappointment.)

How

With the “why” out of the way, let’s look at how chardetng actually works.

Standardizability

One principled concern related to just using ced was that it’s bad for interoperability of the Web Platform depending on a single implementation that everyone would have to ship. Since chardetng is something that I just made up without a spec, is it any better in this regard?

Before writing a single line of code I arranged the data tables used by chardetng to be under CC0 so as enable their inclusion in a WHATWG spec for encoding detection, should there be interest for one. Also, I have tried to keep everything that chardetng does explainable. The non-CC0 (Apache-2.0 OR MIT) part of chardetng is under 3000 lines of Rust that could realistically be reversed into spec English in case someone wanted to write an interoperable second implementation.

The training tool for creating the data tables given Wikipedia dumps would be possible to explain as well. It runs in two phases. The first one computes statistics from Wikipedia dumps, and the second one generates Rust files from the statistics. The intermediate statistics are available, so as long as you don’t change the character classes or the set of languages, which would invalidate the format of the statistics, you can make changes to the later parts of the code generation and re-run it from the same statistics that I used.

Foundational Ideas
  • The most foundational idea of chardetng is the observation that legacy CJK encodings have enough structure to them that if you take bytes and a decoder for a legacy CJK encoding and try to decode the bytes with the decoder, if the bytes aren’t intended to be in that encoding, the decoder will report an error sooner or later, with the exception that EUC-family encodings may decode without error as other EUC-family encodings. And that a browser (or any app for that matter) that wants to make useful use of the detection result has to have those decoders anyway.

  • The EUC family (EUC-JP, EUC-KR, and GBK—the GB naming is more commonly used than the EUC-CN name) can be distinguished by observing that Japanese text has kana, which distinguishes EUC-JP, and Hanja is very rare in Korean, so if enough EUC byte pairs fall outside the KS X 1001 Hangul range, chances are the text is Chinese in GBK.

  • Some single-byte encodings have unassigned bytes. If an unassigned byte occurs, the encoding can be removed from consideration. C1 controls can be treated as if unassigned.

  • Bicameral scripts have a certain regularity to capital letter use. Even though e.g. brand names can violate these regularities, the bulk of text should consist of words that are lower-case, start with an upper-case letter, or are in all-caps.

  • Non-Latin letters generally don’t occur right before or right after Latin letters. Hence, pairing an ASCII letter with a non-ASCII letter is indicative of the Latin script and should be penalized in non-Latin-script encodings.

  • Single-byte encodings can be distinguished well enough from the relative probabilities of byte pairs excluding ASCII pairs. These pairings don’t need to and should not be analyzed on exact byte values but the bytes should be classified more coarsely such that ASCII punctuation and parantheses, etc., are classified as space-equivalent and at least upper and lower case of a given letter are unified.

  • Pairs of ASCII bytes should be neutral in terms of the detection outcome apart from ISO-2022-JP detection and cases where the first ASCII byte of a pair is interpreted as a trail byte of a legacy CJK sequence. This way, the detector ignores various computer-readable syntaxes such as HTML (for deployment) and MediaWiki syntax (for training) without any syntax-specific state machine(s).

  • Visual Hebrew can be distinguished by observing the placement of ASCII punctuation relative to non-ASCII words.

  • Avoid detecting encodings that have never worked without declaration in any localization of a major browser.

There are two major observations to make of the above ideas:

  1. The first point fundamentally trades off accuracy for short inputs for legacy CJK encodings in order to minimize binary size.
  2. The rules are all super-simple to implement except that finding out the relative probabilities of character class pairs for single-byte encodings requires some effort.
Included and Excluded Encodings

The last point on the foundational idea list suggests that chardetng does not detect all encodings in the Encoding Standard. This indeed is the case. After all, the purpose of the detector is not to catch them all but to deal with the legacy that has arisen from locale-specific defaults and, to some extent, previously-deployed detectors. Most encodings in the Encoding Standard correspond to the “ANSI” code page of some Windows localization and, therefore, the default fallback in some localization of Internet Explorer. Additionally, ISO-8859-2 and ISO-8859-7 have appeared as the default fallback in non-Microsoft browsers. Some encodings require a bit more justification for why they are included or excluded.

ISO-2022-JP and EUC-JP are included, because multiple browsers have shipped on-by-default detectors over most of the existence of the Web with these as possible outcomes. Likewise, ISO-8859-5, IBM866, and KOI8-U (the Encoding Standard name for the encoding officially known as KOI8-RU) are detected, because they were detected by Gecko and are detected by IE and Chrome. The data published as part of ced also indicates that these encodings have actually been used on the Web.

ISO-8859-4 and ISO-8859-6 are included, because both IE and Chrome detect them, they have been in the menu in IE and Firefox practically forever, and the data published as part of ced indicates that they have actually been used on the Web.

ISO-8859-13 is included, because it has the same letter assignments as windows-1257, so browser support for windows-1257 has allowed ISO-8859-13 to work readably. However, disqualifying an encoding based on one encoding error could break such compatibility. To avoid breakage due to eager disqualification of windows-1257, chardetng supports ISO-8859-13 explicitly despite it not qualifying for inclusion on the usual browser legacy behavior grounds. (The non-UTF-8 glibc locales for Lithuanian and Latvian used ISO-8859-13. Solaris 2.6 used ISO-8859-4 instead.)

ISO-8859-8 is supported, because it has been in the menu in IE and Firefox practically forever, and if it wasn’t explicitly supported, the failure mode would be detection as windows-1255 with the direction of the text swapped.

KOI8-R, ISO-8859-8-I, and GB18030 are detected as KOI8-U, windows-1255, and GBK instead. KOI8-U differs from KOI8-R by assigning a couple of box drawing bytes to letters instead, so at worst the failure mode of this unification is some box drawing segments (which aren’t really used on the Web anyway) showing up as letters. windows-1255 is a superset of ISO-8859-8-I except for swapping the currency symbol. GBK and GB18030 have the same decoder in the Encoding Standard. However, chardetng makes no attempt to detect the use of GB18030/GBK for content other than Simplified Chinese despite the ability to also represent other content being a key design goal of GB18030. As far as I am aware, there is no legacy mechanism that would have allowed Web authors to rely on non-Chinese usage of GB18030 to work without declaring the encoding. (I haven’t evaluated how well Traditional Chinese encoded as GBK gets detected.)

x-user-defined as defined in the Encoding Standard (as opposed to how an encoding of the same name is defined in IE’s mlang.dll) is relevant to XMLHttpRequest but not really to HTML, so it is not a possible detection outcome.

The legacy encodings that the Encoding Standard maps to the replacement encoding are not detected. Hence, the replacement encoding is not a possible detection outcome.

UTF-16BE and UTF-16LE are not detected by chardetng. They are detected from the BOM outside chardetng. (Additionally for compatibility with IE’s U+0000 ignoring behavior as of 2009, Firefox has a hack to detect Latin1-only BOMless UTF-16BE and UTF-16LE.)

The macintosh encoding (better known as MacRoman) is not detected, because it has not been the fallback for any major browser. The usage data published as part of ced suggests that the macintosh encoding exists on the Web, but the data looks a lot like the data is descriptive of ced’s own detection results and is recording misdetection.

x-mac-cyrillic is not detected, because it isn’t detected by IE and Chrome. It was previously detected by Firefox, though.

ISO-8859-3 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. The IE4 character encoding documentation published on the W3C’s site remarks of ISO-8859-3: “not used in the real world”. The languages for which this encoding is supposed to be relevant are Maltese and Esperanto. Despite glibc having an ISO-8859-3 locale for Maltese, the data published as part of ced doesn’t show ISO-8859-3 usage under the TLD of Malta. The ced data shows usage for Catalan, but this is likely a matter of recording misdetection, since Catalan is clearly an ISO-8859-1/windows-1252 language.

ISO-8859-10 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. Likewise, the anachronistic (post-dating UTF-8) late additions to the series, ISO-8859-14, ISO-8859-15, and ISO-8859-16, are not detected for the same reason. Of these, ISO-8859-15 differs from windows-1252 so little and for such rare characters that detection isn’t even particularly practical, and having ISO-8859-15 detected as windows-1252 is quite tolerable.

Pairwise Probabilities for Single-Byte Encodings

I assigned the single-byte encodings to groups such that multiple encodings that address roughly the same character repertoire are together. Some groups only have one member. E.g. windows-1254 is alone in a group. However, the Cyrillic group has windows-1251, KOI8-U, ISO-8859-5, and IBM866.

I assigned character classes according to the character repertoire of each group. Unifying upper and lower case can be done algorithmically. Then the groupings of characters can be listed somewhere and a code generator can generate lookup tables that are indexed by byte value yielding another 8-bit number. I reserved the most significant bit of the number read from the lookup tables to indicate case for bicameral scripts. I also split the classification lookup table in the middle in order to reuse the ASCII half across the cases that share the ASCII half (non-Latin vs. non-windows-1254 Latin vs. windows-1254). windows-1254 requires a different ASCII classification in order not to treat ‘i’ and ‘I’ as a case pair. Other than that, for Latin-script encodings ASCII letters as individual classes matter and for non-Latin-script, they don’t.

I looked at the list of Wikipedias. I excluded languages with fewer than 10000-article Wikipedias as well as languages that don’t have a legacy of having been written in single-byte encodings in the above-mentioned groupings and languages whose orthography is all-ASCII or nearly all-ASCII. Then I assigned the remaining languages to the encoding groups.

I wrote a tool that ingested the Wikipedia database dumps for those languages and for each language normalized the input to Unicode Normalization Form C (just in case; I didn’t bother examining how well Wikipedia was already in NFC) and then classified each Unicode scalar value according to the classification built above such that characters not part of the mapping were considered equivalent to spaces, because ampersand and semicolon were treated as being in the same equivalence class as space, and unmappable characters would be represented as numeric character references in single-byte encodings, so the adjacencies would be with ampersand and semicolon.

The program counted the pairs, ignoring ASCII pairs, divided the count for each pair by the total (non-ASCII) pair count and divided also by class size (with a specially-picked divisor for the space-equivalent class), where class size didn’t consider case (i.e. upper and lower case didn’t count as two). Within an encoding group, the languages were merged together by taking the maximum value across languages for each character class pair. If the original count was actually zero, i.e. no occurrence of the pair in relevant Wikipedias, the output lookup table got an implausibility marker (number 255). Otherwise, the floating point results were scaled to the 0 to 254 range to turn them into relative scores that fit into a byte. The scaling was done such that the highest spikes got clipped in a way that retained reasonable value range otherwise instead of mapping the highest spike to 254. I also added manually-picked multipliers for some encoding groups. This made it possible to e.g. boost Greek a bit relative to Cyrillic, which made accuracy better for short inputs, and for long inputs ends up correcting for misdetecting Cyrillic as Greek anyway because Greek has unmapped bytes such that sufficiently long windows-1251 input ends up disqualifying the Greek encodings according to the rule that a single unmapped byte disqualifies an encoding from consideration.

Thanks to Rust compiling to very efficient code and Rayon making it easy to process as many Wikipedias in parallel as there are hardware threads, this is a quicker processing task than it may seem.

The pairs logically form a two-dimension grid of scores (and case bits for bicameral scripts) where the rows and column are the character classes participating in the pair. Since ASCII pairs don’t contribute to the score, the part of the grid that would correspond to ASCII-ASCII pairs is not actually stored.

Synthetizing Legacy from Present-Day Data

Wikipedia uses present-day Unicode-enabled orthographies. For some languages, this is not the same as the legacy 8-bit orthography. Fortunately, with the exception of Vietnamese, going from the current orthography to the legacy orthography is a simple matter of replacing specific Unicode scalar values with other Unicode scalar values one-to-one. I made the following substitutions for training (also in upper case):

  • For Azerbaijani, I replaced ə with ä to synthetize the windows-1254-compatible 1991 orthography.
  • For Mongolian, I replaced ү with ї and ө with є to apply a convention that uses Ukrainian characters to allow the use of windows-1251.
  • For Romanian, I replaced ș with ş and ț with ţ. Unicode disunified the comma-below characters from the cedilla versions at the request of the Romanian authorities, but the 8-bit legacy encodings had them unified.

These were the languages that pretty obviously required such replacements. I did not investigate the orthography of languages whose orthography I didn’t already expect to require measures like this, so there is a small chance that some other language in the training set would have required similar substitution. I’m fairly confident that that isn’t the case, though.

For Vietnamese the legacy synthesis is a bit more complicated. windows-1258 cannot represent Vietnamese in Unicode Normalization Form C. There is a need to decompose the characters. I wrote a tiny crate that performs the decomposition and ran the training with both plausible decompositions:

  • The minimal decomposition: This could plausibly arise when converting IME-originating NFC data to windows-1258. In this case, if a base is simple enough that with a tone it becomes a combination that exists as precomposed in windows-1258 due to it appearing as precomposed in windows-1252, it’s not decomposed.
  • The orthographic decomposition: This the decomposition that arises naturally when using the standard Vietnamese keyboard layout (as opposed to IME) without normalization.

I assume that the languages themselves haven’t shifted so much from the early days of the Web that the pairwise frequencies observed from Wikipedia would not work. Also, I’m assuming that encyclopedic writing doesn’t disturb the pairwise frequencies too much relative to other writing styles and topics. Intuitively this should be true for alphabetic writing, but I have no proof. (Notably, this isn’t true for Japanese. If one takes a look at the most frequent kanji in Wikipedia titles and the most frequent kanji in Wikipedia text generally, the titles are biased towards science-related kanji.)

Misgrouped Languages

The Latin script does not fit in an 8-bit code space (and certainly not in the ISO-style code space that wastes 32 code points for the C1 controls). Even the Latin script as used in Europe does not fit in an 8-bit code space when assuming precomposed diacritic combinations. For this reason, there are multiple Latin-script legacy encodings.

Especially after the 1990s evolution of the ISO encodings into the Windows encodings, two languages, Albanian and Estonian, are left in a weird place in terms of encoding detection. Both Albanian and Estonian can be written using windows-1252 but their default “ANSI” encoding in Windows is something different: windows-1250 (Central European) and windows-1257 (Baltic), respectively.

With Albanian, the case is pretty clear. Orthographically, Albanian is a windows-1252-compatible language, and the data (from 2007?) that Google published as part of ced shows the Albanian language and the TLD for Albania strongly associated with windows-1252 and not at all associated with either windows-1250 or ISO-8859-2. It made sense to put Albanian in the windows-1252 training set for chardetng.

Estonian is a trickier case. Estonian words that aren’t recent (from the last 100 years or so?) loans can be written using the ISO-8859-1 repertoire (the non-ASCII letters being õ, ä, ö, and ü; et_EE without further suffix in glibc is an ISO-8859-1 locale, and et was an ISO-8859-1 locale in Solaris 2.6, too). Clearly, Estonian has better detection synergy with Finnish, German, and Portuguese than with Lithuanian and Latvian. For this reason, chardetng treats Estonian as a windows-1252 language.

Although the official Estonian character reportoire is fully part of windows-1252 in the final form of windows-1252 reached in Windows 98, it wasn’t the case when Windows 95 introduced an Estonian localization of Windows and the windows-1257 Baltic encoding. At that time, windows-1252 didn’t yet have ž, which was added in Windows 98—presumably in order to match all the letter additions that ISO-8859-15 got relative to ISO-8859-1. (ISO-8859-15 got ž and š as a result of lobbying by the Finnish language regulator which insists that these letters be used for certain loans in Finnish. windows-1252 already had š in Windows 95.) While the general design principle of the window-125x series appears to be that if a character occurs in windows-1252, it is in the same position in the other windows-125x encodings that it occurs in, this principle does not apply to the placement of š and ž in windows-1257. ISO-8859-13 has the same placement of š and ž as windows-1257. ISO-8859-4 has yet different placement. (As does ISO-8859-15, which isn’t a possible detection outcome of chardetng.) The Estonian-native vowels are in the same positions in all these encodings.

The Estonian language regulator designates š and ž as part of the Estonian orthography, but these characters are rare in Estonian, since they are only used in recent loans and in transliteration. It’s completely normal to have Web pages where the entire page has neither or the page has only one occurrence of either of them. Still, they are common enough that you can find them on a major newspaper site with a few clicks. This, obviously, is problematic for encoding detection. Estonian gets detected as windows-1252. If the encoding actually was windows-1257, ISO-8859-13, ISO-8859-4, or ISO-8859-15, the (likely) lone instance of š or ž gets garbled.

It would be possible to add Estonian-specific post-processing logic to map the windows-1252 result to windows-1257, ISO-8859-13, or ISO-8859-4 (or even ISO-8859-15) if the content looks Estonian based on the Estonian non-ASCII vowels being the four most frequent non-ASCII letters and then checking which encoding is the best fit for š and ž. However, I haven’t taken the time to do this, and it would cause reloads of Web pages just to fix maybe one character per page.

Refinements

The foundational ideas described above weren’t quite enough. Some refinements were needed. I wrote a test harness that synthetized input from Wikipedia titles and checked how well titles that encoded to non-ASCII were detected in such a way that they roundtripped. (In some cases, there are multiple encodings that roundtrip a string. Detecting any one of them was considered a success.) I looked at the cases that chardetng failed but ced succeeded at.

When looking at the results, it was pretty easy to figure out why a particular case failed, and it was usually pretty quick and easy to come up with and implement a new rule to address the failure mode. This iteration could be continued endlessly, but I stopped early when the result seemed competitive enough compared to ced. However, the last two items on the list were made in response to a bug report. Here are the adjustments that I made.

  • Some non-ASCII letters in Latin-script encodings got high pair-wise scores leading to misdetecting non-Latin as Latin. I remedied this by penalizing sequences of three non-ASCII letters in Latin encodings a little and penalizing sequences of four or more a lot. (Polish and Turkish do have some legitimate sequences of four non-ASCII letters, though.)

  • Treating non-ASCII punctuation and symbols as space-equivalent didn’t work, because pairing a letter with a space-like byte tends to score high but some encodings assign letters where others have symbols or punctuation. Therefore, when bytes were intended to be two letters but could be interpreted as a symbol and letter in another encoding, the latter interpretation scored higher due to the letter and space-like combination scoring higher. Such symbols and punctuation needed new non-space-equivalent character classes. I adjusted things such that no byte above 0xA0 in any single-byte encoding can get grouped together in the same character class as the ASCII space. No-break space when assigned to 0xA0 as well as Windows curly quotes and dashes when assigned below 0xA0 remain in the same character class as the ASCII space, though.

  • Symbols in the 0xA1 to 0xBF byte range were split further into classes that have different implausibility characteristics before or after letters (or both) or that are implausible next to characters from their own class. (Things like ® typically appear after letters but © appears before letters if next to a letter at all, and ®® and ©© are very unlikely.)

  • Some other characters had existence proof of occurring in pairs in Wikipedia that are in practice extremely unlikely and benefited from manually-forced implausibility. Examples include forcing implausibility of a letter after Greek final sigma, forcing implausibility of left-to-right mark and right-to-left mark next to a letter (they are supposed to be used between punctuation and space), and marking Vietnamese tones as implausible after letters other than the ones that are legitimate base characters in Vietnamese orthography. (Note that an implausibility penalty isn’t absolutely disqualifying, so long input can tolerate isolated implausibilities.)

  • Giving non-ASCII bicameral-script words that start with a capital letter a slight boost helped distinguish between Greek and the various non-Windows Cyrillic encodings.

  • Splitting the windows-1252 model into two: Icelandic and Faroese on one hand and the rest on the other. Since characters that (within the windows-1252 language set) are specific to Icelandic and Faroese are the first ones that that get replaced with something else in other Latin-script Windows and ISO encodings, not merging their use with scores for other windows-1252 languages helps keep the models for windows-1257 and windows-1254 relatively more distinctive.

  • I made use of the fact that present-day Korean uses ASCII spaces between words while Chinese and Japanese don’t.

  • Giving the same score to any Han character turned out to be a bad idea in terms of being able to distinguish legacy CJK encodings from non-Latin single-byte encodings. Fortunately, the legacy CJK encodings have a coarse frequency classification built in, and the most frequent class has nice properties relative to single-byte encodings.

    JIS X 0208, GB2312, and the original Big5 have their kanji/hanzi organized into two levels roughly according to respective locales’ education systems’ classification at the time of initial standard creation. That is, Level 1 corresponds to the most frequent kanji/hanzi that the education systems prioritize. KS X 1001 instead splits into common Hangul and to hanja (very rare). This means that just by looking at the bytes, it’s possible to classify legacy CJK characters into three frequency classes: Level 1 (or Hangul), Level 2 (or hanja in the case of EUC-KR), and other (rare Hangul in the case of EUC-KR). These three can, and now are, given different scores.

    Moreover, these have the fortuitous byte mapping that the Level 1 or common Hangul section in each encoding uses lower byte values for the lead byte while non-Thai, non-Arabic Windows and ISO encodings use high byte values for the common non-ASCII characters: for lower case in bicameral scripts or, in the case of Hebrew, for the consonants. This yields naturally distinctive scoring except for Thai and, to lesser extent, Arabic.

  • Some lead bytes in CJK encodings that overlap with windows-125x non-ASCII punctuation are problematic, because they pair with ASCII trail bytes in ways that can occur in Latin-script text without other adjacent letters that would trigger an Latin adjacency penalty. For example, without intervention, “Rock ’n Roll” could get interpreted as “Rock 地 Roll”. I made it so that the score for CJK characters with a problematic lead byte is committed only if the next character is a CJK character, too.

  • The differences between EUC-JP (presence of kana), EUC-KR (mainly just Hangul and with spaces between words), and GBK don’t necessarily show up in short titles. This is to be expected given the fundamental bet made in the design. Still, this made chardetng look bad relative to ced. I deviated from the plan of not having CJK frequency tables by including tables of the most frequent JIS X 0208 Level 1 Kanji, the most frequent GB2312 Level 1 Hanzi, and the most frequent KS X 1001 Hangul. I set the cutoff for “most frequent” to 128, so the resulting tables ended up being very small but still effective. (Big5 is structurally distinctive even with short inputs, so after trying including a table of the most frequent Big5 Level 1 Hanzi, I removed the table as unnecessary.)

  • Thai needed byte range-specific multipliers to tune it relative to GBK.

  • It was impractical to give score to windows-1252 ordinal indicators in the pairwise model without breaking Romanian detection. For this reason, there’s a state machine for giving score to ordinal indicator usage based on a little more context than just byte pairs taken into account. This boosted detecting accuracy especially for Italian but also for Portuguese, Castilian, Catalan, and Galician.

  • The byte 0xA0 is no-break space in most encodings. To avoid misdetecting an odd number of no-break spaces as IBM866 and to avoid misdetecting an even number as Chinese or Korean, 0xA0 is treated as a problematic lead for CJK purposes, and there’s a special case not to apply the score to IBM866 from certain combinations involving 0xA0.

  • To avoid detecting windows-1252 English as windows-1254, Latin candidates don’t count a score for a pair that involves an ASCII byte and a space-like non-ASCII byte. Otherwise, the score for Turkish dotless ı in word-final would be applied to English I’ (as in “I’ve” or “I’m”). While this would decode to the right characters and look right, it would cause an unnecessary reload in Firefox.

While the single-byte letter pair scores arose from data and the Level 1 Kanji/Hanzi was then calibrated relative to Arabic scoring (and the common Hangul score is one higher than that with a penalty for implausibly long words to distinguish from Chinese given enough input), in the interest of expediency, I assigned the rest of the scoring, including penalty values, as educated guesses rather than trying to build any kind of training framework to try to calibrate them optimally.

TLD-Awareness

The general accuracy characterization relates to generic domains, such as .com. In the case of ccTLDs that actually appear to be in local use (such as .fi) as opposed to generic use (such as .tv), chardetng penalizes encodings that are less plausible for the ccTLD but are known to be confusable with the encoding(s) plausible for the ccTLD. Therefore, typical legacy content on ccTLDs is even more likely to get the right guess than legacy content on generic domains. On the flip side, the guess may be worse for atypical legacy content on ccTLDs.

Integration with Firefox

Firefox asks chardetng to provide a guess twice during the loading of a page. If there is no HTTP-level character encoding declaration and no BOM, Firefox buffers up to 1024 bytes for the <meta charset> pre-scan. Therefore, if the pre-scan fails and content-based detection is necessary, there always already is a buffer of the first 1024 bytes. chardetng makes it first guess from that buffer and the top-level domain. This doesn’t involve reloading anything, because the first 1024 byte haven’t been decoded yet.

Then, when the end of the stream is reached, chardetng guesses again. If the guess differs from the earlier guess, the page is reloaded using the new guess. By making the second guess at the end of the stream, chardetng has the maximal information available, and there is no need to estimate what portion of the stream would have been sufficient to make the guess. Also, unlike Chrome’s approach of examining the first chunk of data that the networking subsystem passes to the HTML parser, this approach does not depend on how the stream is split into buffers. If the title of the page fit into the first 1024 bytes, chances are very good that the first guess was already correct. (This is why evaluating accuracy for “title-length input” was interesting.) Encoding-related reloads are a pre-existing behavior of Gecko. A <meta charset> that does not fit in the first 1024 bytes also triggers a reload.

In addition to the .in and .lk TLDs being exempt from chardetng (explanation why comes further down), the .jp TLD uses a Japanese-specific detector that makes its decision as soon as logically possible to decide between ISO-2022-JP, Shift_JIS and EUC-JP rather than waiting until the end of the stream.

Evaluation

So is it any good?

Accuracy

Is it more or less accurate that ced? This question does not have a simple answer. First of all, accuracy depends on the input length, and chardetng and ced scale down differently. They also scale up differently. After all, as noted, one of the fundamental bets of chardetng was that it was OK to let legacy CJK detection accuracy scale down less well in order to have binary size savings. In contrast, ced appears to have been designed to scale down well. However, ced in Chrome gets to guess once per page but chardetng in Firefox gets to revise its guess. Second, for some languages ced is more accurate and for others chardetng is more accurate. There is no right way to weigh the languages against each other to come up with a single number.

Moreover, since so many languages are windows-1252 languages, just looking at the number of languages is misleading. A detector that always guesses windows-1252 would be more accurate for a large number of languages, especially with short inputs that don’t exercise the whole character repertoire of a given language. In fact, guessing windows-1252 for pretty much anything Latin-script makes the old chardet (tested as the Rust port) look really good for windows-1252. (Other than that, both chardetng and ced are clearly so much better than ICU4C and old chardet that I will limit further discussion to chardetng vs. ced.)

There are three measures of accuracy that I think are relevant:

Title-length accuracy

Given page titles that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?

Since Firefox first asks chardetng to guess from the first 1024 bytes, chances are that the only bit of non-ASCII content that participates in the guess is the page title.

Document-length accuracy

Given documents that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?

Since Firefox asks chardetng to guess again at the end of the stream, it is relevant how well the detector does with full-document input.

Document-length-equivalent number of non-ASCII bytes

Given the guess that is made from a full document, what prefix length, counted as number of non-ASCII bytes in the prefix, is sufficient to look at for getting the same result?

The set of languages I tested was the set of languages that have Web Platform-relevant legacy encodings, have at least 10000 articles in the language’s Wikipedia, and are not too close to being ASCII-only: an, ar, az, be, bg, br, bs, ca, ce, cs, da, de, el, es, et, eu, fa, fi, fo, fr, ga, gd, gl, he, hr, ht, hu, is, it, ja, ko, ku, lb, li, lt, lv, mk, mn, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sq, sr, sv, th, tr, uk, ur, vi (orthographically decomposed), vi (minimally decomposed), wa, yi, zh-hans, zh-hant. (The last two were algorithmically derived from the zh Wikipedia using MediaWiki’s own capabilities.)

It’s worth noting that it’s somewhat questionable to use the same data set for training and for assessing accuracy. The accuracy results would be stronger if the detector was shown accurate on a data set independent from the training data. In this sense, the results for ced are stonger than the results for chardetng. Unfortunately, it’s not simple to obtain a data set alternative to Wikipedia, which is why the same data set is used for both purposes.

Title-Length Accuracy

If we look at Wikipedia article titles (after rejecting titles that encode to all ASCII, kana-containing titles in Chinese Wikipedia, and some really short mnemonic titles for Wikipedia meta pages), we can pick some accuracy threshold and see how many languages ced and chardetng leave below the threshold.

No matter what accuracy threshold is chosen, ced leaves more combinations of language and encoding below the threshold, but among the least accurate are Vietnamese, which is simply unsupported by ced, and ISO-8859-4 and ISO-8859-6, which are not as relevant as Windows encodings. Still, I think it’s fair to say that chardetng is overall more accurate on this title-length threshold metric, although it’s still possible to argue about the relative importance. For example, one might argue that considering the number of users it should matter more which one does better on Simplified Chinese (ced) than on Breton or Walloon (chardetng). This metric is questionable, because there are so many windows-1252 languages (some of which make it past the 10000 Wikipedia article threshold by having lots of stub articles) that a detector that always guessed windows-1252 would get a large number of languages right (old chardet is close to this characterization).

If we put the threshold at 80%, the only languages that chardetng leaves below the threshold are Latvian (61%) and Lithuanian (48%). Fixing the title-length accuracy for Latvian and Lithuanian is unlikely to be possible without a binary size penalty. ced spends 8 KB on a trigram table that improves Latvian and Lithuanian accuracy somewhat but still leaves them less accurate than most other Latin-script languages. An alternative possibility would be to have distinct models for Lithuanian and Latvian to be able to boost them individually such that the boosted models wouldn’t compete with languages that match the combination of Lithuanian and Latvian but don’t match either individually. The non-ASCII letter sets of Lithuanian and Latvian are disjoint except for č, š and ž. Anyway, as noted earlier, the detector is primarily for the benefit of non-Latin scripts, since the failure mode for Latin scripts is relatively benign. For this reason, I have not made the effort to split Lithuanian and Latvian into separate models.

The bet on trading away the capability to scale down for legacy CJK shows up the most clearly for GBK: chardetng is only 88% accurate on GBK-encoded Simplified Chinese Wikipedia titles while ced is 95% accurate. This is due to the GBK accuracy of chardetng being bad with fewer than 6 hanzi. Five syllables is still a plausible Korean word length, so the penalty for implausibly long Korean words doesn’t take effect to decide that the input is Chinese. Overall, I think failing initially for cases where the input is shorter than 6 hanzi is a reasonable price to pay for the binary size savings. After all, once the detector has seen the whole page, it can for sure correct itself and figure out the distinction between GBK and EUC-KR. (Likewise, if the 80% threshold from the previous paragraph seems bad—one in five failure rate, it’s good to remember that it just means that one in five cases needs to correct the guess after having examined the whole page.)

In the table language/encoding combinations for which chardetng is worse than ced by more than one percentage point are highlighed with bold font and tomato background. The combinations for which chardetng is worse than ced by one percentage point are highlighted with italic font and thistle background.

LanguageEncodingchardetngcedchardetICU4C anwindows-125298%97%100%92% arISO-8859-689%49%0%64% arwindows-125688%98%1%41% azwindows-125492%73%44%49% beISO-8859-599%88%96%45% beKOI8-U99%78%23%5% bewindows-125199%99%72%34% bgISO-8859-598%92%98%51% bgKOI8-U97%94%97%40% bgwindows-125194%98%77%44% brwindows-125297%60%100%82% bsISO-8859-289%68%3%20% bswindows-125089%87%39%50% cawindows-125292%89%100%92% ceIBM86699%96%93%0% ceISO-8859-599%93%98%44% ceKOI8-U99%97%98%39% cewindows-125198%99%69%42% csISO-8859-286%80%39%55% cswindows-125086%93%53%65% dawindows-125295%92%100%88% dewindows-125299%97%100%94% elISO-8859-796%92%91%63% elwindows-125397%90%91%61% eswindows-125299%98%100%96% etwindows-125298%95%99%83% euwindows-125296%94%100%87% fawindows-125688%97%0%19% fiwindows-125299%98%99%88% fowindows-125294%86%98%79% frwindows-125294%91%100%96% gawindows-1252100%98%100%89% gdwindows-125289%66%100%83% glwindows-125299%98%100%94% hewindows-125597%95%93%59% hrISO-8859-295%70%8%24% hrwindows-125095%87%41%52% htwindows-125296%71%100%83% huISO-8859-293%96%76%74% huwindows-125093%96%77%75% iswindows-125292%85%95%75% itwindows-125295%90%100%90% jaEUC-JP86%55%56%17% jaShift_JIS95%99%37%17% koEUC-KR95%98%68%7% kuwindows-125481%42%80%54% lbwindows-125296%86%100%95% liwindows-125294%74%99%87% ltISO-8859-471%47%2%3% ltwindows-125748%88%2%2% lvISO-8859-470%45%3%4% lvwindows-125761%74%3%3% mkISO-8859-598%91%97%48% mkKOI8-U96%97%97%36% mkwindows-125195%98%73%41% mnKOI8-U97%71%71%11% mnwindows-125194%97%64%12% nnwindows-125294%94%100%84% nowindows-125295%95%100%91% ocwindows-125291%80%100%85% plISO-8859-292%97%23%64% plwindows-125090%96%26%65% ptwindows-125297%98%100%96% roISO-8859-290%55%30%53% rowindows-125090%56%31%55% ruIBM86699%96%91%0% ruISO-8859-599%93%98%46% ruKOI8-U98%97%98%41% ruwindows-125197%99%73%44% shISO-8859-291%78%44%50% shwindows-125093%93%79%83% skISO-8859-290%81%54%57% skwindows-125087%93%72%68% slISO-8859-292%73%10%26% slwindows-125091%93%51%59% sqwindows-125298%53%100%89% srISO-8859-599%96%99%41% srKOI8-U99%98%99%27% srwindows-125199%99%87%34% svwindows-125296%94%100%92% thwindows-87493%96%86%0% trwindows-125484%87%41%52% ukKOI8-U98%81%34%10% ukwindows-125198%98%69%34% urwindows-125686%87%0%13% viwindows-1258 (orthographic)93%10%11%10% viwindows-1258 (minimally decomposed)91%21%22%19% wawindows-125298%71%100%84% yiwindows-125593%86%86%30% zh-hansGBK88%95%28%5% zh-hantBig595%94%25%5% Document-length Accuracy

If we look at Wikipedia articles themselves and filter out ones whose wikitext UTF-8 byte length is 6000 or less (arbitrary threshold to try to filter out stub articles), chardetng looks even better compared to ced in terms of how many language are left below a given accuracy threshold.

If the accuracy is rounded to full percents, ced leaves 29 language/encoding combinations at worse than 98% (i.e. 97% or lower). chardetng leaves 8. Moreover, ced leaves 22 combinations below the 89% threshold. chardetng leaves 1: Lithuanian as ISO-8859-4. That’s a pretty good result!

LanguageEncodingchardetngcedchardetICU4C anwindows-125299%99%100%100% arISO-8859-6100%100%0%94% arwindows-1256100%100%0%93% azwindows-125499%46%1%88% beISO-8859-5100%100%100%66% beKOI8-U100%100%0%0% bewindows-1251100%100%81%66% bgISO-8859-5100%100%100%89% bgKOI8-U100%94%93%83% bgwindows-1251100%100%100%89% brwindows-1252100%23%100%99% bsISO-8859-2100%8%0%23% bswindows-1250100%99%0%24% cawindows-1252100%99%100%100% ceIBM866100%100%100%0% ceISO-8859-5100%100%100%51% ceKOI8-U100%96%95%37% cewindows-1251100%100%98%51% csISO-8859-2100%6%0%84% cswindows-1250100%100%0%85% dawindows-1252100%100%100%100% dewindows-1252100%98%100%100% elISO-8859-797%31%57%95% elwindows-1253100%100%17%64% eswindows-1252100%100%100%100% etBetter of windows-1252 and windows-1257100%98%98%98% euwindows-125298%98%100%100% fawindows-1256100%100%0%12% fiwindows-1252100%77%100%99% fowindows-125295%98%100%99% frwindows-1252100%100%100%100% gawindows-125299%100%100%100% gdwindows-125299%75%100%99% glwindows-1252100%100%100%100% hewindows-1255100%100%100%84% hrISO-8859-298%17%2%65% hrwindows-125098%99%4%68% htwindows-125299%73%100%100% huISO-8859-289%85%1%85% huwindows-125089%98%1%82% iswindows-125299%99%100%99% itwindows-125297%94%100%100% jaEUC-JP100%100%99%100% jaShift_JIS100%100%92%100% koEUC-KR100%100%94%100% kuwindows-125496%6%8%44% lbwindows-1252100%91%100%100% liwindows-1252100%32%100%100% ltISO-8859-454%87%0%0% ltwindows-125794%99%0%0% lvISO-8859-498%99%0%0% lvwindows-125799%100%0%0% mkISO-8859-5100%100%100%83% mkKOI8-U100%98%97%82% mkwindows-1251100%100%99%83% mnKOI8-U100%99%1%0% mnwindows-1251100%99%98%1% nnwindows-1252100%100%100%100% nowindows-125299%99%100%100% ocwindows-1252100%98%100%98% plISO-8859-299%98%0%84% plwindows-125099%100%0%85% ptwindows-125299%100%100%100% roISO-8859-299%66%0%82% rowindows-125099%71%1%78% ruIBM866100%100%100%0% ruISO-8859-5100%100%100%93% ruKOI8-U100%96%93%86% ruwindows-1251100%100%97%93% shISO-8859-299%11%0%31% shwindows-125099%98%4%36% skISO-8859-299%41%0%64% skwindows-125099%100%13%65% slISO-8859-299%33%0%41% slwindows-125098%98%2%46% sqwindows-1252100%16%100%100% srISO-8859-5100%100%100%22% srKOI8-U100%100%100%22% srwindows-1251100%100%99%22% svwindows-1252100%100%100%100% thwindows-874100%91%99%0% trwindows-125499%97%0%80% ukKOI8-U100%100%0%0% ukwindows-1251100%100%99%80% urwindows-125699%98%1%5% viwindows-1258 (orthographic)100%0%0%0% viwindows-1258 (minimally decomposed)99%0%0%0% wawindows-1252100%79%100%99% yiwindows-1255100%100%99%30% zh-hansGBK100%100%100%100% zh-hantBig5100%100%99%100% Document-length-equivalent number of non-ASCII bytes

I examined how truncated input (starting from 10 non-ASCII bytes and continuing by 10-byte intervals (until 100 and then my coarser intervals) but truncating by one more if the truncation would otherwise render CJK input invalid) compared to document-length input.

For legacy CJK encodings, chardetng achieves document-length-equivalent accuracy with about 10 non-ASCII bytes. For most windows-1252 and windows-1251 languages, chardetng achieves document-length-equivalent accuracy with about 20 non-ASCII bytes. Obviously, this means shorter overall input for windows-1251 than for windows-1252. At 50 non-ASCII bytes, there are very few language/encoding combinations that haven’t completely converged. Some oscillate back a little afterwards, and almost everything has settled at 90 non-ASCII bytes.

Hungarian and ISO-8859-2 Romanian are special cases that haven’t completely converged even at 1000 non-ASCII bytes.

While the title-length case showed that ced scaled down better in some cases, the advantage is lost even at 10 non-ASCII bytes. While ced has document-lengh-equivalent accuracy for the legacy CJK encodings at 10 non-ASCII bytes, the rest take significantly longer to converge than they do with chardetng.

ced had a number of windows-1252 adn windows-1251 cases that converged at 20 non-ASCII bytes as with chardetng. However, it has more cases, including windows-1252 and windows-1251 cases, whose convergenge went into hundrends of non-ASCII bytes. Notably, KOI8-U as an encoding was particularly bad at converging to document-length-equilavence and for most languages (for which it is relevant) had not converged even at 1000 non-ASCII bytes.

Overall, I think it is fair to say that ced may scale down better in some case where there are fewer than 10 non-ASCII bytes, but chardetng generally scales up better from 10 non-ASCII bytes onwards. (The threshold may be a bit under 10, but because the computation of these tests is quite slow, I did not spend time searching for the exact threshold.)

ISO-8895-7 Greek exhibited strange enough behavior with both chardetng and ced in this test that it made me suspect the testing method has some ISO-8859-7-specific problem, but I did not have time to investigate the ISO-8895-7 Greek issue.

Binary Size

Is it more compact than ced? As of Firefox 78, chardetng and the special-purpose Japanese encoding detector shift_or_euc contribute 62 KB to x86_64 Android libxul size, when the crates that these depend on are treated as sunk cost (i.e. Firefox already has the dependencies anyway, so they don’t count towards the added bytes). When built as part of libxul, ced contributes 226 KB to x86_64 Android libxul size. The binary size contribution on chardetng and shift_or_euc together is 28% of what the binary size contribution of ced would be. The x86 situation is similar.

chardetng + shift_or_eucced .text24.6 KB34.3 KB .rodata30.8 KB120 KB .data.rel.ro2.52 KB59.9 KB

On x86_64, the goal of creating something smaller than ced worked out very well. On ARMv7 and aarch64, chardetng and shift_or_euc together result in smaller code than ced but by a less impressive factor. PGO effects ended up changing other code in unfortunate ways so much that it doesn’t make sense to give exact numbers.

Speed

chardetng is slower than ced. In single-threaded mode, chardetng takes 42% longer than ced to process the same input on Haswell.

This is not surprising, since I intentionally resolved most tradeoffs between binary size and speed in favor of smaller binary size at the expense of speed. When I resolved tradeoffs in favor of speed instead of binary size, I didn’t do so primarily for speed but for code readability. Furthermore, Firefox feeds chardetng more data than Chrome feeds to ced, so it’s pretty clear that overall Firefox spends more time in encoding detection than Chrome does.

I think optimizing for binary size rather than speed is the right tradeoff for code that only runs on legacy pages and doesn’t run at all for modern pages. (Also, microbenchmarks don’t show the cache effects on the performance of other code that likely result from ced having a larger working set of data tables than chardetng does.)

Rayon

The construction of encoding detectors is that there are a number of probes that process the same data logically independently of each other. On the surface, this structure looks perfect for parallelization using one of Rust’s superpowers: being able to easily convert an iteration to use multiple worker threads using Rayon.

Unfortunately, the result in the case of chardetng is rather underwhelming even with document-length passed in as 128 KB chunks (the best case with Firefox’s networking stack when the network is fast). While Rayon makes chardetng faster in terms of wall-clock time, the result is very far from scaling linearly with the number of hardware threads. With 8 hyperthreads available on a Haswell desktop i7, the wall-clock result using Rayon is still slower than ced running on a single thread. The synchronization overhead is significant, and the overall sum of compute time across threads is inefficient compared to the single-threaded scenario. If there is a reason to expect parallelism from higher-level task division, it doesn’t make sense to enable the Rayon mode in chardetng. As of Firefox 78, the Rayon mode isn’t used in Firefox, and I don’t expect to enable the Rayon mode for Firefox.

Risks

There are some imaginable risks that testing with data synthetized from Wikipedia cannot reveal.

Big5 and EUC-KR Private-Use Characters

The approach that a single encoding error disqualifies an encoding could disqualify Big5 or EUC-KR if there’s a single private-use character that is not acknowledged by the Encoding Standard.

The Encoding Standard definition of EUC-KR does not acknowledge the Private Use Area mappings in Windows code page 949. Windows maps byte pairs with lead byte 0xC9 or 0xFE and trail byte 0xA1 through 0xFE (inclusive) to the Private Use Area (and byte 0x80 to U+0080). The use cases, especially on the Web, for this area are mostly theoretical. However, if a page somehow managed to have a PUA character like this, the detector would reject EUC-KR as a possible detection outcome.

In the case of Big5, the issue is slightly less theoretical. Big5 as defined in the Encoding Standard fills the areas that were originally for private use but that were taken by Hong Kong Supplementary Character Set with actual non-PUA mappings for the HKSCS characters. However, this still leaves a range below HKSCS, byte pairs whose lead byte is 0x81 through 0x86 (inclusive), unmapped and, therefore, treated as errors by the Encoding Standard. Big5 has had more extension activity than EUC-KR, including mutually-incompatible extensions, and the Han script gives more of a reason (than Hangul) to use the End-User-Defined Characters feature of Windows. Therefore, it is more plausible that a private-use character could find its way into a Big5-encoded Web page than into an EUC-KR-encoded Web page.

For GBK, the Encoding Standard supports the PUA mappings that Windows has. For Shift_JIS, the Encoding Standard supports two-byte PUA mappings that Windows has. (Windows also maps a few single bytes to PUA code points.) Therefore, the concern raised in this section is moot for GBK and Shift_JIS.

I am slightly uneasy that Big5 and EUC-KR in the Encoding Standard are, on the topic of (non-HKSCS) private use characters, inconsistent with Windows and with the way the Encoding Standard handles GBK and Shift_JIS. However, others have been rather opposed to adding the PUA mappings and I haven’t seen actual instances of problems in the real world, so I haven’t made a real effort to get these mappings added.

(Primarily Tamil) Font Hacks

Some scripts had (single-byte) legacy encodings that didn’t make it to IE4 and, therefore, the Web Platform. These were supported by having the page decode as windows-1252 (possible via an x-user-defined declaration that meant windows-1252 decoding in IE4; the Encoding Standard x-user-defined is Mozilla’s different thing relevant to legacy XHR) and having the user install an intentionally misencoded font that assigned non-Latin glyphs to windows-1252 code points.

In some cases, there was some kind of cross-font agreement on how these were arranged. For example, for Armenian there was ARMSCII-8. (Gecko at one point implemented it as a real character encoding, but doing so was useless, because Web sites that used it didn’t declare it as such.) In other cases, these arrangements were font-specific and the relevant sites were simply popular enough (e.g. sites of newspapers in India) to be able to demand that the user install a particular font.

ced knows about a couple of font-specific encodings for Devanagari and multiple encodings, both font-specific and Tamil Nadu state standard, for Tamil. The Tamil script’s visual features make it possible to treat it as more Thai-like than as Devenagari-like, which means that Tamil is more suited for font hacks than the other Brahmic scripts of India. Unicode adopted the approach standardized by the federal government of India in 1988 to treat Tamil as Devanagari-like (logical order) whereas in 1999 the state government of Tamil Nadu sought to promote treating Tamil the way Thai is treated in Unicode (visual order).

All indications are that ced being able to detect these font hacks has nothing to do with Chrome’s needs as of 2017. It is more likely that this capability was put there for the benefit of the Google search engine more than a decade earlier. However, Chrome post-processes the detection of these encodings to windows-1252, so if sites that use these font hacks still exist, they’d appear to work in Chrome. (ced doesn’t know about Armenian, Georgian, Tajik, or Kazakh 8-bit encodings despite these having had glibc locales.)

Do such sites still exist? I don’t know. Looking at the old bugs in Bugzilla, the reported sites appear to have migrated to Unicode. This makes sense. Despite deferring the migration for years and years after Unicode was usable, chances are that the rise of mobile devices has forced migration. It’s considerably less practical to tell users of mobile operating systems to install fonts that it is to tell users of desktop operating systems to install fonts.

So chances are that such sites no longer exist (in quantity that matters), but it’s hard to tell, and if they did, they’d work in Chrome (if using an encoding that ced knows about) but wouldn’t work with chardetng in Firefox. Instead of adding support for detecting such encodings as windows-1252, I made the .in and .lk top-level domains not run chardetng and simply fall back to windows-1252. (My understanding is that the font hacks were more about the Tamil language in India specifically than about the Tamil language generally, but I turned off chardetng for .lk just in case.) This kept the previous behavior of Firefox for these two TLDs. If the problem exists on .com/.net/.org, the problem is not solved there. Also, this has the slight downside of not detecting windows-1256 to the extent it is used under .in.

UTF-8

chardetng detects UTF-8 by checking if the input is valid as UTF-8. However, like Chrome, Firefox only honors this detection result for file: URLs. As in Chrome, for non-file: URLs, UTF-8 is never a possible detection outcome. If it was, Web developers could start relying on it, which would make the Web Platform more brittle. (The assumption is that at this point, Web developers want to use UTF-8 for new content, so being able to rely on legacy encodings getting detected is less harmful at this point in time.) That is, the user-facing problem of unlabeled UTF-8 is deliberately left unaddressed in order to avoid more instances of problematic content getting created.

As with the Quirks mode being the default and everyone having to opt into the Standards mode, and on mobile a desktop-like view port being the default and everyone having to opt into a mobile-friendly view port, for encodings legacy is the default and everyone has to opt into UTF-8. In all these cases, the legacy content isn’t going to be changed to opt out.

The full implications of “what if” UTF-8 was detected for non-file: URLs require a whole article on their own. The reason why file: URLs are different is that the entire content can be assumed to be present up front. The problems with the detecting UTF-8 on non-file: URLs relate to supporting incremental parsing and display of HTML as it arrives over the network.

When UTF-8 is detected on non-file: URLs, chardetng reports the encoding affiliated with the top-level domain instead. Various test cases, both test cases that intentionally test this and test cases that accidentally end up testing this, require windows-1252 to be reported for generic top-level domains when the content is valid UTF-8. Reporting the TLD-affiliated encoding as opposed to always reporting windows-1252 avoids needless reloads on TLDs that are affiliated with an encoding other than windows-1252.

Categorieën: Mozilla-nl planet

Mozilla Privacy Blog: Mozilla releases recommendations on EU Data Strategy

vr, 05/06/2020 - 13:24

Mozilla recently submitted our response to the European Commission’s public consultation on its European Strategy for Data.  The Commission’s data strategy is one of the pillars of its tech strategy, which was published in early 2020 (more on that here). To European policymakers, promoting proper use and management of data can play a key role in a modern industrial policy, particularly as it can provide a general basis for insights and innovations that advance the public interest.

Our recommendations provide insights on how to manage data in a way that protects the rights of individuals, maintains trust, and allows for innovation. In addition to highlighting some of Mozilla’s practices and policies which underscore our commitment to ethical data and working in the open – such as our Lean Data Practices Toolkit, the Data Stewardship Program, and the Firefox Public Data Report – our key recommendations for the European Commission are the following:

  • Address collective harms: In order to foster the development of data ecosystems where data can be leveraged to serve collective benefits, legal and policy frameworks must also reflect an understanding of potential collective harms arising from abusive data practices and how to mitigate them.
  • Empower users: While enhancing data literacy is a laudable objective, data literacy is not a silver bullet in mitigating the risks and harms that would emerge in an unbridled data economy. Data literacy – i.e. the ability to understand, assess, and ultimately choose between certain data-driven market offerings – is effective only if there is actually meaningful choice of privacy-respecting goods and services for consumers. Creating the conditions for privacy-respecting goods and services to thrive should be a key objective of the strategy.
  • Explore data stewardship models (with caution): We welcome the Commission’s exploration of novel means of data governance and management. We believe data trusts and other models and structures of data governance may hold promise. However, there are a range of challenges and complexities associated with the concept that will require careful navigation in order for new data governance structures to meaningfully improve the state of data management and to serve as the foundation for a truly ethical and trustworthy data ecosystem.

We’ll continue to build out our thinking on these recommendations, and will work with the European Commission and other stakeholders to make them a reality in the EU data strategy. For now, you can find our full submission here.

 

The post Mozilla releases recommendations on EU Data Strategy appeared first on Open Policy & Advocacy.

Categorieën: Mozilla-nl planet

The Talospace Project: Firefox 77 on POWER

do, 04/06/2020 - 19:38
Firefox 77 is released. I really couldn't care less about Pocket recommendations, and I don't know who was clamouring for that exactly because everybody be tripping recommendations, but better accessibility options are always welcome and the debugging and developer tools improvements sound really nice. This post is being typed in it.

There are no OpenPOWER-specific changes in Fx77, though a few compilation issues were fixed expeditiously through Dan Horák's testing just in time for the Fx78 beta. Daniel Kolesa reported an issue with system NSS 3.52 and WebRTC, but I have not heard if this is still a problem (at least on the v2 ABI), and I always build using in-tree NSS myself which seems to be fine. This morning Daniel Pocock sent me a basic query of 64-bit Power ISA bugs yet to be fixed in Firefox; I suspect some are dupes (I closed one just this morning which I know I fixed myself already), and many are endian-specific, but we should try whittling down that list (and, as usual, LTO and PGO still need to be investigated further). I'm still using the same .mozconfigs from Firefox 67.

In a minor moment of self-promotion, I'm also shamelessly reminding readers that Fx77 comes out parallel with TenFourFox Feature Parity Release 23, relevant to Talospace readers because I made some fixes to its Content Security Policy support to properly support the web-based OpenBMC with System Package 2.00. Although the serial console-LAN redirector has some stuttery keystrokes, I think this is a timing problem rather than a feature deficiency, and everything else generally works. Connecting over ssh or serial port is naturally always an option, but I have to agree the web OpenBMC is a lot nicer and some tasks are certainly easier that way. If you're a long-term PowerPC dweeb like me and you want to use your beloved Power Mac to manage your brand-spanking-new Talos II or Blackbird, now you can.

Categorieën: Mozilla-nl planet

Hacks.Mozilla.Org: A New RegExp Engine in SpiderMonkey

do, 04/06/2020 - 16:21
Background: RegExps in SpiderMonkey

Regular expressions – commonly known as RegExps – are a powerful tool in JavaScript for manipulating strings. They provide a rich syntax to describe and capture character information. They’re also heavily used, so it’s important for SpiderMonkey (the JavaScript engine in Firefox) to optimize them well.

Over the years, we’ve had several approaches to RegExps. Conveniently, there’s a fairly clear dividing line between the RegExp engine and the rest of SpiderMonkey. It’s still not easy to replace the RegExp engine, but it can be done without too much impact on the rest of SpiderMonkey.

In 2014, we took advantage of this flexibility to replace YARR (our previous RegExp engine) with a forked copy of Irregexp, the engine used in V8. This raised a tricky question: how do you make code designed for one engine work inside another? Irregexp uses a number of V8 APIs, including core concepts like the representation of strings, the object model, and the garbage collector.

At the time, we chose to heavily rewrite Irregexp to use our own internal APIs. This made it easier for us to work with, but much harder to import new changes from upstream. RegExps were changing relatively infrequently, so this seemed like a good trade-off. At first, it worked out well for us. When new features like the ‘\u’ flag were introduced, we added them to Irregexp. Over time, though, we began to fall behind. ES2018 added four new RegExp features: the dotAll flag, named capture groups, Unicode property escapes, and look-behind assertions. The V8 team added Irregexp support for those features, but the SpiderMonkey copy of Irregexp had diverged enough to make it difficult to apply the same changes.

We began to rethink our approach. Was there a way for us to support modern RegExp features, with less of an ongoing maintenance burden? What would our RegExp engine look like if we prioritized keeping it up to date? How close could we stay to upstream Irregexp?

Solution: Building a shim layer for Irregexp

The answer, it turns out, is very close indeed. As of the writing of this post, SpiderMonkey is using the very latest version of Irregexp, imported from the V8 repository, with no changes other than mechanically rewritten #include statements. Refreshing the import requires minimal work beyond running an update script. We are actively contributing bug reports and patches upstream.

How did we get to this point? Our approach was to build a shim layer between SpiderMonkey and Irregexp. This shim provides Irregexp with access to all the functionality that it normally gets from V8: everything from memory allocation, to code generation, to a variety of utility functions and data structures.

A diagram showing the architecture of Irregexp inside SpiderMonkey. SpiderMonkey calls through the shim layer into Irregexp, providing a RegExp pattern. The Irregexp parser converts the pattern into an internal representation. The Irregexp compiler uses the MacroAssembler API to call either the SpiderMonkey macro-assembler, or the Irregexp bytecode generator. The SpiderMonkey macro-assembler produces native code which can be executed directly. The bytecode generator produces bytecode, which is interpreted by the Irregexp interpreter. In both cases, this produces a match result, which is returned to SpiderMonkey.

This took some work. A lot of it was a straightforward matter of hooking things together. For example, the Irregexp parser and compiler use V8’s Zone, an arena-style memory allocator, to allocate temporary objects and discard them efficiently. SpiderMonkey’s equivalent is called a LifoAlloc, but it has a very similar interface. Our shim was able to implement calls to Zone methods by forwarding them directly to their LifoAlloc equivalents.

Other areas had more interesting solutions. A few examples:

Code Generation

Irregexp has two strategies for executing RegExps: a bytecode interpreter, and a just-in-time compiler. The former generates denser code (using less memory), and can be used on systems where native code generation is not available. The latter generates code that runs faster, which is important for RegExps that are executed repeatedly. Both SpiderMonkey and V8 interpret RegExps on first use, then tier up to compiling them later.

Tools for generating native code are very engine-specific. Fortunately, Irregexp has a well-designed API for code generation, called RegExpMacroAssembler. After parsing and optimizing the RegExp, the RegExpCompiler will make a series of calls to a RegExpMacroAssembler to generate code. For example, to determine whether the next character in the string matches a particular character, the compiler will call CheckCharacter. To backtrack if a back-reference fails to match, the compiler will call CheckNotBackReference.

Overall, there are roughly 40 available operations. Together, these operations can represent any JavaScript RegExp. The macro-assembler is responsible for converting these abstract operations into a final executable form. V8 contains no less than nine separate implementations of RegExpMacroAssembler: one for each of the eight architectures it supports, and a final implementation that generates bytecode for the interpreter. SpiderMonkey can reuse the bytecode generator and the interpreter, but we needed our own macro-assembler. Fortunately, a couple of things were working in our favour.

First, SpiderMonkey’s native code generation tools work at a higher level than V8’s. Instead of having to implement a macro-assembler for each architecture, we only needed one, which could target any supported machine. Second, much of the work to implement RegExpMacroAssembler using SpiderMonkey’s code generator had already been done for our first import of Irregexp. We had to make quite a few changes to support new features (especially look-behind references), but the existing code gave us an excellent starting point.

Garbage Collection

Memory in JavaScript is automatically managed. When memory runs short, the garbage collector (GC) walks through the program and cleans up any memory that is no longer in use. If you’re writing JavaScript, this happens behind the scenes. If you’re implementing JavaScript, though, it means you have to be careful. When you’re working with something that might be garbage-collected – a string, say, that you’re matching against a RegExp – you need to inform the GC. Otherwise, if you call a function that triggers a garbage collection, the GC might move your string somewhere else (or even get rid of it entirely, if you were the only remaining reference). For obvious reasons, this is a bad thing. The process of telling the GC about the objects you’re using is called rooting. One of the most interesting challenges for our shim implementation was the difference between the way SpiderMonkey and V8 root things.

SpiderMonkey creates its roots right on the C++ stack. For example, if you want to root a string, you create a Rooted<JSString*> that lives in your local stack frame. When your function returns, the root disappears and the GC is free to collect your JSString. In V8, you create a Handle. Under the hood, V8 creates a root and stores it in a parallel stack. The lifetime of roots in V8 is controlled by HandleScope objects, which mark a point on the root stack when they are created, and clear out every root newer than the marked point when they are destroyed.

To make our shim work, we implemented our own miniature version of V8’s HandleScopes. As an extra complication, some types of objects are garbage-collected in V8, but are regular non-GC objects in SpiderMonkey. To handle those objects (no pun intended), we added a parallel stack of “PseudoHandles”, which look like normal Handles to Irregexp, but are backed by (non-GC) unique pointers.

Collaboration

None of this would have been possible without the support and advice of the V8 team. In particular, Jakob Gruber has been exceptionally helpful. It turns out that this project aligns nicely with a pre-existing desire on the V8 team to make Irregexp more independent of V8. While we tried to make our shim as complete as possible, there were some circumstances where upstream changes were the best solution. Many of those changes were quite minor. Some were more interesting.

Some code at the interface between V8 and Irregexp turned out to be too hard to use in SpiderMonkey. For example, to execute a compiled RegExp, Irregexp calls NativeRegExpMacroAssembler::Match. That function was tightly entangled with V8’s string representation. The string implementations in the two engines are surprisingly close, but not so close that we could share the code. Our solution was to move that code out of Irregexp entirely, and to hide other unusable code behind an embedder-specific #ifdef. These changes are not particularly interesting from a technical perspective, but from a software engineering perspective they give us a clearer sense of where the API boundary might be drawn in a future project to separate Irregexp from V8.

As our prototype implementation neared completion, we realized that one of the remaining failures in SpiderMonkey’s test suite was also failing in V8. Upon investigation, we determined that there was a subtle mismatch between Irregexp and the JavaScript specification when it came to case-insensitive, non-unicode RegExps. We contributed a patch upstream to rewrite Irregexp’s handling of characters with non-standard case-folding behaviour (like ‘ß’, LATIN SMALL LETTER SHARP S, which gives “SS” when upper-cased).

Our opportunities to help improve Irregexp didn’t stop there. Shortly after we landed the new version of Irregexp in Firefox Nightly, our intrepid fuzzing team discovered a convoluted RegExp that crashed in debug builds of both SpiderMonkey and V8. Fortunately, upon further investigation, it turned out to be an overly strict assertion. It did, however, inspire some additional code quality improvements in the RegExp interpreter.

Conclusion: Up to date and ready to go

 

What did we get for all this work, aside from some improved subscores on the JetStream2 benchmark?

Most importantly, we got full support for all the new RegExp features. Unicode property escapes and look-behind references only affect RegExp matching, so they worked as soon as the shim was complete. The dotAll flag only required a small amount of additional work to support. Named captures involved slightly more support from the rest of SpiderMonkey, but a couple of weeks after the new engine was enabled, named captures landed too. (While testing them, we turned up one last bug in the equivalent V8 code.) This brings Firefox fully up to date with the latest ECMAScript standards for JavaScript.

We also have a stronger foundation for future RegExp support. More collaboration on Irregexp is mutually beneficial. SpiderMonkey can add new RegExp syntax much more quickly. V8 gains an extra set of eyes and hands to find and fix bugs. Hypothetical future embedders of Irregexp have a proven starting point.

The new engine is available in Firefox 78, which is currently in our Developer Edition browser release. Hopefully, this work will be the basis for RegExps in Firefox for years to come.

 

The post A New RegExp Engine in SpiderMonkey appeared first on Mozilla Hacks - the Web developer blog.

Categorieën: Mozilla-nl planet

Marco Zehe: My Journey To Ghost

do, 04/06/2020 - 13:30

As I wrote in my last post, this blog has moved from WordPress to Ghost recently. Ghost is a modern publishing platform that focuses on the essentials. Unlike WordPress, it doesn't try to be the one-stop solution for every possible use case. Instead, it is a CMS geared towards bloggers, writers, and publishers of free and premium content. In other words, people like me. :-)

After a lot of research, some pros and cons soul searching, and some experimentation, last week I decided to go through with the migration. This blog is hosted with the Ghost Foundation's Ghost(Pro) offering. So not only do I get excellent hosting, but my monthly fee will also be a donation to the foundation and help future development. They also take care of updates for me and that everything runs smoothly. And through a worldwide CDN, the site is now super fast no matter where my visitors come from.

The following should, however, also work the same on a self-hosted Ghost installation. I am not consciously aware of anything particular that would only work on the hosted Ghost(Pro) instances. So no matter how you have your Ghost site running, the following all assumes that you have, but the details are up to you.

Publishing from iPad

One of the main reasons to choose Ghost also was the ability to publish from my iPad without any hassle. My favorite writing app, Ulysses, has had the ability to publish to Ghost since June 2019. Similar to its years long capabilities to publish to WordPress and Medium, it now also does the same with Ghost through their open APIs. The Markdown I write, images, tags, and other bits of information is automatically translated to concepts Ghost understands. For a complete walk-through, read the post on the Ulysses blog about this integration.

Migrating from WordPress

My journey began by following the Ghost tutorial on migrating from WordPress. In a nutshell, this consists of:

  • Installing an official exporter plugin into your WordPress site.
  • Exporting your content using that plugin.
  • Importing the export into your Ghost site.
  • Check that everything works.

Sounds easy, eh? Well, it was, except for some pitfalls. With some trial and error, and deleting and importing my content from and into my Ghost site a total of three times, I got it working, though. Here's what I learned.

Match the author

Before exporting your content from WordPress, make sure that the author's profile e-mail address matches that of the author account in Ghost. Otherwise, a new author will be created, and the posts won't be attributed to you. That is, of course, assuming that you are doing this import for yourself, not for a team mate. The match this is done by is the e-mail address of the actual author profile, not the general admin e-mail from the WordPress general settings.

Check your image paths

This is another bit that differs between Ghost and WordPress. WordPress puts images into a wp-content/uploads/year/mo/<filename> folder. The Ghost exporter tries to mimic that, puts the images in content/wordpress/year/mo/<filenames>. But the links in the actually exported JSON file are not adjusted, you have to do that manually in your favorite text editor with a find and replace operation. And don't forget to zip up the changed JSON file back into the export you want to upload to the Ghost importer.

Permalink redirects

This was actually the hardest part for me, and with which I struggled for a few hours before I got it working. In default WordPress installations, the permalink structure looks something like mysite.com/yyyy/mm/dd/post-slug/. Some may omit the day part, but this is how things usually stand with WordPress. Ghost's permalink structure, which you can also change, by the way, is different. Its default permalinks look something like mysite.com/post-slug/. Since this was all new, I wanted to stick with the defaults and not reproduce the WordPress URL structure with custom routing.

The solution, of course, is one that, if someone brings up a link to my previous posts from another site, or a not yet updated index from Google searches, they will still get my post displayed, not a 404 Page Not Found error. And the proper way to do that is by permanent 301 redirects. Those are actually quite powerful, because they support regular expressions, or RegEx.

Regular expressions are powerful search phrases. They can, for example, do things like „Look for a string that starts somewhere with a forward slash, followed by 4 digits, followed by another slash, followed by 2 digits, another slash, another 2 digits, another slash, and some arbitrary string of characters until you reach the end of that string“. And if you've found that, return me only that arbitrary string so I can process it further. You guessed it, that is, in plain English, the search we need to do to get WordPress permalinks processed. We then only have to use the extracted slug so we can actually redirect to an URL that only contains that slug part.

The tricky part was how to get that right. I have been notoriously bad with Regex syntax. Its machine readable form is something not everyone can understand, much less compose, easily. And I thought: Someone must have run into this problem before, so let's ask Aunt Google.

What I found was, not surprisingly, something that pertained to changing the permalink structure in WordPress from the default to something that was exactly what Ghost is using. And the people who offer such a conversion tool are the makers of the YOAST SEO plugin. It is called YOAST permalink helper and is an awesome web tool that outputs the redirect entries for both Apache and NGINX configuration files.

Equipped with that, I started by looking at another web tool called Regex101. This is another awesome, although not fully accessible, tool that can take Regex of four flavors, you also give it a search string, and it tells you not only what the Regex does, but also if it works on the string you gave it. So I tried it out and could even generate a JavaScript snippet that then translated my Regex into the flavor that JavaScript uses. Because, you know, Regex isn't complicated enough as it is, it also needs flavors for many languages and systems. And they sometimes bite each other, like flavors in food can also do.

The Ghost team has a great tutorial on permanent redirects. but as I found out, the Ghost implementation has a few catches that took me a while to figure out. For example, to search for a forward slash, you usually escape that with a backslash character. However, in Ghost, the very first forward slash in the „from“ value must not be escaped. All others, yes please. But if you actually try the JavaScript flavor out on the Regex 101 page the tutorial recommends, it shows all forward slashes as to be escaped. Also, you better not conclude with a slash, but let the regex end in whatever character comes before that last forward slash Regex101 recommends.

The „To:“ value then also starts with a forward slash, and can then take one of the groups, in my case the 4th group, denoted by the $4 notation. I banged my head against these subtleties for a few hours, even went out on a completely different tangent there for a while only to discover that my initial approach was still the best bet I was getting.

Compared to the above, redirecting the RSS /feed/ to the Ghost style /rss/ was, after that previous ordeal, a piece of cake. Some RSS readers may struggle with this, so if yours doesn't pick up new posts any more, please change your feed URL setting.

My final redirect JSON file looks like this. If you plan to migrate to Ghost from WordPress, and have a similar permalink structure, feel free to use it.

[{ "from": "/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/(.*)", "to": "/$4", "permanent": true }, { "from": "/feed/", "to": "/rss/", "permanent": true }]Tags and categories

There are some more things that only partially translate between WordPress and Ghost. For example, while tags carry over, categories don't. The only way to do that is via a plugin that converts categories to tags. It is mentioned in Ghost‘s tutorial, but as I was looking at it, I saw that it had been updated last 6 years ago, and the last tested version was equally old. And while I don't think this part of WordPress has changed much, at least from the looks of it, I didn't trust such an old plugin to mess with my data. Yes, I had a backup anyway, but still.

And then there was CodeMirror

So here I was, having imported all my stuff, and opening one of the posts for editing in the Ghost admin. And to my surprise, I could not edit it! I found that the post was put into an instance of the CodeMirror editor.

CodeMirror has, at least in all versions before the upcoming version 6, a huge accessibility issue which makes it impossible for screen readers to read any text. It uses a hidden textarea for input, and this is actually where the focus is at all times. But as you type, the text gets grabbed and removed, and put into a contentEditable mirror that never gets focus. It also does some more fancy stuff like code syntax highlighting and line numbers. Version 6 will address accessibility, but it is not production-ready yet.

But wait, Ghost was said to work with Markdown? And I had actually tested the regular editor with Markdown. That was a contentEditable that I could read fine with my screen reader. So why was this different?

The answer is simple: To make things as seamless as possible, the Ghost WordPress Exporter exports the full HTML of the post, and imports it in Ghost as something called an HTML card. Cards are some special blocks that allow for code or HTML formatting. They are inserted as blocks into the regular content. And no, this is not actually a Gutenberg clone, it is more like some special areas of the post. Only that with these imported posts, the whole post was this special area.

Fortunately, if you need to work on such an older post after the import, for most simple formatting, there is a way to do it. You edit the HTML card, and when focused in the text area, press CTRL+A to select all, CTRL+X to cut the whole contents, then escape out of that card once. Back in the regular contentEditable, paste the clipboard contents. For not too complicated formatting, this will simply put the HTML into your contentEditable, and you get headings, lists, links, etc. The one thing I found that doesn't translate, are tables. This is probably because Markdown has such different and differing implementations in its flavors for tables.

If you need to insert HTML, write it in your favorite code editor first. Then, insert an HTML card, and paste the HTML there. I did so while updating my guide on how to use NVDA and Firefox to test your web pages for accessibility. Worked flawlessly. Also, the JSON code snippet above was input the same way.

But believe me, that moment where I opened an older post and could actually not edit it, was a scary moment that almost made me give up on the Ghost effort. Thankfully, there was help. So here we are.

A special thank you

I would like to extend a special thank you to Dave, the Ghost Foundation's developer advocate, who took it upon himself early on to help me with the migration. He answered quite a number of very different questions I was having, sent me helpful links, and also was a great assistant in understanding some of the quirks of the Ghost publishing screen. Some of which has led to some pull requests I sent in to fix these quirks. You know, I can't help it, I'm just that kind of accessibility guy. ;-)

But John O'Nolan, Ghost's founder, and others from the team have been very helpful and welcoming, merging my pull requests literally from day 1.5 of me using Ghost, answering more questions and offering to help.

In conclusion

This was a pleasurable experience through and through. And even the two hiccups I encountered were dealt with eventually, or are, in the case of the inaccessible CodeMirror bits, things I can somehow work around.

My blog has been running smoothly since May 29, and I hope to have some of the kinks with the theme smoothed out next, especially the color contrast bit, and the bit about the fonts some people have given me feedback on. I will work with the maintainer of the Attila theme to work through these.

Again, welcome to this new blogging chapter!

Categorieën: Mozilla-nl planet

Pagina's