Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec text is mostly en-GB-x-hixie with sprinkles of en-US, which should it be? #654

Closed
foolip opened this issue Feb 8, 2016 · 31 comments
Closed

Comments

@foolip
Copy link
Member

foolip commented Feb 8, 2016

No inconsistency says @Hixie, but this is low-priority, high-churn. Not a good first bug.

The spec text uses both British and American spelling, mostly British in obvious cases like "colour" and "normalise", but with exceptions. APIs use American spelling.

In #574 this was the cause of a typo in a cross reference.

The fix, if any, is to pick a side of the Atlantic and try to stay there.

https://en.wikipedia.org/wiki/Wikipedia:List_of_spelling_variants could be useful.

Disclaimer: prompted by w3c/html#69, about a small part of this issue in the fork

@domenic
Copy link
Member

domenic commented Feb 8, 2016

I think staying with en-GB-x-hixie (except for APIs) seems fine.

@zcorpan
Copy link
Member

zcorpan commented Feb 8, 2016

It might be nice to use the same spelling across specs so text can be moved around and not have to tweak the spelling when doing so.

But there's also churn for links, xrefs for other specs and tools, etc...

@sideshowbarker
Copy link
Contributor

I think staying with en-GB-x-hixie (except for APIs) seems fine.

I agree. IMHO there’s not really broken here that needs fixing.

About the specific case of commas before and after e.g., I’ve commented elsewhere in favor of always using commas but I think there’s also a reasonable argument that they’re really not necessary.

It might be nice to use the same spelling across specs so text can be moved around and not have to tweak the spelling when doing so.

IMHO, the possible need to maybe have to tweak spelling when moving text among specs would be a very minor inconvenience relative to the more major work of (re)changing the language style of the HTML spec at this point.

As far as the HTML spec itself goes, I think internal consistency is more important than cross-spec consistency. And I don’t even mean absolute consistency among distinct types of content even within the spec itself; for example, if the body of the spec uses British spelling, but the APIs use American spelling, IMHO that’s not a real consistency problem that needs to be solved—as long at the APIs themselves just follow the same regular style consistently. We still have a very easy rule to describe and follow consistently: use en-GB-x-hixie except for APIs.

In other words, I agree with Hixie’s “no inconsistency” maxim but I don’t think we actually have any inconsistency here that needs to fixed.

@foolip
Copy link
Member Author

foolip commented Feb 9, 2016

if the body of the spec uses British spelling, but the APIs use American spelling, IMHO that’s not a real consistency problem that needs to be solved

I agree. The "but with exceptions" wasn't in reference to the APIs, but things like "For historical reasons, the element's value is normalised in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. "

I don’t think we actually have any inconsistency here that needs to fixed.

A search for "ized" finds a bunch of US spellings that have slipped through, as well as some things that should use that spelling, like the imported "parse a serialized Content Security Policy" algorithm.

We can fix typos where we notice them, of course. A lint step for catching them would be nice.

@sideshowbarker
Copy link
Contributor

The "but with exceptions" wasn't in reference to the APIs, but things like "For historical reasons, the element's value is normalised in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. "

Ah, OK—yeah, I misunderstood before and I agree now we should definitely fix cases like that.

A search for "ized" finds a bunch of US spellings that have slipped through, as well as some things that should use that spelling, like the imported "parse a serialized Content Security Policy" algorithm.

We can fix typos where we notice them, of course. A lint step for catching them would be nice.

Yeah it seems like what would be ideal is if we can create a lint step for the cases we want to fix, and then run the lint initially and fix the existing errors (and then just keep the lint set up so that runs as part of the build and any CI we eventually put together).

So then the main task would become constructing the lint to cover the cases we need to cover.

@annevk
Copy link
Member

annevk commented Feb 9, 2016

I would definitely prefer en-us-x-hixie. This is a problem whenever we move text between standards and whenever we use terms from other standards. If we ever get a style guide from the ground for WHATWG it's also an additional learning curve for folks not familiar with the distinction. Actually, that's true regardless of whether we get a style guide.

The main benefit of en-gb-x-hixie is that you can distinguish between "colour" the concept and "color" the API, but this is not a universal trait for all words so not that useful.

@foolip
Copy link
Member Author

foolip commented Feb 9, 2016

I didn't realize that en-US-x-hixie had appeared in the spec, but it did before 5984a04. If we go with US spelling that would be great, dropping the -x-hixie suffix is not an ulterior motive.

If I had a magic wand (scripts to do this safely) I would prefer US spelling, because it is what I (try to) write and review all day long in other contexts.

@annevk
Copy link
Member

annevk commented Feb 9, 2016

The main problem is fragment identifiers. And a lot of those broke when @Hixie switched from US to GB and although some were fixed I'm guessing fragment identifiers might need to work for both spelling variants.

@zcorpan
Copy link
Member

zcorpan commented Feb 9, 2016

Maybe we can patch bikeshed to automagically support both spellings, to reduce breakage for other specs if we do switch back to US spelling. I suppose there are fewer specs that use Anolis these days, and maybe the right fix for Anolis specs is migrating to bikeshed anyway?

cc @tabatkins

@foolip
Copy link
Member Author

foolip commented Feb 9, 2016

There's also link-fixup.js, we'll know everything that changes, so that could probably be patched to handle it with a few simple-ish rules.

@tabatkins
Copy link
Contributor

I'm happy to do more linktext fixups in Bikeshed - I already have about a dozen things for various english variations.

Note that, for safety's sake, they generally only correct endings of words. If that limitation works, then just let me know what British-isms I should be correcting for.

@zcorpan
Copy link
Member

zcorpan commented Feb 11, 2016

If you correct endings of each word in a term, I suppose that works...

Let's see... https://github.com/whatwg/xref/blob/master/xrefs/dom/html.json

 "ascii serialization of an origin": "ascii-serialisation-of-an-origin",
 "html fragment serialization algorithm": "html-fragment-serialisation-algorithm",
 "rules for serializing simple color values": "rules-for-serialising-simple-colour-values",

(Note two words differ here.)

 "unicode serialization of an origin": "unicode-serialisation-of-an-origin",
 "simple color": "simple-colour",

https://html.spec.whatwg.org/multipage/fragment-links.js

tokenizing
uninitialized
sanitization
initialize
serializer
behaviour

@foolip
Copy link
Member Author

foolip commented Feb 19, 2016

Took a look at what changed in 5984a04 by comparing cat source | grep -oE '[A-Za-z]+' | tr A-Z a-z | sort | uniq -c before and after:

  • anonymize → anonymise
  • authorize → authorise
  • categorize(d) → categorise(d)
  • customized → customised
  • emphasize → emphasise
  • initialize(d) → initialise(d)
  • localized → localised
  • minimize → minimise
  • neutralized → neutralised
  • normalized → normalised
  • optimize(d) → optimise(d)
  • rasterized → rasterised
  • realize(d) → realise(d)
  • recognize(d) → recognise(d)
  • romanized → romanised
  • serialize(r) → serialise(r)
  • standardized → standardised
  • summarized → summarised
  • synchronized → synchronised
  • synthesize(r) → synthesise(r)
  • tokenize → tokenise
  • unoptimized → unoptimised

In other words, a simple pattern to search for. There's also colour that was changed earlier, and perhaps that's all, because a removed comment said <!--CLEANUP: for my sanity, switch back to british spellings (ise instead of ize, colour instead of color, etc) -->.

@annevk
Copy link
Member

annevk commented Feb 19, 2016

Cool, if you want to take this that'd be great I think. We just need to be careful with anything inside <dfn>.

@Hixie
Copy link
Member

Hixie commented Feb 19, 2016

There's tons of stuff that would need to change. It took me weeks to catch all of them. The reason that particular diff looks like it's easily caught with a pattern is that it was the result of me trying to catch the easy ones with a pattern...

@foolip
Copy link
Member Author

foolip commented Feb 20, 2016

@Hixie, other than ize→ise in all its variations and color→colour, did you make any other systematic changes?

@domenic
Copy link
Member

domenic commented Feb 20, 2016

Wait, are we still talking about just helping Bikeshed, or have we moved back to talking about changing the spec? Because I'm still fairly opposed to the latter.

@foolip
Copy link
Member Author

foolip commented Feb 20, 2016

Actually changing the spelling in the spec to be consistent one way or the other is the whole premise of this issue. It's a given that this shouldn't break anchors, and fiddling with Bikeshed would be one way of avoiding that, although I think something involving link-fixup.js sounds more robust.

@domenic
Copy link
Member

domenic commented Feb 20, 2016

Well, I'd prefer the consistency that gives a smaller diff, i.e. en-gb-x-hixie.

@foolip
Copy link
Member Author

foolip commented Feb 20, 2016

If the concern is mistakes, then if edited and reviewed manually the risk of error would be proportional to the size of the diff, but I wouldn't recommend that. Scripts to stay consistently en-US sound pretty simple, simpler than ensuring the correct use of en-US and en-GB depending on context. Knowing and verifying exactly how the output changes (IDs or anything else) is also not hard.

@annevk
Copy link
Member

annevk commented Feb 20, 2016

@domenic what about my argument in #654 (comment)? I really don't see why we should impose this additional bit of learning to everyone forever.

@zcorpan
Copy link
Member

zcorpan commented Mar 2, 2016

So 05e4a1f first used "color" and @domenic changed it to "colour", but I see now it also contains "customize". Maintaining GB spelling appears to be difficult. I think we should switch to US, and maybe have a linter that complains about GB-isms. Linting is possible with US spelling but not so much with GB spelling since GB will have to use a mix of GB and US since keywords are US and terms from other specs are US.

@foolip
Copy link
Member Author

foolip commented Mar 2, 2016

I agree with that, and would be willing to edit and/or review to make this happen.

@annevk
Copy link
Member

annevk commented Aug 30, 2016

I created a PR to get things going here. With everyone slipping up on en-GB, HTML currently using a mix, and all other standards using en-US, we should switch to en-US. This makes things easier for ourselves, and more importantly for new contributors.

I'm happy to do this over a period of time as we're already inconsistent. As a first step I'd like to add IDs to en-GB terms to make sure those don't get broken.

(Later on I'd like to start having discussions on editorial style and forge WHATWG-wide agreement on certain matters. The incentive there too is to make things easier for ourselves and new contributors.)

@foolip
Copy link
Member Author

foolip commented Aug 30, 2016

That sounds great!

zcorpan pushed a commit that referenced this issue Aug 30, 2016
In preparation for switching to en-US without breaking links.
Part of #654.
zcorpan added a commit that referenced this issue Aug 30, 2016
@zcorpan
Copy link
Member

zcorpan commented Aug 31, 2016

Specs using Anolis will not be affected by this change, as far as I can tell. xref already uses en-US spelling of HTML's terms, and the fragment identifiers are now stable.

Specs using Bikeshed will be affected because the spelling of some terms change. Among WHATWG specs I see dom and url having <a lt="Unicode serialisation of an origin">. I can send PRs for that.

I didn't find anything that would break in w3c/csswg-drafts.

@zcorpan
Copy link
Member

zcorpan commented Aug 31, 2016

@tabatkins do you want to change bikeshed to support both spelling variants before we merge? See whatwg/html-build#92 for the regexp, and note that it's not only the very end of terms that vary, e.g. "rules for serialising simple colour values".

@annevk
Copy link
Member

annevk commented Aug 31, 2016

Prolly better to just update references. Churn is not great, but not too bad either. I suspect some WebAppSec work will be affected too.

@zcorpan
Copy link
Member

zcorpan commented Aug 31, 2016

Yeah I think it's manageable to update the references.

@zcorpan zcorpan closed this as completed in 2f3c8cd Sep 1, 2016
zcorpan added a commit to whatwg/html-build that referenced this issue Sep 1, 2016
zcorpan added a commit to whatwg/url that referenced this issue Sep 2, 2016
annevk pushed a commit to whatwg/url that referenced this issue Sep 2, 2016
zcorpan added a commit to zcorpan/web-bluetooth that referenced this issue Sep 2, 2016
zcorpan added a commit to w3c/webappsec-csp that referenced this issue Sep 2, 2016
initialising -> initializing
serialisation -> serialization

Ref. whatwg/html#654

(Trailing whitespace was also stripped and possibly a newer version
of Bikeshed caused various changes in the generated index.html.)
@zcorpan
Copy link
Member

zcorpan commented Sep 2, 2016

I've sent PRs for everything I could find, except whatwg/dom (which is blocked on plinss/widlparser#17 )

zcorpan added a commit to w3c/webappsec-csp that referenced this issue Sep 2, 2016
initialising -> initializing
serialisation -> serialization

Ref. whatwg/html#654

(Trailing whitespace was also stripped.)
mikewest pushed a commit to w3c/webappsec-csp that referenced this issue Sep 2, 2016
initialising -> initializing
serialisation -> serialization

Ref. whatwg/html#654

(Trailing whitespace was also stripped.)
jyasskin pushed a commit to WebBluetoothCG/web-bluetooth that referenced this issue Sep 2, 2016
alice pushed a commit to alice/html that referenced this issue Jan 8, 2019
In preparation for switching to en-US without breaking links.
Part of whatwg#654.
alice pushed a commit to alice/html that referenced this issue Jan 8, 2019
I used the regexp in
whatwg/html-build#92 to
search/replace the en-GB-spelled words, and
checked each occurrence manually. Some IDs still
use en-GB spelling to not break links, and some
examples use en-GB spelling.

Fixes whatwg#654
ryandel8834 added a commit to ryandel8834/WebAppSec-CSP that referenced this issue Aug 13, 2022
initialising -> initializing
serialisation -> serialization

Ref. whatwg/html#654

(Trailing whitespace was also stripped.)
@safinaskar

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants