CSS Custom Text transformations

A Collection of Interesting Ideas,

This version:
https://specs.rivoal.net/css-custom-tt/
Issue Tracking:
GitLab
Inline In Spec
Editor:
Florian Rivoal

Abstract

This specification defines a mechanism for authors to create custom values for the CSS text-transform property.

1. Introduction

This section is non-normative.

1.1. Motivating scenarios

2. Module Interactions

3. Defining Custom Text transformations: the @text-transform rule

The general form @text-transform rule is:

@text-transform <transform-name> {
  <declaration-list>
}
<transform-name> = <custom-ident>

The @text-transform rule accepts the following descriptors: transform and position. The transform descriptor is required, and the @text-transform rule has no effect if it is omitted. Other descriptors may be omitted, and evaluate to their initial value in that case. When a given descriptor occurs multiple times in a given @text-transform rule, only the last one is used; all prior instances of that descriptor within that rule must be ignored.

These descriptors apply solely within the context of the enclosing ''@text-transform, and do not apply to document elements.

define what happens when multiple Custom Text transformations by the same name are declared.

The text-transform property is extended by this specification to accept Customer Text transformations, refered to by their <transform-name>, as values:

Name: text-transform
New values: none | capitalize | uppercase | lowercase | full-width | <transform-name>

With the exception of CSS-wide keywords which always have their global meaning, if <transform-name> conflicts with an existing value of the text-transform property, the conflict is resolved in favor of the custom definition introduced using @text-transform.

Note: This enables the addition by later specifications of new keywords for the text-transform property without changing the behavior of any existing page already defining and using a custom text transformation by the same name.

3.1. Defining the transformation: transform descriptor

Name: transform
For: @text-transform
Value: <conversion>#
Initial: N/A
<conversion> = [<char-list> to <char-list>] | <'text-transform'>
<char-list> = <enumeration> | <range>
<range> = <urange> | <string>
<enumeration> = <string>

This descriptor defines which character will be replaced by which, by listing a series of conversions, to be applied in the same order as they appear in the descriptor.

Conversions may refer to existing text transformations, either predefined by CSS or defined by the author.

Note: While a transformation using only a single such conversion is not very useful, combining it with other conversions allows authors to extend or define variants of existing transformations.

Referring to the text-transform currently being defined is not allowed, and makes the whole descriptor invalid.

Conversions may also define new mapping from one <char-list> to another. When defined using a <urange>, the <char-list> is is an ordered list of each individual Unicode character code point designated by the <urange>.

A <range> may also be defined as a string made of a single unicode character, followed by a hyphen (U+002D), followed by another signle unicode character. The semantics are identical to the <urange> U+XXXXXX-YYYYYY where XXXXXX is the code point of the first character and YYYYYY the code point of the second character. Any other string is interpreted as an <enumeration>.

Note: This notation is included despite being redundant with the <urange> notation as is is much more readable in many situations.

Since these operate on single code points, which normalization (if any) is used matters. Should we have a way to switch how they apply? Should we have built-in nfc, nfd, nfkc and nfkd transforms?

If defined by an <enumeration>, the <char-list> is an ordered list of each extended combining character squence in the string, The same character may not appear twice in the <char-list> defining the source of the mapping, otherwise the whole descriptor is invalid.

Note: In addition to the usual CSS rules of character escaping, hyphen (U+002D) need to be escaped to appear in second position of a 3 charater long <enumeration>. The string would otherwise be interpreted as a <range>.

This ability to have both enumeration strings and range strings is nice for authors, but is this something implementors are willing to live with, or does it make parsing too cumbersome? Wrapping the strings in functional notation would solve the problem, at the expense of making the syntax much more verbose and less readable.

In a <conversion>, if the source <char-list> is longer than the target <char-list>, then the last item of the target list is used for all remaining items in the source list.

Note: This allows shortcuts like transform: "aàáå" to "a".

We probably need a switch of some kind to be able to operate only on base characters. Maybe also only on combining marks, or sets of combining marks. At least for base characters only, there are clear use cases.

Should we allow spaces and other collapsible characters in the target? Since text-transform is applied after white space collapsing, what are the implications of generating runs of collapsible white space that won’t be collapsed? It has been proposed that we should allow them, and trigger a second white space collapsing if they are actually used.

Should we allow an empty <char-list> as the target? It has been suggested that this be used to delete text. I am not sure I like the idea that text-transform could be able to make some non-empty element empty.

It has been suggested that it should be possible to write text-transforms that behave differently on different languages. This can probably be achieved by adding some optional part at the beginning of each <conversion>, although I am not sure what the syntax should be.

@text-transform latin-only-uppercase {
  transform: "a-z" to "A-Z";
}
The following two transformations are identical.
@text-tranform abcdef1 {
  transform: "abc" to "def";
}
@text-tranform abcdef2 {
  transform: "a" to "d",
             "b" to "e",
             "c" to "f";
}

3.2. What the transformation applies to: the position descriptor

Name: position
For: @text-transform
Value: all | [ initial || medial || final ]
Initial: all

This descriptor makes it possible to restrict which characters in the source text are affected by the transform.

The definition of word is UA-dependent; [UAX29] is suggested (but not required) for determining such word boundaries.

The transform descriptor may be used to refer to existing text-transforms in the definition of a new one. If the text-transforms referred to has a different position than the position specified in the text-transform that refers to them, they apply at the intersection of the two positions.

@text-transform latin-only-uppercase {
  transform: "a-z" to "A-Z";
}
@text-transform latin-only-capitalize {
  transform: latin-only-uppercase;
  position: initial;
}

4. DOM interaction

Custom text transformation values defined within @text-transform rules are accessible via the following modifications to the CSS Object Model.

partial interface CSSRule {
  const unsigned short TEXT_TRANSFORM_RULE = 1000;
};
interface CSSTextTransformRule : CSSRule {
  attribute          DOMString   name;
  readonly attribute CSSStyleDeclaration  style;
};

This is a lousy OM. Do better.

Apendix A. Use cases and Examples

This Apendix is non-normative.

Single-Language uses

The following use cases only apply to a single language. Defining all the possibly useful text-transforms for all languages would go beyond the capacity and expertise of the CSS WG. Having the generic mechanism allows authors to solve their specific problem.

Full-size kana

In Japanese, small kana characters appearing within ruby are sometimes replaced by the equivalent full-size kana for legibility purposes, even though the semantically correct characters are used in the markup. The following transformation defines this conversion.

@text-transform full-size-kana {
  transform: "ぁぃぅぇぉゕゖっゃゅょゎ" to "あいうえおかけつやゆよわ",
             "ァィゥェォヵㇰヶㇱㇲッㇳㇴㇵㇶㇷㇸㇹㇺャュョㇻㇼㇽㇾㇿヮ" to "アイウエオカクケシスツトヌハヒフヘホムヤユヨラリルレロワ",
             "ァィゥェォッャュョ" to "アイウエオツヤユヨ";
}

German ß

As discussed in [this email thread](http://lists.w3.org/Archives/Public/www-style/2011Nov/0193.html), ß (aka &szlig; or U+00DF) is traditionally considered a lower case letter without an uppercase equivalent. text-transform: uppercase leaves it unchanged. Unicode has introduced ẞ (U+1E9E), an uppercase version of it since 5.1, but without making it a target of toupper().

This letter being rather new, authors are bound to disagree whether it is a proper uppercase variant of U+00DF or not. Those who think it is not may use text-transform: uppercase; and text-transform: lowercase. Those who think it is could use the following.

@text-transform german-uppercase {
  transform: U+00DF to U+1E9E, uppercase;
}

@text-transform german-lowercase {
  transform: U+1E9E to U+00DF, lowercase;
}
It has been suggested that overloading existing values with a language descriptor or selector would be better:
@text-transform uppercase {
  transform: U+00DF to U+1E9E;
  language: de;
}
@text-transform uppercase:lang(de) {
  transform: U+00DF to U+1E9E;
}

Turkish i/ı

In Turkish and a few related languages, dotted and dotless i are distinct letters, both in upper land lower case.

The uppercasing and lowercasing algorithm defined for the text-transform property only preserve this when the content language of the element is known.

Someone may want to apply an uppercase or lowercase transformation to a document where language is insufficiently marked up, but known to the author of the style sheet to be Turkish. In this case, the generic uppercase and lowercase transformations would fail, but the following would work.

@text-transform turkic-uppercase {
  transform: "i" to "İ", uppercase;
}

@text-transform turkic-lowercase {
  transform: "I" to "ı", lowercase;
}

Georgian upper/lower case

The Georgian language has used three different unicameral alphabets through history: Asomtavruli, Nuskhuri, and Mkhedruli. Some authors have been using Asomtavruli letters in an otherwise Mkhedruli text, in a way that resembles a bicameral alphabet. The following transformations would be useful to authors wishing to apply this effect to—or to remove it from—Georgian text.

@text-transform Mkhedruli-to-Asomtavruli {
  transform: "ა-ჵ" to "Ⴀ-Ⴥ";
}

@text-transform Asomtavruli-to-Mkhedruli {
  transform: "Ⴀ-Ⴥ" to "ა-ჵ";
}

@text-transform georgian-uppercase {
  transform: Mkhedruli-to-Asomtavruli;
}
@text-transform georgian-capitalize {
  transform: Mkhedruli-to-Asomtavruli;
  position: initial;
}
@text-transform georgian-lowercase {
  transform: Asomtavruli-to-Mkhedruli;
}

Cross-language uses

The following cases are examples of cases useful in several languages, but rare enough that they are better addressed by authors when needed than by the CSS WG.

Long s

In old (18th century and earlier) European texts, the letter s, when at the middle or begining of the word, was written ſ (U+017F). S occuring at the end of a word would be written as the modern s is.

Modern readers are often unfamiliar with this letter form, and for readability reasons, one may want to convert from one to the other. The follow transformation would accomplish this.

@text-transform modernize-s {
  transform: "ſ" to "s";
}

This does the opposite transform:

@text-transform long-s {
  transform: "s" to "ſ" ;
  position: initial medial;
}

Miscellaneous

Here are some more example of how the generic mechanism may be used.

Comic book Vikings

In the “Asterix and the Great Crossing” comic book, the Viking characters are supposed to speak a foreign language unintelligible to the main characters, but still understandable to the readers. This is represented by writing down their speech normally, except that some letters are replaced by similarly looking letters found in Scandinavian languages.

This effect could be obtained by the following transform:

@text-transform fake-norse {
  transform: "aoAO" to "åøÅØ";
}

L33t speak

In Internet, hacker and gamer culture, a phenomenon is quite common where characters are replaced by other characters or character sequences which have a somewhat similar appearance. Although no single consensual convention exists and sometimes mappings are neither injective nor surjective, one could simulate this playful style with a transformation like the following:

@text-transform leet-speak {
  transform: "A-Z" to "48©)3F6H1!K£MN0¶9®57UVW*¥2";
}

Rot13

@text-transform rot13 {
  transform: "A-M" to "N-Z", "N-Z" to "A-M",
             "a-m" to "n-z", "n-z" to "a-m";
}
.spoiler:hover {
    text-transform: rot13;
}
<p class="spoiler">Qnegu Inqbe vf Yhxr’f sngure.</p>

Apendix B. Security and Privacy Considerations

This appendix is non-normative.

There are no known security or privacy impacts of this feature.

The W3C TAG is developing a Self-Review Questionnaire: Security and Privacy for editors of specifications to informatively answer. As far as currently known, here are the answers to the Questions to Consider:

Does this specification deal with personally-identifiable information?
No
Does this specification deal with high-value data?
No
Does this specification introduce new state for an origin that persists across browsing sessions?
no
Does this specification expose any other data to an origin that it doesn’t currently have access to?
No
Does this specification enable new script execution/loading mechanisms?
No
Does this specification allow an origin access to a user’s location?
No
Does this specification allow an origin access to sensors on a user’s device?
No
Does this specification allow an origin access to aspects of a user’s local computing environment?
No
Does this specification allow an origin access to other devices?
No
Does this specification allow an origin some measure of control over a user agent’s native UI?
No. However, it does allow changes to the text displayed in some “native-looking” form controls and replaced elements within the page. As these are under author control anyway, this is not considered a risk.
Does this specification expose temporary identifiers to the web?
No
Does this specification distinguish between behavior in first-party and third-party contexts?
No
How should this specification work in the context of a user agent’s "incognito" mode?
No difference in behavior is expected or needed.
Does this specification persist data to a user’s local device?
No
Does this specification have a "Security Considerations" and "Privacy Considerations" section?
Yes, this is the role of this Appendix.
Does this specification allow downgrading default security characteristics?
No

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[CSS-SYNTAX-3]
Tab Atkins Jr.; Simon Sapin. CSS Syntax Module Level 3. 16 July 2019. CR. URL: https://www.w3.org/TR/css-syntax-3/
[CSS-TEXT-3]
Elika Etemad; Koji Ishii; Florian Rivoal. CSS Text Module Level 3. 22 April 2021. CR. URL: https://www.w3.org/TR/css-text-3/
[CSS-VALUES-4]
Tab Atkins Jr.; Elika Etemad. CSS Values and Units Module Level 4. 16 December 2021. WD. URL: https://www.w3.org/TR/css-values-4/
[CSSOM-1]
Daniel Glazman; Emilio Cobos Álvarez. CSS Object Model (CSSOM). 26 August 2021. WD. URL: https://www.w3.org/TR/cssom-1/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[WEBIDL]
Edgar Chen; Timothy Gu. Web IDL Standard. Living Standard. URL: https://webidl.spec.whatwg.org/

Informative References

[CSS-RUBY-1]
Elika Etemad; et al. CSS Ruby Annotation Layout Module Level 1. 2 December 2021. WD. URL: https://www.w3.org/TR/css-ruby-1/
[UAX29]
Mark Davis; Christopher Chapman. Unicode Text Segmentation. 24 August 2021. Unicode Standard Annex #29. URL: https://www.unicode.org/reports/tr29/tr29-39.html

Property Index

No properties defined.

@text-transform Descriptors

Name Value Initial
position all | [ initial || medial || final ] all
transform <conversion># N/A

IDL Index

partial interface CSSRule {
  const unsigned short TEXT_TRANSFORM_RULE = 1000;
};
interface CSSTextTransformRule : CSSRule {
  attribute          DOMString   name;
  readonly attribute CSSStyleDeclaration  style;
};

Issues Index

define what happens when multiple Custom Text transformations by the same name are declared.
Since these operate on single code points, which normalization (if any) is used matters. Should we have a way to switch how they apply? Should we have built-in nfc, nfd, nfkc and nfkd transforms?
This ability to have both enumeration strings and range strings is nice for authors, but is this something implementors are willing to live with, or does it make parsing too cumbersome? Wrapping the strings in functional notation would solve the problem, at the expense of making the syntax much more verbose and less readable.
We probably need a switch of some kind to be able to operate only on base characters. Maybe also only on combining marks, or sets of combining marks. At least for base characters only, there are clear use cases.
Should we allow spaces and other collapsible characters in the target? Since text-transform is applied after white space collapsing, what are the implications of generating runs of collapsible white space that won’t be collapsed? It has been proposed that we should allow them, and trigger a second white space collapsing if they are actually used.
Should we allow an empty <char-list> as the target? It has been suggested that this be used to delete text. I am not sure I like the idea that text-transform could be able to make some non-empty element empty.
It has been suggested that it should be possible to write text-transforms that behave differently on different languages. This can probably be achieved by adding some optional part at the beginning of each <conversion>, although I am not sure what the syntax should be.
This is a lousy OM. Do better.
It has been suggested that overloading existing values with a language descriptor or selector would be better:
@text-transform uppercase {
  transform: U+00DF to U+1E9E;
  language: de;
}
@text-transform uppercase:lang(de) {
  transform: U+00DF to U+1E9E;
}