Appendix B. Method Name Syntax
The purpose of this section is to define a set of recognised characters for use in Method Names. This is primarily to enable the growing number of software developers who make use of the Method Library to know which characters their applications should support.
The character sets below encompass all characters already used in the Method Library, plus many more.
The Central Council plans to make available a tool that will let users verify whether the characters they plan to use in a new Method Name are part of the recognised sets. The Central Council will also consider adding additional recognised characters on request. Requests can be emailed to firstname.lastname@example.org.
The Method Name comparison process below ensures Method Names remain clearly unique. For example, as there is already a Method named 'London No.3 Surprise Royal', the process would prevent a new Method being named 'London No 3 Surprise Royal' or 'London No. 3 Surprise Royal'. This is considered beneficial to reduce the likelihood of misidentification of Methods.
In the following, 'the Unicode standard' refers to version 10.0.0.
Various attributes of individual characters are given the files comprising the Unicode Character Database (UCD, http://unicode.org/ucd/), and particularly the file https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, which contains both general category information and case folding information.
Unicode blocks are defined in https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt.
Normalization is described in https://www.unicode.org/reports/tr15/tr15-45.html#Norm_Forms.
Method Names are a sequence of from 1 to 120 characters selected from:
- All those enumerated in the Unicode standard as being in the Basic Latin block and having a category of Lu, Ll, or Nd (upper and lower case letters, and digits;
- All those enumerated in the Unicode standard as being in the Latin-1 Supplement block and having a category of Lu or Ll;
- All those enumerated in the Unicode standard as being in the Latin Extended-A block, except Latin Small Letter N Preceded By Apostrophe;
- All those enumerated in the Unicode standard as being in the Latin Extended-B block;
- The Unicode characters named: Space, Exclamation Mark, Quotation Mark, Ampersand, Apostrophe, Left Parenthesis, Right Parenthesis, Comma, Hyphen-minus, Full Stop, Solidus, Equal Sign, Percent Sign, Question Mark, Pound Sign, Dollar Sign, Euro Sign and Trade Mark Sign;
- The Unicode characters named: Superscript Zero, Superscript One, Superscript Two, Superscript Three, Superscript Four, Superscript Five, Superscript Six, Superscript Seven, Superscript Eight, Superscript Nine, Subscript Zero, Subscript One, Subscript Two, Subscript Three, Subscript Four, Subscript Five, Subscript Six, Subscript Seven, Subscript Eight and Subscript Nine;
subject to the further constraints that a Method Name:
- Must contain at least one character of Unicode general category Lu, Ll, or Nd; and
- May neither begin nor end with a Space character, nor may it contain within it two consecutive Space characters.
Two Method Names are considered the same if they would be reduced to the same sequence of characters by the following process:
- The sequence of characters is converted to Unicode Normalization Form KD (NFKD, Normalization Form Compatibility Decomposition);
- All characters now appearing in the sequence that are not allowed in a Method Name, as per 3 above, are removed;
- All characters for which the UCD defines a case folding are converted to that folded character (upper case);
- The following conversions are made: ‘Ø’ to ‘O’, ‘Æ’ to the two character sequence ‘AE’, and ‘Œ’ to the two character sequence ‘OE’;
- Each character for which the Unicode general category is not Lu or Nd is replaced by the Space character; and
- Any Spaces now at the beginning or end of the sequence are removed, and any internal runs of two or more Space characters are replaced by a single Space character.
- The exclusion of Latin Small Letter N Preceded By Apostrophe is because that character is now deprecated in Unicode;
- The normalization to NFKD followed by deletion of inappropriate characters eliminates diacritics, brings the superscript and subscript numerals to the baseline, and replaces ‘™’ with the two character sequence ‘TM’;
- Punctuation and symbols are ignored for Method Name comparisons. Thus ‘London No.3’ is the same as ‘London No 3’. Less obviously, ‘E=mc²’ is the same as ‘e & (MC)₂’. Given how rare, and potentially troublesome, punctuation is in Method Names, this seems a small price to pay, as in practice it just prevents otherwise likely pathological Method Names from being used.
Additional background on Method Names and Method Name syntax is available here in an article written by Don Morrison.