TIP: 249 Title: Unification of Tcl's Parsing of Numbers Version: $Revision: 1.9 $ Author: Kevin B. Kenny Author: David S. Cargo Author: Don Porter State: Draft Type: Informative Vote: No voting Created: 13-Jun-2005 Post-History: ~ Abstract This TIP proposes to unify the recognition of all of Tcl's "numeric" objects into a single parser. The intended effect is to improve performance by eliminating a number of cases where a cached numeric representation may be discarded, and to restore (more accurately, to establish) the "everything is a string" principle in dealing with numbers. ~ Rationale Tcl's handling of numbers has always been problematic and ambiguous. Even in the earliest releases of the '''expr''' command, there were issues with the unexpected demotion of floating point numbers to integers, causing subsequent divisions to be interpreted as integer division with incorrect results. Another trouble spot has been the interpretation of constants with leading zeroes. When these are interpreted as integers, they are octal numbers. They can also be interpreted as floating point constants (at least with ''Tcl_GetDoubleFromObj''), in which case they are decimal. Because of this ambiguity, the '''expr''' system cannot make effective use of the internal representation of a floating point number; it needs to refer back to the string to make sure that the number is not an octal integer to which ''Tcl_GetDoubleFromObj'' has been applied. Even more confusing is the treatment of numbers that have leading zeroes but contain the digits 8 or 9. These are rejected by the '''expr''' parser as invalid octal but are accepted by ''Tcl_GetDoubleFromObj''. ~ Proposal This TIP proposes a strict "everything is a string" interpretation for strings as numeric values. The set of strings that can be interpreted as numbers shall be partitioned into disjoint subsets, with a single "canonical" representation for each. This change will imply that a few C calls will break compatibility. In particular, ''Tcl_GetDoubleFromObj'' may leave an integer internal representation in the object, despite the documentation's assertion that the object will shimmer. Similarly, ''Tcl_GetDoubleFromObj'' will no longer interpret octal integers as decimal; this feature causes only surprise and consternation. The ''Tcl_ConvertToType'' call will also no longer force conversion to a specific numeric type. Since it does not do so, it is not reasonable for extensions to use it on the numeric types. For this reason, the numeric types ''shall not be registered;'' ''Tcl_GetObjType'' will fail when presented with one of their names. When one of the conversion procedures ''Tcl_GetIntFromObj'', ''Tcl_GetWideIntFromObj'', ''Tcl_GetBignumFromObj'' (assuming the eventual approval of [237]), or ''Tcl_GetDoubleFromObj'' is called, it will cast any pre-existing numeric internal representation that it finds to the appropriate return type (throwing an error if the number is too large to represent, or a double is used in an integer context). If the procedure finds no pre-existing numeric internal representation, it will extract the string representation, determine its canonical representation as a number, and store that. The easiest way to visualize the specific sets of strings that are recognized as numbers is with a diagram of the state machine that implements them. #image:249statemachine State machine that recognizes numbers. In the diagram, "Start" represents the start state of the machine. The leading and trailing whitespace that is allowed for all numbers is not diagrammed, for clarity. Intermediate states of the machine are represented by small ovals. Large rectangles represent final states, and are labeled with the type of number that will result. Note that any number can optionally begin with a '+' or '-' character, which will not be mentioned further. Each of the accepting states, however, merits further discussion. 1. The string "0" shall always represent an integer of the smallest type available ('''tclIntType'''). It shall never represent a floating point value. 2. A leading zero followed by a string of octal digits shall be interpreted as an octal integer. The integer shall be stored in the smallest of '''tclIntType''', '''tclWideIntType''' and '''tclBignumType''' that will hold it. (Note that storing '''tclBignumType''' is possible without accepting [237], provided that the ''Tcl_Get*FromObj'' routines recognize it and convert its value as needed.) The interpretation as an octal integer shall hold even if the string is presented to ''Tcl_GetDoubleFromObj'', which today interprets it as decimal. 3. A leading zero, followed by the letter 'X' (case insensitive) and a string of hexadecimal digits shall be interpreted as a hexadecimal integer. Again, the smallest representation needed is chosen. 4. A string of decimal digits beginning with a nonzero digit is interpreted as a decimal integer and stored in the smallest suitable internal representation. 5. A string of digits beginning with a zero but containing the digits '''8''' or '''9''' is an error; it appears to be bad octal. It would be possible to allow this case in ''Tcl_GetDoubleFromObj'', but it seems unwise, since the consequence would be that '''string is double''' would accept "double" strings that will fail in '''expr'''. 6. A string consisting of a nonempty sequence of decimal digits and a single period (which may appear anywhere within the string) is a valid floating point constant in 'F' format, even if it begins with '0'. It is interpreted in decimal and stored in a '''tclDoubleType'''. If the input number is too small to represent, an appropriately signed zero is stored. If the input number is too large to represent, an appropriately signed infinity is stored. 7. Floating point numbers in the usual 'E' format are accepted and interpreted in decimal. Once again, they are stored in '''tclDoubleType''' and are replaced with zero or infinity if they are too small or large. 8. The constants, "Inf", and "Infinity" (perhaps with a leading signum) are interpreted as infinities. Infinity is represented as '''tclDoubleType.''' 9. The constant "NaN" is the IEEE "Not a Number" value. It is specifically permitted in the parser so that '''binary format q NaN''' and similar calls can produce NaN on an external medium. The presence of NaN in expressions, or in ''Tcl_GetDoubleFromObj'', signals an error. NaN is represented as '''tclDoubleType'''. 10. IEEE floating point does not have a single unique NaN value, so a NaN may be augmented by a parenthesized string of hexadecimal digits, which will be stored in its least significant bits. It shall not be possible to construct signalling NaN by this route; only quiet NaN will be supported. NaN is represented as '''tclDoubleType.''' ~ Additions In addition to the base state machine detailed above, the state machine of the reference implementation contains additional states to parse integer values beginning with the '''0b''' or '''0o''' prefixes as originally proposed in [114]. Getting these prefixes recognized in Tcl 8.5 is an important migration step to support migration to whatever version of Tcl drops the "leading '''0''' implies octal format" rule. Also in addition, the parsing routine will accept a ''flags'' value containing the flag bits below that exert finer control on the parsing. These extra controls were found to be required to permit the [[scan]] command to use the same parser. * '''TCL_PARSE_INTEGER_ONLY''' -- accept only integer values; reject strings that denote floating point values (or accept only the leading portion of them that are integer values). * '''TCL_PARSE_SCAN_PREFIXES''' -- ignore the prefixes '''0b''' and '''0o''' that are not part of the [[scan]] command's vocabulary. Use only in combination with '''TCL_PARSE_INTEGER_ONLY'''. * '''TCL_PARSE_OCTAL_ONLY''' - parse only in the octal format, whether or not a prefix is present that would lead to octal parsing. Use only in combination with '''TCL_PARSE_INTEGER_ONLY'''. * '''TCL_PARSE_HEXADECIMAL_ONLY''' - parse only in the hexadecimal format, whether or not a prefix is present that would lead to hexadecimal parsing. Use only in combination with '''TCL_PARSE_INTEGER_ONLY'''. * '''TCL_PARSE_DECIMAL_ONLY''' - parse only in the decimal format, no matter whether a '''0''' prefix would normally force a different base. ~ Incompatibilities The change described is sufficient to run the Tcl and Tk test suites with unwanted test results only in the detailed format of error messages for integer overflow and in the types returned by using the '''testobj''' command (not part of the usual distribution) to introspect them. Despite this reassurance, several potential incompatibilities are identified. First, as mentioned above, C extensions will no longer have fine control over Tcl's built-in numeric types, because the types will not be registered and hence will be unavailable for use with ''Tcl_ConvertToType.'' This is actually a good thing, since it means that the rest of Tcl can assume that they are well-behaved, resulting in a considerable simplification. Most of the Tcl Core Team believes that ''Tcl_ConvertToType'' has no legitimate use in any case. Second, it will no longer be correct to assume that ''Tcl_Get*FromObj'' will leave an internal representation of precisely the requested type. It is, in any case, a highly questionable practice for callers to assume a specific internal representation (with the possible exception of Tcl_Set*Obj and Tcl_New*Obj). There will no doubt be a few extensions that run afoul of this change, but they can be fixed easily in such a way that they will continue to compile and run on earlier versions of Tcl. Third, ''Tcl_GetDoubleFromObj'' will be both more and less permissive than before. It will no longer accept constants with a leading zero and no decimal point or 'E' that are invalid octal numbers. On the other hand, it will accept constants that are too large to fit in a '''Tcl_WideInt'''; somewhat surprisingly, '''string repeat 9 50''' cannot today be interpreted as a double. '''string is double''' will follow ''Tcl_GetDoubleFromObj'' in what it considers acceptable. Any string that is accepted as either an integer or a double by '''expr''' will be accepted in ''Tcl_GetDoubleFromObj'', and only those strings will be accepted. Fourth, the recognition of '''0b''' and '''0o''' as valid prefixes for integer values is a type of incompatibility. ~ Reference Implementation See [237] for more implementation details. ~ Copyright Copyright (c) 2005 by Kevin B. Kenny. All rights reserved. This document may be distributed subject to the terms and conditions set forth in the Open Publication License, version 1.0 [http://www.opencontent.org/openpub/].