TIP: | 313 |

Title: | Inexact Searching in Sorted List |

Version: | $Revision: 1.14 $ |

Author: | Peter Spjuth <peter dot spjuth at gmail dot com> |

State: | Final |

Type: | Project |

Tcl-Version: | 8.6 |

Vote: | Done |

Created: | Thursday, 14 February 2008 |

Keywords: | Tcl |

This TIP adds a new switch to **lsearch** to do a binary search to find the insertion point in a sorted list.

Sometimes, it is necessary to find the location in a sorted list where a particular new element could be inserted. Either for actual insertion or for lookups to do interpolation or approximate search in data tables. Given that the list is already sorted, it is obviously the case that the location should be located through an O(log N) algorithm that takes advantage of this fact (binary searching is the most reliable method given that measuring the "distance" between two strings is a complex and expensive operation in itself).

The usefulness of the feature is shown by a quick search. I found usages in the core, in tcllib and in tklib. Given that the infrastructure for a binary search is already in **lsearch**, this is a very cheap addition.

One question for the specification is exactly what index should be returned. Below an increasing list is assumed, things are equivalent for a decreasing.

First where element is greater than key.

Last where element is less than or equal to key.

First where element is greater than or equal to key.

Last where element is less than key.

Here, 1 is the use case for a stable insertion sort.

In the core we can find **::tcl::clock::BSearch** which returns the index of the greatest element in $list that is less than or equal to $key, i.e. type 2. The same can be found in tklib's ::khim::BSearch.

In tcllib we can find **::struct::prioqueue::__linsertsorted**, which would use type 1.

Personally I have had use for both 2 and 3.

1 can trivialy be calculated from 2 and vice versa. Same for 3 and 4.

One key difference between 1/2 and 3/4 is that 1/2 return last among equals while 3/4 returns first among equals. This means that it is easier to lay 3/4 over 1/2 by first doing an exact search. *i.e.* by doing both a -sorted and a -bisect search you get all info needed, in log(N) time, to get either of 1/2/3/4.

Finally, I think it makes sense for **lsearch** to return an exact match if there is one, leading to type 2 being specified in this TIP.

For a decreasing list, things are equivalent. The same relationships between 1/2/3/4 applies, so it is reasonable to select the same there.

I saw the word bisect used for this type of operation, but a better name is probably possible if someone have a suggestion.

An option **-bisect** is added to **lsearch**. This is a modifier to **-sorted** and implies **-sorted** search mode.

The list to be searched thus must be sorted and how it is sorted is specified just as for unmodified **-sorted**.

For an increasing list, the **-bisect** flag makes **lsearch** return the greatest index where the element is less than or equal to the key.

For a decreasing list, the **-bisect** flag makes **lsearch** return the greatest index where the element is greater than or equal to the key.

If the key is before the first element, or the list is empty, -1 is returned.

It is illegal to use **-bisect** with either **-all** or **-not**.

Note that **-inline** and **-start** are still valid, though perhaps not very useful.

A stable insertion sort:

set dest {} foreach elem $src { set i [lsearch -bisect -integer $dest] set dest [linsert $dest [+ $i 1] $elem] }

http://sourceforge.net/support/tracker.php?aid=1894241

Some messages on news:comp.lang.tcl provide additional motivation for this TIP:

From Kevin Kenny: <47B59A43 dot 9040205 at acm dot org>:

[...] When I've coded binary search like that, it's generally been as part of an interpolation or approximate search procedure. For instance,

::tcl::clock::BSearchfinds, among other things, the last change of time zone at or before a given time. The cubic spline procedure in tcllib uses BSearch to find the control point just to the left of the interpolated point. There are a great many cases where you want to look up the nearest neighbour without inserting anything if there is no exact match.

From Neil Madden: <mailto:fp46r6$pb6$1@oyez.ccc.nottingham.ac.uk>:

Indeed. My own recent experience was in looking up annotations to display while playing back some 3D visualisation data (through Togl). A list of form {timestamp annotation ts annot ...} was used and the annotation should be displayed for any frame from that timestamp up until the next annotation. When the user can randomly seek to any position in the data it was necessary to find the nearest preceeding annotation. I hand-coded a binary search for this. This

lsearchenhancement would have been ideal (well, providing I massaged the data into a form suitable for use with-index, which I assume this TIP is compatible with).

Instead of making **-bisect** a modifier to **-sorted** it could be a search mode in itself. This was the original specification in this TIP. Making it a modifier makes more sense and makes documentation simpler.

Naming the option based on something like "insert", like **-insertpos** or making this an option to **linsert**; Since the spec selects variant 2, it does not return the actual insertion position.

The name **-nearest** was proposed. Since the spec does not return the nearest, it is not that good.

Also proposed was "**-inexact** (by analogy with switch -exact)." "**-inexact** is good enough for the title of the TIP, so it should be good enough for the option name. It may be tricky writing the documentation since **-sorted** already implies **-exact**."

"Names that capture the meaning better are too verbose, but maybe they'll suggest a good name ... -allowmissing -maybeabsent -absent -approximate -fuzzy -insertionpointifmissing"

"The difference with the new option is that the search term need not be present in the list. That distinction is what the option name should capture."

This document has been placed in the public domain.

[Index] [History] [HTML Format] [Source Format] [LaTeX Format] [Text Format] [XML Format] [*roff Format *(experimental)*] [RTF Format *(experimental)*]