Introduction Edit

Let's suppose you have a collection of items (a.k.a data, the plural of "datum") and the program is constantly looking up a certain item in the collection. A naïve way to do that would be to iterate over all items in the collection one by one and check if each of them matches the constraints of the lookup. This is called linear search, and has a complexity of O(N) per lookup and an accumulative complexity of O(N^2) assuming N items are looked up and inserted. That can become very slow for large N's.

As a result, using a more efficient lookup mechanism, such as one that provides O(log(N)) lookup, or even optimally O(1) may be a better idea.

Implementations Edit

Binary-search tree Edit

A Self-balancing binary search tree provides a worst-case lookup, insertion and deletion time of O(log(N)) and an accumulative O(N*log(N)) performance. Another advantage of using such a tree is that it allows one to traverse the collection in order in O(N) (linear) time.

Hash table Edit

A Hash table provides an average-case / best-case of O(1) (constant time) lookup, insertion and deletion complexity (while the worst case performance may be up to O(N^2) depending on the implementation and the data fed to it.), which generally performs better than binary search trees. On the other hand, hash tables store the data in an arbitrary order, which makes any traversal of it (or finding the next or previous item) to be harder than with a BST.

A Trie or a Radix-tree Edit

Tries and Radix trees are data structures that allow looking up a string proportionally to its length. While there are some algorithms for inserting and deleting from radix trees, they are often used when the collection of items remains constant throughout the run-time of the program, as is the case for dictionaries of human languages and spell-checkers.

A Sorted Array Edit

If the contents of the collections are constant, one can sometimes order them (or pointers to them) in a sorted array and use a binary search on them. This will generally yield somewhat better performance (though not an asymptotic complexity one) than a binary search tree, but on the other hand, does not allow an efficient way to change the contents of the collection.

Notes Edit

Many higher-level programming languages provide convenient built-ins or a convenient array-like syntax for an associative array (a.k.a "dictionary" or "map") that allows one to look up arbitrary keys and match with values. Most of the so-called "scripting languages" , "dynamic languages" or "agile languages" (e.g: Awk, Perl, Python, Ruby, PHP, Tcl, Lua, etc.) provide that for example. Using such built-in constructs is convenient and easy, and is probably good enough for most needs.

Links Edit

  • Binary search trees on the wikipedia
  • Hash tables - on the wikipedia
  • The Judy Arrays' Homepage - an open-source (LGPL), highly optimised and cache-wise implementation of binary search trees giving very good performance.
  • The libavl2 Homepage - a collection of open-source (LGPL) optimized implementations of binary search trees, including detailed accompanying documentation (in a literate style programming). Also includes links to other implementations.
  • Berkeley DB (on the wikipedia) - provides an open-source, high-performance and powerful on-disk database based on hash tables or B-trees.
  • Tokyo Cabinet - a very high-performance, open-source (LGPL) implementation of a key/value store.
  • The C++ Standard Template Library - a collection of data structures and algorithms for the C++ programming language (part of the standard library), that contains some implementations of efficient lookup data structures.
  • Glib, Apache Portable Runtime - similar implementations for non-C++ C.