PEP 393 performances

Writing efficient code manipulating Unicode is harder since the PEP 393. The problem is to respect the canonical form.

If you preallocate an ASCII buffer but you need to write a Latin1 character, you have to convert the ASCII string to Latin1 which means copying all already written characters. It is inefficient especially if the Latin1 characters occurs at the end. If you preallocate an UCS4 buffer, but the result is UCS2, you have to “shrink” the buffer from UCS4 to UCS2, which means copying all characters.

To not having to widen or shrink your buffer, you can scan your input to compute the maximum character before allocating the buffer. In practice, processing the input twice may be slower.

Another problem is the length of the result. Getting the length of str%args and str.format(args) require to do the work twice: once to get the length, once to write characters. Both approaches were tested (*), and processing the output twice is too slow.

For efficient code, you should be optimistic and enlarge or widen the buffer on demand. When the output length is unknown, it is better to overallocate the buffer.

The _PyUnicodeWriter API helps to implement such function:

  • Widen the buffer on demand
  • Enlarge the buffer on demand
  • Minimum length and overallocation of the buffer can be configured
  • Avoid completly the need of a buffer when the output is only composed of one string
  • Delay allocation of the buffer until the first write. It helps to compute the length and kind of the buffer, because the length and kind cannot always be computed before the first write. It avoids also allocating a buffer is no write is done at all (ex: error before writing the first characters).
  • Give a direct access to the buffer for best performances