Demystifying python container types

11 minute read

Writing type annotations in python can be hard. How do I pick the right one with so many different options that look so similar? If a function returns [1, 2, 3], how should I annotate it? Should it be an Iterable, or a MutableSequence? Can it be a Generator? And why not just a list?

The short answer is, as always, “it depends”, on what exactly you want to do.

Let’s go through some of the python types related to container of elements, discuss how those are related, and where to use them.

TL;DR

Do not use list, dict, tuple as type annotations in python. In most cases, those type annotation are either too demanding (as input types), or promising too much (as output types). The typing system in python has more narrowly scoped types. Also, developers can build their own types that are tailored for their specific needs.

Before we start…

… lets talk about the technical setup. The code in this post is written in pseudo python3. Where necessary, links to the actual code are provided, and most of the information comes directly from the python typeshed and builtins.py.

Explicit, minimal context matters

Types imply behaviour. If I am given a list[int], I know that I can do a couple of things with it:

  • .append something to the end of the list,
  • find the first index of a value via .index,
  • get an item by index like x[y],

and many more.

In most of the cases however, I don’t need all of that. I might want only to check the size of the list and whether it contains a particular element. Python typing system allows to be quite specific in what we actually want, by using more narrowly scoped type annotation than, for example, list.

Being specific is great. It is easier to satisfy a more specific requirement (e.g., what an input argument to a function is expected), than a less specific requirement. It is easier to understand and safer to use a more specific result (e.g., what an output of a function is), than a less specific result. Being specific means that we, as developers, make the code more explicit, and reduce the amount of context that is necessary to understand us.

We need to know two things to be specific:

  1. The minimal requirements or promises that we want to provide or make,
  2. How to express those in terms of the python typing system.

The first one depends on the concrete case, and it is the developer’s responsibility to figure that out.

The second one is a general question, and we will discuss it in terms of the python typing primitives. Ultimately, we will understand when it makes sense to annotate something as list or dict, and when something much simpler would be a better choice.

Building blocks 0: primitive types

Some types that define a behaviour must be first in the chain of inheritance. Three of such primitive types are of particular interest to us: Iterable, Sized, and Container. Those are independent, “level 0” types, and they mean only what is in their source code.

A Container object must be able to tell if it contains something, as in y in x for an x: Container and y: object must work:

class Container:
    def __contains__(self, __x: object) -> bool: ...

(The actual code is simplified but the idea is the same. See ref for the actual definition.) The python in statement calls __contains__.

A Sized object must be able to tell its length, as in len(s) for an s: Sized must work:

class Sized:
    def __len__(self) -> int: ...

(Again, this is slightly simplified. See ref for the actual definition.) The python len() statement calls __len__.

An Iterable object must be able to create a “stream” of data. This stream of data is called an iterator, and we will it discuss in the next section. For an it: Iterable, a call iter(it) must work (obviously, not everything in python is an Iterable, for example, iter(1) will not work because int does not implement __iter__ and thus is not an Iterable).

class Iterable:
    def __iter__(self) -> Iterator: ...

(And this also is slightly simplified. See ref for the actual definition.) The python iter() statement calls __iter__.

Ok, but why?

The reason we need all three of these types is that they represent quite distinct behaviour, and one cannot replace the other.

An Iterable (something that can represent the inside data as a stream) is not necessarily Sized, i.e., does not have a fixed length (it can be representing a stream of objects that are generated on the fly). For the same reason it is not necessarily a Container: we cannot know in advance if a particular object will appear in the stream.

A Sized (something that has a known length) does not have to provide a way to peek into its elements, and therefore is neither an Iterable nor a Container. Imagine, for example, an Sized object that represents a seating plan in an airplane. We might want to tell how many passengers are already in, but nothing more.

A Container (something that can check what is inside) can be unSized, e.g., like a Container of all twin primes. It is also not necessarily an Iterable, following the same argument as for the Sized type, when we don’t want to disclose which elements are actually inside.

Summing up, the level 0 container types are:

Level 0 container types

Building blocks 1: add a bit of spice, and mix

We can enrich the primitive types, Iterable, Sized, and Container, by adding some additional behaviour or mixing them together.

A new type can add some new functionality to an existing one. For example, an Iterable that can reverse the iteration direction (instead of iterating from the first element to the last, iterate from the last element to the first) is called a Reversible. An Iterable cannot change the direction by itself, so it is necessary to introduce a new behaviour for that.

class Reversible(Iterable):
    def __reversed__(self) -> Iterator: ...

(See ref for the full definition.)

Another example is a “stateful” Iterable. As we discussed above, an Iterable can create a stream of data, but does not deal with consuming that stream. An Iterable that can iterate through the elements of the stream, one by one, is called an Iterator. An Iterator adds a __next__ method to an Iterable:

class Iterator(Iterable[T]):
    def __next__(self) -> T: ...

(See ref for the full definition.) A Iterator object must be able to yield its next element, as in next(it) for an it: Iterator must work. This is in contrast to a plain Iterable, which does not have a state, so calling next on it will fail.

Side note: in python, users can write for x in it, where it: Iterable. This implicitly calls __iter__ once on the iterable it which creates an Iterator. Then, implicit next calls are made multiple times against the produced Iterator, and assigns the elements to x one by one.

A new type does not necessarily need to introduce brand new functionality, and can be a combination of the existing ones. A Collection type is a sum of Iterable, Sized, and Container:

class Collection(Iterable, Container, Sized): ...

(This is slightly simplified. See ref for the actual definition.)

If there are cases when we would need to

  • iterate over content,
  • check if a particular object is inside,
  • known the length of a container,

a Collection type annotation should be just fine!

Summing up, we discussed the following level 1 container types:

Level 1 container types

Custom level 1 types

Of course, it is perfectly fine to define custom types that include only those pieces that are of interest to the developer. For example, a SizedContainer = Sized + Container:

_T_co = TypeVar("_T_co", covariant=True)

class SizedContainer(Sized, Container[_T_co], Protocol[_T_co]): ...

(Here a type variable _T_co is used. See my previous post to learn more about that.)

Building blocks 2: All around me are familiar faces

The types that we discussed so far are quite limited. It is great on one hand, as the scope around those is very concise. On the other hand, developers are likely to expect to work with more feature-rich objects. The level 2 types cover most of those needs.

Let’s talk about three common container types: sequences, hash tables, and sets. All of those are Collections, with extra behaviour specific to each concrete type.

A Sequence is a Reversible Collection with a __getitem__, index, and count methods. In a sense, a Sequence is a very minimal immutable list - we can iterate over it, get an element by index, count the number of occurrences of a particular entry.

A Mapping is a Collection that contains key-value pairs, and allows to inspect elements by key, view the keys, values, and key-value pairs. Similar to the Sequence, it is an immutable type. However, it is not possible to iterate over a Mapping directly (but we can iterate its keys and values).

An AbstractSet is a Collection of unique elements, and allows to compare itself against other AbstractSets. Again, it is an immutable type.

Sequence, Mapping, and AbstractSet are great options when we need to annotate immutable container types.

Level 2 container types

Building blocks 3: They Are A-Changin’

So far, all the types that we discussed were immutable. We do want to add elements to our containers, pop elements, update values for a particular key, etc. Let’s talk about some mutable types provided by the standard typeshed.

A MutableSequence is a mutable version of the Sequence type that allows to add and remove elements anywhere in the sequence. MutableMapping and MutableSet are mutable versions of the Mapping and AbstractSet types, respectively. The syntax for elements manipulation is slightly different, but the general idea is the same.

Level 3 container types

Note that these types are purely for type annotation (as they are protocols and only define the behaviour, but do not implement it), and we cannot instantiate them like s = MutableSet().

Building blocks 4: Back Where It All Started

We are all familiar with list, tuple, dict, set. Those are actual python objects, and we can instantiate them with l = list() or s = set((1, 2)). So, at least, they should implement the __init__ method (which already makes them different from the protocol types discussed above).

Let’s start with the tuple which is the only immutable object in this group. It is a Sequence (but not a MutableSequence) plus a bit of extra functionality: we can add and multiply tuples, and compare tuples (which is based on element-wise comparison).

The list, dict, and set objects are expectedly MutableSequence, MutableMapping, and MutableSet, respectively. They can also create themselves via new methods like copy(), fromkeys (for dict), and difference / intersection / etc. (for set). This is possibly the main difference from the Mutable protocols, which scopes do not extend to generation of new objects.

Similarly, the built-in types implement some form of user representation (__repr__), unlike the protocol types. This makes sense, because, for example, a Collection does not really care about being represented in any user-friendly form.

Level 4 container types

Side note: interestingly, a dict is also a Reversible.

Summary

I believe it is useful to be aware of the container types build chain, from a primitive Sized to a top-level dict.

Building a signature of a method or a function is very exciting, and it is very important to do that wisely. Being specific in both the input and output types helps the users of the function: they know what exactly is expected (inputs) so there is less to implement; they know what can they rely on (outputs). A function that requires as little as possible suits more use cases.

Similarly, if developers of a function provide only what is actually needed to the users, they would evade unnecessary work, and de-risk backwards compatibility. Developers and users avoid potential misunderstandings: it would be unfair to the users if a function promises to return a dict (a mutable mapping with a bunch of utility methods), but what it actually wanted to provide was a plain Mapping as it was never expected for the users to add new elements to it.

Imagine a taxi service: you, as a service provider, allows your passangers to go from A to B for a fee. As a service provider, the only thing you require is that a client is able to pay for the service (the input requirements are minimal). Also, as a service provider, you promise a safe journey from one place to another, but nothing else. Your employee taxi driver is be capable of changing the destination to a different city - but it would not make much sense to promise that to the passanger, because this is not what you agreed on (even though the driver can do that, just a many other unrelated things). If you don’t inform the user of your service about the limits of the service, the user might be quite upset if the services they assumed to be availble are not supported.

Being able to be explicit in what is actually needed or provided reduces risks and sets the expectations, making the code easier to follow and support.

Updated: