2012-02-06

How I bootstrapped importlib

If you have been reading this blog over the past five years I am sure you have read a post or five about my desire to bootstrap importlib into Python as the implementation of __import__. Well, as of today I'm willing to say that the difficult technological hurdles have been scaled! At this point the only thing holding me back from taking my code from https://hg.python.org/sandbox/bcannon#bootstrap_importlib and making importlib drive import statements are some small compatibility issues, integrating into the build process better, a code review, and python-dev sign-off. In other words all of the interesting problems have been solved, so I'm finally ready to write a blog post discussing how I pulled off what I have.

So how exactly do you import __import__? To begin, as with any bootstrap challenge, you need to figure out what is available to you so you know what your design parameters are. In my case I knew I couldn't import anything that required filesystem access since half of import is handling the search for a module (the other half is the actual importing); if I wanted to import a file I would need to essentially write half of import in C to work properly. This restriction also has unexpected side-effects, e.g. you can't rely on open() because that is part of the io module which is a Python module.

That meant I could only rely on built-in modules. If you run sys.builtin_module_names you will discover what is available directly within the CPython binary. The question then becomes if that is enough? It turns out that yes, those built-in modules are enough to perform an import. OK, so you know you have the bare minimum modules required to do an import, but how the heck do you get the built-in modules into the global scope of the module that imports module since you can't use an import statements?

This is when Python's dynamism comes in handy. Since the import statement doesn't do much more than pull in the module object and assign it to a variable at the global scope of the module, I just needed to get the module object for importlib and assign to its __dict__ the built-in modules I needed. Turns out that sys and imp are enough to allow importlib to handle the import of the rest of the built-in modules needed for import to work, so that kept this bit of code short.

But this brings up the next quandry: how do I create a module object of importlib? If I end up searching for importlib on sys.modules then I would have ended up implementing a decent chunk of import itself. So how could I get the module object? This is when frozen modules comes into play.

A frozen module is just a C array containing the marshaled code for a module (which is what a .pyc file is sans magic number, timestamp, and now file size of the source). Since marshal is a built-in module then frozen modules can be loaded without issue. That means you can load a frozen module without using import (much like importing built-in modules).

And that is all of the parts needed to import importlib w/o import. =) To summarize, you get importlib set as __import__ by doing the following:

  1. Import the frozen module (i.e. read in a C array of a marshaled module object and unmarshal it)
  2. Import sys and imp (built-in modules, so done in C code by calling key C functions which return module objects) and set it on the module object
  3. Call Python code to import the rest of the built-in modules using sys and imp
  4. Set Python-based __import__ on the builtins module
And voila! __import__ ends up implemented in pure Python code. Now I just need to clean up the code, fix the compatibility issues, rip out the old C code, and get python-dev to sign off. =) Hopefully I will get far enough I will have a lightning talk at PyCon with benchmark numbers to show this is actually all a good thing (including ripping out a ton of C code, especially if I can re-implement chunks of imp in pure Python =).