dev / notes

Fast Jupyter Notebook Parsing

A Crude Comparison of Falsifiable vs nbformat

Gopher with Jupyter Logo on T-Shirt. Produced using gopherize.me.Gopher with Jupyter logo

Fernando Pérez chose JSON as the original IPython serialization format. This continues to make basic parsing trivial. However, nbformat does more than merely loading a JSON document because not every JSON document is a compliant notebook.See jupyter/nbformat for relevant JSON schemas. Meanwhile, the abreka command line tool used by falsifiable is a golang application. There are a lot of reasons for this decision. The most important ones will become obvious closer to v0.1.0. But in the near term, cross-compilation of a self-contained and performant binary simplifies a lot of work and use.

This post quickly confirms superior performance. Both the canonical python package and my golang module sequentially parse a corpus of 24,183 notebooks (about 10GB), checking each one for an error.

Python Performance

import nbformat
import time


start_time, n_processed, n_errors = time.time(), 0, 0

with open(FILE_LIST_PATH) as list_fp:
    for line in list_fp:
        with open(line.strip()) as fp:
            try:
                nbformat.read(fp, as_version=4)
            except KeyboardInterrupt:
                raise
            except:
                n_errors += 1
        n_processed += 1
        
end_time = time.time()
elapsed = (end_time-start_time) / 60
n_processed, n_errors, elapsed
(24183, 1, 6.845025225480398)

Golang Performance

Falsifiable is not yet open source. When it is, I’ll repost this analysis as a repository. The code I’m testing does full validation against nbformat version 4.4, with failures “relaxed” in the sense errors are swallowed.Strict adherence is faster given early termination. It uses buger/jsonparser for the deserialization phase before building the associated structs (e.g. MarkdownCell, DisplayData, etc). The following was copy/pasted from the benchmark.

n_processed_go, n_errors_go, elapsed_go = (24183, 0, 1.264153416)
n_processed_go, n_errors_go, elapsed_go
(24183, 0, 1.264153416)
golang_speedup = elapsed/elapsed_go
golang_speedup
5.414710856170639

Conclusion

Falsifiable’s notebook parsing is fast. Given that deserialization populates structs, navigating the resulting notebook easy (read: IDE-assisted). Limited to parsing, validating, and rendering notebooks – locally or on falsifiable.com – this is overkill. However, for bulk processing, 5x is nothing to shrug off.]