Fast Jupyter Notebook Parsing
A Crude Comparison of Falsifiable vs nbformat
Fernando Pérez chose JSON as the original IPython serialization format. This continues to make basic parsing trivial. However, nbformat does more than merely loading a JSON document because not every JSON document is a compliant notebook.See jupyter/nbformat for relevant JSON schemas. Meanwhile, the
abreka command line tool used by
falsifiable is a golang application. There are a lot of reasons for this decision. The most important ones will become obvious closer to
v0.1.0. But in the near term, cross-compilation of a self-contained and performant binary simplifies a lot of work and use.
This post quickly confirms superior performance. Both the canonical python package and my golang module sequentially parse a corpus of 24,183 notebooks (about 10GB), checking each one for an error.
import nbformat import time start_time, n_processed, n_errors = time.time(), 0, 0 with open(FILE_LIST_PATH) as list_fp: for line in list_fp: with open(line.strip()) as fp: try: nbformat.read(fp, as_version=4) except KeyboardInterrupt: raise except: n_errors += 1 n_processed += 1 end_time = time.time() elapsed = (end_time-start_time) / 60 n_processed, n_errors, elapsed
(24183, 1, 6.845025225480398)
Falsifiable is not yet open source. When it is, I’ll repost this analysis as a repository. The code I’m testing does full validation against nbformat version 4.4, with failures “relaxed” in the sense errors are swallowed.Strict adherence is faster given early termination. It uses buger/jsonparser for the deserialization phase before building the associated structs (e.g.
DisplayData, etc). The following was copy/pasted from the benchmark.
n_processed_go, n_errors_go, elapsed_go = (24183, 0, 1.264153416) n_processed_go, n_errors_go, elapsed_go
(24183, 0, 1.264153416)
golang_speedup = elapsed/elapsed_go golang_speedup
Falsifiable’s notebook parsing is fast. Given that deserialization populates structs, navigating the resulting notebook easy (read: IDE-assisted). Limited to parsing, validating, and rendering notebooks – locally or on falsifiable.com – this is overkill. However, for bulk processing, 5x is nothing to shrug off.]