Fast Jupyter Notebook Parsing
A Crude Comparison of Falsifiable vs nbformat
Fernando Pérez chose JSON as the original IPython serialization format. This continues to make basic parsing trivial. However, nbformat does more than merely loading a JSON document because not every JSON document is a compliant notebook.See jupyter/nbformat for relevant JSON schemas. Meanwhile, the abreka
command line tool used by falsifiable
is a golang application. There are a lot of reasons for this decision. The most important ones will become obvious closer to v0.1.0
. But in the near term, cross-compilation of a self-contained and performant binary simplifies a lot of work and use.
This post quickly confirms superior performance. Both the canonical python package and my golang module sequentially parse a corpus of 24,183 notebooks (about 10GB), checking each one for an error.
Python Performance
import nbformat
import time
start_time, n_processed, n_errors = time.time(), 0, 0
with open(FILE_LIST_PATH) as list_fp:
for line in list_fp:
with open(line.strip()) as fp:
try:
nbformat.read(fp, as_version=4)
except KeyboardInterrupt:
raise
except:
n_errors += 1
n_processed += 1
end_time = time.time()
elapsed = (end_time-start_time) / 60
n_processed, n_errors, elapsed
(24183, 1, 6.845025225480398)
Golang Performance
Falsifiable is not yet open source. When it is, I’ll repost this analysis as a repository. The code I’m testing does full validation against nbformat version 4.4, with failures “relaxed” in the sense errors are swallowed.Strict adherence is faster given early termination. It uses buger/jsonparser for the deserialization phase before building the associated structs (e.g. MarkdownCell
, DisplayData
, etc). The following was copy/pasted from the benchmark.
n_processed_go, n_errors_go, elapsed_go = (24183, 0, 1.264153416)
n_processed_go, n_errors_go, elapsed_go
(24183, 0, 1.264153416)
golang_speedup = elapsed/elapsed_go
golang_speedup
5.414710856170639
Conclusion
Falsifiable’s notebook parsing is fast. Given that deserialization populates structs, navigating the resulting notebook easy (read: IDE-assisted). Limited to parsing, validating, and rendering notebooks – locally or on falsifiable.com – this is overkill. However, for bulk processing, 5x is nothing to shrug off.]