Scheduling Jupyter Notebooks at Meta

Blog

HomeHome / Blog / Scheduling Jupyter Notebooks at Meta

Jul 21, 2023

Scheduling Jupyter Notebooks at Meta

At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing

At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals. Users with such notebooks would have to remember to manually run their notebooks at the required cadence – a process people might forget because it does not scale with the number of notebooks used.

To address this problem, we invested in building a scheduled notebooks infrastructure that fits in seamlessly with the rest of the internal tooling available at Meta. Investing in infrastructure helps ensure that privacy is inherent in everything we build. It enables us to continue building innovative, valuable solutions in a privacy-safe way.

The ability to transparently answer questions about data flow through Meta systems for purposes of data privacy and complying with regulations differentiates our scheduled notebooks implementation from the rest of the industry.

In this post, we’ll explain how we married Bento with our batch ETL pipeline framework called Dataswarm (think Apache Airflow) in a privacy and lineage-aware manner.

At Meta, we’re committed to improving confidence in production by performing static analysis on scheduled artifacts and maintaining coherent narratives around dataflows by leveraging transparent Dataswarm Operators and data annotations. Notebooks pose a special challenge because:

These three considerations shaped and influenced our design decisions. In particular, we limited notebooks that can be scheduled to those primarily performing ETL and those performing data transformations and displaying visualizations. Notebooks with any other side effects are currently out of scope and are not eligible to be scheduled.

There are three main components for supporting scheduled notebooks:

In order to address the majority of the concerns highlighted above, we perform the notebook execution state in a container without access to the network. We also leverage input & output data annotations to show the flow of data.

For ETL, we fetch data and write it out in a novel way:

We have integrated BentoOperator within Meta’s data purpose framework to ensure that data is used only for the purpose it was intended. This framework ensures that the data usage purpose is respected as data flows and transmutes across Meta’s stack. As part of scheduling a notebook, a “purpose policy zone” is supplied by the user and this serves as the integration point with the data purpose framework.

Let’s now explore the workflow for scheduling a notebook:

We’ve exposed the scheduling entry point directly from the notebook header, so all users have to do is hit a button to get started.

The first step in the workflow is setting up some parameters that will be used for automatically generating the pipeline for the schedule.

The next step involves previewing the generated pipeline before a Phabricator (Meta’s diff review tool) diff is created.

In addition to the pipeline code for running the notebook, the notebook itself is also checked into source control so it can be reviewed. The results of attempting to run the notebook in a scheduled setup are also included in the test plan.

Once the diff has been reviewed and landed, the schedule starts running the next day. In the event that the notebook execution fails for whatever reason, the schedule owner is automatically notified. We’ve also built a context pane extension directly in Bento to help with debugging notebook runs.

While we’ve addressed the challenge of supporting scheduled notebooks in a privacy-aware manner, the notebooks that are in scope for scheduling are limited to those performing ETL or those performing data analysis with no other side effects. This is only a fraction of the notebooks that users want to eventually schedule. In order to increase the number of use cases, we’ll be investing in supporting other transparent data sources in addition to the SQL cell.

We have also begun work on supporting parameterized notebooks in a scheduled setup. The idea is to support instances where instead of checking in many notebooks into source control that only differ by a few variables, we instead just check in one notebook and inject the differentiating parameters during runtime.

Lastly, we’ll be working on event-based scheduling (in addition to the time-based approach we have here) so that a scheduled notebook can also wait for predefined events before running. This would include, for example, the ability to wait until all data sources the notebook depends on land before notebook execution can begin.

Some of the approaches we took were directly inspired by the work done on Papermill.