Converting Jupyter Notebooks Into Blog Posts with Gatsby

October 05, 20208 min read

Last reviewed October 30, 2020

This article was originally posted in the LogRocket blog.

Converting Jupyter Notebooks Into Blog Posts With Gatsby

Everyone acquainted with data science knows that Jupyter Notebooks are the way to go. They easily allow you to mix Markdown with actual code, creating a lively environment for research and learning. Code becomes user-friendly and nicely formatted — write about it and generate dynamic charts, tables, and images on the go.

Writing Notebooks is so good that it is only natural to imagine that you might want to share them on the internet. Surely, you can host it in GitHub or even in Google Colab, but that will require a running kernel, and it’s definitely not as friendly as a good ol’ webpage.

Before we go any further, it’s important to understand that a Jupyter Notebook is nothing more than a collection of JSON objects containing inputs, outputs, and tons of metadata. It then constructs the outputs and can easily be converted into different formats (such as HTML).

Knowing that Notebooks can become an HTML document is all we need — what remains is finding a way to automate this process so a .ipynb file can become a static page on the internet. My solution to this problem is to use GatsbyJS — notably, one of the best static site generators out there, if not the single best.

Gatsby easily sources data from different formats — JSON, Markdown, YAML, you name it — and statically generate webpages that you can host on the world wide web. The final piece then becomes: instead of transforming Markdown into a post, do the same with a .ipynb file. The goal of this post is to walk you through this process.

Technical challenges

A quick search on the web will show you gatsby-transformer-ipynb. Basically, this is a Gatsby plugin that is able to parse the Notebook file in a way that we can access it later in our GraphQL queries. It’s almost too good to be true!

And, in fact, it is. The hard work was done by the fine folks of nteract. However, the plugin hasn’t been maintained in a while, and things don’t simply work out of the box — not to mention the lack of customization that one would expect from a plugin.

I’ll spare you the boring stuff, but after fussing around the dark corners of GitHub, and with significant help from this post by Specific Solutions, I managed to create my own fork of gatsby-transformer-ipynb, which solves my problems and will suffice for the purpose of this post.

Note, however, that I have no intention of become an active maintainer, and most of what I’ve done was solely to get what I need to work — use it at your own risk!

Enough with the preambles, let’s get to some code.

Creating a project

Firstly, the source code for what we are going to build can be found here on GitHub. We’ll start by creating a Gatsby project. Make sure you have Gatsby installed, and create a new project by running:

gatsby new jupyter-blog
cd jupyter-blog

Run gatsby develop and go to http://localhost:8000/ to make sure everything is working fine.

Create your first Notebook

Since Jupyter Notebooks will be the data source for our brand-new blog, we need to start adding content. Within your project folder, go to src and create a notebooks folder. We’ll make sure to read from this folder later.

It’s time to create our first Notebook. For the purposes of this tutorial, I’ll use this simple Notebook as a base. You can see the dynamic output in GitHub, but feel free to use whichever you want.

In any case, it’s worth mentioning that some rich outputs such as dynamic charts generated by Plotly may need extra care — let me know if you want me to cover that in a later post! To keep this post short, however, we’ll handle only static images, tables, and Markdown.

Now that you have a Gatsby project with data, the next step is to query it using GraphQL.

Querying data

One of the biggest advantages of Gatsby is flexibility when sourcing data. Virtually anything you want can become a data source that can be used to generate static content.

As mentioned above, we’ll be using my own version of the transformer. Go ahead and install it:

yarn add @rafaelquintanilha/gatsby-transformer-ipynb

The next step is to configure the plugins. In gatsby-config.js, add the following to your plugins array (you can always check GitHub when in doubt):

...
{
  resolve: `gatsby-source-filesystem`,
  options: {
    name: `notebooks`,
    path: `${__dirname}/src/notebooks`,
    ignore: [`**/.ipynb_checkpoints`],
  },
},
{
  resolve: `@rafaelquintanilha/gatsby-transformer-ipynb`,
  options: {
    notebookProps: {
      displayOrder: ["image/png", "text/html", "text/plain"],
      showPrompt: false,
    },
  },
},
...

Let’s break it down.

First, we add a gatsby-source-filesystem option in the array. We are telling Gatsby to look for files in src/notebooks, where our .ipynb files live. Next, we are configuring the transformer and setting some props:

  • displayOrder – MIME type of the outputs we are displaying
  • showPrompt – whether the prompt is displayed

While prompts make sense in Notebooks, in static pages, they lose their purpose. For that matter, we will hide them in order to have clear content.

Time to check whether everything went according to plan. Open GraphiQL by going to http://localhost:8000/___graphql and run the following query:

query MyQuery {
  allJupyterNotebook {
    nodes {
      html
    }
  }
}

Success! Note how the HTML of our notebooks was generated. All that is left is to inject this HTML into a React component and our process will be complete.

Generating posts automatically

The worst is behind us now. The next step is to query this data in gatsby-node.js so we can generate static pages for each Notebook in src/notebooks.

Note, however, that we need to add additional metadata to our Notebook, e.g., author and post title. There are several ways of doing it, and the simplest is probably to take advantage of the fact that .ipynb files are JSON and use their own metadata field. Open the .ipynb and add the info you need:

{
 "metadata": {
  "author": "Rafael Quintanilha",
  "title": "My First Jupyter Post",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  ...
 ]
}

Pro tip: If you’re using VS Code, opening the file will probably launch the Jupyter kernel. You can disable it in the configs to edit the raw content, but I usually just open the file with another editor (such as gedit or Notepad++).

The process now is exactly the same for any data source with Gatsby. We’ll query the data in gatsby-node.js and pass the relevant info to a post template, which, in turn, will become a unique page in our domain.

Before getting to that, however, open gatsby-node.js and add the following:

exports.onCreateNode = ({ node, actions }) => {
  const { createNodeField } = actions
  if (node.internal.type === 'JupyterNotebook') {
    createNodeField({
      name: 'slug',
      node,
      value: node.json.metadata.title
        .split(' ')
        .map(token => token.toLowerCase())
        .join('-'),
    })
  }
}

The above excerpt will, for every node created in GraphQL, check those that are a Jupyter Notebook and extend them with a new field, slug. We are using a naive approach here, but you can use a robust library such as slugify. The new field will be queried and used to generate the post path. In the same file, add the following:

const path = require(`path`)
exports.createPages = async ({ graphql, actions: { createPage } }) => {
  const blogPostTemplate = path.resolve(`src/templates/BlogPost.js`)
  const results = await graphql(
    `
      {
        allJupyterNotebook() {
          nodes {
            fields {
              slug
            }
          }
        }
      }
    `
  )
  const posts = results.data.allJupyterNotebook.nodes
  posts.forEach(post => {
    createPage({
      path: post.fields.slug,
      component: blogPostTemplate,
      context: {
        slug: post.fields.slug,
      },
    })
  })
}

This basically queries data by slug and sends them to BlogPost.js. Let’s create it now:

import React from 'react'
import { graphql } from 'gatsby'
import SEO from '../components/seo'

const BlogPost = ({
  data: {
    jupyterNotebook: {
      json: { metadata },
      html,
    },
  },
}) => {
  return (
    <div>
      <SEO title={metadata.title} />
      <h1>{metadata.title}</h1>
      <p>Written by {metadata.author}</p>
      <div dangerouslySetInnerHTML={{ __html: html }} />
    </div>
  )
}
export default BlogPost
export const query = graphql`
  query BlogPostBySlug($slug: String!) {
    jupyterNotebook(fields: { slug: { eq: $slug } }) {
      json {
        metadata {
          title
          author
        }
      }
      html
    }
  }
`

And that’s it! Hop over to http://localhost:8000/my-first-jupyter-post and see your Notebook as a static HTML page.

Improvements

As you can see, a lot can be improved upon in terms of styling and design. This is beyond the scope of this post, but as a hint, you can use CSS Modules to enhance the layout and remove unnecessary stdout (text output that you don’t care about in a blog post). Create BlogPost.module.css and add the following:

.content {
  max-width: 900px;
  margin-left: auto;
  margin-right: auto;
  padding: 40px 20px;
}

.content :global(.nteract-display-area-stdout),
.content :global(.nteract-outputs > .cell_display > pre) {
  display: none;
}

.content :global(.nteract-outputs > .cell_display > img) {
  display: block;
}

.content :global(.input-container) {
  margin-bottom: 20px;
}

.content :global(.input-container pre.input) {
  border-radius: 10px !important;
  padding: 1em !important;
}
.content :global(.input-container code) {
  line-height: 1.5