Notebooks are great for invoking existing functions and exploring data.
Notebooks aren't ideal for creating functions (standard text editor features are lacking and testing is impossible).
Notebooks encourage an "order dependent variable assignment" programming style without abstractions. Here's what you'll commonly see in a notebook:
val df = spark.read.csv("some_data")
df2 = df.withColumn("clean_name", trim("name"))
df3 = df2.filter("clean_name" === "Mark")
I've found that notebooks are very useful if you write all the complicated code in separate GitHub repos and attach binary executables to the cluster. If you try to write all your logic in notebooks, you'll quickly struggle with order dependent, messy code.
I had the same trouble with order dependence as notebooks got to a certain size, so my team and I created and open-sourced a library, Loman, to help with that. It allows you to interactively create a graph, where nodes represent inputs or functions, and then keeps track of state as you change or add inputs, intermediate functions and request recalculations. Our experience has been broadly positive with this way of working. As graphs get larger, it's easy to lift them into code files in libraries, while continuing to modify or extend them in notebooks. The graph structure and visualization make it easy to return to loman graphs with up to low hundreds of nodes, which would make for a fearsome notebook otherwise. It also makes it easy to bolt Qt or Bokeh UIs onto them for interactive dashboards - just bind UI widgets and events to the inputs and widgets to the outputs. They can be serialized, which is useful for tracking exceptions in intermediate calculations when we put them in airflow to run periodically, as you can see all the inputs to the failing calculation, and its upstreams.
The function bit has always confused me. I tend to write code with lots of functions/modules/classes for handling various aspects of the analysis and I just don't understand how that's supposed to be integrated. Instead notebooks seem better designed to handle small code snippets that rely on well known libraries. I'd be happy to jump on the notebook bandwagon but I'm having trouble seeing how I could adapt my code to the notebook style.
Notebooks aren't ideal for creating functions (standard text editor features are lacking and testing is impossible).
Notebooks encourage an "order dependent variable assignment" programming style without abstractions. Here's what you'll commonly see in a notebook:
val df = spark.read.csv("some_data")
df2 = df.withColumn("clean_name", trim("name"))
df3 = df2.filter("clean_name" === "Mark")
I've found that notebooks are very useful if you write all the complicated code in separate GitHub repos and attach binary executables to the cluster. If you try to write all your logic in notebooks, you'll quickly struggle with order dependent, messy code.