Saturn Cloud vs Google Colab for Data Science and Machine Learning
Which Tool is Better for Scalable Data Science?
Ask any data scientist for the most common tool they use at work. Chances are, you will hear a lot about Jupyter notebook or Google Colab. That’s no surprise since data scientists often need an interactive environment to code — to see the results of our data wrangling immediately, to extract insights from visualizations, and to monitor the performance of the machine learning models closely. I for one would love for my code to execute very quickly, if not immediately. That is usually done with the help of GPU and parallel programming.
After a machine learning model is developed and tested on a local environment, it has to go live and be deployed. Unfortunately, the process of deployment is… complicated. The work runs the gamut from creating servers to setting up security protocols. Getting those done usually requires a dedicated DevOps engineer who is an expert in cloud services, operating systems, and networking.
Now, how can data scientists drastically speed up our data science development process? How can data scientists like ourselves bypass the need of a DevOps engineer and deploy our model?
This is what Saturn Cloud can do for data practitioners. Saturn helps teams train and deploy models with distributed systems and GPUs.
Curious? In this blog post, I will show how Saturn Cloud is similar to (but also very different from) Google Colab, and how you can use Saturn Cloud to do things that optimize your data science workflow.
Here are the six differences between Saturn Cloud and Colab Pro.
- Pricing
- Appearance of the Coding Interface
- Ease of Deploying and Sharing a Dashboard
- Customizability of the Runtime
- Efficiency of Code from Parallel Computing
- Level of support
Pricing
Both Saturn Cloud and Google Colab offer free and paid services. The table below compares the two services.
For many beginner users, the services offered by Google Colab Free is more than sufficient. Those who want more resources than Google Colab Free may opt for Colab Pro at a rather affordable price of $9.99. However, those who are looking for greater flexibility and ease of deployment would find Saturn Cloud an attractive alternative to Google Colab.
Note that while all of the features discussed in this article are available to the free-tier Saturn Cloud, the full capability of Saturn Cloud is unlocked with the Pro Version.
Appearance of the Coding Interface
Google Colab’s interface resembles that of a Jupyter notebook, except it contains a few unique features. These features include the left pane that shows the folder directory and two bars that remind you of the resource usage on the top right corner.
On the other hand, the Saturn Cloud coding environment is exactly like that of JupyterLab. In fact, Saturn is built on JupyterLab. JupyterLab provides data scientists with the interface to code in Jupyter notebooks, access the shell terminals and move files in a GUI environment.
If you prefer running your code in just Jupyter notebook, you can do that on Saturn Cloud too.
Ease of Deploying and Sharing an Interactive Dashboard
Many python packages allow us to build interactive dashboards that the user can interact with. These packages include Plotly Dash, Voila and Bokeh. Such interactive dashboards can be built on both Google Colab and Saturn Cloud. In this example, I will replicate an interactive dashboard from the Voila Gallery using both Google Colab and Saturn Cloud.
To share the dashboard with visitors, the user needs to deploy the dashboard to a server so that the user can interact with the visualization. The ease of deployment is the key difference between Google Colab and Saturn Cloud.
On Google Colab, the user might need to rely on third-party solutions Heroku or Ngrok for deployment, which is somewhat tedious. On the other hand, the deployment of a dashboard on Saturn Cloud is relatively straightforward, requiring only five clicks because Saturn Cloud has already taken care of the heavy lifting related to deployment.
In short, while it is possible to deploy interactive visualization on both Google Colab and Saturn Cloud, the latter saves you valuable time in doing so.
Customizability of the Runtime
One of the most attractive perks of Google Colab is its free runtime that includes CPU, GPU, TPU and around 12GB of RAM (at the time of writing). The advent of cloud-based notebooks like Google Colab with free resources has indeed democratized deep learning. Nowadays, anyone with a Google account can get their hands on some GPU to train their neural network.
However, to keep Colab free for everyone, the types of GPUs available in Colab vary over time, and there is no way to choose the type of GPU that one can connect to at any given time, even if you are a Colab Pro user.
Moreover, if you are a geek who has used Google Colab extensively, you must have run into one of these devastating screens…
- Your Google Drive has insufficient space to store your data or model
- Your GPU memory usage is cut off
- Your session crashes because it has run out of memory
- Your session has timed out after 1 hour of inactivity or after 12 hours of running the notebook
These screens can be very disruptive. If the GPU memory usage is cut off, your model might not be sufficiently trained. If your runtime crashes when you’re training a model, you essentially lose all your progress. Granted, there are hacks to ensure to address each scenario. However, we cannot be sure such hacks will work indefinitely. One can sign up for Colab Pro to get priority access to such resources, but they are not guaranteed.
This is where Saturn Cloud’s hardware customizability comes in handy.
Before running a notebook on Saturn Cloud, one has to create a workspace. The user is free to customize the workspace, including the disc space needed (from 10 GB to 1000 GB), the hardware (CPU/GPU), the size requirement (from 2 cores 4GB to 64 cores 512GB), and shut-off duration (from 1 hour to Never). Since you are the boss of this workspace, you need not worry about running out of space, GPU access, memory, or runtime.
A bonus is the ability to set up a workspace with an existing Docker image. This means that the workspace will be set up with certain packages dictated by the Docker image, allowing you to reproduce code that someone else has written.
Efficiency of Code from Parallel Computing
Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. Unlike non-parallel (or serial) programming which solves a big problem sequentially, parallel programming breaks down a problem into smaller pieces and solving these small problems simultaneously.
Parallel programming can speed up computationally heavy tasks. For example, parallel programming allows programmers to load and handle large datasets (especially if they are too large to be held in memory). It also allows the process of hyperparameter tuning to be quicker.
There are many packages in python for parallel programming. One of the most well-known is Dask. It is designed to work well on single-machine setups and multi-machine clusters and can be used with pandas, NumPy, scikit-learn and other Python libraries. For more information on Dask, you can refer to this documentation on Dask by Saturn Cloud.
You can use Dask on both Google Colab and Saturn Cloud. However, one has a greater potential to speed up their code on Saturn Cloud than on Google Colab.
This is because the amount of speedup that one can obtain from parallelizing your code differs depending on the specification of the workspace. Intuitively, if a CPU has more cores, then there are workers to work on these small problems. This means that there is a greater potential for speedup from parallel programming.
Since one can customize the specification of the workspace in Saturn Cloud but not on Google Colab, a user of Saturn Cloud can decide on the number of cores of the runtime based on how much speedup is needed while a user of Google Colab is not able to.
Moreover, one can make use of Dask clusters available on Saturn Cloud to speed up Dask commands even further. This increases the number of workers that are working on the problem. (Free Saturn Cloud users have 3 hours to use Dusk clusters with 3 workers every month.) This feature is not available on Google Colab.
Let’s visualize this using a toy example. We first create a square matrix of 10,000 rows, populate it with random numbers, then the sum of itself and its transpose, then find a mean along one of the axes.
Here are three code blocks that illustrate this point.
- A code block written in NumPy arrays on Google Colab Free
- A code block written in Dask arrays on Google Colab Free
- A code block written in Dask arrays on Saturn Cloud with a Dask cluster.
We see a 56% increase in speed when we switch from NumPy arrays to Dask arrays and another 97% increase in speed when we switch from Google Colab Free to Saturn Cloud.
Code block 1: NumPy arrays on Google Colab Free
>>> import numpy as np
>>> def test():
>>> x = np.random.random((10000, 10000))
>>> y = x + x.T
>>> z = y[::2, 5000:].mean(axis=1)
>>> return
>>> %timeit test()
1 loop, best of 5: 2.14 s per loop
Code block 2: Dask arrays on Google Colab Free
>>> import dask.array as da
>>> def test():
>>> x = da.random.random((10000, 10000), chunks=(1000, 1000))
>>> y = x + x.T
>>> z = y[::2, 5000:].mean(axis=1)
>>> return z
>>> %timeit test().persist()
1 loop, best of 5: 866 ms per loop
Code block 3: Dask arrays on Saturn Cloud with a Dask cluster on Saturn Cloud
>>> from dask.distributed import Client
>>> from dask_saturn import SaturnCluster
>>> import dask.array as da
>>> cluster = SaturnCluster(n_workers=3)
>>> client = Client(cluster)
>>> def test():
>>> x = da.random.random((10000, 10000), chunks=(1000, 1000))
>>> y = x + x.T
>>> z = y[::2, 5000:].mean(axis=1)
>>> return z
>>> %timeit test().persist()
26.7 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The Level of Support
Google Colab does not offer personalized support, so your best bet to a problem is Stack Overflow. On the other hand, Saturn Cloud has a Slack community and a support team that you can reach out to for your problem.
Conclusion
While Google Colab is an amazing tool for data science, it has limitations in its customizability, resources reliability, efficiency and ease of deployment. For data science beginners who are starting, Google Colab is your best bet. For intermediate to advanced data science practitioners who are looking for a complete solution to deploy data science solutions efficiently, it is worth considering Saturn Cloud. Sure, you might need to invest a little time and money on Saturn Cloud, but the gain in efficiency from adopting it is likely to outweigh its cost.
In conclusion, Google Colab is great for personal small-scale data science projects, while Saturn Cloud is the winner for scalable data science. If you’re looking to try it out, feel free to start experimenting with Saturn Cloud.