Breaking down docker for data science

Published in

Dev Genius

4 min readJan 10, 2022

Have you ever set up a database or any open-source tool like apache Kafka manually on VM? Setting up infra for experimentation or say even for production is quite a fuss, Moreover replicating the whole environment is even more difficult.

So what the heck is “DOCKER”?

According to the official definition — “Docker is a set of the platform as service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.”

Now let’s try to understand this in simpler logic.

Say you have a laptop where you have manually installed python 3.8 with a couple of libraries in a virtual environment with a postgresql running on your local. Now you feel that you need the exact same setup in five different cloud VMs , replicating the whole environment can be tedious in this case.

The best case to still be doing it manually can be setting up shell scripts that automate the whole process but what if your laptop has windows as the OS and the target vms have different distros of Linux like CentOS, Ubuntu etc. In some cases you might find these scripts behave differently even due to the build of the OS.

What about creating a VM on your local device? Sure this is where virtualization comes in. In simpler terms, you could now have a different OS running on the same laptop at the same time and their resources would be shared. A VM has a fully feature-packed distribution of OS and everything works as expected.

Great ! or not so Great??

The question here is: Do you need all those features? An OS itself has a huge size due to its features. This can be further optimized.

So, think of Docker as an Engine that runs an encapsulated environment or rather a VM with minimalistic features that you have specifically mentioned and with just enough of the operating system for it to function as expected. Containers are nothing but these optimized environments which run in isolation which share resources from the host machine. You can further enable networking using ports that enable containers to communicate with local network , vpc or public networks. You can also mount your local or any persistent storage to one of these docker containers to persist the changes when you stop the docker container.

Why should I use Docker for Data Science?

If you’ve been experimenting with data science packages on your local machine with package managers like pip and conda , messing around with virtual environments to resolve package dependencies or having issues in replicating your results due to environmental changes. Sometimes in the name of an experiment we go berserk on our local environment and we may by mistake break some dependencies and even our GPU driver configs , Docker can solve all these problems for you.

How so?

You can run a whole juypter environment (with or without conda environment). Plus you can get GPU support with minimal setup. There are many docker images that support GPU as they have proper versions of drivers required for efficient usage. For Example :

Dockerfile for setting up container

You can find many official organisations like tensorflow and pytorch hosting their official images publically for use. You may need to read the info on the overview page to decide the right image tag for your use case but these are usually self explanatory. For the example, above we have used jupyter with tensorflow already configured but if you prefer installing everything yourself and just require the base env you could try miniconda or plain ol’ jupyter nb.

You can either have your notebooks and requirements.txt copied to the docker container or you could mount the volume (basically just attach your local storage to docker container) in case your files are too large due to which docker image build size may be heavily affected. In other words, your files won’t be packed inside the container and in case you need to share the setup with anyone you’ll also have to share the files separately along with the container.

You can run an example by doing the following:

git clone https://github.com/smiraldr/docker-jupyter-sample.gitcd docker-jupyter-sample/appdocker build -t my_tf_jupyterenv .docker run -v /Users/smiralrashinkar/Desktop/free/docker-jupyter-sample/app/notebooks:/app -p 80:80 my_tf_jupyterenv:latest 
#after -v remove the path and paste your own path

You can then head over to your localhost:80 and Voilà ! you have your jupyter lab running inside this container. Moreover you can see your notebooks on jupyterlab as we have mounted a volume from our local machine.

Now if you want to share this environment or deploy it on cloud you can simply push the container to dockerhub or deploy it using various solutions like Google cloud run.

Breaking down docker for data science

So what the heck is “DOCKER”?

Why should I use Docker for Data Science?

How so?

Written by Smiral Rashinkar