Docker for data engineering, Part 1

In this post I'll list the most common containers that I'm using in my daily work. I do not pretend to explain the theory behind docker's containers instead you can get some useful docker files to work with.

Depending on the scenario that you are facing up you can do a combination with them to get the environment according with your requirements.


Don't forget setting up the volumes correctly when you are working with databases this features prevent the loss of your data when the container no longer exists.


docker run --name mongo_nea -d \
-v $HOME/data/mongo:/app/data \
-v $HOME/config/mongod.conf:/etc/mongo.conf \
-p 27017:27017 mongo:3.6.2 \


docker run --name influxdb -d \
-p 8083:8083 -p 8086:8086
-v /local/path/to/db:/var/lib/influxdb
-v /local/path/to/config:/etc/influxdb/influxdb.conf:ro \
influxdb:1.2 -config /etc/influxdb/influxdb.conf


docker run -d --name this_postgres -v pg-datastore:/var/lib/postgresql/data -p 5432:5432 postgres


docker run -v $HOME/data/mysql:/var/lib/mysql \
--name mysql_spring_boot -e MYSQL_DATABASE='test' \
-e MYSQL_USER='mysql' \
-e MYSQL_PASSWORD='mysql' \
-d mysql

SQL Server

docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=secret' -p 1433:1433 \

Data Analysis

The interesting part here is pwd wich is the current path, and will be linked with the workpace in jupyter. So for example when I receive a new dataset sample I run the container from the unzipped folder and get access to the entire dataset from jupyter for analyze it with pandas.

docker run -d -v `pwd`:/home/jovyan/data \
-P jupyter/scipy-notebook

Data Pipelines


docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper
docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka
// Get the IPs
export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)
export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)
sudo docker exec -it kafka bash --create --topic test --zookeeper $ZK_IP:2181 --replication-factor 1 --partitions 1 --topic test --broker-list $KAFKA_IP:9092

sudo docker exec -it kafka bash --topic test --from-beginning --zookeeper $ZK_IP:2181

Apache NiFi

docker run --name nifi \
-v $HOME/data/nifi/sample.nar:/opt/nifi/nifi-1.7.0/lib/sample.nar \
-p 8081:8080 -d apache/nifi:1.7.0
Show Comments