Docker for data engineering, Part 1

In this post I'll list the most common containers that I'm using in my daily work. I do not pretend to explain the theory behind docker's containers instead you can get some useful docker files to work with.

Depending on the scenario that you are facing up you can do a combination with them to get the environment according with your requirements.

Databases

Don't forget setting up the volumes correctly when you are working with databases this features prevent the loss of your data when the container no longer exists.

Mongo

docker run --name mongo_nea -d \
-v $HOME/data/mongo:/app/data \
-v $HOME/config/mongod.conf:/etc/mongo.conf \
-p 27017:27017 mongo:3.6.2 \

Influx

docker run --name influxdb -d \
-p 8083:8083 -p 8086:8086
-v /local/path/to/db:/var/lib/influxdb
-v /local/path/to/config:/etc/influxdb/influxdb.conf:ro \
influxdb:1.2 -config /etc/influxdb/influxdb.conf

PostgreSQL

docker run -d --name this_postgres -v pg-datastore:/var/lib/postgresql/data -p 5432:5432 postgres

MySQL

docker run -v $HOME/data/mysql:/var/lib/mysql \
--name mysql_spring_boot -e MYSQL_DATABASE='test' \
-e MYSQL_USER='mysql' \
-e MYSQL_PASSWORD='mysql' \
-e MYSQL_ROOT_PASSWORD='admin' \
-e MYSQL_ALLOW_EMPTY_PASSWORD='yes' \
-d mysql

SQL Server

docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=secret' -p 1433:1433 \
-d mcr.microsoft.com/mssql/server:2017-latest

Data Analysis

The interesting part here is pwd wich is the current path, and will be linked with the workpace in jupyter. So for example when I receive a new dataset sample I run the container from the unzipped folder and get access to the entire dataset from jupyter for analyze it with pandas.

docker run -d -v `pwd`:/home/jovyan/data \
-P jupyter/scipy-notebook

Data Pipelines

Kafka

docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper
docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka

// Get the IPs
export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)
export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)

sudo docker exec -it kafka bash
kafka-topics.sh --create --topic test --zookeeper $ZK_IP:2181 --replication-factor 1 --partitions 1
kafka-console-producer.sh --topic test --broker-list $KAFKA_IP:9092

sudo docker exec -it kafka bash
kafka-console-consumer.sh --topic test --from-beginning --zookeeper $ZK_IP:2181

Apache NiFi

docker run --name nifi \
-v $HOME/data/nifi/sample.nar:/opt/nifi/nifi-1.7.0/lib/sample.nar \
-p 8081:8080 -d apache/nifi:1.7.0

Show Comments