Docker for data engineering, Part 1
In this post I'll list the most common containers that I'm using in my daily work. I do not pretend to explain the theory behind docker's containers instead you can get some useful docker files to work with.
Depending on the scenario that you are facing up you can do a combination with them to get the environment according with your requirements.
Databases
Don't forget setting up the volumes correctly when you are working with databases this features prevent the loss of your data when the container no longer exists.
Mongo
docker run --name mongo_nea -d \
-v $HOME/data/mongo:/app/data \
-v $HOME/config/mongod.conf:/etc/mongo.conf \
-p 27017:27017 mongo:3.6.2 \
Influx
docker run --name influxdb -d \
-p 8083:8083 -p 8086:8086
-v /local/path/to/db:/var/lib/influxdb
-v /local/path/to/config:/etc/influxdb/influxdb.conf:ro \
influxdb:1.2 -config /etc/influxdb/influxdb.conf
PostgreSQL
docker run -d --name this_postgres -v pg-datastore:/var/lib/postgresql/data -p 5432:5432 postgres
MySQL
docker run -v $HOME/data/mysql:/var/lib/mysql \
--name mysql_spring_boot -e MYSQL_DATABASE='test' \
-e MYSQL_USER='mysql' \
-e MYSQL_PASSWORD='mysql' \
-e MYSQL_ROOT_PASSWORD='admin' \
-e MYSQL_ALLOW_EMPTY_PASSWORD='yes' \
-d mysql
SQL Server
docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=secret' -p 1433:1433 \
-d mcr.microsoft.com/mssql/server:2017-latest
Data Analysis
The interesting part here is pwd wich is the current path, and will be linked with the workpace in jupyter. So for example when I receive a new dataset sample I run the container from the unzipped folder and get access to the entire dataset from jupyter for analyze it with pandas.
docker run -d -v `pwd`:/home/jovyan/data \
-P jupyter/scipy-notebook
Data Pipelines
Kafka
docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper
docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka
// Get the IPs
export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)
export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)
sudo docker exec -it kafka bash
kafka-topics.sh --create --topic test --zookeeper $ZK_IP:2181 --replication-factor 1 --partitions 1
kafka-console-producer.sh --topic test --broker-list $KAFKA_IP:9092
sudo docker exec -it kafka bash
kafka-console-consumer.sh --topic test --from-beginning --zookeeper $ZK_IP:2181
Apache NiFi
docker run --name nifi \
-v $HOME/data/nifi/sample.nar:/opt/nifi/nifi-1.7.0/lib/sample.nar \
-p 8081:8080 -d apache/nifi:1.7.0