1.docker-compose.yml
docker-compose.yml主要参数说明,完整的yml文件见附录A
version: '3.5' # compose file format 版本
services: # 该 Compose 文件定义了的服务
......
version: '3.5' # compose file format 版本
services: # 该 Compose 文件定义了的服务
......
networks: # 配置容器连接的网络
default:
name: datahub_network
volumes: # 将主机的数据卷或着文件挂载到容器里。
mysqldata: # 挂载 mysql 数据信息
esdata: # 挂载 Elasticsearch 数据信息
neo4jdata: # 挂载 neo4j 图形数据库数据信息
zkdata: # 挂载 zookeeper 数据信息
2.services详解
2.1 mysql
DataHub GMS使用MySQL作为存储后端。
mysql:
container_name: mysql # 容器名
hostname: mysql # 主机服务名
image: mysql:5.7 # 镜像名和版本
restart: always # 当 Docker 重启时,容器自动启动
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
environment: # 数据库参数
MYSQL_DATABASE: 'datahub'
MYSQL_USER: 'datahub'
MYSQL_PASSWORD: 'datahub'
MYSQL_ROOT_PASSWORD: 'datahub'
ports:
- "3306:3306" # 端口映射
volumes: # 数据卷挂载
- ../mysql/init.sql:/docker-entrypoint-initdb.d/init.sql
- mysqldata:/var/lib/mysql
2.2 zookeeper
ZooKeeper是一项集中式服务,用于维护配置信息,命名,提供分布式同步和提供组服务。
Confluent平台是一个流媒体平台,使您能够一站式组织和管理来自许多不同来源的数据可靠的高性能系统。Confluent平台使轻松构建实时数据管道和流应用程序。 通过整合来自多个来源的数据位置集中到单个中央事件流平台
zookeeper:
image: confluentinc/cp-zookeeper:5.4.0
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181 # 客户端连接 Zookeeper 服务器的端口
ZOOKEEPER_TICK_TIME: 2000 # Zookeeper 服务器之间或客户端与服务器之间维持心跳的时间间隔
volumes:
- zkdata:/var/opt/zookeeper
2.3 broker
Kafka服务器
broker:
image: confluentinc/cp-kafka:5.4.0
hostname: broker
container_name: broker
depends_on:
- zookeeper
ports:
- "29092:29092"
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1 # ID
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' # zk 连接
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT # 配置监听者的安全协议
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 # Broker 的Listener 信息发布到 Zookeeper 中
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 # 主题的副本数
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 # 空消费组延时 rebalance
2.4 kafka-rest-proxy
为Kafka集群提供了RESTful接口,可以轻松生成和使用消息,查看集群状态以及执行管理操作。REST proxy 用开放HTTP接口的方式允许你通过网络访问Kafka的各种功能
kafka-rest-proxy:
image: confluentinc/cp-kafka-rest:5.4.0
hostname: kafka-rest-proxy
container_name: kafka-rest-proxy
ports:
- "8082:8082"
environment:
KAFKA_REST_LISTENERS: http://0.0.0.0:8082/ # 监听端口
KAFKA_REST_SCHEMA_REGISTRY_URL: http://schema-registry:8081/ # Schema Registry url
KAFKA_REST_HOST_NAME: kafka-rest-proxy
KAFKA_REST_BOOTSTRAP_SERVERS: PLAINTEXT://broker:29092 # broker 端口
depends_on:
- zookeeper
- broker
- schema-registry
2.5 kafka-topics-ui
浏览器Kafka topic,知道集群中发生的事情。 查找主题/查看主题元数据/浏览主题数据(kafka消息)/查看主题配置/下载数据。
The software is stateless and the only necessary option is your Kafka REST Proxy URL:
用于使用kafka-rest查看Kafka Topics配置和数据的UI
kafka-topics-ui:
image: landoop/kafka-topics-ui:0.9.4
hostname: kafka-topics-ui
container_name: kafka-topics-ui
ports:
- "18000:8000"
environment:
KAFKA_REST_PROXY_URL: "http://kafka-rest-proxy:8082/" # 使得我们能通过 http 访问 kafka
PROXY: "true"
depends_on:
- zookeeper
- broker
- schema-registry
- kafka-rest-proxy
2.6 kafka-setup
初始化了一个名为kafka-setup的容器,以创建MetadataAuditEvent和MetadataChangeEvent&FailedMetadataChangeEvent主题。 该容器唯一要做的就是在Kafka broker准备就绪后创建Kafka主题。
kafka-setup:
build:
context: ../kafka
hostname: kafka-setup
container_name: kafka-setup
depends_on:
- broker
- schema-registry
environment:
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_BOOTSTRAP_SERVER=broker:29092
2.7 schema-registry
Confluent Schema Registry为您的元数据提供了一个服务层。 它提供了一个RESTful接口,用于存储和检索Avro,JSON Schema和Protobuf Schema。
Schema Registry在Kafka broker之外活动。 producers and consumers仍在与Kafka读入和读取数据,以发布和读取主题的数据(消息)。 同时,他们还可以与Schema Registry进行对话,以发送和检索描述消息数据模型的架构。
Schema Registry 帮助你集中管理Kafka消息格式以实现数据前向/后向兼容
schema-registry:
image: confluentinc/cp-schema-registry:5.4.0
hostname: schema-registry
container_name: schema-registry
depends_on:
- zookeeper
- broker
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry # 注册表名称
SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: 'zookeeper:2181' # 链接 Kafka 集群的 zookeeper url
2.8 schema-registry-ui
提供可视化架构注册表界面的容器,您可以在该界面中注册/注销架构。 您可以通过http://localhost:8000连接到Web浏览器上的schema-registry-ui以监视Kafka Schema Registry
这是一个用于confluentinc / schema-registry的Web工具,用于创建/查看/搜索/演化/查看历史记录以及配置Kafka集群的 Avro schemas。
schema-registry-ui:
image: landoop/schema-registry-ui:latest # 镜像名
container_name: schema-registry-ui # 容器名
hostname: schema-registry-ui # 主机名
ports: # 端口映射
- "8000:8000"
environment: # 运行环境
SCHEMAREGISTRY_URL: 'http://schema-registry:8081' # 注册表地址
ALLOW_GLOBAL: 'true'
ALLOW_TRANSITIVE: 'true'
ALLOW_DELETION: 'true' # 允许删除
READONLY_MODE: 'true' # 只读
PROXY: 'true' # 设置代理
depends_on: # 设置依赖
- schema-registry
2.9 elasticsearch
DataHub使用Elasticsearch作为搜索引擎。 Elasticsearch支持DataHub的搜索,预输入和浏览功能。
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:5.6.8
container_name: elasticsearch
hostname: elasticsearch
ports:
- "9200:9200"
environment:
- discovery.type=single-node # 设置单节点集群
- xpack.security.enabled=false # 是否开启 xpack 用户认证
- "ES_JAVA_OPTS=-Xms1g -Xmx1g" # JVM内存设置初始值1G,最大值1G
volumes: # 挂载目录
- esdata:/usr/share/elasticsearch/data
2.10 kibana
Kibana 是一个免费且开放的用户界面,能够让您对 Elasticsearch 数据进行可视化,并让您在 Elastic Stack 中进行导航。您可以进行各种操作,从跟踪查询负载,到理解请求如何流经您的整个应用,都能轻松完成。
您可以通过以下链接连接到Web浏览器上的Kibana以监视Elasticsearch:http://localhost:5601
kibana:
image: docker.elastic.co/kibana/kibana:5.6.8
container_name: kibana
hostname: kibana
ports:
- "5601:5601"
environment:
- SERVER_HOST=0.0.0.0 # 对外暴露服务的地址
- ELASTICSEARCH_URL=http://elasticsearch:9200
depends_on:
- elasticsearch
2.11 neo4j
Neo4j是一个世界领先的开源图形数据库。DataHub在后端使用Neo4j作为图数据库来提供图查询。数据的来源 血统
neo4j:
image: neo4j:3.5.7
hostname: neo4j
container_name: neo4j
environment:
NEO4J_AUTH: 'neo4j/datahub'
ports:
- "7474:7474"
- "7687:7687"
volumes:
- neo4jdata:/data
2.12 elasticsearch-setup
# This "container" is a workaround to pre-create search indices
elasticsearch-setup:
build:
context: ../elasticsearch
hostname: elasticsearch-setup
container_name: elasticsearch-setup
depends_on:
- elasticsearch
environment:
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
2.13 datahub-gms
DataHub GMS(Generalized Metadata Service)是用Java编写的Rest.li服务。 它遵循常用的Rest.li(大规模构建RESTful架构的框架,Rest.li构建RESTful客户端和服务器)服务器开发实践,所有数据模型均为Pegasus(.pdl)模型。
datahub-gms:
image: linkedin/datahub-gms:${DATAHUB_VERSION:-latest}
hostname: datahub-gms
container_name: datahub-gms
ports:
- "8080:8080"
environment:
- EBEAN_DATASOURCE_USERNAME=datahub
- EBEAN_DATASOURCE_PASSWORD=datahub
- EBEAN_DATASOURCE_HOST=mysql:3306
- EBEAN_DATASOURCE_URL=jdbc:mysql://mysql:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8
- EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
- NEO4J_HOST=neo4j:7474
- NEO4J_URI=bolt://neo4j
- NEO4J_USERNAME=neo4j
depends_on:
- elasticsearch-setup
- kafka-setup
- mysql
- neo4j
2.14 datahub-frontend
DataHub-frontend是一种用Java编写的Play服务(Play是一个高效的Java和Scala Web应用程序框架,该框架集成了现代Web应用程序开发所需的组件和API。)。它用作DataHub-gms(即后端服务)和DataHub Web之间的中间层。
datahub-frontend:
image: linkedin/datahub-frontend:${DATAHUB_VERSION:-latest}
hostname: datahub-frontend
container_name: datahub-frontend
ports:
- "9001:9001"
environment:
- DATAHUB_GMS_HOST=datahub-gms
- DATAHUB_GMS_PORT=8080
- DATAHUB_SECRET=YouKnowNothing
- DATAHUB_APP_VERSION=1.0
- DATAHUB_PLAY_MEM_BUFFER_SIZE=10MB
depends_on:
- datahub-gms
2.15 datahub-mae-consumer
(MetadataAuditEvent)MAE Consumer是Kafka Streams job。它的主要功能是监听 MetadataAuditEventKafka主题中的消息,并使用index builders处理这些消息。index builders通过处理MAE创建搜索文档模型,然后将这些文档索引到Elasticsearch中。因此,这项工作为我们提供了近乎实时的搜索索引更新。
datahub-mae-consumer:
image: linkedin/datahub-mae-consumer:${DATAHUB_VERSION:-latest}
hostname: datahub-mae-consumer
container_name: datahub-mae-consumer
ports:
- "9091:9091"
environment:
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
- NEO4J_HOST=neo4j:7474 # Docker网络内容器的主机名:端口号
- NEO4J_URI=bolt://neo4j
- NEO4J_USERNAME=neo4j
- NEO4J_PASSWORD=datahub
depends_on:
- kafka-setup
- elasticsearch-setup
- neo4j
command: "sh -c 'while ping -c1 kafka-setup &>/dev/null; do echo waiting for kafka-setup... && sleep 1; done; \
echo kafka-setup done! && /start.sh'"
2.16 datahub-mce-consumer
(MetadataChangeEvent)MCE Consumer是Kafka Streams job。 它的主要功能是监听MetadataChangeEvent Kafka主题中的消息并处理这些消息,并将新的元数据写入DataHub GMS。 在每次成功更新元数据后,GMS都会触发MetadataAuditEvent,并由MAE consumer job使用这些数据。
datahub-mce-consumer:
image: linkedin/datahub-mce-consumer:${DATAHUB_VERSION:-latest}
hostname: datahub-mce-consumer
container_name: datahub-mce-consumer
ports:
- "9090:9090"
environment:
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- GMS_HOST=datahub-gms
- GMS_PORT=8080
depends_on:
- kafka-setup
- datahub-gms
command: "sh -c 'while ping -c1 kafka-setup &>/dev/null; do echo waiting for kafka-setup... && sleep 1; done; \
echo kafka-setup done! && /start.sh'"
3.流程介绍
Datahub通常考虑到两种类型的用户。其中一个拥有元数据,并使用Datahub提供的工具将元数据提取到Datahub中。另一种是使用Datahub来发现Datahub中可用的元数据。Datahub提供直观的UI,全文搜索功能和图形关系表示,使元数据发现和理解更加轻松。
以下序列图突出显示了Datahub的关键功能,以及两种类型的用户-元数据提取工程师和元数据发现用户如何充分利用Datahub。
datahub-sequence-diagram.png
-
首先将您的元数据提取到datahub中。通过python脚本与流行的关系数据库一起使用,查找数据源的元数据,并将Avro数据格式的元数据发布到MetadataChangeEvent(MCE)Kafka主题。
-
MetadataChangeEvent(MCE)处理器使用具有给定主题的Kafka消息,并进行必要的转换,然后发送到通用元数据服务(GMS),然后GMS将元数据保存到您选择的关系数据库中。目前,我们支持MySQL,PostgreSQL和MariaDB。
-
GMS还检查接收到的元数据,以查找是否存在更新版本。如果是这样,它将把更新的差异发布到Kafka的MetadataAuditEvent(MAE)主题。
-
MAE处理器使用来自Kafka的MetadataAuditEvent消息,并存储到Neo4j和Elastic Search(ES)。
-
Datahub frontend是GMS的元数据restful API服务。发现元数据的用户可以浏览,搜索元数据,获取元数据的详细信息,例如所有者,血统和其他客户标签。
附录A
version: '3.5'
services:
mysql:
container_name: mysql
hostname: mysql
image: mysql:5.7
restart: always
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
environment:
MYSQL_DATABASE: 'datahub'
MYSQL_USER: 'datahub'
MYSQL_PASSWORD: 'datahub'
MYSQL_ROOT_PASSWORD: 'datahub'
ports:
- "3306:3306"
volumes:
- ../mysql/init.sql:/docker-entrypoint-initdb.d/init.sql
- mysqldata:/var/lib/mysql
zookeeper:
image: confluentinc/cp-zookeeper:5.4.0
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
volumes:
- zkdata:/var/opt/zookeeper
broker:
image: confluentinc/cp-kafka:5.4.0
hostname: broker
container_name: broker
depends_on:
- zookeeper
ports:
- "29092:29092"
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
kafka-rest-proxy:
image: confluentinc/cp-kafka-rest:5.4.0
hostname: kafka-rest-proxy
container_name: kafka-rest-proxy
ports:
- "8082:8082"
environment:
KAFKA_REST_LISTENERS: http://0.0.0.0:8082/
KAFKA_REST_SCHEMA_REGISTRY_URL: http://schema-registry:8081/
KAFKA_REST_HOST_NAME: kafka-rest-proxy
KAFKA_REST_BOOTSTRAP_SERVERS: PLAINTEXT://broker:29092
depends_on:
- zookeeper
- broker
- schema-registry
kafka-topics-ui:
image: landoop/kafka-topics-ui:0.9.4
hostname: kafka-topics-ui
container_name: kafka-topics-ui
ports:
- "18000:8000"
environment:
KAFKA_REST_PROXY_URL: "http://kafka-rest-proxy:8082/"
PROXY: "true"
depends_on:
- zookeeper
- broker
- schema-registry
- kafka-rest-proxy
# This "container" is a workaround to pre-create topics
kafka-setup:
build:
context: ../kafka
hostname: kafka-setup
container_name: kafka-setup
depends_on:
- broker
- schema-registry
environment:
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_BOOTSTRAP_SERVER=broker:29092
schema-registry:
image: confluentinc/cp-schema-registry:5.4.0
hostname: schema-registry
container_name: schema-registry
depends_on:
- zookeeper
- broker
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: 'zookeeper:2181'
schema-registry-ui:
image: landoop/schema-registry-ui:latest
container_name: schema-registry-ui
hostname: schema-registry-ui
ports:
- "8000:8000"
environment:
SCHEMAREGISTRY_URL: 'http://schema-registry:8081'
ALLOW_GLOBAL: 'true'
ALLOW_TRANSITIVE: 'true'
ALLOW_DELETION: 'true'
READONLY_MODE: 'true'
PROXY: 'true'
depends_on:
- schema-registry
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:5.6.8
container_name: elasticsearch
hostname: elasticsearch
ports:
- "9200:9200"
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- esdata:/usr/share/elasticsearch/data
kibana:
image: docker.elastic.co/kibana/kibana:5.6.8
container_name: kibana
hostname: kibana
ports:
- "5601:5601"
environment:
- SERVER_HOST=0.0.0.0
- ELASTICSEARCH_URL=http://elasticsearch:9200
depends_on:
- elasticsearch
neo4j:
image: neo4j:3.5.7
hostname: neo4j
container_name: neo4j
environment:
NEO4J_AUTH: 'neo4j/datahub'
ports:
- "7474:7474"
- "7687:7687"
volumes:
- neo4jdata:/data
# This "container" is a workaround to pre-create search indices
elasticsearch-setup:
build:
context: ../elasticsearch
hostname: elasticsearch-setup
container_name: elasticsearch-setup
depends_on:
- elasticsearch
environment:
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
datahub-gms:
image: linkedin/datahub-gms:${DATAHUB_VERSION:-latest}
hostname: datahub-gms
container_name: datahub-gms
ports:
- "8080:8080"
environment:
- EBEAN_DATASOURCE_USERNAME=datahub
- EBEAN_DATASOURCE_PASSWORD=datahub
- EBEAN_DATASOURCE_HOST=mysql:3306
- EBEAN_DATASOURCE_URL=jdbc:mysql://mysql:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8
- EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
- NEO4J_HOST=neo4j:7474
- NEO4J_URI=bolt://neo4j
- NEO4J_USERNAME=neo4j
depends_on:
- elasticsearch-setup
- kafka-setup
- mysql
- neo4j
datahub-frontend:
image: linkedin/datahub-frontend:${DATAHUB_VERSION:-latest}
hostname: datahub-frontend
container_name: datahub-frontend
ports:
- "9001:9001"
environment:
- DATAHUB_GMS_HOST=datahub-gms
- DATAHUB_GMS_PORT=8080
- DATAHUB_SECRET=YouKnowNothing
- DATAHUB_APP_VERSION=1.0
- DATAHUB_PLAY_MEM_BUFFER_SIZE=10MB
depends_on:
- datahub-gms
datahub-mae-consumer:
image: linkedin/datahub-mae-consumer:${DATAHUB_VERSION:-latest}
hostname: datahub-mae-consumer
container_name: datahub-mae-consumer
ports:
- "9091:9091"
environment:
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- ELASTICSEARCH_HOST=elasticsearch
- ELASTICSEARCH_PORT=9200
- NEO4J_HOST=neo4j:7474
- NEO4J_URI=bolt://neo4j
- NEO4J_USERNAME=neo4j
- NEO4J_PASSWORD=datahub
depends_on:
- kafka-setup
- elasticsearch-setup
- neo4j
command: "sh -c 'while ping -c1 kafka-setup &>/dev/null; do echo waiting for kafka-setup... && sleep 1; done; \
echo kafka-setup done! && /start.sh'"
datahub-mce-consumer:
image: linkedin/datahub-mce-consumer:${DATAHUB_VERSION:-latest}
hostname: datahub-mce-consumer
container_name: datahub-mce-consumer
ports:
- "9090:9090"
environment:
- KAFKA_BOOTSTRAP_SERVER=broker:29092
- KAFKA_SCHEMAREGISTRY_URL=http://schema-registry:8081
- GMS_HOST=datahub-gms
- GMS_PORT=8080
depends_on:
- kafka-setup
- datahub-gms
command: "sh -c 'while ping -c1 kafka-setup &>/dev/null; do echo waiting for kafka-setup... && sleep 1; done; \
echo kafka-setup done! && /start.sh'"
networks:
default:
name: datahub_network
volumes:
mysqldata:
esdata:
neo4jdata:
zkdata:









网友评论