什么是DolphinScheduler
Apache DolphinScheduler是一个分布式、去中心化、易扩展的可视化DAG工作流任务调度系统,旨在解决数据处理流程中复杂的依赖关系问题,使调度系统在数据处理流程中开箱即用12。
主要功能特性
- 可视化DAG(有向无环图):DolphinScheduler通过DAG图的方式将任务按照依赖关系关联起来,可以实时可视化监控任务的运行状态2。
- 丰富的任务类型:支持Shell、MR、Spark、SQL(mysql、postgresql、hive、sparksql)、Python、Sub_Process、Procedure等多种任务类型12。
- 高可靠性和高扩展性:通过ZooKeeper实现Master集群和Worker集群的去中心化设计,支持集群HA(高可用性)23。
- 实时监控和故障处理:支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复,同时支持失败重试、从指定节点恢复失败、Kill任务等操作2。
组成
DolphinScheduler的架构主要包括以下几个部分:
- MasterServer:负责DAG任务切分、任务提交和监控,监听其他MasterServer和WorkerServer的健康状态。
- WorkerServer:负责任务的执行和提供日志服务。
- ZooKeeper:用于集群管理和容错。
- AlertServer:提供告警相关服务。
- API接口层:处理前端UI层的请求。
- UI:提供系统的各种可视化操作界面。
安装
版本DolphinScheduler 3.2.2
,中文文档地址
DolphinScheduler支持单机部署(Standalone)
,伪集群部署(Pseudo-Cluster)
,集群部署(Cluster)
,Kubernetes部署(Kubernetes)
本文档主要是使用docker-compose安装:
单机部署比较简单,直接装好java环境,如果需要其他驱动直接将jar包
放到对应应用的libs下:

对应文件
主要的文件,其他挂在的目录如果报权限问题加一下权限即可
/mnt/data/www/dolphinscheduler/
├── docker-compose.yml
├── Dockerfile/
│ └── ds-worker/
│ └── Dockerfile # 您的自定义文件
├── drivers/ #驱动文件目录,会copy到对应镜像的libs目录
│ ─── ...
│ └── mysql-connector-java.jar #java的mysql驱动,dolphinscheduler使用mysql数据库需要使用
└── .env
docker-compose.yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
version: "3.8"
services:
dolphinscheduler-postgresql:
image: bitnami/postgresql:15.2.0
ports:
- "5432:5432"
profiles: ["all", "schema"]
environment:
POSTGRESQL_USERNAME: root
POSTGRESQL_PASSWORD: root
POSTGRESQL_DATABASE: dolphinscheduler
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-postgresql:/bitnami/postgresql
healthcheck:
test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/5432"]
interval: 5s
timeout: 60s
retries: 120
networks:
- dolphinscheduler
dolphinscheduler-zookeeper:
image: bitnami/zookeeper:3.7.1
profiles: ["all"]
environment:
ALLOW_ANONYMOUS_LOGIN: "yes"
ZOO_4LW_COMMANDS_WHITELIST: srvr,ruok,wchs,cons
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-zookeeper:/bitnami/zookeeper
healthcheck:
test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/2181"]
interval: 5s
timeout: 60s
retries: 120
networks:
- dolphinscheduler
dolphinscheduler-schema-initializer:
image: ${HUB}/dolphinscheduler-tools:${TAG}
env_file: .env
profiles: ["schema"]
command: [ tools/bin/upgrade-schema.sh ]
depends_on:
dolphinscheduler-postgresql:
condition: service_healthy
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-logs:/opt/dolphinscheduler/logs
- /mnt/data/www/dolphinscheduler/dolphinscheduler-shared-local:/opt/soft
- /mnt/data/www/dolphinscheduler/dolphinscheduler-resource-local:/dolphinscheduler
networks:
- dolphinscheduler
dolphinscheduler-api:
image: ${HUB}/dolphinscheduler-api:${TAG}
ports:
- "12345:12345"
- "25333:25333"
profiles: ["all"]
env_file: .env
healthcheck:
test: [ "CMD", "curl", "http://localhost:12345/dolphinscheduler/actuator/health" ]
interval: 30s
timeout: 5s
retries: 3
depends_on:
dolphinscheduler-zookeeper:
condition: service_healthy
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-logs:/opt/dolphinscheduler/logs
- /mnt/data/www/dolphinscheduler/dolphinscheduler-shared-local:/opt/soft
- /mnt/data/www/dolphinscheduler/dolphinscheduler-resource-local:/dolphinscheduler
# 挂载驱动目录
- /mnt/data/www/dolphinscheduler/drivers:/opt/dolphinscheduler/lib
command: >
sh -c "
# 复制驱动到目标目录
cp /opt/dolphinscheduler/lib/* /opt/dolphinscheduler/libs/ &&
# 执行原容器的启动命令
/opt/dolphinscheduler/bin/start.sh
"
networks:
- dolphinscheduler
dolphinscheduler-alert:
image: ${HUB}/dolphinscheduler-alert-server:${TAG}
profiles: ["all"]
env_file: .env
healthcheck:
test: [ "CMD", "curl", "http://localhost:50053/actuator/health" ]
interval: 30s
timeout: 5s
retries: 3
depends_on:
dolphinscheduler-zookeeper:
condition: service_healthy
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-logs:/opt/dolphinscheduler/logs
networks:
- dolphinscheduler
dolphinscheduler-master:
image: ${HUB}/dolphinscheduler-master:${TAG}
profiles: ["all"]
env_file: .env
healthcheck:
test: [ "CMD", "curl", "http://localhost:5679/actuator/health" ]
interval: 30s
timeout: 5s
retries: 3
depends_on:
dolphinscheduler-zookeeper:
condition: service_healthy
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-logs:/opt/dolphinscheduler/logs
- /mnt/data/www/dolphinscheduler/dolphinscheduler-shared-local:/opt/soft
- /mnt/data/www/dolphinscheduler/drivers:/opt/dolphinscheduler/lib #添加驱动映射
command: >
sh -c "
# 复制驱动到目标目录
cp /opt/dolphinscheduler/lib/* /opt/dolphinscheduler/libs/ &&
# 执行原容器的启动命令
/opt/dolphinscheduler/bin/start.sh
"
networks:
- dolphinscheduler
dolphinscheduler-worker:
#image: ${HUB}/dolphinscheduler-worker:${TAG}
build:
context: .
dockerfile: Dockerfile/ds-worker/Dockerfile # 指定自定义Dockerfile
profiles: ["all"]
env_file: .env
healthcheck:
test: [ "CMD", "curl", "http://localhost:1235/actuator/health" ]
interval: 30s
timeout: 5s
retries: 3
depends_on:
dolphinscheduler-zookeeper:
condition: service_healthy
environment:
- PYTHON_HOME=/usr/local/python2.4/bin
- PATH=/usr/local/python2.4/bin:$PATH
volumes:
- /mnt/data/www/dolphinscheduler/dolphinscheduler-worker-data:/tmp/dolphinscheduler
- /mnt/data/www/dolphinscheduler/dolphinscheduler-logs:/opt/dolphinscheduler/logs
- /mnt/data/www/dolphinscheduler/dolphinscheduler-shared-local:/opt/soft
- /mnt/data/www/dolphinscheduler/dolphinscheduler-resource-local:/dolphinscheduler
- /mnt/data/www/dolphinscheduler/drivers:/opt/dolphinscheduler/lib #添加驱动映射
- /mnt/data/www/dolphinscheduler/datax:/opt/soft/datax #添加datax映射
command: >
sh -c "
# 复制驱动到目标目录
cp /opt/dolphinscheduler/lib/* /opt/dolphinscheduler/libs/ &&
# 执行原容器的启动命令
/opt/dolphinscheduler/bin/start.sh
"
networks:
- dolphinscheduler
networks:
dolphinscheduler:
driver: bridge
-
Dockerfile
因为要用到datax和python,所以需要自定义安装一下,后面有需要直接在Dockerfile中添加即可
FROM apache/dolphinscheduler-worker:3.2.2
#安装datax
RUN wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz \
&& tar -zxvf datax.tar.gz -C /opt/ \
&& rm datax.tar.gz
# 安装 Python 3 和 pip
RUN apt-get update && \
apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
ENV DATAX_HOME=/opt/datax \
PATH=$PATH:/opt/datax/bin \
.env
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
HUB=apache
TAG=3.2.2
TZ=Asia/Shanghai
DATABASE=postgresql
SPRING_JACKSON_TIME_ZONE=UTC
SPRING_DATASOURCE_URL=jdbc:postgresql://dolphinscheduler-postgresql:5432/dolphinscheduler
REGISTRY_ZOOKEEPER_CONNECT_STRING=dolphinscheduler-zookeeper:2181
#DATABASE=mysql
#DATABASE_TYPE=mysql
#SPRING_DATASOURCE_DRIVER_CLASS_NAME=com.mysql.cj.jdbc.Driver
#SPRING_DATASOURCE_URL=jdbc:mysql://你的宿主机ip:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false
#SPRING_DATASOURCE_USERNAME=dolphinscheduler
#SPRING_DATASOURCE_PASSWORD=123456
上面的注释部分时指定dolphinscheduler持久化数据库为mysql,独立部署配置如下
export DATABASE=mysql
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_USERNAME=dolphinscheduler
export SPRING_DATASOURCE_PASSWORD=123456
启动并使用
命令
启动:
docker-compose --profile all up -d
停用:docker-compose --profile all down
使用
登录地址:http://你的ip:12345/dolphinscheduler
-
添加一个环境
DATAX_LAUNCHER
和PYTHON_LAUNCHER
是因为我要使用datax
和python
,所以需要加一下环境,对应的就是Dockerfile
中我安装的应用
JAVA_HOME
和PATH
也要加一下,独立部署不需要,docker安装需要,在执行脚本时,会切换成default用户sudo -u default -i
,此时PATH
会被重置,导致没有了java路径
-
创建项目
-
在当前项目下创建一个工作流
-
创建datax节点
自定义模板参考
{
"setting": {},
"job": {
"content":[
{
"reader":{
"name":"oraclereader",
"parameter":{
"username":"源库账号",
"password":"源库密码",
"connection":[
{
"querySql":[
"SELECT
bp.PK_PSNDOC,
bp.name,
bp.code,
bp.MOBILE,
CASE
WHEN bp.SEX IS NULL AND bp.SEX=1 THEN 1
ELSE 0
END AS SEX,
CASE
WHEN bp.BIRTHDATE IS NULL THEN '1000-01-01'
ELSE bp.BIRTHDATE
END AS BIRTHDATE,
br.NAME AS native_place,
bp.TS AS updated_at,
bp.CREATIONTIME AS created_at
FROM
BD_PSNDOC bp
LEFT JOIN BD_REGION br ON bp.nativeplace=br.PK_REGION"
],
"jdbcUrl":[
"jdbc:oracle:thin:@//源库地址:1521/orcl"
]
}
]
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"username":"目标库用户",
"password":"目标库密码",
"writeMode": "update",
"primaryKey": ["PK_PSNDOC"],
"column":[
"`PK_PSNDOC`",
"`name`",
"`code`",
"`mobile`",
"`sex`",
"`BIRTHDATE`",
"`native_place`",
"`updated_at`",
"`created_at`"
],
"connection":[
{
"table":[
"bi_nc_psndoc"
],
"jdbcUrl":"jdbc:mysql://目标库ip地址:3306/data_warehouse"
}
]
}
}
}
],
"setting":{
"speed": {
"channel": 2
}
}
}
}
-
保存并执行
-
查看结果
常见问题
-
/tmp/dolphinscheduler/exec/process/default/140885259385600/140887069920000_1/1/2/1_2.sh: line 5: --jvm=-Xms1G -Xmx1G: command not found
解决:需要设置下安全中心/环境管理/对应环境
的python路径:PYTHON_LAUNCHER=你的python
-
- 如下错误:
File "/opt/datax/bin/datax.py", line 114
print readerRef
^^^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?
解决:datax
版本问题,升级到3.0,wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202309/datax.tar.gz && tar -zxvf datax.tar.gz -C /opt/
-
/bin/sh: 1: java: not found
这个问题时在执行脚本时,执行了一条sudo -u default -i
,切换成了default
用户,导致java
的环境没了,从而执行不了java代码,只需要在安全中心/环境管理/对应环境
中添加上export JAVA_HOME=/opt/java/openjdk
和export PATH=/opt/java/openjdk/bin:$PATH
-
- docker中如何执行脚本
以php为例,我要在dolphinscheduler
添加shell
来执行php
容器的脚本,我是通过docker
的tcp
连接,
- docker中如何执行脚本
#ExecStart添加
-H tcp://0.0.0.0:2375
#执行
$ systemctl daemon-reload
$ systemctl restart docker
shell
参考如下:
#!/bin/bash
echo "执行全量生成月度年度快照"
# 1. 创建 exec 实例
#max-time 禁止 curl 超时
#no-buffer 实时输出流
EXEC_RESPONSE=$(curl -s -X POST --max-time 0 --no-buffer \
"http://宿主机ip:2375/containers/php8-oracle-sqlsvr/exec" \
-H "Content-Type: application/json" \
-d '{"Cmd": ["php", "/var/www/data-warehouse/artisan", "nc:generate_work_snapshot_history"], "AttachStdout": true, "AttachStderr": true}')
# 提取 ExecID(用 grep + cut 替代 jq)
EXEC_ID=$(echo "$EXEC_RESPONSE" | grep -o '"Id":"[^"]*"' | cut -d'"' -f4)
if [ -z "$EXEC_ID" ]; then
echo "错误:无法创建 Docker exec 实例"
echo "API 响应: $EXEC_RESPONSE"
exit 1
fi
# 2. 启动执行并获取输出
OUTPUT=$(curl -s -X POST \
"http://宿主机ip:2375/exec/$EXEC_ID/start" \
-H "Content-Type: application/json" \
-d '{"Detach": false, "Tty": false}' \
| tail -c +9) # 跳过 Docker 头信息
# 3. 获取退出码(直接从 JSON 中提取)
EXIT_CODE_JSON=$(curl -s "http://宿主机ip:2375/exec/$EXEC_ID/json")
EXIT_CODE=$(echo "$EXIT_CODE_JSON" | grep -o '"ExitCode":[0-9]*' | cut -d':' -f2)
# 4. 输出结果
echo "----------------------------------------"
echo "PHP脚本输出:"
echo "$OUTPUT"
echo "----------------------------------------"
echo "退出码: ${EXIT_CODE:-未知}"
if [ -z "$EXIT_CODE" ] || [ "$EXIT_CODE" -ne 0 ]; then
echo "错误:脚本执行失败"
exit "${EXIT_CODE:-1}"
fi
网友评论