diff --git a/README.md b/README.md index 599f8f149e..4ee91d03bf 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,10 @@ * **FlinkX是在是袋鼠云内部广泛使用的基于flink的分布式离线数据同步框架,实现了多种异构数据源之间高效的数据迁移。** - 不同的数据源头被抽象成不同的Reader插件,不同的数据目标被抽象成不同的Writer插件。理论上,FlinkX框架可以支持任意数据源类型的数据同步工作。作为一套生态系统,每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
- +
## 2 工作原理 @@ -16,14 +15,13 @@ 在底层实现上,FlinkX依赖Flink,数据同步任务会被翻译成StreamGraph在Flink上执行,工作原理如下图:
- +
## 3 快速起步 ### 3.1 运行模式 - * 单机模式:对应Flink集群的单机模式 * standalone模式:对应Flink集群的分布式模式 * yarn模式:对应Flink集群的yarn模式 @@ -34,7 +32,6 @@ * Flink集群: 1.4及以上(单机模式不需要安装Flink集群) * 操作系统:理论上不限,但是目前只编写了shell启动脚本,用户可以可以参考shell脚本编写适合特定操作系统的启动脚本。 - ### 3.3 打包 进入项目根目录,使用maven打包: @@ -50,39 +47,46 @@ mvn clean package -Dmaven.test.skip #### 3.4.1 命令行参数选项 * **model** - * 描述:执行模式,也就是flink集群的工作模式 - * local: 本地模式 - * standalone: 独立部署模式的flink集群 - * yarn: yarn模式的flink集群 - * 必选:否 - * 默认值:local + + * 描述:执行模式,也就是flink集群的工作模式 + * local: 本地模式 + * standalone: 独立部署模式的flink集群 + * yarn: yarn模式的flink集群,需要提前在yarn上启动一个flink session,使用默认名称"Flink session cluster" + * 必选:否 + * 默认值:local * **job** - * 描述:数据同步任务描述文件的存放路径;该描述文件中使用json字符串存放任务信息。 - * 必选:是 - * 默认值:无 - + + * 描述:数据同步任务描述文件的存放路径;该描述文件中使用json字符串存放任务信息。 + * 必选:是 + * 默认值:无 + * **plugin** - * 描述:插件根目录地址,也就是打包后产生的plugins目录。 - * 必选:是 - * 默认值:无 - + + * 描述:插件根目录地址,也就是打包后产生的plugins目录。 + * 必选:是 + * 默认值:无 + * **flinkconf** - * 描述:flink配置文件所在的目录(单机模式下不需要),如/hadoop/flink-1.4.0/conf - * 必选:否 - * 默认值:无 - + + * 描述:flink配置文件所在的目录(单机模式下不需要),如/hadoop/flink-1.4.0/conf + * 必选:否 + * 默认值:无 + * **yarnconf** - * 描述:Hadoop配置文件(包括hdfs和yarn)所在的目录(单机模式下不需要),如/hadoop/etc/hadoop - * 必选:否 - * 默认值:无 + + * 描述:Hadoop配置文件(包括hdfs和yarn)所在的目录(单机模式下不需要),如/hadoop/etc/hadoop + * 必选:否 + * 默认值:无 #### 3.4.2 启动数据同步任务 + * **以本地模式启动数据同步任务** ``` bin/flinkx -mode local -job /Users/softfly/company/flink-data-transfer/jobs/task_to_run.json -plugin /Users/softfly/company/flink-data-transfer/plugins ``` + * **以standalone模式启动数据同步任务** ``` @@ -101,12 +105,13 @@ bin/flinkx -mode yarn -job /Users/softfly/company/flinkx/jobs/mysql_to_mysql.jso ``` { - "job": { - "setting": {...}, - "content": [...] - } + "job": { + "setting": {...}, + "content": [...] + } } ``` + 数据同步任务包括一个job元素,而这个元素包括setting和content两部分。 * setting: 用于配置限速、错误控制和脏数据管理 @@ -115,12 +120,13 @@ bin/flinkx -mode yarn -job /Users/softfly/company/flinkx/jobs/mysql_to_mysql.jso ### 4.1 setting ``` - "setting": { - "speed": {...}, - "errorLimit": {...}, - "dirty": {...} - } + "setting": { + "speed": {...}, + "errorLimit": {...}, + "dirty": {...} + } ``` + setting包括speed、errorLimit和dirty三部分,分别描述限速、错误控制和脏数据管理的配置信息 #### 4.1.1 speed @@ -133,7 +139,7 @@ setting包括speed、errorLimit和dirty三部分,分别描述限速、错误 ``` * channel: 任务并发数 -* bytes: 每秒字节数,默认为0(不限速) +* bytes: 每秒字节数,默认为 Long.MAX_VALUE #### 4.1.2 errorLimit @@ -150,7 +156,7 @@ setting包括speed、errorLimit和dirty三部分,分别描述限速、错误 #### 4.1.3 dirty ``` - "dirty": { + "dirty": { "path": "/tmp", "hadoopConfig": { "fs.default.name": "hdfs://ns1", @@ -176,18 +182,17 @@ setting包括speed、errorLimit和dirty三部分,分别描述限速、错误 "reader": { "name": "...", "parameter": { - ... + ... } }, "writer": { "name": "...", "parameter": { - ... + ... } } } ] - ``` * reader: 用于读取数据的插件的信息 @@ -203,46 +208,34 @@ reader和writer包括name和parameter,分别表示插件名称和插件参数 ### 5.1 读取插件 -* [MySQL读取插件](docs/mysqlreader.md) -* [MySQL分库分表读取插件](docs/mysqldreader.md) -* [Oracle读取插件](docs/oraclereader.md) -* [SQLServer读取插件](docs/sqlserverreader.md) +* [关系数据库读取插件](docs/rdbreader.md) +* [分库分表读取插件](docs/rdbdreader.md) * [HDFS读取插件](docs/hdfsreader.md) * [HBase读取插件](docs/hbasereader.md) * [Elasticsearch读取插件](docs/esreader.md) * [Ftp读取插件](docs/ftpreader.md) * [Odps读取插件](docs/odpsreader.md) -* [PostgreSQL读取插件](docs/postgresqlreader.md) * [MongoDB读取插件](docs/mongodbreader.md) -* [DB2读取插件](docs/db2reader.md) +* [Stream读取插件](docs/streamreader.md) +* [Carbondata读取插件](docs/carbondatareader.md) ### 5.2 写入插件 -* [MySQL写入插件](docs/mysqlwriter.md) -* [Oracle写入插件](docs/oraclewriter.md) -* [SQLServer写入插件](docs/sqlserverwriter.md) +* [关系数据库写入插件](docs/rdbwriter.md) * [HDFS写入插件](docs/hdfswriter.md) * [HBase写入插件](docs/hbasewriter.md) * [Elasticsearch写入插件](docs/eswriter.md) * [Ftp写入插件](docs/ftpwriter.md) * [Odps写入插件](docs/odpswriter.md) -* [PostgreSQL写入插件](docs/postgresqlwriter.md) * [MongoDB写入插件](docs/mongodbwriter.md) * [Redis写入插件](docs/rediswriter.md) -* [DB2写入插件](docs/db2writer.md) +* [Stream写入插件](docs/streamwriter.md) +* [Carbondata写入插件](docs/carbondatawriter.md) ## 6.版本说明 - 1.flinkx的分支版本跟flink的版本对应,比如:flinkx v1.4.0 对应 flink1.4.0,现在支持flink1.4和1.5 + 1.flinkx的分支版本跟flink的版本对应,比如:flinkx v1.4.0 对应 flink1.4.0,现在支持flink1.4和1.5 ## 7.招聘信息 - 1.大数据平台开发工程师,想了解岗位详细信息可以添加本人微信号ysqwhiletrue,注明招聘,如有意者发送简历至sishu@dtstack.com。 - - - - - - - - + 1.大数据平台开发工程师,想了解岗位详细信息可以添加本人微信号ysqwhiletrue,注明招聘,如有意者发送简历至sishu@dtstack.com。 diff --git a/docs/db2reader.md b/docs/db2reader.md deleted file mode 100644 index 3a1da429b1..0000000000 --- a/docs/db2reader.md +++ /dev/null @@ -1,152 +0,0 @@ -# MySQL分库分表读取插件(db2reader) - -## 1. 配置样例 - -``` -{ - "job": { - "content": [ - { - "reader": { - "fetchSize": 1024, - "parameter": { - "password": "abc123", - "column": [ - "smallint_col", - "integer_col", - "bigint_col", - "decimal_col", - "real_col", - "double_col" - ], - "where": "", - "connection": [ - { - "password": "abc123", - "jdbcUrl": ["jdbc:db2://172.16.1.191:50000/flinkx"], - "table": [ - "db2_stand_all" - ], - "username": "dtstack" - } - ], - "splitPk": "", - "username": "dtstack" - }, - "name": "db2reader" - }, - "writer": { - "parameter": { - "print":true - }, - "name": "streamwriter" - } - } - ], - "setting": { - "errorLimit": { - }, - "speed": { - "bytes": 0, - "channel": 1 - } - } - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填db2reader,否则Flinkx将无法正常加载该插件包。 - - * 必选:是
- - * 默认值:无
- -* **connection** - - * 描述:需要读取的数据源数组。 - - * 必选:是 - - * 默认值:无 - - * 元素: - - * username:具体数据源的用户名,如果不填则使用全局的用户名。 - - * password:具体数据源的密码,如果不填则使用全局的密码。 - - * jdbcUrl:数据源连接url,只支持写单个连接。 - - * table:要查询的表名称,可写多张表,多张表的表结构必须一致。 - -* **jdbcUrl** - - * 描述:针对db2数据库的jdbc连接字符串 - - jdbcUrl按照DB2官方规范,并可以填写连接附件控制信息。具体请参看[DB2官方文档](https://www.ibm.com/analytics/us/en/db2/)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:全局数据源的用户名
- - * 必选:否
- - * 默认值:无
- -* **password** - - * 描述:全局数据源的密码
- - * 必选:否
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,MysqldReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:MysqldReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqldReader将报错! - -  如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
- diff --git a/docs/db2writer.md b/docs/db2writer.md deleted file mode 100644 index de205310ec..0000000000 --- a/docs/db2writer.md +++ /dev/null @@ -1,156 +0,0 @@ -# MySQL写入插件(db2writer) - -## 1. 配置样例 - -``` -{ - "job": { - "content": [ - { - "reader": { - "parameter": { - "column": [ - { - "name": "id", - "type": "int", - "value": 26 - }, - { - "name": "name", - "type": "string", - "value": "xxxxxx" - } - ], - "sliceRecordCount": 2 - }, - "name": "streamreader" - }, - "writer": { - "parameter": { - "postSql": [], - "password": "abc123", - "session": [], - "column": [ - "id", - "name" - ], - "connection": [ - { - "jdbcUrl": "jdbc:db2://172.16.1.191:50000/flinkx", - "table": [ - "flinkx_test" - ] - } - ], - "writeMode": "replace", - "preSql": [], - "username": "dtstack", - "batchSize":1 - }, - "name": "db2writer" - } - } - ], - "setting": { - "errorLimit": { - "record": 0, - "percentage": 0 - }, - "speed": { - "bytes": 0, - "channel": 1 - } - } - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填db2writer,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对db2数据库的jdbc连接字符串 - - jdbcUrl按照DB2官方规范,并可以填写连接附件控制信息。具体请参看[DB2官方文档](https://www.ibm.com/analytics/us/en/db2/)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **column** - - * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 - - * 必选:是
- - * 默认值:否
- - * 默认值:无
- -* **preSql** - - * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **postSql** - - * 描述:写入数据到目的表后,会执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **table** - - * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 - - 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 - - * 必选:是
- - * 默认值:无
- -* **writeMode** - - * 描述:控制写入数据到目标表采用 `insert into` 或者 `merge into` 语句
- - * 必选:是
- - * 所有选项:insert/replace/update
- - * 默认值:insert
- -* **batchSize** - - * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与Mysql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。
- - * 必选:否
- - * 默认值:1024
\ No newline at end of file diff --git a/docs/esreader.md b/docs/esreader.md index 0313ba62d2..8ac84506e8 100644 --- a/docs/esreader.md +++ b/docs/esreader.md @@ -4,105 +4,95 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 2, - "bytes": 10000 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "reader": { - "name": "esreader", - "parameter": { - "address": "rdos1:9200,rdos2:9200", - "query": { - "match": { - "col2": "hallo" - } - }, - "column": [ - { - "name": "xx.yy.zz", - "type": "string" - }, - { - "name": "col2", - "type": "string" - } - ] - } - }, - "writer": { - "name": "mysqlwriter", - "parameter": { - "writeMode": "insert", - "username": "dtstack", - "password": "abc123", - "column": [ - "col1", - "col2" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true", - "table": [ - "tb333" - ] - } - ] - } - } - } - ] - } + "job": { + "setting": {}, + "content": [{ + "reader": { + "name": "esreader", + "parameter": { + "address": "host1:9200,host2:9200", + "query": { + "match": { + "match_all": {} + } + }, + "index": "indexTest", + "type": "type1", + "batchSize": 0, + "timeout": 10, + "column": [{ + "name": "xx.yy.zz", + "type": "string", + "value": "value" + }] + } + }, + "writer": {} + }] + } } - ``` ## 2. 参数说明 * **address** + + * 描述:Elasticsearch地址,单个节点地址采用host:port形式,多个节点的地址用逗号连接 + + * 必选:是 + + * 默认值:无 - * 描述:Elasticsearch地址,单个节点地址采用host:port形式,多个节点的地址用逗号连接
- - * 必选:是
- - * 默认值:无
- * **query** + + * 描述:Elasticsearch查询表达式,[查询表达式](https://www.elastic.co/guide/cn/elasticsearch/guide/current/query-dsl-intro.html) + + * 必选:否 + + * 默认值:无,默认为全查询 - * 描述:Elasticsearch查询表达式,[查询表达式](https://www.elastic.co/guide/cn/elasticsearch/guide/current/query-dsl-intro.html)
+* **batchSize** + + * 描述:每次读取数据条数 + + * 必选:否 + + * 默认值:10 - * 必选:否
+* **timeout** + + * 描述:连接超时时间 + + * 必选:否 + + * 默认值:无 - * 默认值:无,默认为全查询
- -* **column** - - * 描述:读取elasticsearch的查询结果的若干个列,每列形式如下
- * 普通列 - - ``` - { - "name": "xx.yy.zz", //支持列的多级嵌套,用.连接 - "type": "string" - } - ``` - * 常数列 - - ``` - { - "value": "xxx", // 常量值 - "type": "string" //常量类型 - } - ``` +* **index** + + * 描述:要查询的索引名称 + + * 必选:否 + + * 默认值:无 - * 必选:是
+* **type** + + * 描述:要查询的类型 + + * 必选:否 + + * 默认值:无 - * 默认值:无
+* **column** + + * 描述:读取elasticsearch的查询结果的若干个列,每列形式如下 + + * name:字段名称,可使用多级格式查找 + + * type:字段类型,当name没有指定时,则返回常量列,值为value指定 + + * value:常量列的值 + + * 必选:是 + + * 默认值:无 diff --git a/docs/eswriter.md b/docs/eswriter.md index 1b4ffacf29..da8f3a49a4 100644 --- a/docs/eswriter.md +++ b/docs/eswriter.md @@ -4,145 +4,118 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 10000000 - }, - "errorLimit": { - "record": 0, - "percentage": 20 - } - }, - "content": [ - { - "reader": { - "name": "mysqlreader", - "parameter": { - "username": "dtstack", - "password": "abc123", - "column": [ - "col1", - "col2" - ], - "splitPk": "col1", - "connection": [ - { - "table": [ - "tb2" - ], - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true" - ] - } - ] - } - }, - "writer": { - "name": "eswriter", - "parameter": { - "address": "rdos1:9200,rdos2:9200", - "index": "yoshi", - "type": "nani", - "bulkAction": 3, - "idColumn": [ - { - "index": 0, - "type": "int" - } - ], - "column": [ - { - "name": "col1", - "type": "string" - }, - { - "name": "col2", - "type": "string" - } - ] - } - } - } - ] - } + "job": { + "setting": {}, + "content": [{ + "reader": {}, + "writer": { + "name": "eswriter", + "parameter": { + "address": "host1:9200,host2:9200", + "index": "indexTest", + "type": "type1", + "bulkAction": 100, + "timeout": 100, + "idColumn": [{ + "index": 0, + "type": "int" + }], + "column": [{ + "name": "col1", + "type": "string" + }] + } + } + }] + } } ``` ## 2. 参数说明 * **address** + + * 描述:Elasticsearch地址,单个节点地址采用host:port形式,多个节点的地址用逗号连接 + + * 必选:是 + + * 默认值:无 - * 描述:Elasticsearch地址,单个节点地址采用host:port形式,多个节点的地址用逗号连接
- - * 必选:是
- - * 默认值:无
- * **index** + + * 描述:Elasticsearch 索引值 + + * 必选:是 + + * 默认值:无 - * 描述:Elasticsearch 索引值
- - * 必选:是
- - * 默认值:无
- * **type** + + * 描述:Elasticsearch 索引类型 + + * 必选:是 + + * 默认值:无 - * 描述:Elasticsearch 索引类型
- - * 必选:是
- - * 默认值:无
- * **column** + + * 描述:写入elasticsearch的若干个列,每列形式如下 + + ``` + { + "name": "列名", + "type": "列类型" + } + ``` + + * 必选:是 + + * 默认值:无 - * 描述:写入elasticsearch的若干个列,每列形式如下
- - ``` - { - "name": "列名", - "type": "列类型" - } - ``` - - * 必选:是
- - * 默认值:无
- * **idColumns** + + * 描述:用于构造文档id的若干个列,每列形式如下 + + * 普通列 + + ``` + { + "index": 0, // 前面column属性中列的序号,从0开始 + "type": "string" 列的类型,默认为string + } + ``` + + * 常数列 + + ``` + { + "value": "ffff", // 常数值 + "type": "string" // 常数列的类型,默认为string + } + ``` + + * 必选:否 + + * 注意: + + * 如果不指定idColumns属性,则会随机产生文档id + + * 如果指定的字段值存在重复或者指定了常数,按照es的逻辑,同样值的doc只会保留一份 + + * 默认值:无 - * 描述:用于构造文档id的若干个列,每列形式如下
- - * 普通列 - - ``` - { - "index": 0, // 前面column属性中列的序号,从0开始 - "type": "string" 列的类型,默认为string - } - ``` - - * 常数列 - - ``` - { - "value": "ffff", // 常数值 - "type": "string" // 常数列的类型,默认为string - } - ``` - - * 必选:否
- 如果不指定idColumns属性,则会随机产生文档id - - * 默认值:无
- - * **bulkAction** - - * 描述:批量写入的记录条数
- - * 必选:是
- - * 默认值:100
\ No newline at end of file + + * 描述:批量写入的记录条数 + + * 必选:是 + + * 默认值:100 + +* **timeout** + + * 描述:连接超时时间,如果bulkAction指定的数值过大,写入数据可能会超时,这时可以配置超时时间 + + * 必选:否 + + * 默认值:无 diff --git a/docs/ftpreader.md b/docs/ftpreader.md index a889116135..ba7b5ee95a 100644 --- a/docs/ftpreader.md +++ b/docs/ftpreader.md @@ -5,175 +5,144 @@ ``` { "job": { - "setting": { - "speed": { - "channel": 1, - "bytes": 10000 - }, - "errorLimit": { - "record": 0, - "percentage": 50 - } - }, - "content": [ - { - "reader": { - "name": "ftpreader", - "parameter": { - "protocol": "sftp", - "host": "node01" , - "port": 22, - "username": "mysftp", - "password": "oh1986mygod", - "column": [ - { - "index": 0 - }, - { - "index": 1 - }, - { - "value": "youcan", - "type": "string" - } - ], - "path": "/upload", - "encoding": "UTF-8", - "fieldDelimiter": "\\t", - "isFirstLineHeader":true - } - }, - "writer": { - "parameter": { - "password": "abc123", - "column": [ - "col1", - "col2", - "col3" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?charset=utf8", - "table": [ - "sb5" - ] - } - ], - "writeMode": "insert", - "username": "dtstack" - }, - "name": "mysqlwriter" + "setting": {}, + "content": [{ + "reader": { + "name": "ftpreader", + "parameter": { + "protocol": "sftp", + "host": "127.0.0.1", + "port": 22, + "username": "username", + "password": "password", + "column": [{ + "index": 0, + "type": "", + "value": "value" + }], + "path": "/upload", + "encoding": "UTF-8", + "fieldDelimiter": ",", + "isFirstLineHeader": true } - } - ] + }, + "writer": {} + }] } } - ``` ## 2. 参数说明 * **protocol** - - * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。 + + * 必选:是 + + * 默认值:无 * **host** - - * 描述:ftp服务器地址。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器地址。 + + * 必选:是 + + * 默认值:无 * **port** - - * 描述:ftp服务器端口。
- - * 必选:否
- - * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21
+ + * 描述:ftp服务器端口。 + + * 必选:否 + + * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21 * **connectPattern** - - * 描述:连接模式(主动模式或者被动模式)。该参数只在传输协议是标准ftp协议时使用,值只能为:PORT (主动),PASV(被动)。两种模式主要的不同是数据连接建立的不同。对于Port模式,是客户端在本地打开一个端口等服务器去连接建立数据连接,而Pasv模式就是服务器打开一个端口等待客户端去建立一个数据连接。
- - * 必选:否
- - * 默认值:PASV
+ + * 描述:连接模式(主动模式或者被动模式)。该参数只在传输协议是标准ftp协议时使用,值只能为:PORT (主动),PASV(被动)。两种模式主要的不同是数据连接建立的不同。对于Port模式,是客户端在本地打开一个端口等服务器去连接建立数据连接,而Pasv模式就是服务器打开一个端口等待客户端去建立一个数据连接。 + + * 必选:否 + + * 默认值:PASV * **username** - - * 描述:ftp服务器访问用户名。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器访问用户名。 + + * 必选:是 + + * 默认值:无 * **password** - - * 描述:ftp服务器访问密码。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器访问密码。 + + * 必选:是 + + * 默认值:无 * **path** - - * 描述:远程FTP文件系统的路径信息,注意这里可以支持填写多个路径。
- - * 必选:是
- - * 默认值:/
+ + * 描述:远程FTP文件系统的路径信息,注意这里可以支持填写多个路径。 + + * 必选:是 + + * 默认值:/ * **column** - - * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量。 - - - 用户可以指定column字段信息,配置如下: - - ```json - { - "index": 0 //从远程FTP文件文本第一列获取int字段 - }, - { - "type": "string", - "value": "alibaba" //从FtpReader内部生成alibaba的字符串字段作为当前字段 - } - ``` - - 对于用户指定Column信息,type必须填写,index/value必须选择其一。 - - * 必选:是
- - * 默认值:全部按照string类型读取
+ + * 描述:需要读取的字段。 + + * 格式:支持2中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":["*"] + ``` + + 2.指定具体信息: + + ``` + "column": [{ + "index": 0, + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value" + }] + ``` + + * 属性说明: + + * index:字段索引 + + * type:字段类型,ftp读取的为文本文件,本质上都是字符串类型,这里可以指定要转成的类型 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果没有指定index,则会把value的值作为常量列返回,如果指定了index,当读取的字段的值为null时,会以此value值作为默认值返回 + + * 必选:是 + + * 默认值:无 * **fieldDelimiter** - - * 描述:读取的字段分隔符
- - * 必选:是
- - * 默认值:,
+ + * 描述:读取的字段分隔符 + + * 必选:是 + + * 默认值:, * **encoding** + + * 描述:读取文件的编码配置。 + * 必选:否 + * 默认值:utf-8 - * 描述:读取文件的编码配置。
- - * 必选:否
- - * 默认值:utf-8
- * **isFirstLineHeader** - - * 描述:首行是否为标题行,如果是则不读取第一行。
- - * 必选:否
- - * 默认值:false
- - + + * 描述:首行是否为标题行,如果是则不读取第一行。 + * 必选:否 + * 默认值:false diff --git a/docs/ftpwriter.md b/docs/ftpwriter.md index 873a2259ea..e4da51b536 100644 --- a/docs/ftpwriter.md +++ b/docs/ftpwriter.md @@ -5,149 +5,107 @@ ``` { "job": { - "setting": { - "speed": { - "channel": 2 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "reader": { + "setting": {}, + "content": [{ + "reader": {}, + "writer": { + "name": "ftpwriter", "parameter": { - "password": "abc123", - "columnTypes": [ - "java.lang.String", - "java.lang.String" - ], - "column": [ - "col1", - "col2" - ], - "connection": [ - { - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?charset=utf8" - ], - "table": [ - "tb1" - ] - } - ], - "splitPk": "col1", - "username": "dtstack" - }, - "name": "mysqlreader" - }, - "writer": { - "name": "ftpwriter", - "parameter": { - "protocol": "sftp", - "host": "node03", - "port": 22, - "username": "mysftp", - "password": "oh1986mygod", - "writeMode": "overwrite", - "path": "/upload/xxx", - "fieldDelimiter": ",", - "column": [ - { - "name": "col1", - "type": "string" - }, - { - "name": "col2", - "type": "string" - } - ] - } + "protocol": "sftp", + "host": "127.0.0.1", + "port": 22, + "username": "username", + "password": "password", + "writeMode": "overwrite", + "path": "/sftp", + "fieldDelimiter": ",", + "connectPattern": "PASV", + "column": [{ + "type": "string" + }] } } - ] + }] } } - ``` ## 2. 参数说明 * **protocol** + + * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。 + + * 必选:是 + + * 默认值:无 - * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。
- - * 必选:是
- - * 默认值:无
- * **host** + + * 描述:ftp服务器地址。 + + * 必选:是 + + * 默认值:无 - * 描述:ftp服务器地址。
- - * 必选:是
- - * 默认值:无
- * **port** + + * 描述:ftp服务器端口。 + + * 必选:否 + + * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21 - * 描述:ftp服务器端口。
- - * 必选:否
- - * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21
- - * **username** - - * 描述:ftp服务器访问用户名。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器访问用户名。 + + * 必选:是 + + * 默认值:无 * **password** - - * 描述:ftp服务器访问密码。
- - * 必选:是
- - * 默认值:无
+ + * 描述:ftp服务器访问密码。 + + * 必选:是 + + * 默认值:无 + +* **connectPattern** + + * 描述:连接模式(主动模式或者被动模式)。该参数只在传输协议是标准ftp协议时使用,值只能为:PORT (主动),PASV(被动)。两种模式主要的不同是数据连接建立的不同。对于Port模式,是客户端在本地打开一个端口等服务器去连接建立数据连接,而Pasv模式就是服务器打开一个端口等待客户端去建立一个数据连接。 + + * 必选:否 + + * 默认值:PASV * **path** - - * 描述:FTP文件系统的路径信息,FtpWriter会写入Path目录下属多个文件。
- - * 必选:是
- - * 默认值:无
- + + * 描述:FTP文件系统的路径信息,FtpWriter会写入Path目录下属多个文件。 + + * 必选:是 + + * 默认值:无 * **writeMode** - - * 描述:FtpWriter写入前数据清理处理模式:
- - * overwrite,覆盖 - * append,追加 - - * 必选:是
- - * 默认值:无
+ + * 描述:FtpWriter写入前数据清理处理模式: + * overwrite,覆盖 + * append,追加 + * 必选:是 + * 默认值:无 * **fieldDelimiter** + + * 描述:写入的字段分隔符 + + * 必选:否 + + * 默认值:, - * 描述:读取的字段分隔符
- - * 必选:否
- - * 默认值:,
- * **encoding** - - * 描述:读取文件的编码配置。
- - * 必选:否
- - * 默认值:utf-8
- + + * 描述:读取文件的编码配置。 + * 必选:否 + * 默认值:utf-8 diff --git a/docs/hbasereader.md b/docs/hbasereader.md index 422b0187c4..a076698337 100644 --- a/docs/hbasereader.md +++ b/docs/hbasereader.md @@ -4,142 +4,99 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 2, - "bytes": 10000 - }, - "errorLimit": { - "record": 0, - "percentage": 2 - } - }, - "content": [ - { - "reader": { - "name": "hbasereader", - "parameter": { - "hbaseConfig": { - "hbase.zookeeper.property.clientPort": "2181", - "hbase.rootdir": "hdfs://ns1/hbase", - "hbase.cluster.distributed": "true", - "hbase.zookeeper.quorum": "node01,node02,node03", - "zookeeper.znode.parent": "/hbase" - }, - "table": "sb5", - "encodig": "utf-8", - "column": [ - { - "name": "rowkey", - "type": "string" - }, - { - "name": "cf1:id", - "type": "string" - } - ], - "range": { - "startRowkey": "", - "endRowkey": "", - "isBinaryRowkey": true - } - } - }, - "writer": { - "parameter": { - "password": "abc123", - "column": [ - "col1", - "col2" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?charset=utf8", - "table": [ - "sb5" - ] - } - ], - "writeMode": "insert", - "username": "dtstack" - }, - "name": "mysqlwriter" - } - } - ] - } + "job": { + "setting": {}, + "content": [{ + "reader": { + "name": "hbasereader", + "parameter": { + "hbaseConfig": { + "hbase.zookeeper.property.clientPort": "2181", + "hbase.rootdir": "hdfs://ns1/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "host1,host2,host3", + "zookeeper.znode.parent": "/hbase" + }, + "table": "tableTest", + "encodig": "utf-8", + "column": [{ + "name": "rowkey", + "type": "string" + }, + { + "name": "cf1:id", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "", + "isBinaryRowkey": true + } + } + }, + "writer": {} + }] + } } - ``` ## 2. 参数说明 * **hbaseConfig** + + * 描述:hbase的连接配置,以json的形式组织 (见hbase-site.xml) + + * 必选:是 + + * 默认值:无 - * 描述:hbase的连接配置,以json的形式组织 (见hbase-site.xml)
- - * 必选:是
- - * 默认值:无
- * **encoding** + + * 描述:字符编码 + + * 必选:无 + + * 默认值:utf-8 - * 描述:字符编码
- - * 必选:无
- - * 默认值:utf-8
- * **table** + + * 描述:hbase表名 + + * 必选:是 + + * 默认值:无 - * 描述:hbase表名
- - * 必选:是
- - * 默认值:无
- * **range** - - * 描述:指定hbasereader读取的rowkey范围。
- startRowkey:指定开始rowkey;
- endRowkey指定结束rowkey;
- isBinaryRowkey:指定配置的startRowkey和endRowkey转换为byte[]时的方式,默认值为false,若为true,则调用Bytes.toBytesBinary(rowkey)方法进行转换;若为false:则调用Bytes.toBytes(rowkey)
- 配置格式如下: - - ``` - "range": { - "startRowkey": "aaa", - "endRowkey": "ccc", - "isBinaryRowkey":false -} - ``` -
- - * 必选:否
- - * 默认值:无
+ + * 描述:指定hbasereader读取的rowkey范围。 + + * startRowkey:指定开始rowkey; + + * endRowkey指定结束rowkey; + + * isBinaryRowkey:指定配置的startRowkey和endRowkey转换为byte[]时的方式,默认值为false,若为true,则调用Bytes.toBytesBinary(rowkey)方法进行转换;若为false:则调用Bytes.toBytes(rowkey),配置格式如下: + + ``` + "range": { + "startRowkey": "aaa", + "endRowkey": "ccc", + "isBinaryRowkey":false + } + ``` + + * 必选:否 + + * 默认值:无 * **column** - - * 描述:要读取的hbase字段,normal 模式与multiVersionFixedColumn 模式下必填项。 - name指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式,type指定源数据的类型,format指定日期类型的格式,value指定当前类型为常量,不从hbase读取数据,而是根据value值自动生成对应的列。配置格式如下: - - ``` - "column": -[ - { - "name": "rowkey", - "type": "string" - }, - { - "value": "test", - "type": "string" - } -] - - ``` - - * 必选:是
- - * 默认值:无
\ No newline at end of file + + * 描述:要读取的hbase字段,normal 模式与multiVersionFixedColumn 模式下必填项。 + + * name:指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式; + + * type:指定源数据的类型,format指定日期类型的格式,value指定当前类型为常量,不从hbase读取数据,而是根据value值自动生成对应的列。 + + * 必选:是 + + * 默认值:无 diff --git a/docs/hbasewriter.md b/docs/hbasewriter.md index ba627d6824..5c29c46f0c 100644 --- a/docs/hbasewriter.md +++ b/docs/hbasewriter.md @@ -4,187 +4,159 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 1 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "reader": { - "name": "mysqlreader", - "parameter": { - "username": "dtstack", - "password": "abc123", - "column": [ - "col1", - "col2" - ], - "splitPk": "col1", - "connection": [ - { - "table": [ - "tb2" - ], - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true" - ] - } - ] - } - }, - "writer": { - "name": "hbasewriter", - "parameter": { - "hbaseConfig": { - "hbase.zookeeper.property.clientPort": "2181", - "hbase.rootdir": "hdfs://ns1/hbase", - "hbase.cluster.distributed": "true", - "hbase.zookeeper.quorum": "node01,node02,node03", - "zookeeper.znode.parent": "/hbase" - }, - "table": "tb1", - "rowkeyColumn": [ - { - "index": 0, - "type": "string" - }, - { - "value": "_postfix", - "type": "string" - } - ], - "column": [ - { - "name": "cf1:id", - "type": "string" - }, - { - "name": "cf1:vv", - "type": "string" - } - ] - } - } - } - ] - } + "job": { + "setting": { + "speed": {}, + "content": [{ + "reader": {}, + "writer": { + "name": "hbasewriter", + "parameter": { + "hbaseConfig": { + "hbase.zookeeper.property.clientPort": "2181", + "hbase.rootdir": "hdfs://ns1/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "host1,host2,host3", + "zookeeper.znode.parent": "/hbase" + }, + "table": "tableTest", + "rowkeyColumn": [{ + "index": 0, + "type": "string" + }, + { + "value": "_postfix", + "type": "string" + } + ], + "column": [{ + "name": "cf1:id", + "type": "string" + }, + { + "name": "cf1:vv", + "type": "string" + } + ] + } + } + }] + } + } } - ``` ## 2. 参数说明 * **hbaseConfig** + + * 描述:hbase的连接配置,以json的形式组织 (见hbase-site.xml) + + * 必选:是 + + * 默认值:无 - * 描述:hbase的连接配置,以json的形式组织 (见hbase-site.xml)
- - * 必选:是
- - * 默认值:无
- * **table** + + * 描述:hbase表名 + + * 必选:是 + + * 默认值:无 - * 描述:hbase表名
- - * 必选:是
- - * 默认值:无
- - * **column** + + * 描述:写入hbase表的若干个列,hbase表的每一列由列簇和列名组成,用":"连接 + + ``` + { + "name": "cf1:id", // 列簇:列名 + "type": "string" // 列类型 + } + ``` + + * 必选:是 + + * 默认值:无 - * 描述:写入hbase表的若干个列,hbase表的每一列由列簇和列名组成,用连接
- - ``` - { - "name": "cf1:id", // 列簇:列名 - "type": "string" // 列类型 - } - ``` - - * 必选:是
- - * 默认值:无
- * **rowkeyColumn** + + * 描述:用于构造rowkey的若干个列,每列形式如下 + + * 普通列 + + ``` + { + "index": 0, // 该列在column属性中的序号,从0开始 + "type": "string" 列的类型,默认为string + } + ``` + + * 常数列 + + ``` + { + "value": "ffff", // 常数值 + "type": "string" // 常数列的类型,默认为string + } + ``` + + * 必选:否 + + 如果不指定idColumns属性,则会随机产生文档id + + * 默认值:无 - * 描述:用于构造rowkey的若干个列,每列形式如下
- - * 普通列 - - ``` - { - "index": 0, // 该列在column属性中的序号,从0开始 - "type": "string" 列的类型,默认为string - } - ``` - - * 常数列 - - ``` - { - "value": "ffff", // 常数值 - "type": "string" // 常数列的类型,默认为string - } - ``` - - * 必选:否
- 如果不指定idColumns属性,则会随机产生文档id - - * 默认值:无
- * **versionColumn** + + * 描述:指定写入hbase的时间戳。支持:当前时间、指定时间列,指定时间,三者选一。若不配置表示用当前时间。index:指定对应reader端column的索引,从0开始,需保证能转换为long,若是Date类型,会尝试用yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm:ss SSS去解析;若不指定index;value:指定时间的值,类型为字符串。配置格式如下: + + ``` + "versionColumn":{ + "index":1 + } + ``` + + 或者 + + ``` + "versionColumn":{ + "value":"123456789" + } + ``` - * 描述:指定写入hbase的时间戳。支持:当前时间、指定时间列,指定时间,三者选一。若不配置表示用当前时间。index:指定对应reader端column的索引,从0开始,需保证能转换为long,若是Date类型,会尝试用yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm:ss SSS去解析;若不指定index;value:指定时间的值,类型为字符串。配置格式如下: - - ``` -"versionColumn":{ - "index":1 -} - - ``` - - 或者 - - ``` -"versionColumn":{ - "value":"123456789" -} - * **encoding** + + * 描述:字符编码 + + * 必选:无 + + * 默认值:utf-8 - * 描述:字符编码
- - * 必选:无
- - * 默认值:utf-8
- * **nullMode** + + * 描述:读取的null值时,如何处理。支持两种方式: + + * (1)skip:表示不向hbase写这列; + + * (2)empty:写入HConstants.EMPTY_BYTE_ARRAY,即new byte [0] + + * 必选:否 + + * 默认值:skip - * 描述:读取的null值时,如何处理。支持两种方式:(1)skip:表示不向hbase写这列;(2)empty:写入HConstants.EMPTY_BYTE_ARRAY,即new byte [0]
- - * 必选:否
- - * 默认值:skip
- * **writeBufferSize** + + * 描述:设置HBae client的写buffer大小,单位字节。配合autoflush使用。autoflush,开启(true)表示Hbase client在写的时候有一条put就执行一次更新;关闭(false),表示Hbase client在写的时候只有当put填满客户端写缓存时,才实际向HBase服务端发起写请求 + + * 必选:否 + + * 默认值:8M - * 描述:设置HBae client的写buffer大小,单位字节。配合autoflush使用。autoflush,开启(true)表示Hbase client在写的时候有一条put就执行一次更新;关闭(false),表示Hbase client在写的时候只有当put填满客户端写缓存时,才实际向HBase服务端发起写请求
- - * 必选:否
- - * 默认值:8M
- * **walFlag** - - * 描述:在HBae client向集群中的RegionServer提交数据时(Put/Delete操作),首先会先写WAL(Write Ahead Log)日志(即HLog,一个RegionServer上的所有Region共享一个HLog),只有当WAL日志写成功后,再接着写MemStore,然后客户端被通知提交数据成功;如果写WAL日志失败,客户端则被通知提交失败。关闭(false)放弃写WAL日志,从而提高数据写入的性能。
- - * 必选:否
- - * 默认值:false
\ No newline at end of file + + * 描述:在HBae client向集群中的RegionServer提交数据时(Put/Delete操作),首先会先写WAL(Write Ahead Log)日志(即HLog,一个RegionServer上的所有Region共享一个HLog),只有当WAL日志写成功后,再接着写MemStore,然后客户端被通知提交数据成功;如果写WAL日志失败,客户端则被通知提交失败。关闭(false)放弃写WAL日志,从而提高数据写入的性能。 + + * 必选:否 + + * 默认值:false diff --git a/docs/hdfsreader.md b/docs/hdfsreader.md index dc64cef75a..457d4ee8d4 100644 --- a/docs/hdfsreader.md +++ b/docs/hdfsreader.md @@ -4,161 +4,141 @@ ``` { - "job": { - "content": [ - { - "reader": { - "parameter": { - "path": "hdfs://ns1/user/hive/warehouse/wujing_test.db/kepa_250", - "hadoopConfig": { - "dfs.ha.namenodes.ns1": "nn1,nn2", - "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", - "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", - "dfs.nameservices": "ns1" + "job": { + "content": [{ + "reader": { + "parameter": { + "path": "hdfs://ns1/user/hive/warehouse/wujing_test.db/test", + "hadoopConfig": { + "dfs.ha.namenodes.ns1": "nn1,nn2", + "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", + "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", + "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", + "dfs.nameservices": "ns1" + }, + "defaultFS": "hdfs://ns1", + "column": [{ + "name": "col1", + "index": 0, + "type": "string", + "value": "", + "format": "" + }], + "fieldDelimiter": "", + "encoding": "utf-8", + "fileType": "orc" + }, + "name": "hdfsreader" }, - "column": [ - { - "name": "col1", - "index": 0, - "type": "string" - }, - { - "name": "col2", - "index": 1, - "type": "string" - } - ], - "defaultFS": "hdfs://ns1", - "fieldDelimiter": "", - "encoding": "utf-8", - "fileType": "orc" - }, - "name": "hdfsreader" - }, - "writer": { - "parameter": { - "password": "abc123", - "column": [ - "col1", - "col2" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?charset=utf8", - "table": [ - "sb5" - ] - } - ], - "writeMode": "insert", - "username": "dtstack" - }, - "name": "mysqlwriter" - } - } - ], - "setting": { - "errorLimit": { - "record": 100 - }, - "speed": { - "bytes": 1048576, - "channel": 1 - } + "writer": {} + }], + "setting": {} } - } } ``` - ## 2. 参数说明 * **path** - - * 描述:要读取的文件路径,多个路径可以用逗号隔开 - - * 必选:是
- - * 默认值:无
+ + * 描述:要读取的文件路径,多个路径可以用逗号隔开 + + * 必选:是
+ + * 默认值:无
* **defaultFS** - - * 描述:Hadoop hdfs文件系统namenode节点地址。
- - * 必选:是
- - * 默认值:无
+ + * 描述:Hadoop hdfs文件系统namenode节点地址。
+ + * 必选:是
+ + * 默认值:无
* **fileType** - - * 描述:文件的类型,目前只支持用户配置为"text"、"orc"
- - text表示textfile文件格式 - - orc表示orcfile文件格式 - - * 必选:是
- - * 默认值:无
- + + * 描述:文件的类型,目前只支持用户配置为"text"、"orc"、“parquet” + + * text:textfile文件格式 + + * orc:orcfile文件格式 + + * parquet:parquet文件格式 + + * 必选:是
+ + * 默认值:无
* **column** - - * 描述:读取字段列表,type指定源数据的类型, - - ```json -{ - "type": "long", - "index": 0 //从本地文件文本第一列获取int字段 -}, -{ - "type": "string", - "value": "yesyoucan" //HdfsReader内部生成yesyoucan的字符串字段作为当前字段 -} - ``` - - 对于用户指定Column信息,type必须填写,index/value必须选择其一。 - - * 必选:是
- - * 默认值:全部按照string类型读取
+ + * 描述:需要读取的字段。 + + * 格式:支持3中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":[*] + ``` + + 2.只指定字段名称: + + ``` + "column":["id","name"] + ``` + + 3.指定具体信息: + + ``` + "column": [{ + "name": "col", + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value" + }] + ``` + + * 属性说明: + + * name:字段名称 + + * index:字段索引,当读取text格式的文件时指定此属性 + + * type:字段类型,可以和数据文件里的字段类型不一样,程序会做一次类型转换 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果数据文件里不存在指定的字段,则会把value的值作为常量列返回,如果指定的字段存在,当指定字段的值为null时,会以此value值作为默认值返回 * **fieldDelimiter** - - * 描述:读取的字段分隔符
- - **另外需要注意的是,HdfsReader在读取textfile数据时,需要指定字段分割符,HdfsReader在读取orcfile时,用户无需指定字段分割符** - - * 必选:否
- - * 默认值:\\001
- + + * 描述:读取的字段分隔符
+ + * 注意:在读取text格式文件时需要指定此参数 + + * 必选:否
+ + * 默认值:“\001”
* **encoding** + + * 描述:读取文件的编码配置。 + * 必选:否 + * 默认值:utf-8 - * 描述:读取文件的编码配置。
- - * 必选:否
- - * 默认值:utf-8
- * **hadoopConfig** - - * 描述:hadoopConfig里可以配置与Hadoop相关的一些高级参数,比如HA的配置。
- - ```json - "hadoopConfig":{ - "dfs.nameservices": "testDfs", - "dfs.ha.namenodes.testDfs": "namenode1,namenode2", -        "dfs.namenode.rpc-address.aliDfs.namenode1": "", - "dfs.namenode.rpc-address.aliDfs.namenode2": "", - "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" - } - ``` - - * 必选:否
- - * 默认值:无
- - + + * 描述:hadoopConfig里可以配置与Hadoop相关的一些高级参数,比如HA的配置。
+ + ``` + "dfs.nameservices": "testDfs", + "dfs.ha.namenodes.testDfs": "namenode1,namenode2", + "dfs.namenode.rpc-address.aliDfs.namenode1": "", + "dfs.namenode.rpc-address.aliDfs.namenode2": "", + "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + } + ``` + + * 必选:否
+ + * 默认值:无 diff --git a/docs/hdfswriter.md b/docs/hdfswriter.md index d28030fe09..b33ca831d2 100644 --- a/docs/hdfswriter.md +++ b/docs/hdfswriter.md @@ -4,169 +4,120 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 1 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "reader": { - "parameter": { - "password": "abc123", - "columnTypes": [ - "java.lang.Integer", - "java.lang.String" - ], - "column": [ - "col1", - "col2" - ], - "connection": [ - { - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?charset=utf8" - ], - "table": [ - "tb2" - ] - } - ], - "splitPk": "col1", - "username": "dtstack" - }, - "name": "mysqlreader" - }, - "writer": { - "name": "hdfswriter", - "parameter": { - "hadoopConfig": { - "dfs.nameservices":"ns1", - "dfs.ha.namenodes.ns1": "nn1,nn2", - "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", - "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", - "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" - }, - "defaultFS": "hdfs://ns1", - "fileType": "text", - "fileName": "hallo", - "column": [ - { - "name": "col1", - "type": "STRING" - }, - { - "name": "col2", - "type": "STRING" - } - ], - "path": "/hyf", - "writeMode": "append", - "fieldDelimiter": "\\001" - } - } - } - ] - } + "job": { + "setting": {}, + "content": [{ + "reader": {}, + "writer": { + "name": "hdfswriter", + "parameter": { + "hadoopConfig": { + "dfs.nameservices": "ns1", + "dfs.ha.namenodes.ns1": "nn1,nn2", + "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", + "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", + "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + }, + "defaultFS": "hdfs://ns1", + "fileType": "text", + "fileName": "hello", + "column": [{ + "name": "col1", + "index": 0, + "type": "STRING" + }], + "rowGroupSize": 134217728, + "compress": "SNAPPY", + "path": "/test", + "writeMode": "append", + "fieldDelimiter": "\\001" + } + } + }] + } } ``` ## 2. 参数说明 * **defaultFS** - - * 描述:Hadoop hdfs文件系统namenode节点地址。格式:hdfs://ip:端口;例如:hdfs://127.0.0.1:9000
- - * 必选:是
- - * 默认值:无
+ + * 描述:Hadoop hdfs文件系统namenode节点地址。格式:hdfs://ip:端口;例如:hdfs://127.0.0.1:9000
+ + * 必选:是
+ + * 默认值:无
* **fileType** + + * 描述:文件的类型,目前只支持用户配置为"text"、"orc"、“parquet” + + * text:textfile文件格式 + + * orc:orcfile文件格式 + + * parquet:parquet文件格式 + + * 必选:是
+ + * 默认值:无
- * 描述:文件的类型,目前只支持用户配置为"text"或"orc"。
- - text表示textfile文件格式 - - orc表示orcfile文件格式 - - * 必选:是
- - * 默认值:无
- * **path** + + * 描述:存储到Hadoop hdfs文件系统的路径信息,HdfsWriter会根据并发配置在Path目录下写入多个文件。 + + * 必选:是
+ + * 默认值:无
+ +* **rowGroupSize** + + * 描述:写入parquet格式文件时指定,表示一个group的大小,如果字段数量很多,并且任务可使用内存有限,使用默认值可能会导致内存溢出,可以通过降低此参数的值来避免内存溢出,如果值很小,则会生产很多小的group,此时通过hive或者spark处理的话会降低效率,因此这个参数的调整要结合具体使用场景。 + + * 必选:否 + + * 默认值:134217728 - * 描述:存储到Hadoop hdfs文件系统的路径信息,HdfsWriter会根据并发配置在Path目录下写入多个文件。为与hive表关联,请填写hive表在hdfs上的存储路径。例:Hive上设置的数据仓库的存储路径为:/user/hive/warehouse/ ,已建立数据库:test,表:hello;则对应的存储路径为:/user/hive/warehouse/test.db/hello
- - * 必选:是
- - * 默认值:无
- -* **fileName** - - * 描述:HdfsWriter写入时的文件名
- - * 必选:是
- - * 默认值:无
- * **column** + + * 描述:写入数据的字段。 + + * name:指定字段名 + + * type:指定字段类型。 + + * 必选:是
+ + * 默认值:无
- * 描述:写入数据的字段,不支持对部分列写入。为与hive中表关联,需要指定表中所有字段名和字段类型,其中:name指定字段名,type指定字段类型。
- - ```json - "column": - [ - { - "name": "userName", - "type": "string" - }, - { - "name": "age", - "type": "long" - } - ] - ``` - - * 必选:是
- - * 默认值:无
- * **writeMode** - - * 描述:hdfswriter写入前数据清理处理模式:
- - * append,追加 - * overwrite,覆盖 - - * 必选:是
- - * 默认值:无
+ + * 描述:hdfswriter写入前数据清理处理模式:
+ * append,追加 + + * overwrite,覆盖 + * 注意:overwrite模式时会删除写入路径下的所有文件 + * 必选:否 + * 默认值:overwrite * **fieldDelimiter** - - * 描述:hdfswriter写入时的字段分隔符,**需要用户保证与创建的Hive表的字段分隔符一致,否则无法在Hive表中查到数据**
- - * 必选:是
- - * 默认值:\\001
+ + * 描述:hdfswriter写入时的字段分隔符,**需要用户保证与创建的Hive表的字段分隔符一致,否则无法在Hive表中查到数据**
+ + * 必选:是
+ + * 默认值:\\001
* **compress** - - * 描述:hdfs文件压缩类型,默认不填写意味着没有压缩。其中:text类型文件支持压缩类型有gzip、bzip2;orc类型文件支持的压缩类型有NONE、SNAPPY(需要用户安装SnappyCodec)。
- - * 必选:否
- - * 默认值:无压缩
+ + * 描述:hdfs文件压缩类型,默认不填写意味着没有压缩。其中:text类型文件支持压缩类型有gzip、bzip2;orc类型文件支持的压缩类型有NONE、SNAPPY(需要用户安装SnappyCodec)。
+ + * 必选:否
+ + * 默认值:无压缩
* **encoding** - - * 描述:写文件的编码配置。
- - * 必选:否
- - * 默认值:utf-8,**慎重修改**
+ + * 描述:写文件的编码配置。
+ * 必选:否 + * 默认值:utf-8 diff --git a/docs/mongodbreader.md b/docs/mongodbreader.md index 856a02f7f3..72f74cac66 100644 --- a/docs/mongodbreader.md +++ b/docs/mongodbreader.md @@ -1,103 +1,144 @@ # MongoDB读取插件(mongodbreader) ## 1. 配置样例 + ```json { - "job":{ - "content":[{ - "reader":{ - "parameter":{ - "hostPorts":"localhost:27017", - "username": "", - "password": "", - "database":"", - "collectionName": "", - "column": [ - { - "name":"id", - "type":"int", - "splitter":"," - }, - { - "name":"id", - "type":"string" - } - ], - "filter": "" - }, - "name":"mongodbreader" - }, - "writer":{} - }] - } + "job":{ + "content":[{ + "reader":{ + "parameter":{ + "hostPorts":"localhost:27017", + "username": "", + "password": "", + "database":"", + "collectionName": "", + "fetchSize":100, + "column": [ + { + "name":"id", + "type":"int", + "splitter":"," + } + ], + "filter": "" + }, + "name":"mongodbreader" + }, + "writer":{} + }] + } } ``` ## 2. 参数说明 * **name** - - * 描述:插件名,此处只能填mongodbreader,否则Flinkx将无法正常加载该插件包。 - - * 必选:是 - - * 默认值:无 + + * 描述:插件名,此处只能填mongodbreader,否则Flinkx将无法正常加载该插件包。 + + * 必选:是 + + * 默认值:无 * **hostPorts** - - * 描述:MongoDB的地址和端口,格式为 IP1:port,可填写多个地址,以英文逗号分隔。 - - * 必选:是 - - * 默认值:无 + + * 描述:MongoDB的地址和端口,格式为 IP1:port,可填写多个地址,以英文逗号分隔。 + + * 必选:是 + + * 默认值:无 * **username** - - * 描述:数据源的用户名 - - * 必选:否 - - * 默认值:无 + + * 描述:数据源的用户名 + + * 必选:否 + + * 默认值:无 * **password** + + * 描述:数据源指定用户名的密码 + + * 必选:否 + + * 默认值:无 - * 描述:数据源指定用户名的密码 - - * 必选:否 - - * 默认值:无 - * **database** + + * 描述:数据库名称 + + * 必选:是 + + * 默认值:无 - * 描述:数据库名称 - - * 必选:是 - - * 默认值:无 - * **collectionName** - - * 描述:集合名称 - - * 必选:是 - - * 默认值:无 + + * 描述:集合名称 + + * 必选:是 + + * 默认值:无 * **column** + + * 描述:需要读取的字段。 + + * 格式:支持3中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":[*] + ``` + + 2.只指定字段名称: + + ``` + "column":["id","name"] + ``` + + 3.指定具体信息: + + ``` + "column": [{ + "name": "col", + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value", + "splitter":"," + }] + ``` + + * 属性说明: + + * name:字段名称 + + * type:字段类型,可以和数据库里的字段类型不一样,程序会做一次类型转换 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果数据库里不存在指定的字段,则会把value的值作为常量列返回,如果指定的字段存在,当指定字段的值为null时,会以此value值作为默认值返回 + + * splitter:因为 MongoDB 支持数组类型,所以 MongoDB 读出来的数组类型要通过这个分隔符合并成字符串 + + * 必选:是 + + * 默认值:无 + +* **fetchSize** + + * 描述:每次读取的数据条数,通过调整此参数来优化读取速率 + + * 必选:否 + + * 默认值:100 - * 描述:MongoDB 的文档列名,配置为数组形式表示 MongoDB 的多个列。 - - name:Column 的名字。 - - type:Column 的类型。 - - splitter:因为 MongoDB 支持数组类型,所以 MongoDB 读出来的数组类型要通过这个分隔符合并成字符串。 - - * 必选:是 - - * 默认值:无 - * **filter** - - * 描述:过滤条件,通过该配置型来限制返回 MongoDB 数据范围,语法请参考[MongoDB查询语法](https://docs.mongodb.com/manual/crud/#read-operations) - - * 必选:否 - - * 默认值:无 \ No newline at end of file + + * 描述:过滤条件,通过该配置型来限制返回 MongoDB 数据范围,语法请参考[MongoDB查询语法](https://docs.mongodb.com/manual/crud/#read-operations) + + * 必选:否 + + * 默认值:无 diff --git a/docs/mongodbwriter.md b/docs/mongodbwriter.md index 80cdfbd886..838b34e412 100644 --- a/docs/mongodbwriter.md +++ b/docs/mongodbwriter.md @@ -1,124 +1,126 @@ # MongoDB写入插件(mongodbwriter) ## 1. 配置样例 + ```json { - "job":{ - "content":[{ - "reader":{}, - "writer":{ - "parameter":{ - "hostPorts":"localhost:27017", - "username": "", - "password": "", - "database":"test", - "collectionName": "test", - "writeMode": "insert", - "batchSize":1, - "column": [ - { - "name":"id", - "type":"int", - "splitter":"," - }, - { - "name":"id", - "type":"string", - "splitter":"," - } - ], - "replaceKey":"id" - }, - "name":"mongodbwriter" - } - }] - } + "job":{ + "content":[{ + "reader":{}, + "writer":{ + "parameter":{ + "hostPorts":"localhost:27017", + "username": "", + "password": "", + "database":"test", + "collectionName": "test", + "writeMode": "insert", + "batchSize":1, + "column": [ + { + "name":"id", + "type":"int", + "splitter":"," + }, + { + "name":"id", + "type":"string", + "splitter":"," + } + ], + "replaceKey":"id" + }, + "name":"mongodbwriter" + } + }] + } } ``` ## 2. 参数说明 * **name** - - * 描述:插件名,此处只能填 mongodbwriter,否则Flinkx将无法正常加载该插件包。 - - * 必选:是 - - * 默认值:无 + + * 描述:插件名,此处只能填 mongodbwriter,否则Flinkx将无法正常加载该插件包。 + + * 必选:是 + + * 默认值:无 * **hostPorts** - - * 描述:MongoDB的地址和端口,格式为 IP1:port,可填写多个地址,以英文逗号分隔。 - - * 必选:是 - - * 默认值:无 + + * 描述:MongoDB的地址和端口,格式为 IP1:port,可填写多个地址,以英文逗号分隔。 + + * 必选:是 + + * 默认值:无 * **username** - - * 描述:数据源的用户名 - - * 必选:否 - - * 默认值:无 + + * 描述:数据源的用户名 + + * 必选:否 + + * 默认值:无 * **password** + + * 描述:数据源指定用户名的密码 + + * 必选:否 + + * 默认值:无 - * 描述:数据源指定用户名的密码 - - * 必选:否 - - * 默认值:无 - * **database** + + * 描述:数据库名称 + + * 必选:是 + + * 默认值:无 - * 描述:数据库名称 - - * 必选:是 - - * 默认值:无 - * **collectionName** - - * 描述:集合名称 - - * 必选:是 - - * 默认值:无 + + * 描述:集合名称 + + * 必选:是 + + * 默认值:无 * **column** + + * 描述:MongoDB 的文档列名,配置为数组形式表示 MongoDB 的多个列。 + + - name:Column 的名字。 + - type:Column 的类型。 + - splitter:特殊分隔符,当且仅当要处理的字符串要用分隔符分隔为字符数组 Array 时,才使用这个参数。通过这个参数指定的分隔符,将字符串分隔存储到 MongoDB 的数组中。 + + * 必选:是 + + * 默认值:无 - * 描述:MongoDB 的文档列名,配置为数组形式表示 MongoDB 的多个列。 - - name:Column 的名字。 - - type:Column 的类型。 - - splitter:特殊分隔符,当且仅当要处理的字符串要用分隔符分隔为字符数组 Array 时,才使用这个参数。通过这个参数指定的分隔符,将字符串分隔存储到 MongoDB 的数组中。 - - * 必选:是 - - * 默认值:无 - * **replaceKey** + + * 描述:replaceKey 指定了每行记录的业务主键,用来做覆盖时使用(不支持 replaceKey为多个键,一般是指Monogo中的主键)。 + + * 必选:否 + + * 默认值:无 - * 描述:replaceKey 指定了每行记录的业务主键,用来做覆盖时使用(不支持 replaceKey为多个键,一般是指Monogo中的主键)。 - - * 必选:否 - - * 默认值:无 - * **writeMode** - - * 描述:写入模式,当 batchSize > 1 时不支持 replace 和 update 模式 - - * 必选:是 - - * 所有选项:insert/replace/update - - * 默认值:insert + + * 描述:写入模式,当 batchSize > 1 时不支持 replace 和 update 模式 + + * 必选:是 + + * 所有选项:insert/replace/update + + * 默认值:insert * **batchSize** - - * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与MongoDB的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。
- - * 必选:否 - - * 默认值:1 \ No newline at end of file + + * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与MongoDB的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。
+ + * 必选:否 + + * 默认值:1 diff --git a/docs/mysqldreader.md b/docs/mysqldreader.md deleted file mode 100644 index 4813fc02dc..0000000000 --- a/docs/mysqldreader.md +++ /dev/null @@ -1,145 +0,0 @@ -# MySQL分库分表读取插件(mysqldreader) - -## 1. 配置样例 - -``` -{ - "job": { - "setting": { - "speed": { - "channel": 4 - }, - "errorLimit": { - "record": 0, - "percentage": 10 - } - }, - "content": [ - { - "reader": { - "parameter": { - "password": "abc123", - "username": "dtstack", - "column": [ - "col1", - "col2" - ], - "where": "id > 1", - "connection": [ - { - "password": "abc123", - "username": "dtstack", - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useUnicode=true&characterEncoding=utf8" - ], - "table": [ - "tb2" - ] - } - ], - "splitPk": "col1" - }, - "name": "mysqldreader" - }, - "writer": {} - } - ] - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填mysqldreader,否则Flinkx将无法正常加载该插件包。 - - * 必选:是
- - * 默认值:无
- -* **connection** - - * 描述:需要读取的数据源数组。 - - * 必选:是 - - * 默认值:无 - - * 元素: - - * username:具体数据源的用户名,如果不填则使用全局的用户名。 - - * password:具体数据源的密码,如果不填则使用全局的密码。 - - * jdbcUrl:数据源连接url,只支持写单个连接。 - - * table:要查询的表名称,可写多张表,多张表的表结构必须一致。 - -* **jdbcUrl** - - * 描述:针对mysql数据库的jdbc连接字符串 - - jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:全局数据源的用户名
- - * 必选:否
- - * 默认值:无
- -* **password** - - * 描述:全局数据源的密码
- - * 必选:否
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,MysqldReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:MysqldReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqldReader将报错! - -  如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
- diff --git a/docs/mysqlreader.md b/docs/mysqlreader.md deleted file mode 100644 index 52e7c97420..0000000000 --- a/docs/mysqlreader.md +++ /dev/null @@ -1,151 +0,0 @@ -# MySQL读取插件(mysqlreader) - -## 1. 配置样例 - -``` -{ - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 0 - }, - "errorLimit": { - "record": 10000, - "percentage": 100 - }, - "dirty": { - "path": "/tmp", - "hadoopConfig": { - - } - } - }, - "content": [ - { - "reader": { - "parameter": { - "password": "abc123" - "column": [ - "col1", - "col2" - ], - "where": "id > 1", - "connection": [ - { - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useUnicode=true&characterEncoding=utf8" - ], - "table": [ - "tb2" - ] - } - ], - "splitPk": "col1", - "username": "dtstack" - }, - "name": "mysqlreader" - }, - "writer": { - "name": "sqlserverwriter", - "parameter": { - "batchSize": 2048, - "username": "sa", - "password": "Dtstack201610!", - "column": [ - "id", - "v" - ], - "writeMode": "replace", - "connection": [ - { - "jdbcUrl": "jdbc:jtds:sqlserver://172.16.10.46:1433;DatabaseName=dq", - "table": [ - "tb1" - ] - } - ] - } - } - } - ] - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填mysqlreader,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对mysql数据库的jdbc连接字符串 - - jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:MysqlReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! - -  如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
- diff --git a/docs/mysqlwriter.md b/docs/mysqlwriter.md deleted file mode 100644 index 5c050e3b48..0000000000 --- a/docs/mysqlwriter.md +++ /dev/null @@ -1,171 +0,0 @@ -# MySQL写入插件(mysqlwriter) - -## 1. 配置样例 - -``` -{ - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 0 - }, - "errorLimit": { - "record": 10000, - "percentage": 100 - }, - "dirty": { - "path": "/tmp", - "hadoopConfig": { - "fs.default.name": "hdfs://ns1", - "dfs.nameservices": "ns1", - "dfs.ha.namenodes.ns1": "nn1,nn2", - "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", - "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", - "dfs.ha.automatic-failover.enabled": "true", - "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "fs.hdfs.impl.disable.cache": "true" - } - } - }, - "content": [ - { - "reader": { - "name": "mysqlreader", - "parameter": { - "username": "dtstack", - "password": "abc123", - "column": [ - "id", - "v1" - ], - "where": "id > 1", - "connection": [ - { - "table": [ - "sb9" - ], - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true" - ] - } - ], - "splitPk": "id" - } - }, - "writer": { - "name": "mysqlwriter", - "parameter": { - "writeMode": "insert", - "username": "dtstack", - "password": "abc123", - "column": [ - "c1", - "c2" - ], - "batchSize": 1, - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true", - "table": [ - "tb3" - ] - } - ] - } - } - } - ] - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填mysqlwriter,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对mysql数据库的jdbc连接字符串 - - jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **column** - - * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 - - * 必选:是
- - * 默认值:否
- - * 默认值:无
- -* **preSql** - - * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **postSql** - - * 描述:写入数据到目的表后,会执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **table** - - * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 - - 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 - - * 必选:是
- - * 默认值:无
- -* **writeMode** - - * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句
- - * 必选:是
- - * 所有选项:insert/replace/update
- - * 默认值:insert
- -* **batchSize** - - * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与Mysql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。
- - * 必选:否
- - * 默认值:1024
\ No newline at end of file diff --git a/docs/odpsreader.md b/docs/odpsreader.md index b92d5ebf98..6f0d838a6d 100644 --- a/docs/odpsreader.md +++ b/docs/odpsreader.md @@ -4,131 +4,111 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 10000000 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "writer": { - "name": "mysqlwriter", - "parameter": { - "writeMode": "insert", - "username": "dtstack", - "password": "abc123", - "column": [ - "c1", - "c2" - ], - "batchSize": 1, - "session": [ - "set session sql_mode='ANSI'" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true", - "table": [ - "tb3" - ] - } - ] - } - }, - "reader": { - "name": "odpsreader", - "parameter": { - "odpsConfig": { - "accessId": "${odps.accessId}", - "accessKey": "${odps.accessKey}", - "project": "${odps.project}" - }, - "table": "tb252", - "partition": "pt='xxooxx'", - "column": [ - { - "name": "col1", - "type": "string" - }, - { - "name": "col2", - "type": "string" - } - ] - } - } - } - ] - } + "job": { + "setting": {}, + "content": [{ + "writer": {}, + "reader": { + "name": "odpsreader", + "parameter": { + "odpsConfig": { + "accessId": "${odps.accessId}", + "accessKey": "${odps.accessKey}", + "project": "${odps.project}" + }, + "table": "tableTest", + "partition": "pt='xxooxx'", + "column": [{ + "name": "col1", + "type": "string", + “value”:"xx", + "format":"yyyy-MM-dd HH:mm:ss" + + }] + } + } + }] + } } - ``` ## 2. 参数说明 - * **accessId** - * 描述:ODPS系统登录ID
- - * 必选:是
- - * 默认值:无
+ + * 描述:ODPS系统登录ID
+ * 必选:是 + * 默认值:无 * **accessKey** - * 描述:ODPS系统登录Key
- - * 必选:是
- - * 默认值:无
+ + * 描述:ODPS系统登录Key
+ * 必选:是 + * 默认值:无 * **project** - - * 描述:读取数据表所在的 ODPS 项目名称(大小写不敏感)
- - * 必选:是
- - * 默认值:无
+ + * 描述:读取数据表所在的 ODPS 项目名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无 * **table** + + * 描述:读取数据表的表名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
- * 描述:读取数据表的表名称(大小写不敏感)
- - * 必选:是
- - * 默认值:无
- * **partition** + + * 描述:读取数据所在的分区信息,支持linux shell通配符,包括 * 表示0个或多个字符,?代表任意一个字符。例如现在有分区表 test,其存在 pt=1,ds=hangzhou pt=1,ds=shanghai pt=2,ds=hangzhou pt=2,ds=beijing 四个分区,如果你想读取 pt=1,ds=shanghai 这个分区的数据,那么你应该配置为: `"partition":["pt=1,ds=shanghai"]`; 如果你想读取 pt=1下的所有分区,那么你应该配置为: `"partition":["pt=1,ds=* "]`;如果你想读取整个 test 表的所有分区的数据,那么你应该配置为: `"partition":["pt=*,ds=*"]`
+ + * 必选:如果表为分区表,则必填。如果表为非分区表,则不能填写
+ + * 默认值:无
- * 描述:读取数据所在的分区信息,支持linux shell通配符,包括 * 表示0个或多个字符,?代表任意一个字符。例如现在有分区表 test,其存在 pt=1,ds=hangzhou pt=1,ds=shanghai pt=2,ds=hangzhou pt=2,ds=beijing 四个分区,如果你想读取 pt=1,ds=shanghai 这个分区的数据,那么你应该配置为: `"partition":["pt=1,ds=shanghai"]`; 如果你想读取 pt=1下的所有分区,那么你应该配置为: `"partition":["pt=1,ds=* "]`;如果你想读取整个 test 表的所有分区的数据,那么你应该配置为: `"partition":["pt=*,ds=*"]`
- - * 必选:如果表为分区表,则必填。如果表为非分区表,则不能填写
- - * 默认值:无
- * **column** - - * 描述:读取 odps 源头表的列信息,包括需要选取的列,每列的格式如下: - * 根据字段名指定列 - - ``` -{ - "name": 'col1' //获取字段名为col1的字段 -} - ``` - - * 指定常量列 - - ``` -{ - "type": "string", - "value": "yesyoucan" //OdpsReader内部生成yesyoucan的字符串字段作为当前字段 -} - ``` - - + + * 描述:需要读取的字段。 + + * 格式:支持3中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":[*] + ``` + + 2.只指定字段名称: + + ``` + "column":["id","name"] + ``` + + 3.指定具体信息: + + ``` + "column": [{ + "name": "col", + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value" + }] + ``` + + * 属性说明: + + * name:字段名称 + + * type:字段类型,可以和数据库里的字段类型不一样,程序会做一次类型转换 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果数据库里不存在指定的字段,则会把value的值作为常量列返回,如果指定的字段存在,当指定字段的值为null时,会以此value值作为默认值返回 + + * 必选:是 + + * 默认值:无 diff --git a/docs/odpswriter.md b/docs/odpswriter.md index 3d3faa7c11..baddb0f64e 100644 --- a/docs/odpswriter.md +++ b/docs/odpswriter.md @@ -4,113 +4,89 @@ ``` { - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 10000000 - }, - "errorLimit": { - "record": 0, - "percentage": 0.02 - } - }, - "content": [ - { - "reader": { - "name": "mysqlreader", - "parameter": { - "username": "dtstack", - "password": "abc123", - "column": [ - "col1", - "col2" - ], - // "splitPk": "col1", - "connection": [ - { - "table": [ - "tb2" - ], - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true" - ] + "job": { + "setting": {}, + "content": [{ + "reader": {}, + "writer": { + "name": "odpswriter", + "parameter": { + "odpsConfig": { + "accessId": "${odps.accessId}", + "accessKey": "${odps.accessKey}", + "project": "${odps.project}" + }, + "table": "tableTest", + "partition": "pt='xx'", + "writeMode": "append", + "bufferSize": 64, + "column": [{ + "name": "col1", + "type": "string" + }] } - ] - } - }, - "writer": { - "name": "odpswriter", - "parameter": { - "odpsConfig": { - "accessId": "${odps.accessId}", - "accessKey": "${odps.accessKey}", - "project": "${odps.project}" - }, - "table": "tb252", - "partition": "pt='xxooxx'", - "column": [ - { - "name": "col1", - "type": "string" - }, - { - "name": "col2", - "type": "string" - } - ] - } - } - } - ] - } + } + }] + } } ``` ## 2. 参数说明 * **accessId** - * 描述:ODPS系统登录ID
- - * 必选:是
- - * 默认值:无
+ + * 描述:ODPS系统登录ID
+ * 必选:是 + * 默认值:无 * **accessKey** - * 描述:ODPS系统登录Key
- - * 必选:是
- - * 默认值:无
+ + * 描述:ODPS系统登录Key
+ * 必选:是 + * 默认值:无 * **project** - - * 描述:写入数据表所在的 ODPS 项目名称(大小写不敏感)
- - * 必选:是
- - * 默认值:无
+ + * 描述:读取数据表所在的 ODPS 项目名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无 * **table** + + * 描述:读取数据表的表名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
- * 描述:写入数据表的表名称(大小写不敏感)
- - * 必选:是
- - * 默认值:无
- * **partition** + + * 描述:需要写入数据表的分区信息,必须指定到最后一级分区。把数据写入一个三级分区表,必须配置到最后一级分区,例如pt=20150101/type=1/biz=2。 + + * 必选:**如果是分区表,该选项必填,如果非分区表,该选项不可填写。** + + * 默认值:空
- * 描述:需要写入数据表的分区信息,必须指定到最后一级分区。把数据写入一个三级分区表,必须配置到最后一级分区,例如pt=20150101/type=1/biz=2。 -
- * 必选:**如果是分区表,该选项必填,如果非分区表,该选项不可填写。** - * 默认值:空
- * **column** - - * 描述:需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"], 当需要插入部分odps列填写部分列,例如"column": ["id", "name"]。ODPSWriter支持列筛选、列换序,例如表有a,b,c三个字段,用户只同步c,b两个字段。可以配置成["c","b"], 在导入过程中,字段a自动补空,设置为null。
- * 必选:否
- * 默认值:无
- - - + + * 描述:需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"], 当需要插入部分odps列填写部分列,例如"column": ["id", "name"]。ODPSWriter支持列筛选、列换序,例如表有a,b,c三个字段,用户只同步c,b两个字段。可以配置成["c","b"], 在导入过程中,字段a自动补空,设置为null。
+ * 必选:否
+ * 默认值:无
+ +* **writeMode** + + * 描述:写入模式,支持append和overwrite + + * 必填:否 + + * 默认值:append + +* **bufferSize** + + * 描述:写入缓存大小,单位兆,odps写入数据时会先缓存,达到一定值后才会写入数据,如果写入数据时出现内存溢出,可以降低此参数的值。 + + * 必填:否 + + * 默认值:64 diff --git a/docs/oraclereader.md b/docs/oraclereader.md deleted file mode 100644 index 870d9f8502..0000000000 --- a/docs/oraclereader.md +++ /dev/null @@ -1,143 +0,0 @@ -# Oracle读取插件(oraclereader) - -## 1. 配置样例 - -``` -{"job": { - "content": [ - { - "reader": { - "parameter": { - "password": "wujing", - "where": "3 > 1", - "column": [ - "ID1", - "C1", - "C2" - ], - "connection": [ - { - "jdbcUrl": [ - "jdbc:oracle:thin:@//172.16.8.121:1521/dtstack" - ], - "table": [ - "SB1" - ] - } - ], - "splitPk": "ID1", - "username": "wujing" - }, - "name": "oraclereader" - }, - "writer": { - "parameter": { - "password": "wujing", - "column": [ - "ID", - "C1", - "C2" - ], - "connection": [ - { - "jdbcUrl": "jdbc:oracle:thin:@//172.16.8.121:1521/dtstack", - "table": [ - "SB2" - ] - } - ], - "writeMode": "replace", - "username": "wujing" - }, - "name": "oraclewriter" - } - } - ], - "setting": { - "errorLimit": { - "record": 100 - }, - "speed": { - "bytes": 1048576, - "channel": 2 - } - } - } -} -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填oraclereader,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对oracle数据库的jdbc连接字符串 - - jdbcUrl按照Oracle官方规范,并可以填写连接附件控制信息。具体请参看[Oracle官方文档](http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,OracleReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:OracleReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,OracleReader将报错! - - 如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
- diff --git a/docs/oraclewriter.md b/docs/oraclewriter.md deleted file mode 100644 index e9ac5fd64e..0000000000 --- a/docs/oraclewriter.md +++ /dev/null @@ -1,156 +0,0 @@ -# Oracle写入插件(oraclewriter) - -## 1. 配置样例 - -``` -{"job": { - "content": [ - { - "reader": { - "parameter": { - "password": "wujing", - "where": "3 > 1", - "column": [ - "ID1", - "C1", - "C2" - ], - "connection": [ - { - "jdbcUrl": [ - "jdbc:oracle:thin:@//172.16.8.121:1521/dtstack" - ], - "table": [ - "SB1" - ] - } - ], - "splitPk": "ID1", - "username": "wujing" - }, - "name": "oraclereader" - }, - "writer": { - "parameter": { - "password": "wujing", - "column": [ - "ID", - "C1", - "C2" - ], - "connection": [ - { - "jdbcUrl": "jdbc:oracle:thin:@//172.16.8.121:1521/dtstack", - "table": [ - "SB2" - ] - } - ], - "writeMode": "replace", - "username": "wujing" - }, - "name": "oraclewriter" - } - } - ], - "setting": { - "errorLimit": { - "record": 100 - }, - "speed": { - "bytes": 1048576, - "channel": 2 - } - } - } -} -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填oraclewriter,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对mysql数据库的jdbc连接字符串 - - jdbcUrl按照Oracle官方规范,并可以填写连接附件控制信息。具体请参看[Oracle官方文档](http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **column** - - * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 - - * 必选:是
- - * 默认值:否
- - * 默认值:无
- -* **preSql** - - * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **postSql** - - * 描述:写入数据到目的表后,会执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **table** - - * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 - - 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 - - * 必选:是
- - * 默认值:无
- -* **writeMode** - - * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句
- ** 在oracle中, 用merge into模拟后两种插入语义。 ** - * 必选:是
- - * 所有选项:insert/replace/update
- - * 默认值:insert
- -* **batchSize** - - * 描述:一次性批量提交的记录数大小 - - * 必选:否
- - * 默认值:1024
\ No newline at end of file diff --git a/docs/postgresqlreader.md b/docs/postgresqlreader.md deleted file mode 100644 index 1870cbb673..0000000000 --- a/docs/postgresqlreader.md +++ /dev/null @@ -1,101 +0,0 @@ -# PostgreSQL读取插件(postgresqlreader) - -## 1. 配置样例 -```json -{ - "job":{ - "content":[{ - "reader":{ - "parameter":{ - "username": "postgres", - "password": "postgres", - "connection": [{ - "jdbcUrl": ["jdbc:postgresql://localhost:5432/postgres"], - "table": ["tableTest"] - }], - "splitPk": "id", - "column": ["id", "tenant_id"], - "where": "id > 0" - }, - "name":"postgresqlreader" - }, - "writer":{} - }] - } -} -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填postgresqlreader,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对PostgreSQL数据库的jdbc连接字符串 - - jdbcUrl按照PostgreSQL官方规范,并可以填写连接附件控制信息。具体请参看[PostgreSQL官方文档](https://jdbc.postgresql.org/documentation/head/connect.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,PostgreSQLReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:PostgreSQLReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,PostgreSQLReader将报错! - -  如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
\ No newline at end of file diff --git a/docs/postgresqlwriter.md b/docs/postgresqlwriter.md deleted file mode 100644 index d769abcaff..0000000000 --- a/docs/postgresqlwriter.md +++ /dev/null @@ -1,120 +0,0 @@ -# PostgreSQL写入插件(postgresqlwriter) - -## 1. 配置样例 -```json -{ - "job":{ - "content":[{ - "reader":{}, - "writer":{ - "parameter":{ - "postSql": [], - "password": "postgres", - "session": [], - "column": ["id", "data_name"], - "connection": [{ - "jdbcUrl": "jdbc:postgresql://localhost:5432/postgres", - "table": ["table1"] - }], - "writeMode": "insert", - "preSql": [], - "username": "postgres" - }, - "name":"postgresqlwriter" - } - }] - } -} -``` - - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填postgresqlwriter,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对PostgreSQL数据库的jdbc连接字符串 - - jdbcUrl按照PostgreSQL官方规范,并可以填写连接附件控制信息。具体请参看[PostgreSQL官方文档](https://jdbc.postgresql.org/documentation/head/connect.html)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **column** - - * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 - - * 必选:是
- - * 默认值:否
- - * 默认值:无
- -* **preSql** - - * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **postSql** - - * 描述:写入数据到目的表后,会执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **table** - - * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 - - 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 - - * 必选:是
- - * 默认值:无
- -* **writeMode** - - * 描述:控制写入数据到目标表采用 `insert into` 或者 `insert into .... on conflict(id) do update set ..` 语句。
- - 注:PostgreSQL 9.5之前的版本不支持`insert into .... on conflict(id) do update set ..`语法,因此如果PostgreSQL的版本小于9.5,PostgreSQLWriter插件的update和replace模式将无法使用 - - * 必选:是
- - * 所有选项:insert/replace/update
- - * 默认值:insert
- -* **batchSize** - - * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与PostgreSQL的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。
- - * 必选:否
- - * 默认值:1024
\ No newline at end of file diff --git a/docs/rdbdreader.md b/docs/rdbdreader.md new file mode 100644 index 0000000000..364e9258dd --- /dev/null +++ b/docs/rdbdreader.md @@ -0,0 +1,158 @@ +# 分库分表读取插件(**dreader) + +## 1. 配置样例 + +``` +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + "parameter": { + "password": "abc123", + "username": "dtstack", + "column": [ + "col1", + "col2" + ], + "where": "id > 1", + "connection": [ + { + "password": "abc123", + "username": "dtstack", + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=utf8" + ], + "table": [ + "tb2" + ] + } + ], + "splitPk": "id" + }, + "name": "mysqldreader" + }, + "writer": {} + } + ] + } +} +``` + +## 2. 参数说明 + +* **name** + + * 描述:插件名,目前只支持mysql的分库分表读取,mysqldreader。 + + * 必选:是
+ + * 默认值:无
+ +* **connection** + + * 描述:需要读取的数据源数组。 + + * 必选:是 + + * 默认值:无 + + * 元素: + + * username:具体数据源的用户名,如果不填则使用全局的用户名。 + + * password:具体数据源的密码,如果不填则使用全局的密码。 + + * jdbcUrl:数据源连接url,只支持写单个连接。 + + * table:要查询的表名称,可写多张表,多张表的表结构必须一致。 + +* **jdbcUrl** + + * 描述:针对关系型数据库的jdbc连接字符串 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:全局数据源的用户名
+ + * 必选:否
+ + * 默认值:无
+ +* **password** + + * 描述:全局数据源的密码
+ + * 必选:否
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,MysqldReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据 + + * 必选:否
+ + * 默认值:无
+ +* **splitPk** + + * 描述:当speed配置中的channel大于1时指定此参数,Reader插件根据并发数和此参数指定的字段拼接sql,使每个并发读取不同的数据,提升读取速率。 + + * 注意: + + * 推荐splitPk使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + * 目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! + * 如果channel大于1但是没有配置此参数,任务将置为失败。 + + * 必选:否 + + * 默认值:空 + +* **column** + + * 描述:需要读取的字段。 + + * 格式:支持3中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":[*] + ``` + + 2.只指定字段名称: + + ``` + "column":["id","name"] + ``` + + 3.指定具体信息: + + ``` + "column": [{ + "name": "col", + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value" + }] + ``` + + * 属性说明: + + * name:字段名称 + + * type:字段类型,可以和数据库里的字段类型不一样,程序会做一次类型转换 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果数据库里不存在指定的字段,则会把value的值作为常量列返回,如果指定的字段存在,当指定字段的值为null时,会以此value值作为默认值返回 + + * 必选:是 + + * 默认值:无 diff --git a/docs/rdbreader.md b/docs/rdbreader.md new file mode 100644 index 0000000000..f30a661b58 --- /dev/null +++ b/docs/rdbreader.md @@ -0,0 +1,241 @@ +# 关系数据库读取插件(*reader) + +## 1. 配置样例 + +``` +{ + "job": { + "content": [{ + "reader": { + "parameter": { + "username": "username", + "password": "password", + "connection": [{ + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/test?useUnicode=true&characterEncoding=utf8" + ], + "table": [ + "tableTest" + ] + }], + "column": [{ + "name": "id", + "type": "int", + "values": 123 + },{ + "name":"", + "index":1, + "type":"", + "value":"", + "format":"" + }], + "where": "id > 1", + "splitPk": "id", + "fetchSize": 1000, + "queryTimeOut": 1000, + "customSql": "select * from tableTest", + "requestAccumulatorInterval": 2, + "increColumn": "id", + "startLocation": null, + "useMaxFunc": true + }, + "name": "mysqlreader" + }, + "writer": { + + } + }] + }, + "setting": { + + } +} +``` + +## 2. 参数说明 + +* **name** + + * 描述:插件名,此处填写插件名称,当前支持的关系数据库插件包括:mysqlreader,oraclereader,sqlserverreader,postgresqlreader,db2reader。 + * 必选:是 + + * 默认值:无 + +* **jdbcUrl** + + * 描述:针对关系型数据库的jdbc连接字符串 + + jdbcUrl参考文档: + + - [Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html) + + - [Oracle官方文档](http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index.html) + + - [SqlServer官方文档](https://docs.microsoft.com/zh-cn/sql/connect/jdbc/overview-of-the-jdbc-driver?view=sql-server-2017) + + - [PostgreSql官方文档](https://jdbc.postgresql.org/documentation/head/connect.html) + + - [Db2官方文档](https://www.ibm.com/analytics/us/en/db2/) + + * 必选:是 + + * 默认值:无 + +* **username** + + * 描述:数据源的用户名 + + * 必选:是 + + * 默认值:无 + +* **password** + + * 描述:数据源指定用户名的密码 + + * 必选:是 + + * 默认值:无 + +* **where** + + * 描述:筛选条件,reader插件根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > time。 + + * 注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。 + + * 必选:否 + + * 默认值:无 + +* **splitPk** + + * 描述:当speed配置中的channel大于1时指定此参数,Reader插件根据并发数和此参数指定的字段拼接sql,使每个并发读取不同的数据,提升读取速率。 + + * 注意: + + * 推荐splitPk使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + * 目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! + * 如果channel大于1但是没有配置此参数,任务将置为失败。 + + * 必选:否 + + * 默认值:空 + +* **fetchSize** + + * 描述:读取时每批次读取的数据条数。 + + * 注意:此参数的值不可设置过大,否则会读取超时,导致任务失败。 + + * 必选:否 + + * 默认值:mysql为0,表示流式读取,其它数据库为1000 + +* **queryTimeOut** + + * 描述:查询超时时间,单位秒。 + + * 注意:当数据量很大,或者从视图查询,或者自定义sql查询时,可通过此参数指定超时时间。 + + * 必选:否 + + * 默认值:1000s + +* **customSql** + + * 描述:自定义的查询语句,如果只指定字段不能满足需求时,可通过此参数指定查询的sql,可以是任意复杂的查询语句。 + + * 注意: + + * 只能是查询语句,否则会导致任务失败; + + * 查询语句返回的字段需要和column列表里的字段对应; + + * 当指定了此参数时,connection里指定的table无效,但是在一些情况下依然必须指定,比如使用增量同步的时候; + + * 必选:否 + + * 默认值:null + +* **increColumn** + + * 描述:当需要增量同步时指定此参数,任务运行过程中会把此字段的值存储到flink的Accumulator里,如果配置了指标,名称为:endLocation,类型为string,日期类型会转为时间戳,精度最多到纳秒,数值类型的为字段的值,程序结束时由外部应用获取。 + + * 注意: + + * 指定的字段必须在column列表里存在,否则任务会失败; + + * 增量字段支持数值类型和日期类型,并且是升序的,推荐使用表主键; + + * 必选:否 + + * 默认值:无 + +* **startLocation** + + * 描述:此配置参数和increColumn参数配合使用,表示本次任务获取数据的开始位置。 + + * 注意: + + * 此参数为空时进行全量同步 + + * 必选:否 + + * 默认值:无 + +* **useMaxFunc** + + * 描述:进行增量同步任务时,如果指定的字段值存在重复值,比如字段类型为时间,精度到秒,就可能出现重复的时间,需要指定此字段为true,读取数据前会获取增量字段的最大值作为此次任务的结束位置,防止数据丢失。 + + * 注意: + + * 此参数设为true时,会执行select max(increCol) from tb语句,会影响数据库负载,配置时需要考虑数据库的使用情况; + + * 此参数设置为true时,本次任务不会读取 increCol = max(increCol)的记录,会在任务下次运行时读取; + + * 必选:否 + + * 默认:false + +* **column** + + * 描述:需要读取的字段。 + + * 格式:支持3中格式 + + 1.读取全部字段,如果字段数量很多,可以使用下面的写法: + + ``` + "column":[*] + ``` + + 2.只指定字段名称: + + ``` + "column":["id","name"] + ``` + + 3.指定具体信息: + + ``` + "column": [{ + "name": "col", + "type": "datetime", + "format": "yyyy-MM-dd hh:mm:ss", + "value": "value" + }] + ``` + + * 属性说明: + + * name:字段名称 + + * type:字段类型,可以和数据库里的字段类型不一样,程序会做一次类型转换 + + * format:如果字段是时间字符串,可以指定时间的格式,将字段类型转为日期格式返回 + + * value:如果数据库里不存在指定的字段,则会把value的值作为常量列返回,如果指定的字段存在,当指定字段的值为null时,会以此value值作为默认值返回 + + * 必选:是 + + * 默认值:无 diff --git a/docs/rdbwriter.md b/docs/rdbwriter.md new file mode 100644 index 0000000000..092fccdd2a --- /dev/null +++ b/docs/rdbwriter.md @@ -0,0 +1,134 @@ +# MySQL写入插件(*writer) + +## 1. 配置样例 + +``` +{ + "job": { + "content": [{ + "reader": {}, + "writer": { + "name": "*writer", + + "parameter": { + "connection": [{ + "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/test?useCursorFetch=true", + "table": [ + "tableTest" + ] + }], + "username": "username", + "password": "password", + "column": [], + + "writeMode": "insert", + "batchSize": 1024, + "preSql": "", + "postSql": "", + "updateKey": "" + } + } + }] + }, + "setting": {} +} +``` + +## 2. 参数说明 + +* **name** + + * 描述:插件名,此处可填写:mysqlwriter,oraclewriter,sqlserverwriter,postgresqlwriter,db2writer + * 必选:是 + + 默认值:无 + +* **jdbcUrl** + + * 描述:针对关系型数据库的jdbc连接字符串 + + * 必选:是 + + * 默认值:无 + +* **username** + + * 描述:数据源的用户名 + + * 必选:是 + + * 默认值:无 + +* **password** + + * 描述:数据源指定用户名的密码 + + * 必选:是 + + * 默认值:无 + +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 + + * 必选:是 + + * 默认值:否 + + * 默认值:无 + +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 + + * 必选:否 + + * 默认值:无 + +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的一组标准语句。 + + * 必选:否 + + * 默认值:无 + +* **table** + + * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 + + * 必选:是 + + * 默认值:无 + +* **writeMode** + + * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句 + + * 必选:是 + + * 所有选项:insert/replace/update + + * 默认值:insert + +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少FlinkX与数据库的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成FlinkX运行进程OOM情况。 + + * 必选:否 + + * 默认值:1024 + +* **updateKey** + + * 描述:当写入模式为update和replace时,需要指定此参数的值为唯一索引字段。 + + * 注意: + + * 如果此参数为空,并且写入模式为update和replace时,应用会自动获取数据库中的唯一索引; + + * 如果数据表没有唯一索引,但是写入模式配置为update和replace,应用会以insert的方式写入数据; + + * 必选:否 + + * 默认值:无 diff --git a/docs/rediswriter.md b/docs/rediswriter.md index e02a2b5264..366664ab63 100644 --- a/docs/rediswriter.md +++ b/docs/rediswriter.md @@ -1,147 +1,145 @@ # Redis写入插件(rediswriter) ## 1. 配置样例 + ```json { - "job":{ - "content":[{ - "reader":{}, - "writer":{ - "parameter":{ - "hostPort":"localhost:6379", - "password": "密码", - "database":1, - "keyIndexes": [0,2], - "writeMode":"", - "keyFieldDelimiter": "\u0001", - "expireTime": 1000, - "timeout": 10000, - "dateFormat": "yyyy-MM-dd HH:mm:ss", - "type": "string", - "mode": "set", - "valueFieldDelimiter": "\u0001" - }, - "name":"rediswriter" - } - }] - } + "job":{ + "content":[{ + "reader":{}, + "writer":{ + "parameter":{ + "hostPort":"localhost:6379", + "password": "密码", + "database":1, + "keyIndexes": [0,2], + "writeMode":"", + "keyFieldDelimiter": "\u0001", + "expireTime": 1000, + "timeout": 10000, + "dateFormat": "yyyy-MM-dd HH:mm:ss", + "type": "string", + "mode": "set", + "valueFieldDelimiter": "\u0001" + }, + "name":"rediswriter" + } + }] + } } ``` ## 2. 参数说明 * **name** + + * 描述:插件名,此处只能填rediswriter,否则Flinkx将无法正常加载该插件包。 + * 必选:是 + * 默认值:无 - * 描述:插件名,此处只能填rediswriter,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- * **hostPort** + + * 描述:Redis的IP地址和端口 + + * 必选:是 + + * 默认值:localhost:6379 - * 描述:Redis的IP地址和端口 - - * 必选:是 - - * 默认值:localhost:6379 - * **password** + + * 描述:数据源指定用户名的密码 + + * 必选:是 + + * 默认值:无 - * 描述:数据源指定用户名的密码 - - * 必选:是 - - * 默认值:无 - * **database** + + * 描述:要写入Redis数据库 + + * 必选:否 + + * 默认值:0 - * 描述:要写入Redis数据库 - - * 必选:否 - - * 默认值:0 - * **keyIndexes** + + * 描述:keyIndexes 表示源端哪几列需要作为 key(第一列是从 0 开始)。如果是第一列和第二列需要组合作为 key,那么 keyIndexes 的值则为 [0,1]。 + + * 注意:配置 keyIndexes 后,Redis Writer 会将其余的列作为 value,如果您只想同步源表的某几列作为 key,某几列作为 value,不需要同步所有字段,那么在 Reader 插件端就指定好 column 作好列筛选即可。 + + * 必选:是 + + * 默认值:无 - * 描述:keyIndexes 表示源端哪几列需要作为 key(第一列是从 0 开始)。如果是第一列和第二列需要组合作为 key, - 那么 keyIndexes 的值则为 [0,1]。 - - 注意:配置 keyIndexes 后,Redis Writer 会将其余的列作为 value,如果您只想同步源表的某几列作为 key,某几列作为 value,不需要同步所有字段,那么在 Reader 插件端就指定好 column 作好列筛选即可。 - - * 必选:是 - - * 默认值:无 - * **keyFieldDelimiter** + + * 描述:写入 Redis 的 key 分隔符。比如: key=key1\u0001id,如果 key 有多个需要拼接时,该值为必填项,如果 key 只有一个则可以忽略该配置项。 + + * 必选:否 + + * 默认值:\u0001 - * 描述:写入 Redis 的 key 分隔符。比如: key=key1\u0001id,如果 key 有多个需要拼接时,该值为必填项,如果 key 只有一个则可以忽略该配置项。 - - * 必选:否 - - * 默认值:\u0001 - * **expireTime** + + * 描述:Redis value 值缓存失效时间(如果需要永久有效则可以不填该配置项)。 + + * 注意:如果过期时间的秒数大于 60*60*24*30(即 30 天),则服务端认为是 Unix 时间,该时间指定了到未来某个时刻数据失效。否则为相对当前时间的秒数,该时间指定了从现在开始多长时间后数据失效。 + + * 必选:否 + + * 默认值:0(0 表示永久有效) - * 描述:Redis value 值缓存失效时间(如果需要永久有效则可以不填该配置项)。 - - 注意:如果过期时间的秒数大于 60*60*24*30(即 30 天),则服务端认为是 Unix 时间,该时间指定了到未来某个时刻数据失效。否则为相对当前时间的秒数,该时间指定了从现在开始多长时间后数据失效。 - - * 必选:否 - - * 默认值:0(0 表示永久有效) - * **timeout** + + * 描述:写入 Redis 的超时时间。 + + * 单位:毫秒 + + * 必选:否 + + * 默认值:30000 - * 描述:写入 Redis 的超时时间。 - - * 单位:毫秒 - - * 必选:否 - - * 默认值:30000 - * **dateFormat** + + * 描述:写入 Redis 时,Date 的时间格式:”yyyy-MM-dd HH:mm:ss” + + * 必选:否 + + * 默认值:将日期以long类型写入 - * 描述:写入 Redis 时,Date 的时间格式:”yyyy-MM-dd HH:mm:ss” - - * 必选:否 - - * 默认值:将日期以long类型写入 - * **writeMode** - - * 描述:写入模式,由于 Redis 的数据结构为key-value模式,因此只要key相同,就会覆盖value值 - - * 必选:是 - - * 所有选项:insert - - * 默认值:insert - + + * 描述:写入模式,由于 Redis 的数据结构为key-value模式,因此只要key相同,就会覆盖value值 + + * 必选:是 + + * 所有选项:insert + + * 默认值:insert * **valueFieldDelimiter** + + * 描述:该配置项是考虑了当源数据每行超过两列的情况(如果您的源数据只有两列即 key 和 value 时,那么可以忽略该配置项,不用填写),value 类型是 string 时,value 之间的分隔符,比如 value1\u0001value2\u0001value3。 + + * 必选:否 + + * 默认值:\u0001 - * 描述:该配置项是考虑了当源数据每行超过两列的情况(如果您的源数据只有两列即 key 和 value 时,那么可以忽略该配置项,不用填写),value 类型是 string 时,value 之间的分隔符,比如 value1\u0001value2\u0001value3。 - - * 必选:否 - - * 默认值:\u0001 - * **type和mode** - - * 描述:type 表示 value 的类型,mode 表示在选定的数据类型下的写入模式。 - - * 选项:string/list/set/zset/hash - - | type | 描述 | mode | 说明 | 注意 | - | --- | --- | --- | --- | --- | - | string | 字符串 | set | 存储这个数据,如果已经存在则覆盖 | | - | list | 字符串列表 | lpush | 在 list 最左边存储这个数据 | | - | list | 字符串列表 | rpush | 在 list 最右边存储这个数据 | | - | set | 字符串集合 | sadd | 向 set 集合中存储这个数据,如果已经存在则覆盖 | | - | zset | 有序字符串集合 | zadd | 向 zset 有序集合中存储这个数据,如果已经存在则覆盖 | 当 value 类型是 zset 时,数据源的每一行记录需要遵循相应的规范,即每一行记录除 key 以外,只能有一对 score 和 value,并且 score 必须在 value 前面,rediswriter 方能解析出哪一个 column 是 score,哪一个 column 是 value。 | - | hash | 哈希 | hset | 向 hash 有序集合中存储这个数据,如果已经存在则覆盖 | 当 value 类型是 hash 时,数据源的每一行记录需要遵循相应的规范,即每一行记录除 key 以外,只能有一对 attribute 和 value,并且 attribute 必须在 value 前面,Rediswriter 方能解析出哪一个 column 是 attribute,哪一个 column 是 value。 | - - * 必选:是 - - * 默认值:无 \ No newline at end of file + + * 描述:type 表示 value 的类型,mode 表示在选定的数据类型下的写入模式。 + + * 选项:string/list/set/zset/hash + + | type | 描述 | mode | 说明 | 注意 | + | ------ | ------- | ----- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | + | string | 字符串 | set | 存储这个数据,如果已经存在则覆盖 | | + | list | 字符串列表 | lpush | 在 list 最左边存储这个数据 | | + | list | 字符串列表 | rpush | 在 list 最右边存储这个数据 | | + | set | 字符串集合 | sadd | 向 set 集合中存储这个数据,如果已经存在则覆盖 | | + | zset | 有序字符串集合 | zadd | 向 zset 有序集合中存储这个数据,如果已经存在则覆盖 | 当 value 类型是 zset 时,数据源的每一行记录需要遵循相应的规范,即每一行记录除 key 以外,只能有一对 score 和 value,并且 score 必须在 value 前面,rediswriter 方能解析出哪一个 column 是 score,哪一个 column 是 value。 | + | hash | 哈希 | hset | 向 hash 有序集合中存储这个数据,如果已经存在则覆盖 | 当 value 类型是 hash 时,数据源的每一行记录需要遵循相应的规范,即每一行记录除 key 以外,只能有一对 attribute 和 value,并且 attribute 必须在 value 前面,Rediswriter 方能解析出哪一个 column 是 attribute,哪一个 column 是 value。 | + + * 必选:是 + + * 默认值:无 diff --git a/docs/sqlserverreader.md b/docs/sqlserverreader.md deleted file mode 100644 index 1977bd9c65..0000000000 --- a/docs/sqlserverreader.md +++ /dev/null @@ -1,161 +0,0 @@ -# SQLServer读取插件(sqlserverreader) - -## 1. 配置样例 - -``` -{ - "job": { - "setting": { - "speed": { - "channel": 3, - "bytes": 0 - }, - "errorLimit": { - "record": 10000, - "percentage": 100 - }, - "dirty": { - "path": "/tmp", - "hadoopConfig": { - "fs.default.name": "hdfs://ns1", - "dfs.nameservices": "ns1", - "dfs.ha.namenodes.ns1": "nn1,nn2", - "dfs.namenode.rpc-address.ns1.nn1": "node02:9000", - "dfs.namenode.rpc-address.ns1.nn2": "node03:9000", - "dfs.ha.automatic-failover.enabled": "true", - "dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", - "fs.hdfs.impl.disable.cache": "true" - } - } - }, - "content": [ - { - "reader": { - "name": "sqlserverreader", - "parameter": { - "username": "sa", - "password": "Dtstack201610!", - "column": [ - "id", - "v" - ], - "where": "id > 1", - "connection": [ - { - "table": [ - "tb1" - ], - "jdbcUrl": [ - "jdbc:jtds:sqlserver://172.16.10.46:1433;DatabaseName=dq" - ] - } - ], - "splitPk": "id" - } - }, - "writer": { - "name": "mysqlwriter", - "parameter": { - "writeMode": "insert", - "username": "dtstack", - "password": "abc123", - "column": [ - "c1", - "c2" - ], - "batchSize": 1, - "session": [ - "set session sql_mode='ANSI'" - ], - "connection": [ - { - "jdbcUrl": "jdbc:mysql://172.16.8.104:3306/test?useCursorFetch=true", - "table": [ - "tb3" - ] - } - ] - } - } - } - ] - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填sqlserverreader,否则FlinkX将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对SQLServer数据库的jdbc连接字符串 - - jdbcUrl按照SqlServer官方规范,并可以填写连接附件控制信息。具体请参看[SqlServer官方文档](http://technet.microsoft.com/zh-cn/library/ms378749(v=SQL.110).aspx)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **where** - - * 描述:筛选条件,sqlserverreader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
- - where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,FlinkX均视作同步全量数据。 - - * 必选:否
- - * 默认值:无
- -* **splitPk** - - * 描述:sqlserverreader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,FlinkX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 - - 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 - -  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! - - 如果splitPk不填写,包括不提供splitPk或者splitPk值为空,FlinkX视作使用单通道同步该表数据。 - - * 必选:否
- - * 默认值:空
- - - -* **column** - - * 描述:所配置的表中需要同步的列名集合。 - - 支持列裁剪,即列可以挑选部分列进行导出。 - - 支持列换序,即列可以不按照表schema信息进行导出。 - - 暂不支持常量列。 - - * 必选:是
- - * 默认值:无
- diff --git a/docs/sqlserverwriter.md b/docs/sqlserverwriter.md deleted file mode 100644 index f6519d8ebc..0000000000 --- a/docs/sqlserverwriter.md +++ /dev/null @@ -1,159 +0,0 @@ -# SQLServer写入插件(sqlserverwriter) - -## 1. 配置样例 - -``` -{ - "job": { - "setting": { - "speed": { - "channel": 4 - }, - "errorLimit": { - "record": 0, - "percentage": 10 - } - }, - "content": [ - { - "reader": { - "parameter": { - "password": "abc123" - "column": [ - "col1", - "col2" - ], - "where": "id > 1", - "connection": [ - { - "jdbcUrl": [ - "jdbc:mysql://172.16.8.104:3306/test?charset=utf8" - ], - "table": [ - "tb2" - ] - } - ], - "splitPk": "col1", - "username": "dtstack" - }, - "name": "mysqlreader" - }, - "writer": { - "name": "sqlserverwriter", - "parameter": { - "batchSize": 2048, - "username": "sa", - "password": "Dtstack201610!", - "column": [ - "id", - "v" - ], - "preSql": [], - "postSql": [], - "writeMode": "replace", - "connection": [ - { - "jdbcUrl": "jdbc:jtds:sqlserver://172.16.10.46:1433;DatabaseName=dq", - "table": [ - "tb1" - ] - } - ] - } - } - } - ] - } -} - -``` - -## 2. 参数说明 - -* **name** - - * 描述:插件名,此处只能填sqlserverwriter,否则Flinkx将无法正常加载该插件包。 - * 必选:是
- - * 默认值:无
- -* **jdbcUrl** - - * 描述:针对SQLServer数据库的jdbc连接字符串 - - jdbcUrl按照SqlServer官方规范,并可以填写连接附件控制信息。具体请参看[SqlServer官方文档](http://technet.microsoft.com/zh-cn/library/ms378749(v=SQL.110).aspx)。 - - * 必选:是
- - * 默认值:无
- -* **username** - - * 描述:数据源的用户名
- - * 必选:是
- - * 默认值:无
- -* **password** - - * 描述:数据源指定用户名的密码
- - * 必选:是
- - * 默认值:无
- -* **column** - - * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。 - - * 必选:是
- - * 默认值:否
- - * 默认值:无
- -* **preSql** - - * 描述:写入数据到目的表前,会先执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **postSql** - - * 描述:写入数据到目的表后,会执行这里的一组标准语句。 - - * 必选:否
- - * 默认值:无
- -* **table** - - * 描述:目的表的表名称。目前只支持配置单个表,后续会支持多表。 - - 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 - - * 必选:是
- - * 默认值:无
- -* **writeMode** - - * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句
- ** 在sqlserver中, 用merge into模拟后两种插入语义。 ** - * 必选:是
- - * 所有选项:insert/replace/update
- - * 默认值:insert
- -* **batchSize** - - * 描述:一次性批量提交的记录数大小 - - * 必选:否
- - * 默认值:1024
\ No newline at end of file diff --git a/docs/streamreader.md b/docs/streamreader.md new file mode 100644 index 0000000000..68886faa7a --- /dev/null +++ b/docs/streamreader.md @@ -0,0 +1,60 @@ +# Stream读取插件(streamreader) + +## 1. 配置样例 + +``` +{ + "job": { + "content": [ + { + "reader": { + "parameter": { + "column": [ + { + "type": "int", + "value":"xxx" + } + ], + "sliceRecordCount":10000 + }, + "name": "streamreader" + }, + "writer": {} + } + ], + "setting": {} + } +} +``` + +## 2. 参数说明 + +* **name** + + * 描述:插件名,此处填写插件名称,streamreader。 + + * 必选:是 + + * 默认值:无 + +* **sliceRecordCount** + + * 描述:每个通道生成的数据条数,不配置此参数或者配置为0,程序会持续生成数据,不会停止 + + * 必选:否 + + * 默认值:0 + +* **column** + + * 描述:需要生成的字段。 + + * 属性说明: + + * type:字段类型,程序根据指定的字段类型生成模拟数据,支持基本数据类型以及基本类型的数组,"int[]"表示生成一个长度随机的整形数组; + + * value:常量值,程序使用此字段的值直接返回; + + * 必选:是 + + * 默认值:无 diff --git a/docs/streamwriter.md b/docs/streamwriter.md new file mode 100644 index 0000000000..0d786d2f6b --- /dev/null +++ b/docs/streamwriter.md @@ -0,0 +1,41 @@ +# Stream写入插件(streamwriter) + +## 1. 配置样例 + +``` +{ + "job": { + "content": [ + { + "reader": {}, + "writer": { + "parameter": { + "print":true + }, + "name": "streamwriter" + } + } + ], + "setting": {} + } +} +``` + +## 2. 参数说明 + +* **name** + + * 描述:插件名,此处填写插件名称,streamwriter,此插件用来单独测试reader插件,对读到的数据不做任务处理; + + * 必选:是 + + * 默认值:无 + +* **print** + + * 描述:是否在控制台打印数据 + + * 必选:否 + + * 默认值:false + diff --git a/flinkx-carbondata/flinkx-carbondata-writer/src/main/java/com/dtstack/flinkx/carbondata/writer/CarbonOutputFormat.java b/flinkx-carbondata/flinkx-carbondata-writer/src/main/java/com/dtstack/flinkx/carbondata/writer/CarbonOutputFormat.java index 8b9d4807e9..2fbac5852f 100644 --- a/flinkx-carbondata/flinkx-carbondata-writer/src/main/java/com/dtstack/flinkx/carbondata/writer/CarbonOutputFormat.java +++ b/flinkx-carbondata/flinkx-carbondata-writer/src/main/java/com/dtstack/flinkx/carbondata/writer/CarbonOutputFormat.java @@ -106,12 +106,8 @@ public void configure(Configuration parameters) { } private void parsePartition(){ - if(carbonTable.getPartitionInfo() == null) { - return; - } - if(partition == null || partition.trim().length() == 0) { - return; + throw new IllegalArgumentException("The table have partition field,'partition' should not be empty"); } partition = partition.trim(); diff --git a/flinkx-core/.gitignore b/flinkx-core/.gitignore index ca7ca55c4c..9803fe0b0e 100644 --- a/flinkx-core/.gitignore +++ b/flinkx-core/.gitignore @@ -11,3 +11,4 @@ target .classpath *.eclipse.* *.iml +/dependency-reduced-pom.xml diff --git a/flinkx-core/pom.xml b/flinkx-core/pom.xml index 161f3e2eed..8660b01683 100644 --- a/flinkx-core/pom.xml +++ b/flinkx-core/pom.xml @@ -153,13 +153,14 @@ + + tofile="${basedir}/../plugins/flinkx-${git.branch}.jar" /> diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/common/ColumnType.java b/flinkx-core/src/main/java/com/dtstack/flinkx/common/ColumnType.java index 02f164ff36..e5f976f05d 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/common/ColumnType.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/common/ColumnType.java @@ -29,7 +29,7 @@ * @author huyifan.zju@163.com */ public enum ColumnType { - STRING, VARCHAR, CHAR,NVARCHAR,TEXT,KEYWORD, + STRING, VARCHAR, CHAR,NVARCHAR,TEXT,KEYWORD,BINARY, INT, MEDIUMINT, TINYINT, DATETIME, SMALLINT, BIGINT,LONG,SHORT,INTEGER, DOUBLE, FLOAT, BOOLEAN, diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/config/WriterConfig.java b/flinkx-core/src/main/java/com/dtstack/flinkx/config/WriterConfig.java index d3d0af9081..2df69d966b 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/config/WriterConfig.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/config/WriterConfig.java @@ -18,6 +18,8 @@ package com.dtstack.flinkx.config; +import org.apache.commons.collections.CollectionUtils; + import java.util.ArrayList; import java.util.List; import java.util.Map; @@ -100,7 +102,12 @@ public class ConnectionConfig extends AbstractConfig { public ConnectionConfig(Map map) { super(map); - jdbcUrl = getStringVal(KEY_JDBC_URL); + Object jdbcUrlObj = internalMap.get(KEY_JDBC_URL); + if(jdbcUrlObj instanceof String){ + jdbcUrl = jdbcUrlObj.toString(); + } else if(jdbcUrlObj instanceof List && CollectionUtils.isNotEmpty((List) jdbcUrlObj)){ + jdbcUrl = ((List) jdbcUrlObj).get(0).toString(); + } table = (List) getVal(KEY_TABLE_LIST); } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/constants/Metrics.java b/flinkx-core/src/main/java/com/dtstack/flinkx/constants/Metrics.java index 964705999d..49b962fff6 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/constants/Metrics.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/constants/Metrics.java @@ -51,4 +51,13 @@ public class Metrics { public static String START_LOCATION = "startLocation"; public static String TABLE_COL = "tableCol"; + + public static String MAX_VALUE = "maxValue"; + + public static String METRIC_GROUP_KEY_FLINKX = "flinkx"; + + public static String METRIC_GROUP_VALUE_INPUT = "input"; + + public static String METRIC_GROUP_VALUE_OUTPUT = "output"; + } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/inputformat/RichInputFormat.java b/flinkx-core/src/main/java/com/dtstack/flinkx/inputformat/RichInputFormat.java index f7ca98edda..54e983599e 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/inputformat/RichInputFormat.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/inputformat/RichInputFormat.java @@ -19,11 +19,12 @@ package com.dtstack.flinkx.inputformat; import com.dtstack.flinkx.constants.Metrics; +import com.dtstack.flinkx.metrics.InputMetric; import com.dtstack.flinkx.reader.ByteRateLimiter; +import com.dtstack.flinkx.util.SysUtil; import org.apache.commons.lang.StringUtils; import org.apache.flink.api.common.accumulators.LongCounter; import org.apache.flink.api.common.io.DefaultInputSplitAssigner; -import org.apache.flink.api.common.io.FinalizeOnMaster; import org.apache.flink.api.common.io.statistics.BaseStatistics; import org.apache.flink.core.io.InputSplit; import org.apache.flink.core.io.InputSplitAssigner; @@ -41,7 +42,7 @@ * 用户只需覆盖openInternal,closeInternal等方法, 无需操心细节 * */ -public abstract class RichInputFormat extends org.apache.flink.api.common.io.RichInputFormat implements FinalizeOnMaster { +public abstract class RichInputFormat extends org.apache.flink.api.common.io.RichInputFormat { protected final Logger LOG = LoggerFactory.getLogger(getClass()); protected String jobName = "defaultJobName"; @@ -50,6 +51,7 @@ public abstract class RichInputFormat extends org.apache.flink.api.common.io.Ric protected long bytes; protected ByteRateLimiter byteRateLimiter; + protected transient InputMetric inputMetric; protected abstract void openInternal(InputSplit inputSplit) throws IOException; @@ -59,12 +61,15 @@ public void open(InputSplit inputSplit) throws IOException { if (vars != null && vars.get(Metrics.JOB_NAME) != null) { jobName = vars.get(Metrics.JOB_NAME); } + numReadCounter = getRuntimeContext().getLongCounter(Metrics.NUM_READS); + inputMetric = new InputMetric(getRuntimeContext(), numReadCounter); + openInternal(inputSplit); if (StringUtils.isNotBlank(this.monitorUrls) && this.bytes > 0) { - this.byteRateLimiter = new ByteRateLimiter(getRuntimeContext(), this.monitorUrls, this.bytes, 1); + this.byteRateLimiter = new ByteRateLimiter(getRuntimeContext(), this.monitorUrls, this.bytes, 2); this.byteRateLimiter.start(); } } @@ -88,6 +93,10 @@ public Row nextRecord(Row row) throws IOException { public void close() throws IOException { try{ closeInternal(); + + if (inputMetric.getDelayPeriodMill() != 0){ + SysUtil.sleep(inputMetric.getDelayPeriodMill()); + } }catch (Exception e){ throw new RuntimeException(e); }finally { @@ -101,11 +110,6 @@ public void close() throws IOException { protected abstract void closeInternal() throws IOException; - @Override - public void finalizeGlobal(int parallelism) throws IOException { - - } - @Override public BaseStatistics getStatistics(BaseStatistics baseStatistics) throws IOException { return null; diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/latch/MetricLatch.java b/flinkx-core/src/main/java/com/dtstack/flinkx/latch/MetricLatch.java index 3200baaa4a..d68ec9c735 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/latch/MetricLatch.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/latch/MetricLatch.java @@ -23,6 +23,9 @@ import com.google.gson.Gson; import com.google.gson.internal.LinkedTreeMap; import org.apache.flink.api.common.functions.RuntimeContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; @@ -39,6 +42,9 @@ * @author huyifan.zju@163.com */ public class MetricLatch extends Latch { + + public static Logger LOG = LoggerFactory.getLogger(MetricLatch.class); + private String metricName; private String[] monitorRoots; private String jobId; @@ -46,19 +52,26 @@ public class MetricLatch extends Latch { private RuntimeContext context; private static final String METRIC_PREFIX = "latch-"; - private boolean checkMonitorRoots() { + private void checkMonitorRoots() { boolean flag = false; int j = 0; + StringBuilder exceptionMsg = new StringBuilder(); for(; j < monitorRoots.length; ++j) { String requestUrl = monitorRoots[j] + "/jobs/" + jobId + "/accumulators"; + LOG.info("Monitor url:" + requestUrl); try(InputStream inputStream = URLUtil.open(requestUrl)) { flag = true; break; } catch (Exception e) { - e.printStackTrace(); + exceptionMsg.append("Monitor url:").append(requestUrl).append("\n"); + exceptionMsg.append("Error info:\n").append(e.getMessage()).append("\n"); + LOG.error("Open monitor url error:{}",e); } } - return flag; + + if (!flag){ + throw new IllegalArgumentException(exceptionMsg.toString()); + } } private int getIntMetricVal(String requestUrl) { @@ -96,13 +109,7 @@ public MetricLatch(RuntimeContext context, String monitors, String metricName) { } } - if(!checkMonitorRoots()) { - String msg = ""; - if(monitorRoots != null && monitorRoots.length >= 1) { - msg = monitorRoots[0]; - } - throw new RuntimeException("Invalid monitors: " + msg); - } + checkMonitorRoots(); } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/InputMetric.java b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/InputMetric.java new file mode 100644 index 0000000000..f052f86a3a --- /dev/null +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/InputMetric.java @@ -0,0 +1,107 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dtstack.flinkx.metrics; + +import com.dtstack.flinkx.constants.Metrics; +import org.apache.flink.api.common.accumulators.LongCounter; +import org.apache.flink.api.common.functions.RuntimeContext; +import org.apache.flink.metrics.MetricGroup; +import org.apache.flink.runtime.metrics.MetricRegistryImpl; +import org.apache.flink.runtime.metrics.groups.AbstractMetricGroup; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.concurrent.duration.FiniteDuration; + +import java.lang.reflect.Field; +import java.util.concurrent.RunnableScheduledFuture; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.ScheduledThreadPoolExecutor; +import java.util.concurrent.TimeUnit; + +/** + * company: www.dtstack.com + * + * @author: toutian + * create: 2019/3/18 + */ +public class InputMetric { + protected final Logger LOG = LoggerFactory.getLogger(getClass()); + + private RuntimeContext runtimeContext; + + private final static Long DEFAULT_PERIOD_MILLISECONDS = 10000L; + + private Long delayPeriodMill = 12000L; + + public InputMetric(RuntimeContext runtimeContext, LongCounter numRead) { + this.runtimeContext = runtimeContext; + + final MetricGroup flinkxInput = getRuntimeContext().getMetricGroup().addGroup(Metrics.METRIC_GROUP_KEY_FLINKX, Metrics.METRIC_GROUP_VALUE_INPUT); + + flinkxInput.gauge(Metrics.NUM_READS, new SimpleAccumulatorGauge(numRead)); + + initPeriod(); + } + + private RuntimeContext getRuntimeContext() { + return runtimeContext; + } + + public Long getDelayPeriodMill() { + return delayPeriodMill; + } + + public void initPeriod() { + try { + MetricGroup mgObj = runtimeContext.getMetricGroup(); + Class amgCls = (Class) mgObj.getClass().getSuperclass().getSuperclass(); + Field registryField = amgCls.getDeclaredField("registry"); + registryField.setAccessible(true); + MetricRegistryImpl registryImplObj = (MetricRegistryImpl) registryField.get(mgObj); + if (registryImplObj.getReporters().isEmpty()) { + return; + } + Field executorField = registryImplObj.getClass().getDeclaredField("executor"); + executorField.setAccessible(true); + ScheduledExecutorService executor = (ScheduledExecutorService) executorField.get(registryImplObj); + Field scheduleField = (executor.getClass().getSuperclass().getDeclaredField("e")); + scheduleField.setAccessible(true); + ScheduledThreadPoolExecutor scheduleObj = (ScheduledThreadPoolExecutor) scheduleField.get(executor); + Runnable runableObj = scheduleObj.getQueue().iterator().next(); + RunnableScheduledFuture runableFuture = (RunnableScheduledFuture) runableObj; + Field outerTaskField = runableFuture.getClass().getDeclaredField("outerTask"); + outerTaskField.setAccessible(true); + Object scheduledFutureTask = outerTaskField.get(runableFuture); + Field periodField = scheduledFutureTask.getClass().getDeclaredField("period"); + periodField.setAccessible(true); + long schedulePeriod = (long) periodField.get(scheduledFutureTask); + long schedulePeriodMill = -1 * new FiniteDuration(schedulePeriod, TimeUnit.NANOSECONDS).toMillis(); + + LOG.info("InputMetric.scheduledFutureTask.schedulePeriodMill:{} ...", schedulePeriodMill); + + if (schedulePeriodMill > DEFAULT_PERIOD_MILLISECONDS) { + this.delayPeriodMill = (long) (schedulePeriodMill * 1.2); + } + } catch (Exception e) { + LOG.error("{}", e); + } + + LOG.info("InputMetric.delayPeriodMill:{} ...", delayPeriodMill); + } +} diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/OutputMetric.java b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/OutputMetric.java new file mode 100644 index 0000000000..79408ebc15 --- /dev/null +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/OutputMetric.java @@ -0,0 +1,54 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dtstack.flinkx.metrics; + +import com.dtstack.flinkx.constants.Metrics; +import org.apache.flink.api.common.accumulators.IntCounter; +import org.apache.flink.api.common.accumulators.LongCounter; +import org.apache.flink.api.common.functions.RuntimeContext; +import org.apache.flink.metrics.MetricGroup; + +/** + * company: www.dtstack.com + * + * @author: toutian + * create: 2019/3/18 + */ +public class OutputMetric { + + private transient RuntimeContext runtimeContext; + + public OutputMetric(RuntimeContext runtimeContext, IntCounter numErrors, IntCounter numNullErrors, + IntCounter numDuplicateErrors, IntCounter numConversionErrors, IntCounter numOtherErrors, LongCounter numWrite) { + this.runtimeContext = runtimeContext; + + final MetricGroup flinkxOutput = getRuntimeContext().getMetricGroup().addGroup(Metrics.METRIC_GROUP_KEY_FLINKX, Metrics.METRIC_GROUP_VALUE_OUTPUT); + + flinkxOutput.gauge(Metrics.NUM_ERRORS, new SimpleAccumulatorGauge(numErrors)); + flinkxOutput.gauge(Metrics.NUM_NULL_ERRORS, new SimpleAccumulatorGauge(numNullErrors)); + flinkxOutput.gauge(Metrics.NUM_DUPLICATE_ERRORS, new SimpleAccumulatorGauge(numDuplicateErrors)); + flinkxOutput.gauge(Metrics.NUM_CONVERSION_ERRORS, new SimpleAccumulatorGauge(numConversionErrors)); + flinkxOutput.gauge(Metrics.NUM_OTHER_ERRORS, new SimpleAccumulatorGauge(numOtherErrors)); + flinkxOutput.gauge(Metrics.NUM_WRITES, new SimpleAccumulatorGauge(numWrite)); + } + + private RuntimeContext getRuntimeContext() { + return runtimeContext; + } +} diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/SimpleAccumulatorGauge.java b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/SimpleAccumulatorGauge.java new file mode 100644 index 0000000000..a4d470b66f --- /dev/null +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/metrics/SimpleAccumulatorGauge.java @@ -0,0 +1,44 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dtstack.flinkx.metrics; + +import org.apache.flink.api.common.accumulators.Accumulator; +import org.apache.flink.metrics.Gauge; + +import java.io.Serializable; + +/** + * company: www.dtstack.com + * + * @author: toutian + * create: 2019/3/21 + */ +public class SimpleAccumulatorGauge implements Gauge { + + private Accumulator accumulator; + + public SimpleAccumulatorGauge(Accumulator accumulator) { + this.accumulator = accumulator; + } + + @Override + public T getValue() { + return accumulator.getLocalValue(); + } +} diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/outputformat/RichOutputFormat.java b/flinkx-core/src/main/java/com/dtstack/flinkx/outputformat/RichOutputFormat.java index 42bffa1336..d587d130fc 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/outputformat/RichOutputFormat.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/outputformat/RichOutputFormat.java @@ -23,6 +23,7 @@ import com.dtstack.flinkx.latch.Latch; import com.dtstack.flinkx.latch.LocalLatch; import com.dtstack.flinkx.latch.MetricLatch; +import com.dtstack.flinkx.metrics.OutputMetric; import com.dtstack.flinkx.writer.DirtyDataManager; import com.dtstack.flinkx.writer.ErrorLimiter; import org.apache.commons.lang.StringUtils; @@ -111,6 +112,8 @@ public abstract class RichOutputFormat extends org.apache.flink.api.common.io.Ri protected String jobId; + protected transient OutputMetric outputMetric; + public DirtyDataManager getDirtyDataManager() { return dirtyDataManager; } @@ -173,6 +176,8 @@ public void open(int taskNumber, int numTasks) throws IOException { //总记录数 numWriteCounter = context.getLongCounter(Metrics.NUM_WRITES); + outputMetric = new OutputMetric(context, errCounter, nullErrCounter, duplicateErrCounter, conversionErrCounter, otherErrCounter, numWriteCounter); + Map vars = context.getMetricGroup().getAllVariables(); if(vars != null && vars.get(Metrics.JOB_NAME) != null) { @@ -186,7 +191,7 @@ public void open(int taskNumber, int numTasks) throws IOException { //启动错误限制 if(StringUtils.isNotBlank(monitorUrl)) { if(errors != null || errorRatio != null) { - errorLimiter = new ErrorLimiter(context, monitorUrl, errors, errorRatio, 1); + errorLimiter = new ErrorLimiter(context, monitorUrl, errors, errorRatio, 2); errorLimiter.start(); } } @@ -332,13 +337,19 @@ public void close() throws IOException { if(dirtyDataManager != null) { dirtyDataManager.close(); } + if(errorLimiter != null) { - // Wait a while before checking dirty data - Latch latch = newLatch("#5"); - latch.addOne(); - latch.waitUntil(numTasks); + try{ + // Wait a while before checking dirty data + Latch latch = newLatch("#5"); + latch.addOne(); + latch.waitUntil(numTasks); + + errorLimiter.updateErrorInfo(); + } catch (Exception e){ + LOG.warn("Update error info error when task closing:{}", e); + } - errorLimiter.updateErrorInfo(); errorLimiter.acquire(); errorLimiter.stop(); } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/reader/ByteRateLimiter.java b/flinkx-core/src/main/java/com/dtstack/flinkx/reader/ByteRateLimiter.java index b3b6b3be96..9b2c2bad28 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/reader/ByteRateLimiter.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/reader/ByteRateLimiter.java @@ -18,24 +18,19 @@ package com.dtstack.flinkx.reader; -import com.dtstack.flinkx.util.RetryUtil; import com.dtstack.flinkx.util.URLUtil; import com.google.common.util.concurrent.RateLimiter; import com.google.gson.Gson; import com.google.gson.internal.LinkedTreeMap; import org.apache.flink.api.common.functions.RuntimeContext; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.CloseableHttpClient; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.HttpClientBuilder; import org.apache.flink.streaming.api.operators.StreamingRuntimeContext; import org.apache.flink.util.Preconditions; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.Reader; -import java.net.URL; import java.util.List; import java.util.Map; -import java.util.concurrent.Callable; import java.util.concurrent.Executors; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.TimeUnit; @@ -68,9 +63,12 @@ public class ByteRateLimiter { private int subtaskIndex; + private CloseableHttpClient httpClient; + private ScheduledExecutorService scheduledExecutorService; public ByteRateLimiter(RuntimeContext runtimeContext, String monitors, double expectedBytePerSecond, double samplePeriod) { + httpClient = HttpClientBuilder.create().build(); Preconditions.checkNotNull(runtimeContext); //DistributedRuntimeUDFContext context = (DistributedRuntimeUDFContext) runtimeContext; @@ -96,10 +94,10 @@ public ByteRateLimiter(RuntimeContext runtimeContext, String monitors, double ex for(; j < monitorUrls.length; ++j) { String url = monitorUrls[j]; LOG.info("monitor_url=" + url); - try (InputStream inputStream = URLUtil.open(url)){ + try { + URLUtil.get(httpClient, url); break; } catch (Exception e) { - e.printStackTrace(); LOG.error("connected error: " + url); } } @@ -140,43 +138,41 @@ public void start() { () -> { for (int index = 0; index < 1; ++index) { String requestUrl = monitorUrls[index] + "/jobs/" + this.jobId + "/vertices/" + this.taskId; - try (InputStream inputStream = URLUtil.open(requestUrl)) { - try (Reader rd = new InputStreamReader(inputStream)) { - Map map = gson.fromJson(rd, Map.class); - double thisWriteBytes = 0; - double thisWriteRecords = 0; - double totalWriteBytes = 0; - double totalWriteRecords = 0; - - List list = (List) map.get("subtasks"); - for (int i = 0; i < list.size(); ++i) { - LinkedTreeMap subTask = list.get(i); - LinkedTreeMap subTaskMetrics = (LinkedTreeMap) subTask.get("metrics"); - double subWriteBytes = (double) subTaskMetrics.get("write-bytes"); - double subWriteRecords = (double) subTaskMetrics.get("write-records"); - if (i == subTaskIndex) { - thisWriteBytes = subWriteBytes; - thisWriteRecords = subWriteRecords; - } - totalWriteBytes += subWriteBytes; - totalWriteRecords += subWriteRecords; - } - - double thisWriteRatio = (totalWriteRecords == 0 ? 0 : thisWriteRecords / totalWriteRecords); - - if (totalWriteRecords > 1000 && totalWriteBytes != 0 && thisWriteRatio != 0) { - double bpr = totalWriteBytes / totalWriteRecords; - double permitsPerSecond = expectedBytePerSecond / bpr * thisWriteRatio; - rateLimiter.setRate(permitsPerSecond); + try { + String response = URLUtil.get(httpClient, requestUrl); + + Map map = gson.fromJson(response, Map.class); + double thisWriteBytes = 0; + double thisWriteRecords = 0; + double totalWriteBytes = 0; + double totalWriteRecords = 0; + + List list = (List) map.get("subtasks"); + for (int i = 0; i < list.size(); ++i) { + LinkedTreeMap subTask = list.get(i); + LinkedTreeMap subTaskMetrics = (LinkedTreeMap) subTask.get("metrics"); + double subWriteBytes = (double) subTaskMetrics.get("write-bytes"); + double subWriteRecords = (double) subTaskMetrics.get("write-records"); + if (i == subTaskIndex) { + thisWriteBytes = subWriteBytes; + thisWriteRecords = subWriteRecords; } + totalWriteBytes += subWriteBytes; + totalWriteRecords += subWriteRecords; + } - break; + double thisWriteRatio = (totalWriteRecords == 0 ? 0 : thisWriteRecords / totalWriteRecords); + if (totalWriteRecords > 1000 && totalWriteBytes != 0 && thisWriteRatio != 0) { + double bpr = totalWriteBytes / totalWriteRecords; + double permitsPerSecond = expectedBytePerSecond / bpr * thisWriteRatio; + rateLimiter.setRate(permitsPerSecond); } + + break; } catch (Exception e) { - e.printStackTrace(); + LOG.error("Get metrics error:",e); } - } }, 0, @@ -190,6 +186,14 @@ public void stop() { return; } + if (httpClient != null){ + try { + httpClient.close(); + } catch (Exception e){ + LOG.error("close httpClient error:{}", e); + } + } + if(scheduledExecutorService != null && !scheduledExecutorService.isShutdown()) { scheduledExecutorService.shutdown(); } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/reader/DataReader.java b/flinkx-core/src/main/java/com/dtstack/flinkx/reader/DataReader.java index 8deb5e9b8c..5c31c8b861 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/reader/DataReader.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/reader/DataReader.java @@ -19,6 +19,7 @@ package com.dtstack.flinkx.reader; import com.dtstack.flinkx.config.DataTransferConfig; +import com.dtstack.flinkx.config.DirtyConfig; import com.dtstack.flinkx.plugin.PluginLoader; import org.apache.flink.api.common.io.InputFormat; import org.apache.flink.api.common.typeinfo.TypeInformation; @@ -28,8 +29,10 @@ import org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction; import org.apache.flink.types.Row; import org.apache.flink.util.Preconditions; + import java.util.ArrayList; import java.util.List; +import java.util.Map; /** * Abstract specification of Reader Plugin @@ -51,6 +54,10 @@ public abstract class DataReader { protected List srcCols = new ArrayList<>(); + /** + * reuse hadoopConfig for metric + */ + protected Map hadoopConfig; public List getSrcCols() { return srcCols; @@ -76,6 +83,14 @@ protected DataReader(DataTransferConfig config, StreamExecutionEnvironment env) this.numPartitions = config.getJob().getSetting().getSpeed().getChannel(); this.bytes = config.getJob().getSetting().getSpeed().getBytes(); this.monitorUrls = config.getMonitorUrls(); + + DirtyConfig dirtyConfig = config.getJob().getSetting().getDirty(); + if (dirtyConfig != null) { + Map hadoopConfig = dirtyConfig.getHadoopConfig(); + if (hadoopConfig != null) { + this.hadoopConfig = hadoopConfig; + } + } } public abstract DataStream readData(); diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/util/DateUtil.java b/flinkx-core/src/main/java/com/dtstack/flinkx/util/DateUtil.java index dbd9dbc79a..fe1b07a625 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/util/DateUtil.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/util/DateUtil.java @@ -123,7 +123,7 @@ public static java.sql.Timestamp columnToTimestamp(Object column,SimpleDateForma Long rawData = (Long) column; return new java.sql.Timestamp(getMillSecond(rawData.toString())); } else if (column instanceof java.sql.Date) { - return (java.sql.Timestamp) column; + return new java.sql.Timestamp(((java.sql.Date) column).getTime()); } else if(column instanceof Timestamp) { return (Timestamp) column; } else if(column instanceof Date) { diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/util/StringUtil.java b/flinkx-core/src/main/java/com/dtstack/flinkx/util/StringUtil.java index a61b49a764..1fa9aac713 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/util/StringUtil.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/util/StringUtil.java @@ -67,11 +67,7 @@ public static String convertRegularExpr (String str) { } public static Object string2col(String str, String type, SimpleDateFormat customTimeFormat) { - if(str == null || str.length() == 0){ - return null; - } - - if(type == null){ + if(str == null || str.length() == 0 || type == null){ return str; } @@ -171,10 +167,11 @@ public static String col2string(Object column, String type) { result = Boolean.valueOf(rowData.trim()); break; case DATE: - result = DateUtil.dateToString((java.util.Date)column); + result = DateUtil.dateToString(DateUtil.columnToDate(column, null)); break; + case DATETIME: case TIMESTAMP: - result = DateUtil.timestampToString((java.util.Date)column); + result = DateUtil.timestampToString(DateUtil.columnToTimestamp(column, null)); break; default: result = rowData; diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/util/TelnetUtil.java b/flinkx-core/src/main/java/com/dtstack/flinkx/util/TelnetUtil.java index 51226c83cf..e0ee5e0edd 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/util/TelnetUtil.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/util/TelnetUtil.java @@ -2,6 +2,7 @@ import org.apache.commons.net.telnet.TelnetClient; +import java.util.concurrent.Callable; import java.util.regex.Matcher; import java.util.regex.Pattern; @@ -12,21 +13,32 @@ public class TelnetUtil { private static final String PORT_KEY = "port"; public static void telnet(String ip,int port) { - TelnetClient client = null; - try{ - client = new TelnetClient(); - client.setConnectTimeout(3000); - client.connect(ip,port); - } catch (Exception e){ - throw new RuntimeException("Unable connect to : " + ip + ":" + port); - } finally { - try { - if (client != null){ - client.disconnect(); + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + TelnetClient client = null; + try{ + client = new TelnetClient(); + client.setConnectTimeout(3000); + client.connect(ip,port); + } catch (Exception e){ + throw new RuntimeException("Unable connect to : " + ip + ":" + port); + } finally { + try { + if (client != null){ + client.disconnect(); + } + } catch (Exception ignore){ + } + } + return null; } - } catch (Exception ignore){ - } + }, 3,1000,false); + } catch (Exception e) { + e.printStackTrace(); } + } public static void telnet(String url) { diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/util/URLUtil.java b/flinkx-core/src/main/java/com/dtstack/flinkx/util/URLUtil.java index b920b77fde..04ef38c524 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/util/URLUtil.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/util/URLUtil.java @@ -18,8 +18,16 @@ package com.dtstack.flinkx.util; +import org.apache.flink.hadoop.shaded.org.apache.http.HttpEntity; +import org.apache.flink.hadoop.shaded.org.apache.http.HttpStatus; +import org.apache.flink.hadoop.shaded.org.apache.http.client.methods.CloseableHttpResponse; +import org.apache.flink.hadoop.shaded.org.apache.http.client.methods.HttpGet; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.CloseableHttpClient; +import org.apache.flink.hadoop.shaded.org.apache.http.util.EntityUtils; + import java.io.InputStream; import java.net.URL; +import java.nio.charset.Charset; import java.util.concurrent.Callable; /** @@ -32,6 +40,8 @@ public class URLUtil { private static int SLEEP_TIME_MILLI_SECOND = 2000; + private static Charset charset = Charset.forName("UTF-8"); + public static InputStream open(String url) throws Exception{ return RetryUtil.executeWithRetry(new Callable() { @Override @@ -40,4 +50,22 @@ public InputStream call() throws Exception{ } },MAX_RETRY_TIMES,SLEEP_TIME_MILLI_SECOND,false); } + + public static String get(CloseableHttpClient httpClient, String url) throws Exception{ + return RetryUtil.executeWithRetry(new Callable() { + @Override + public String call() throws Exception{ + String respBody = null; + HttpGet httpGet = new HttpGet(url); + CloseableHttpResponse response = httpClient.execute(httpGet); + + if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){ + HttpEntity entity = response.getEntity(); + respBody = EntityUtils.toString(entity,charset); + } + + return respBody; + } + },MAX_RETRY_TIMES,SLEEP_TIME_MILLI_SECOND,false); + } } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/util/ValueUtil.java b/flinkx-core/src/main/java/com/dtstack/flinkx/util/ValueUtil.java index baabd5174b..d1202865e8 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/util/ValueUtil.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/util/ValueUtil.java @@ -40,10 +40,9 @@ public static Integer getInt(Object obj) { Method method = obj.getClass().getMethod("intValue"); return (int) method.invoke(obj); } catch (NoSuchMethodException | InvocationTargetException | IllegalAccessException e) { - e.printStackTrace(); + throw new RuntimeException("Unable to convert " + obj + " into Interger",e); } } - throw new RuntimeException("Unable to convert " + obj + " into Interger"); } } diff --git a/flinkx-core/src/main/java/com/dtstack/flinkx/writer/ErrorLimiter.java b/flinkx-core/src/main/java/com/dtstack/flinkx/writer/ErrorLimiter.java index 67df52929d..3a4c6d654e 100644 --- a/flinkx-core/src/main/java/com/dtstack/flinkx/writer/ErrorLimiter.java +++ b/flinkx-core/src/main/java/com/dtstack/flinkx/writer/ErrorLimiter.java @@ -22,15 +22,13 @@ import com.google.gson.Gson; import com.google.gson.internal.LinkedTreeMap; import org.apache.flink.api.common.functions.RuntimeContext; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.CloseableHttpClient; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.HttpClientBuilder; import org.apache.flink.streaming.api.operators.StreamingRuntimeContext; import org.apache.flink.types.Row; import org.apache.flink.util.Preconditions; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.Reader; -import java.net.URL; import java.util.List; import java.util.Map; import java.util.concurrent.Executors; @@ -60,6 +58,8 @@ public class ErrorLimiter { private String errMsg = ""; private Row errorData; + private CloseableHttpClient httpClient; + public void setErrorData(Row errorData){ this.errorData = errorData; } @@ -73,10 +73,11 @@ public void setErrMsg(String errMsg) { } public ErrorLimiter(RuntimeContext runtimeContext, String monitors, int maxErrors, double samplePeriod) { - this(runtimeContext, monitors, maxErrors, Double.MAX_VALUE, 1); + this(runtimeContext, monitors, maxErrors, Double.MAX_VALUE, 2); } public ErrorLimiter(RuntimeContext runtimeContext, String monitors, Integer maxErrors, Double maxErrorRatio, double samplePeriod) { + httpClient = HttpClientBuilder.create().build(); Preconditions.checkArgument(runtimeContext != null || monitors != null, "Should specify rumtimeContext or monitorUrls"); Preconditions.checkArgument(samplePeriod > 0); @@ -102,10 +103,10 @@ public ErrorLimiter(RuntimeContext runtimeContext, String monitors, Integer maxE int j = 0; for(; j < monitorUrls.length; ++j) { String url = monitorUrls[j]; - try (InputStream inputStream = URLUtil.open(url)){ + try { + URLUtil.get(httpClient, url); break; } catch (Exception e) { - e.printStackTrace(); LOG.error("connected error: " + url); } } @@ -124,7 +125,6 @@ public boolean isValid() { } public void start() { - if(scheduledExecutorService == null) { return; } @@ -141,29 +141,36 @@ public void updateErrorInfo(){ Gson gson = new Gson(); for(int index = 0; index < monitorUrls.length; ++index) { String requestUrl = monitorUrls[index] + "/jobs/" + jobId + "/accumulators"; - try(InputStream inputStream = URLUtil.open(requestUrl) ) { - try(Reader rd = new InputStreamReader(inputStream)) { - Map map = gson.fromJson(rd, Map.class); - List userTaskAccumulators = (List) map.get("user-task-accumulators"); - for(LinkedTreeMap accumulator : userTaskAccumulators) { - String name = (String) accumulator.get("name"); - if(name != null) { - if(name.equals("nErrors")) { - this.errors = Double.valueOf((String) accumulator.get("value")).intValue(); - } else if(name.equals("numRead")) { - this.numRead = Double.valueOf((String) accumulator.get("value")).intValue(); - } + try { + String response = URLUtil.get(httpClient, requestUrl); + Map map = gson.fromJson(response, Map.class); + List userTaskAccumulators = (List) map.get("user-task-accumulators"); + for(LinkedTreeMap accumulator : userTaskAccumulators) { + String name = (String) accumulator.get("name"); + if(name != null) { + if(name.equals("nErrors")) { + this.errors = Double.valueOf((String) accumulator.get("value")).intValue(); + } else if(name.equals("numRead")) { + this.numRead = Double.valueOf((String) accumulator.get("value")).intValue(); } } } - } catch (Exception e) { - e.printStackTrace(); + } catch (Exception e){ + LOG.error("Update data error:",e); } break; } } public void stop() { + if (httpClient != null){ + try { + httpClient.close(); + } catch (Exception e){ + LOG.error("close httpClient error:{}", e); + } + } + if(scheduledExecutorService != null && !scheduledExecutorService.isShutdown() && !scheduledExecutorService.isTerminated()) { scheduledExecutorService.shutdown(); } diff --git a/flinkx-db2/flinkx-db2-core/src/main/java/com/dtstack/flinkx/db2/Db2DatabaseMeta.java b/flinkx-db2/flinkx-db2-core/src/main/java/com/dtstack/flinkx/db2/Db2DatabaseMeta.java index 76f63cd37e..12178d0a16 100644 --- a/flinkx-db2/flinkx-db2-core/src/main/java/com/dtstack/flinkx/db2/Db2DatabaseMeta.java +++ b/flinkx-db2/flinkx-db2-core/src/main/java/com/dtstack/flinkx/db2/Db2DatabaseMeta.java @@ -85,6 +85,11 @@ public String getSplitFilter(String columnName) { return String.format("mod(%s,${N}) = ${M}", getStartQuote() + columnName + getEndQuote()); } + @Override + public String getSplitFilterWithTmpTable(String tmpTable, String columnName){ + return String.format("mod(%s.%s,${N}) = ${M}", tmpTable, getStartQuote() + columnName + getEndQuote()); + } + @Override public EDatabaseType getDatabaseType() { return EDatabaseType.DB2; diff --git a/flinkx-db2/flinkx-db2-reader/pom.xml b/flinkx-db2/flinkx-db2-reader/pom.xml index a2c7d2f925..4fc55afc97 100644 --- a/flinkx-db2/flinkx-db2-reader/pom.xml +++ b/flinkx-db2/flinkx-db2-reader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/db2reader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-db2/flinkx-db2-writer/pom.xml b/flinkx-db2/flinkx-db2-writer/pom.xml index 8b2f7ffabd..49c90730a4 100644 --- a/flinkx-db2/flinkx-db2-writer/pom.xml +++ b/flinkx-db2/flinkx-db2-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/db2writer/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsConfigKeys.java b/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsConfigKeys.java index 497da0b39a..30d33608bc 100644 --- a/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsConfigKeys.java +++ b/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsConfigKeys.java @@ -34,6 +34,8 @@ public class EsConfigKeys { public static final String KEY_TYPE = "type"; + public static final String KEY_BATCH_SIZE = "batchSize"; + public static final String KEY_BULK_ACTION = "bulkAction"; public static final String KEY_COLUMN_NAME = "name"; @@ -48,4 +50,8 @@ public class EsConfigKeys { public static final String KEY_ID_COLUMN_VALUE = "value"; + public static final String KEY_TIMEOUT = "timeout"; + + public static final String KEY_PATH_PREFIX = "pathPrefix"; + } diff --git a/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsUtil.java b/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsUtil.java index 906b13cfe2..704b638113 100644 --- a/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsUtil.java +++ b/flinkx-es/flinkx-es-core/src/main/java/com/dtstack/flinkx/es/EsUtil.java @@ -22,6 +22,7 @@ import com.dtstack.flinkx.util.DateUtil; import com.dtstack.flinkx.util.StringUtil; import com.dtstack.flinkx.util.TelnetUtil; +import org.apache.commons.collections.MapUtils; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.math.NumberUtils; import org.apache.flink.types.Row; @@ -30,7 +31,9 @@ import org.elasticsearch.action.search.SearchRequest; import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.client.RestClient; +import org.elasticsearch.client.RestClientBuilder; import org.elasticsearch.client.RestHighLevelClient; +import org.elasticsearch.common.Strings; import org.elasticsearch.index.query.QueryBuilder; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.SearchHit; @@ -51,7 +54,7 @@ */ public class EsUtil { - public static RestHighLevelClient getClient(String address) { + public static RestHighLevelClient getClient(String address,Map config) { List httpHostList = new ArrayList<>(); String[] addr = address.split(","); for(String add : addr) { @@ -59,37 +62,59 @@ public static RestHighLevelClient getClient(String address) { TelnetUtil.telnet(pair[0], Integer.valueOf(pair[1])); httpHostList.add(new HttpHost(pair[0], Integer.valueOf(pair[1]), "http")); } - RestHighLevelClient client = new RestHighLevelClient( - RestClient.builder(httpHostList.toArray(new HttpHost[httpHostList.size()]))); + + RestClientBuilder builder = RestClient.builder(httpHostList.toArray(new HttpHost[httpHostList.size()])); + + Integer timeout = MapUtils.getInteger(config, EsConfigKeys.KEY_TIMEOUT); + if (timeout != null){ + builder.setMaxRetryTimeoutMillis(timeout * 1000); + } + + String pathPrefix = MapUtils.getString(config, EsConfigKeys.KEY_PATH_PREFIX); + if (StringUtils.isNotEmpty(pathPrefix)){ + builder.setPathPrefix(pathPrefix); + } + + RestHighLevelClient client = new RestHighLevelClient(builder); return client; } - public static SearchResponse search(RestHighLevelClient client, String query, int from, int size) { - SearchRequest searchRequest = new SearchRequest(); + public static SearchResponse search(RestHighLevelClient client, String index, String type, String query, int from, int size) { + SearchRequest searchRequest = Strings.isNullOrEmpty(index) ? new SearchRequest() : new SearchRequest(index); + + if(!Strings.isNullOrEmpty(type)){ + searchRequest.types(type); + } + SearchSourceBuilder sourceBuilder = new SearchSourceBuilder(); sourceBuilder.from(from); - sourceBuilder.size(size); + + if(size > 0){ + sourceBuilder.size(size); + } if(StringUtils.isNotBlank(query)) { QueryBuilder qb = QueryBuilders.wrapperQuery(query); sourceBuilder.query(qb); - searchRequest.source(sourceBuilder); } + searchRequest.source(sourceBuilder); + try { return client.search(searchRequest); } catch (IOException e) { throw new RuntimeException(e); } + } - public static long searchCount(RestHighLevelClient client, String query) { - SearchResponse searchResponse = search(client, query, 0, 0); + public static long searchCount(RestHighLevelClient client, String index, String type, String query) { + SearchResponse searchResponse = search(client, index, type, query, 0, 0); return searchResponse.getHits().getTotalHits(); } - public static List> searchContent(RestHighLevelClient client, String query, int from, int size) { - SearchResponse searchResponse = search(client, query, from, size); + public static List> searchContent(RestHighLevelClient client, String index, String type, String query, int from, int size) { + SearchResponse searchResponse = search(client, index, type, query, from, size); SearchHits searchHits = searchResponse.getHits(); List> resultList = new ArrayList<>(); for(SearchHit searchHit : searchHits) { @@ -138,10 +163,10 @@ public static Map rowToJsonMap(Row row, List fields, Lis String key = parts[parts.length - 1]; Object col = row.getField(i); if(col != null) { - Object value = StringUtil.col2string(col, types.get(i)); - currMap.put(key, value); + col = StringUtil.string2col(String.valueOf(col), types.get(i), null); } + currMap.put(key, col); } } catch(Exception ex) { String msg = "EsUtil.rowToJsonMap Writing record error: when converting field[" + i + "] in Row(" + row + ")"; diff --git a/flinkx-es/flinkx-es-reader/pom.xml b/flinkx-es/flinkx-es-reader/pom.xml index 40c0cc78c5..9f617b5854 100644 --- a/flinkx-es/flinkx-es-reader/pom.xml +++ b/flinkx-es/flinkx-es-reader/pom.xml @@ -74,7 +74,7 @@ + tofile="${basedir}/../../plugins/esreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormat.java b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormat.java index e8548d4b59..5507088af9 100644 --- a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormat.java +++ b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormat.java @@ -44,6 +44,10 @@ public class EsInputFormat extends RichInputFormat { protected String address; + protected String index; + + protected String type; + protected String query; protected List columnValues; @@ -52,7 +56,9 @@ public class EsInputFormat extends RichInputFormat { protected List columnNames; - private int batch = 2; + protected int batchSize = 10; + + protected Map clientConfig; private int from; @@ -69,7 +75,7 @@ public class EsInputFormat extends RichInputFormat { @Override public void configure(Configuration configuration) { - client = EsUtil.getClient(address); + client = EsUtil.getClient(address, clientConfig); } @Override @@ -79,7 +85,7 @@ public BaseStatistics getStatistics(BaseStatistics baseStatistics) throws IOExce @Override public InputSplit[] createInputSplits(int splitNum) throws IOException { - long cnt = EsUtil.searchCount(client, query); + long cnt = EsUtil.searchCount(client, index, type, query); if (cnt < splitNum) { EsInputSplit[] splits = new EsInputSplit[1]; splits[0] = new EsInputSplit(0, (int)cnt); @@ -119,11 +125,11 @@ public void openInternal(InputSplit inputSplit) throws IOException { } private void loadNextBatch() { - int range = batch; + int range = batchSize; if (from + range > to) { range = to - from; } - resultList = EsUtil.searchContent(client, query, from, range); + resultList = EsUtil.searchContent(client, index, type, query, from, range); from += range; pos = 0; } @@ -135,6 +141,10 @@ public boolean reachedEnd() throws IOException { return true; } loadNextBatch(); + + //check again + return reachedEnd(); + } return false; } diff --git a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormatBuilder.java b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormatBuilder.java index 3f9a7be530..0fcc7ec79e 100644 --- a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormatBuilder.java +++ b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsInputFormatBuilder.java @@ -20,6 +20,7 @@ import com.dtstack.flinkx.inputformat.RichInputFormatBuilder; import java.util.List; +import java.util.Map; /** * The builder class of EsInputFormat @@ -65,6 +66,27 @@ public EsInputFormatBuilder setColumnTypes(List columnTypes) { return this; } + public EsInputFormatBuilder setIndex(String index){ + format.index = index; + return this; + } + + public EsInputFormatBuilder setType(String type){ + format.type = type; + return this; + } + + public EsInputFormatBuilder setBatchSize(Integer batchSize){ + if(batchSize != null && batchSize > 0){ + format.batchSize = batchSize; + } + return this; + } + + public EsInputFormatBuilder setClientConfig(Map clientConfig){ + format.clientConfig = clientConfig; + return this; + } @Override protected void checkFormat() { diff --git a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsReader.java b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsReader.java index e4fc711d81..c91b32a0d8 100644 --- a/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsReader.java +++ b/flinkx-es/flinkx-es-reader/src/main/java/com/dtstack/flinkx/es/reader/EsReader.java @@ -27,6 +27,7 @@ import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.types.Row; import java.util.ArrayList; +import java.util.HashMap; import java.util.List; import java.util.Map; @@ -41,6 +42,11 @@ public class EsReader extends DataReader { private String address; private String query; + private String index; + private String type; + private Integer batchSize; + private Map clientConfig; + protected List columnType; protected List columnValue; protected List columnName; @@ -49,6 +55,13 @@ public EsReader(DataTransferConfig config, StreamExecutionEnvironment env) { super(config, env); ReaderConfig readerConfig = config.getJob().getContent().get(0).getReader(); address = readerConfig.getParameter().getStringVal(EsConfigKeys.KEY_ADDRESS); + index = readerConfig.getParameter().getStringVal(EsConfigKeys.KEY_INDEX); + type = readerConfig.getParameter().getStringVal(EsConfigKeys.KEY_TYPE); + batchSize = readerConfig.getParameter().getIntVal(EsConfigKeys.KEY_BATCH_SIZE, 10); + + clientConfig = new HashMap<>(); + clientConfig.put(EsConfigKeys.KEY_TIMEOUT, readerConfig.getParameter().getVal(EsConfigKeys.KEY_TIMEOUT)); + clientConfig.put(EsConfigKeys.KEY_PATH_PREFIX, readerConfig.getParameter().getVal(EsConfigKeys.KEY_PATH_PREFIX)); Object queryMap = readerConfig.getParameter().getVal(EsConfigKeys.KEY_QUERY); if(queryMap != null) { @@ -83,6 +96,10 @@ public DataStream readData() { builder.setColumnTypes(columnType); builder.setColumnValues(columnValue); builder.setAddress(address); + builder.setIndex(index); + builder.setType(type); + builder.setBatchSize(batchSize); + builder.setClientConfig(clientConfig); builder.setQuery(query); builder.setBytes(bytes); builder.setMonitorUrls(monitorUrls); diff --git a/flinkx-es/flinkx-es-writer/pom.xml b/flinkx-es/flinkx-es-writer/pom.xml index 759c6a16fe..885ec9737c 100644 --- a/flinkx-es/flinkx-es-writer/pom.xml +++ b/flinkx-es/flinkx-es-writer/pom.xml @@ -75,7 +75,7 @@ + tofile="${basedir}/../../plugins/eswriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormat.java b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormat.java index 28013af32f..dbdb70bba4 100644 --- a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormat.java +++ b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormat.java @@ -25,11 +25,14 @@ import org.apache.commons.lang.StringUtils; import org.apache.flink.configuration.Configuration; import org.apache.flink.types.Row; +import org.elasticsearch.action.bulk.BulkItemResponse; import org.elasticsearch.action.bulk.BulkRequest; +import org.elasticsearch.action.bulk.BulkResponse; import org.elasticsearch.action.index.IndexRequest; import org.elasticsearch.client.RestHighLevelClient; import java.io.IOException; import java.util.List; +import java.util.Map; /** * The OutputFormat class of ElasticSearch @@ -55,6 +58,8 @@ public class EsOutputFormat extends RichOutputFormat { protected List columnNames; + protected Map clientConfig; + private transient RestHighLevelClient client; private transient BulkRequest bulkRequest; @@ -62,7 +67,7 @@ public class EsOutputFormat extends RichOutputFormat { @Override public void configure(Configuration configuration) { - client = EsUtil.getClient(address); + client = EsUtil.getClient(address, clientConfig); bulkRequest = new BulkRequest(); } @@ -91,7 +96,23 @@ protected void writeMultipleRecordsInternal() throws Exception { IndexRequest request = StringUtils.isBlank(id) ? new IndexRequest(index, type) : new IndexRequest(index, type, id); request = request.source(EsUtil.rowToJsonMap(row, columnNames, columnTypes)); bulkRequest.add(request); - client.bulk(bulkRequest); + } + + BulkResponse response = client.bulk(bulkRequest); + if (response.hasFailures()){ + if (dirtyDataManager != null){ + BulkItemResponse[] itemResponses = response.getItems(); + WriteRecordException exception; + for (int i = 0; i < itemResponses.length; i++) { + if(itemResponses[i].isFailed()){ + exception = new WriteRecordException(itemResponses[i].getFailureMessage() + ,itemResponses[i].getFailure().getCause()); + dirtyDataManager.writeData(rows.get(i), exception); + } + } + } else { + LOG.warn(response.buildFailureMessage()); + } } } diff --git a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormatBuilder.java b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormatBuilder.java index 4323ee4162..30edca2dac 100644 --- a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormatBuilder.java +++ b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsOutputFormatBuilder.java @@ -20,6 +20,7 @@ import com.dtstack.flinkx.outputformat.RichOutputFormatBuilder; import java.util.List; +import java.util.Map; /** * The Builder class of EsOutputFormat @@ -67,6 +68,10 @@ public void setColumnTypes(List columnTypes) { format.columnTypes = columnTypes; } + public EsOutputFormatBuilder setClientConfig(Map clientConfig){ + format.clientConfig = clientConfig; + return this; + } @Override protected void checkFormat() { diff --git a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsWriter.java b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsWriter.java index 3ac8ba26f4..a218349ff7 100644 --- a/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsWriter.java +++ b/flinkx-es/flinkx-es-writer/src/main/java/com/dtstack/flinkx/es/writer/EsWriter.java @@ -27,6 +27,7 @@ import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction; import org.apache.flink.types.Row; import java.util.ArrayList; +import java.util.HashMap; import java.util.List; import java.util.Map; @@ -48,6 +49,8 @@ public class EsWriter extends DataWriter { private int bulkAction; + private Map clientConfig; + private List columnTypes; private List columnNames; @@ -66,6 +69,10 @@ public EsWriter(DataTransferConfig config) { index = writerConfig.getParameter().getStringVal(EsConfigKeys.KEY_INDEX); bulkAction = writerConfig.getParameter().getIntVal(EsConfigKeys.KEY_BULK_ACTION, DEFAULT_BULK_ACTION); + clientConfig = new HashMap<>(); + clientConfig.put(EsConfigKeys.KEY_TIMEOUT, writerConfig.getParameter().getVal(EsConfigKeys.KEY_TIMEOUT)); + clientConfig.put(EsConfigKeys.KEY_PATH_PREFIX, writerConfig.getParameter().getVal(EsConfigKeys.KEY_PATH_PREFIX)); + List columns = writerConfig.getParameter().getColumn(); if(columns != null || columns.size() != 0) { columnTypes = new ArrayList<>(); @@ -111,6 +118,7 @@ public DataStreamSink writeData(DataStream dataSet) { builder.setIndex(index); builder.setType(type); builder.setBatchInterval(bulkAction); + builder.setClientConfig(clientConfig); builder.setColumnNames(columnNames); builder.setColumnTypes(columnTypes); builder.setIdColumnIndices(idColumnIndices); diff --git a/flinkx-ftp/flinkx-ftp-core/src/main/java/com/dtstack/flinkx/ftp/StandardFtpHandler.java b/flinkx-ftp/flinkx-ftp-core/src/main/java/com/dtstack/flinkx/ftp/StandardFtpHandler.java index ca2d0414d4..0c02c9f8e6 100644 --- a/flinkx-ftp/flinkx-ftp-core/src/main/java/com/dtstack/flinkx/ftp/StandardFtpHandler.java +++ b/flinkx-ftp/flinkx-ftp-core/src/main/java/com/dtstack/flinkx/ftp/StandardFtpHandler.java @@ -117,7 +117,7 @@ public boolean isDirExist(String directoryPath) { try { ftpClient.changeWorkingDirectory(originDir); } catch (IOException e) { - e.printStackTrace(); + LOG.error(e.getMessage()); } } } diff --git a/flinkx-ftp/flinkx-ftp-reader/pom.xml b/flinkx-ftp/flinkx-ftp-reader/pom.xml index fee2388531..7ca313e801 100644 --- a/flinkx-ftp/flinkx-ftp-reader/pom.xml +++ b/flinkx-ftp/flinkx-ftp-reader/pom.xml @@ -95,7 +95,7 @@ under the License. + tofile="${basedir}/../../plugins/ftpreader/${project.name}-${git.branch}.jar"/> diff --git a/flinkx-ftp/flinkx-ftp-reader/src/main/java/com/dtstack/flinkx/ftp/reader/FtpInputFormat.java b/flinkx-ftp/flinkx-ftp-reader/src/main/java/com/dtstack/flinkx/ftp/reader/FtpInputFormat.java index db22d524a6..377989213f 100644 --- a/flinkx-ftp/flinkx-ftp-reader/src/main/java/com/dtstack/flinkx/ftp/reader/FtpInputFormat.java +++ b/flinkx-ftp/flinkx-ftp-reader/src/main/java/com/dtstack/flinkx/ftp/reader/FtpInputFormat.java @@ -132,11 +132,6 @@ public void openInternal(InputSplit split) throws IOException { br.setFromLine(0); } br.setCharsetName(charsetName); - - if(StringUtils.isNotBlank(monitorUrls) && this.bytes > 0) { - this.byteRateLimiter = new ByteRateLimiter(getRuntimeContext(), monitorUrls, bytes, 1); - this.byteRateLimiter.start(); - } } @Override diff --git a/flinkx-ftp/flinkx-ftp-writer/pom.xml b/flinkx-ftp/flinkx-ftp-writer/pom.xml index 15297d5c45..2022d3ead1 100644 --- a/flinkx-ftp/flinkx-ftp-writer/pom.xml +++ b/flinkx-ftp/flinkx-ftp-writer/pom.xml @@ -94,7 +94,7 @@ under the License. + tofile="${basedir}/../../plugins/ftpwriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-hbase/flinkx-hbase-reader/pom.xml b/flinkx-hbase/flinkx-hbase-reader/pom.xml index ada607a2f2..e7f06daac8 100644 --- a/flinkx-hbase/flinkx-hbase-reader/pom.xml +++ b/flinkx-hbase/flinkx-hbase-reader/pom.xml @@ -92,7 +92,7 @@ + tofile="${basedir}/../../plugins/hbasereader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-hbase/flinkx-hbase-reader/src/main/java/com/dtstack/flinkx/hbase/reader/HbaseInputFormat.java b/flinkx-hbase/flinkx-hbase-reader/src/main/java/com/dtstack/flinkx/hbase/reader/HbaseInputFormat.java index 277f6545eb..c105392fc2 100644 --- a/flinkx-hbase/flinkx-hbase-reader/src/main/java/com/dtstack/flinkx/hbase/reader/HbaseInputFormat.java +++ b/flinkx-hbase/flinkx-hbase-reader/src/main/java/com/dtstack/flinkx/hbase/reader/HbaseInputFormat.java @@ -74,15 +74,8 @@ public class HbaseInputFormat extends RichInputFormat { public void configure(Configuration configuration) { LOG.info("HbaseOutputFormat configure start"); - org.apache.hadoop.conf.Configuration hConfiguration = new org.apache.hadoop.conf.Configuration(); - Validate.isTrue(hbaseConfig != null && hbaseConfig.size() !=0, "hbaseConfig不能为空Map结构!"); - - for (Map.Entry entry : hbaseConfig.entrySet()) { - hConfiguration.set(entry.getKey(), entry.getValue()); - } - try { - connection = ConnectionFactory.createConnection(hConfiguration); + connection = ConnectionFactory.createConnection(getConfig()); } catch (Exception e) { HbaseHelper.closeConnection(connection); throw new IllegalArgumentException(e); @@ -91,6 +84,17 @@ public void configure(Configuration configuration) { LOG.info("HbaseOutputFormat configure end"); } + public org.apache.hadoop.conf.Configuration getConfig(){ + org.apache.hadoop.conf.Configuration hConfiguration = new org.apache.hadoop.conf.Configuration(); + Validate.isTrue(hbaseConfig != null && hbaseConfig.size() !=0, "hbaseConfig不能为空Map结构!"); + + for (Map.Entry entry : hbaseConfig.entrySet()) { + hConfiguration.set(entry.getKey(), entry.getValue()); + } + + return hConfiguration; + } + @Override public BaseStatistics getStatistics(BaseStatistics baseStatistics) throws IOException { return null; @@ -220,6 +224,16 @@ public void openInternal(InputSplit inputSplit) throws IOException { HbaseInputSplit hbaseInputSplit = (HbaseInputSplit) inputSplit; byte[] startRow = Bytes.toBytesBinary(hbaseInputSplit.getStartkey()); byte[] stopRow = Bytes.toBytesBinary(hbaseInputSplit.getEndKey()); + + if(null == connection || connection.isClosed()){ + try { + connection = ConnectionFactory.createConnection(getConfig()); + } catch (Exception e) { + HbaseHelper.closeConnection(connection); + throw new IllegalArgumentException(e); + } + } + table = connection.getTable(TableName.valueOf(tableName)); scan = new Scan(); scan.setStartRow(startRow); @@ -227,11 +241,6 @@ public void openInternal(InputSplit inputSplit) throws IOException { scan.setCaching(scanCacheSize); scan.setBatch(scanBatchSize); resultScanner = table.getScanner(scan); - - if(StringUtils.isNotBlank(monitorUrls) && this.bytes > 0) { - this.byteRateLimiter = new ByteRateLimiter(getRuntimeContext(), monitorUrls, bytes, 1); - this.byteRateLimiter.start(); - } } @Override @@ -264,15 +273,13 @@ public Row nextRecordInternal(Row row) throws IOException { String family = arr[0].trim(); String qualifier = arr[1].trim(); bytes = next.getValue(family.getBytes(), qualifier.getBytes()); - //col = String.valueOf(bytes); } col = convertBytesToAssignType(columnType, bytes, columnFormat); } row.setField(i, col); } catch(Exception e) { - e.printStackTrace(); + throw new IOException("Couldn't read data:",e); } - } return row; diff --git a/flinkx-hbase/flinkx-hbase-writer/pom.xml b/flinkx-hbase/flinkx-hbase-writer/pom.xml index e7ce7428a3..361bdcb4b8 100644 --- a/flinkx-hbase/flinkx-hbase-writer/pom.xml +++ b/flinkx-hbase/flinkx-hbase-writer/pom.xml @@ -79,7 +79,7 @@ + tofile="${basedir}/../../plugins/hbasewriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsConfigKeys.java b/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsConfigKeys.java index eafa081221..b61f298ed6 100644 --- a/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsConfigKeys.java +++ b/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsConfigKeys.java @@ -38,10 +38,6 @@ public class HdfsConfigKeys { public static final String KEY_WRITE_MODE = "writeMode"; - public static final String KEY_USERNAME = "username"; - - public static final String KEY_PASSWORD = "password"; - public static final String KEY_FULL_COLUMN_NAME_LIST = "fullColumnName"; public static final String KEY_FULL_COLUMN_TYPE_LIST = "fullColumnType"; @@ -52,8 +48,6 @@ public class HdfsConfigKeys { public static final String KEY_COMPRESS = "compress"; - public static final String KEY_PARTITION = "partition"; - public static final String KEY_FILE_NAME = "fileName"; public static final String KEY_ENCODING = "encoding"; diff --git a/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsUtil.java b/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsUtil.java index fdce472e6d..0c6425fbda 100644 --- a/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsUtil.java +++ b/flinkx-hdfs/flinkx-hdfs-core/src/main/java/com/dtstack/flinkx/hdfs/HdfsUtil.java @@ -24,6 +24,7 @@ import org.apache.commons.lang3.math.NumberUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat; import org.apache.hadoop.hive.serde2.io.DateWritable; import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; @@ -155,6 +156,8 @@ public static Object getWritableValue(Object writable) { ret = ((DateWritable) writable).get(); } else if(writable instanceof Writable) { ret = writable.toString(); + } else { + ret = writable.toString(); } return ret; @@ -198,6 +201,9 @@ public static ObjectInspector columnTypeToObjectInspetor(ColumnType columnType) case BOOLEAN: objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Boolean.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); break; + case BINARY: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(BytesWritable.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; default: throw new IllegalArgumentException("You should not be here"); } diff --git a/flinkx-hdfs/flinkx-hdfs-reader/pom.xml b/flinkx-hdfs/flinkx-hdfs-reader/pom.xml index a60bb14ca9..afb0f02ac3 100644 --- a/flinkx-hdfs/flinkx-hdfs-reader/pom.xml +++ b/flinkx-hdfs/flinkx-hdfs-reader/pom.xml @@ -118,7 +118,7 @@ under the License. + tofile="${basedir}/../../plugins/hdfsreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsOrcInputFormat.java b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsOrcInputFormat.java index c81618090f..0edde3ff8e 100644 --- a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsOrcInputFormat.java +++ b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsOrcInputFormat.java @@ -204,11 +204,9 @@ public Row nextRecordInternal(Row row) throws IOException { val = metaColumn.getValue(); } - if(val instanceof String){ + if(val instanceof String || val instanceof org.apache.hadoop.io.Text){ val = HdfsUtil.string2col(String.valueOf(val),metaColumn.getType(),metaColumn.getTimeFormat()); - } - - if (val != null) { + } else if(val != null){ val = HdfsUtil.getWritableValue(val); } diff --git a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsParquetInputFormat.java b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsParquetInputFormat.java index 6b452f9e14..f1f1c3d210 100644 --- a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsParquetInputFormat.java +++ b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsParquetInputFormat.java @@ -172,6 +172,11 @@ public boolean reachedEnd() throws IOException { private Object getData(Group currentLine,String type,int index){ Object data = null; try{ + if (index == -1){ + return null; + } + + Type colSchemaType = currentLine.getType().getType(index); switch (type){ case "tinyint" : case "smallint" : @@ -190,10 +195,18 @@ private Object getData(Group currentLine,String type,int index){ break; } case "decimal" : { - Type structType = currentLine.getType().getType(index); - DecimalMetadata dm = ((PrimitiveType) structType).getDecimalMetadata(); - Binary binary = currentLine.getBinary(index,0); - data = binaryToDecimalStr(binary,dm.getScale()); + DecimalMetadata dm = ((PrimitiveType) colSchemaType).getDecimalMetadata(); + String primitiveTypeName = currentLine.getType().getType(index).asPrimitiveType().getPrimitiveTypeName().name(); + if ("INT32".equals(primitiveTypeName)){ + int intVal = currentLine.getInteger(index,0); + data = longToDecimalStr((long)intVal,dm.getScale()); + } else if("INT64".equals(primitiveTypeName)){ + long longVal = currentLine.getLong(index,0); + data = longToDecimalStr(longVal,dm.getScale()); + } else { + Binary binary = currentLine.getBinary(index,0); + data = binaryToDecimalStr(binary,dm.getScale()); + } break; } case "date" : { @@ -231,10 +244,21 @@ public HdfsParquetSplit[] createInputSplits(int minNumSplits) throws IOException public void closeInternal() throws IOException { if (currentFileReader != null){ currentFileReader.close(); + currentFileReader = null; } + + currentLine = null; + currentFileIndex = 0; + } + + private String longToDecimalStr(long value,int scale){ + BigInteger bi = BigInteger.valueOf(value); + BigDecimal bg = new BigDecimal(bi, scale); + + return bg.toString(); } - private static String binaryToDecimalStr(Binary binary,int scale){ + private String binaryToDecimalStr(Binary binary,int scale){ BigInteger bi = new BigInteger(binary.getBytes()); BigDecimal bg = new BigDecimal(bi,scale); diff --git a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsTextInputFormat.java b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsTextInputFormat.java index e66b07877a..a38c9b89c6 100644 --- a/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsTextInputFormat.java +++ b/flinkx-hdfs/flinkx-hdfs-reader/src/main/java/com/dtstack/flinkx/hdfs/reader/HdfsTextInputFormat.java @@ -30,13 +30,13 @@ import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.FileSplit; + import java.io.ByteArrayInputStream; import java.io.DataInputStream; import java.io.DataOutputStream; import java.io.IOException; import java.nio.charset.Charset; import java.nio.charset.UnsupportedCharsetException; -import java.util.List; import java.util.Map; /** @@ -81,12 +81,9 @@ public void openInternal(InputSplit inputSplit) throws IOException { value = new Text(); } - - @Override public Row nextRecordInternal(Row row) throws IOException { - byte[] data = ((Text)value).getBytes(); - String line = new String(data, charsetName); + String line = new String(((Text)value).getBytes(), 0, ((Text)value).getLength(), charsetName); String[] fields = line.split(delimiter); if (metaColumns.size() == 1 && "*".equals(metaColumns.get(0).getName())){ diff --git a/flinkx-hdfs/flinkx-hdfs-writer/pom.xml b/flinkx-hdfs/flinkx-hdfs-writer/pom.xml index de7c514b09..94c88df44b 100644 --- a/flinkx-hdfs/flinkx-hdfs-writer/pom.xml +++ b/flinkx-hdfs/flinkx-hdfs-writer/pom.xml @@ -119,7 +119,7 @@ under the License. + tofile="${basedir}/../../plugins/hdfswriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsOrcOutputFormat.java b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsOrcOutputFormat.java index e7250c56c5..860d41be89 100644 --- a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsOrcOutputFormat.java +++ b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsOrcOutputFormat.java @@ -30,6 +30,7 @@ import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.compress.SnappyCodec; import org.apache.hadoop.mapred.FileOutputFormat; @@ -169,6 +170,9 @@ public void writeSingleRecordInternal(Row row) throws WriteRecordException { case TIMESTAMP: recordList.add(DateUtil.columnToTimestamp(column,null)); break; + case BINARY: + recordList.add(new BytesWritable(rowData.getBytes())); + break; default: throw new IllegalArgumentException(); } diff --git a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsParquetOutputFormat.java b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsParquetOutputFormat.java index d7f9bec2be..104b4507c6 100644 --- a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsParquetOutputFormat.java +++ b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsParquetOutputFormat.java @@ -74,7 +74,7 @@ public class HdfsParquetOutputFormat extends HdfsOutputFormat { try { cal.setTime(DateUtil.getDateFormatter().parse("1970-01-01")); } catch (Exception e){ - e.printStackTrace(); + throw new RuntimeException("Init calendar fail:",e); } } @@ -86,7 +86,7 @@ protected void open() throws IOException { ExampleParquetWriter.Builder builder = ExampleParquetWriter.builder(writePath) .withWriteMode(ParquetFileWriter.Mode.CREATE) - .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0) + .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_1_0) .withCompressionCodec(CompressionCodecName.SNAPPY) .withConf(conf) .withType(schema) @@ -173,7 +173,6 @@ protected void writeSingleRecordInternal(Row row) throws WriteRecordException { writer.write(group); } catch (Exception e){ - e.printStackTrace(); if(i < row.getArity()) { throw new WriteRecordException(recordConvertDetailErrorMessage(i, row), e, i, row); } @@ -240,7 +239,7 @@ private MessageType buildSchema(){ if (colType.contains("decimal")){ int precision = Integer.parseInt(colType.substring(colType.indexOf("(") + 1,colType.indexOf(",")).trim()); int scale = Integer.parseInt(colType.substring(colType.indexOf(",") + 1,colType.indexOf(")")).trim()); - typeBuilder.optional(PrimitiveType.PrimitiveTypeName.BINARY) + typeBuilder.optional(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY) .as(OriginalType.DECIMAL) .precision(precision) .scale(scale) diff --git a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsWriter.java b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsWriter.java index e41871f899..0912b94a94 100644 --- a/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsWriter.java +++ b/flinkx-hdfs/flinkx-hdfs-writer/src/main/java/com/dtstack/flinkx/hdfs/writer/HdfsWriter.java @@ -75,17 +75,6 @@ public class HdfsWriter extends DataWriter { protected List fullColumnType; - /** hive config **/ - protected String partition; - - protected String dbUrl; - - protected String username; - - protected String password; - - protected String table; - protected static final String DATA_SUBDIR = ".data"; protected static final String FINISHED_SUBDIR = ".finished"; diff --git a/flinkx-launcher/src/main/java/com/dtstack/flinkx/launcher/Launcher.java b/flinkx-launcher/src/main/java/com/dtstack/flinkx/launcher/Launcher.java index 8cc037977b..72333381b3 100644 --- a/flinkx-launcher/src/main/java/com/dtstack/flinkx/launcher/Launcher.java +++ b/flinkx-launcher/src/main/java/com/dtstack/flinkx/launcher/Launcher.java @@ -21,10 +21,14 @@ import com.dtstack.flinkx.config.ContentConfig; import com.dtstack.flinkx.config.DataTransferConfig; import com.dtstack.flinkx.util.SysUtil; +import org.apache.commons.lang3.StringUtils; import org.apache.flink.client.program.ClusterClient; import org.apache.flink.client.program.PackagedProgram; import org.apache.flink.util.Preconditions; + import java.io.File; +import java.io.FileNotFoundException; +import java.io.FilenameFilter; import java.net.MalformedURLException; import java.net.URL; import java.util.ArrayList; @@ -38,6 +42,8 @@ */ public class Launcher { + public static final String CORE_JAR_NAME_PREFIX = "flinkx"; + private static List initFlinkxArgList(LauncherOptions launcherOptions) { List argList = new ArrayList<>(); argList.add("-job"); @@ -75,9 +81,6 @@ private static List analyzeUserClasspath(String content, String pluginRoot) return urlList; } - - - public static void main(String[] args) throws Exception { LauncherOptions launcherOptions = new LauncherOptionParser(args).getLauncherOptions(); String mode = launcherOptions.getMode(); @@ -93,7 +96,8 @@ public static void main(String[] args) throws Exception { String pluginRoot = launcherOptions.getPlugin(); String content = launcherOptions.getJob(); - File jarFile = new File(pluginRoot + File.separator + "flinkx.jar"); + String coreJarName = getCoreJarFileName(pluginRoot); + File jarFile = new File(pluginRoot + File.separator + coreJarName); List urlList = analyzeUserClasspath(content, pluginRoot); String[] remoteArgs = argList.toArray(new String[argList.size()]); PackagedProgram program = new PackagedProgram(jarFile, urlList, remoteArgs); @@ -101,4 +105,27 @@ public static void main(String[] args) throws Exception { clusterClient.shutdown(); } } + + private static String getCoreJarFileName (String pluginRoot) throws FileNotFoundException{ + String coreJarFileName = null; + File pluginDir = new File(pluginRoot); + if (pluginDir.exists() && pluginDir.isDirectory()){ + File[] jarFiles = pluginDir.listFiles(new FilenameFilter() { + @Override + public boolean accept(File dir, String name) { + return name.toLowerCase().startsWith(CORE_JAR_NAME_PREFIX) && name.toLowerCase().endsWith(".jar"); + } + }); + + if (jarFiles != null && jarFiles.length > 0){ + coreJarFileName = jarFiles[0].getName(); + } + } + + if (StringUtils.isEmpty(coreJarFileName)){ + throw new FileNotFoundException("Can not find core jar file in path:" + pluginRoot); + } + + return coreJarFileName; + } } diff --git a/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbConfigKeys.java b/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbConfigKeys.java index 3e3b462399..1f5617b915 100644 --- a/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbConfigKeys.java +++ b/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbConfigKeys.java @@ -38,13 +38,21 @@ public class MongodbConfigKeys { public final static String KEY_FILTER = "filter"; + public final static String KEY_FETCH_SIZE = "fetchSize"; + public final static String KEY_MODE = "writeMode"; public final static String KEY_REPLACE_KEY = "replaceKey"; - public final static String KEY_NAME = "name"; + public final static String KEY_MONGODB_CONFIG = "mongodbConfig"; + + public final static String KEY_CONNECTIONS_PERHOST = "connectionsPerHost"; + + public final static String KEY_THREADS_FOR_CONNECTION_MULTIPLIER = "threadsForConnectionMultiplier"; + + public final static String KEY_CONNECTION_TIMEOUT = "connectionTimeout"; - public final static String KEY_TYPE = "type"; + public final static String KEY_MAX_WAIT_TIME = "maxWaitTime"; - public final static String KEY_SPLITTER = "splitter"; + public final static String KEY_SOCKET_TIMEOUT = "socketTimeout"; } diff --git a/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbUtil.java b/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbUtil.java index a8d10ba727..2ca8502fca 100644 --- a/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbUtil.java +++ b/flinkx-mongodb/flinkx-mongodb-core/src/main/java/com/dtstack/flinkx/mongodb/MongodbUtil.java @@ -25,8 +25,8 @@ import com.dtstack.flinkx.util.TelnetUtil; import com.google.common.collect.Lists; import com.mongodb.*; -import com.mongodb.client.MongoCollection; -import com.mongodb.client.MongoDatabase; +import com.mongodb.client.MongoCursor; +import org.apache.commons.collections.MapUtils; import org.apache.commons.lang.StringUtils; import org.apache.flink.types.Row; import org.bson.Document; @@ -60,69 +60,58 @@ public class MongodbUtil { private static final Integer DEFAULT_PORT = 27017; - private static final Integer ONE_SECOND = 1000; + private static final Integer DEFAULT_CONNECTIONS_PER_HOST = 100; - private static final Integer CONNECTIONS_PER_HOST = 100; + private static final Integer DEFAULT_THREADS_FOR_CONNECTION_MULTIPLIER = 100; - private static final Integer THREADS_FOR_CONNECTION_MULTIPLIER = 100; + private static final Integer DEFAULT_CONNECT_TIMEOUT = 10 * 1000; - private static final Integer CONNECT_TIMEOUT = 10 * ONE_SECOND; + private static final Integer DEFAULT_MAX_WAIT_TIME = 5 * 1000; - private static final Integer MAX_WAIT_TIME = 5 * ONE_SECOND; - - private static final Integer SOCKET_TIMEOUT = 0; - - private static MongoClient mongoClient; + private static final Integer DEFAULT_SOCKET_TIMEOUT = 0; /** * Get mongo client - * @param config + * @param mongodbConfig * @return MongoClient */ - public static MongoClient getMongoClient(Map config){ + public static MongoClient getMongoClient(Map mongodbConfig){ + MongoClient mongoClient; try{ - if(mongoClient == null){ - MongoClientOptions options = getOption(); - List serverAddress = getServerAddress(config.get(KEY_HOST_PORTS)); - String username = config.get(KEY_USERNAME); - String password = config.get(KEY_PASSWORD); - String database = config.get(KEY_DATABASE); - - if(StringUtils.isEmpty(username)){ - mongoClient = new MongoClient(serverAddress,options); - } else { - MongoCredential credential = MongoCredential.createScramSha1Credential(username, database, password.toCharArray()); - List credentials = Lists.newArrayList(); - credentials.add(credential); - - mongoClient = new MongoClient(serverAddress,credentials,options); - } - - - LOG.info("mongo客户端获取成功"); + MongoClientOptions options = getOption(mongodbConfig); + List serverAddress = getServerAddress(MapUtils.getString(mongodbConfig, KEY_HOST_PORTS)); + String username = MapUtils.getString(mongodbConfig, KEY_USERNAME); + String password = MapUtils.getString(mongodbConfig, KEY_PASSWORD); + String database = MapUtils.getString(mongodbConfig, KEY_DATABASE); + + if(StringUtils.isEmpty(username)){ + mongoClient = new MongoClient(serverAddress,options); + } else { + MongoCredential credential = MongoCredential.createScramSha1Credential(username, database, password.toCharArray()); + List credentials = Lists.newArrayList(); + credentials.add(credential); + + mongoClient = new MongoClient(serverAddress,credentials,options); } + + LOG.info("Get mongodb client successful"); return mongoClient; }catch (Exception e){ throw new RuntimeException(e); } } - public static MongoDatabase getDatabase(Map config,String database){ - MongoClient client = getMongoClient(config); - return mongoClient.getDatabase(database); - } - - public static MongoCollection getCollection(Map config,String database, String collection){ - MongoClient client = getMongoClient(config); - MongoDatabase db = client.getDatabase(database); - - return db.getCollection(collection); - } + public static void close(MongoClient mongoClient, MongoCursor cursor){ + if (cursor != null){ + LOG.info("Start close mongodb cursor"); + cursor.close(); + LOG.info("Close mongodb cursor successfully"); + } - public static void close(){ if (mongoClient != null){ + LOG.info("Start close mongodb client"); mongoClient.close(); - mongoClient = null; + LOG.info("Close mongodb client successfully"); } } @@ -181,13 +170,29 @@ private static List getServerAddress(String hostPorts) { return addresses; } - private static MongoClientOptions getOption(){ + private static MongoClientOptions getOption(Map mongodbConfig){ MongoClientOptions.Builder build = new MongoClientOptions.Builder(); - build.connectionsPerHost(CONNECTIONS_PER_HOST); - build.threadsAllowedToBlockForConnectionMultiplier(THREADS_FOR_CONNECTION_MULTIPLIER); - build.connectTimeout(CONNECT_TIMEOUT); - build.maxWaitTime(MAX_WAIT_TIME); - build.socketTimeout(SOCKET_TIMEOUT); + + int connectionsPerHost = MapUtils.getIntValue(mongodbConfig, KEY_CONNECTIONS_PERHOST, DEFAULT_CONNECTIONS_PER_HOST); + LOG.info("Mongodb config -- connectionsPerHost:" + connectionsPerHost); + build.connectionsPerHost(connectionsPerHost); + + int threadsForConnectionMultiplier = MapUtils.getIntValue(mongodbConfig, KEY_THREADS_FOR_CONNECTION_MULTIPLIER, DEFAULT_THREADS_FOR_CONNECTION_MULTIPLIER); + LOG.info("Mongodb config -- threadsForConnectionMultiplier:" + threadsForConnectionMultiplier); + build.threadsAllowedToBlockForConnectionMultiplier(threadsForConnectionMultiplier); + + int connectionTimeout = MapUtils.getIntValue(mongodbConfig, KEY_CONNECTION_TIMEOUT, DEFAULT_CONNECT_TIMEOUT); + LOG.info("Mongodb config -- connectionTimeout:" + connectionTimeout); + build.connectTimeout(connectionTimeout); + + int maxWaitTime = MapUtils.getIntValue(mongodbConfig, KEY_MAX_WAIT_TIME, DEFAULT_MAX_WAIT_TIME); + LOG.info("Mongodb config -- maxWaitTime:" + maxWaitTime); + build.maxWaitTime(maxWaitTime); + + int socketTimeout = MapUtils.getIntValue(mongodbConfig, KEY_SOCKET_TIMEOUT, DEFAULT_SOCKET_TIMEOUT); + LOG.info("Mongodb config -- socketTimeout:" + socketTimeout); + build.maxWaitTime(socketTimeout); + build.writeConcern(WriteConcern.UNACKNOWLEDGED); return build.build(); } diff --git a/flinkx-mongodb/flinkx-mongodb-reader/pom.xml b/flinkx-mongodb/flinkx-mongodb-reader/pom.xml index ab40ec8100..7370b820ef 100644 --- a/flinkx-mongodb/flinkx-mongodb-reader/pom.xml +++ b/flinkx-mongodb/flinkx-mongodb-reader/pom.xml @@ -66,7 +66,7 @@ + tofile="${basedir}/../../plugins/mongodbreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormat.java b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormat.java index 50847d4f79..2a1beeed2f 100644 --- a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormat.java +++ b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormat.java @@ -23,9 +23,11 @@ import com.dtstack.flinkx.reader.MetaColumn; import com.dtstack.flinkx.util.StringUtil; import com.mongodb.BasicDBObject; +import com.mongodb.MongoClient; import com.mongodb.client.FindIterable; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; +import com.mongodb.client.MongoDatabase; import org.apache.commons.lang.StringUtils; import org.apache.flink.configuration.Configuration; import org.apache.flink.core.io.InputSplit; @@ -36,8 +38,6 @@ import java.io.IOException; import java.util.*; -import static com.dtstack.flinkx.mongodb.MongodbConfigKeys.*; - /** * Read plugin for reading static data * @@ -60,21 +60,18 @@ public class MongodbInputFormat extends RichInputFormat { protected String filterJson; - private Bson filter; + protected Map mongodbConfig; - private transient MongoCollection collection; + protected int fetchSize; + + private Bson filter; private transient MongoCursor cursor; + private transient MongoClient client; + @Override public void configure(Configuration parameters) { - Map config = new HashMap<>(4); - config.put(KEY_HOST_PORTS,hostPorts); - config.put(KEY_USERNAME,username); - config.put(KEY_PASSWORD,password); - config.put(KEY_DATABASE,database); - - collection = MongodbUtil.getCollection(config,database,collectionName); buildFilter(); } @@ -83,13 +80,19 @@ protected void openInternal(InputSplit inputSplit) throws IOException { MongodbInputSplit split = (MongodbInputSplit) inputSplit; FindIterable findIterable; + client = MongodbUtil.getMongoClient(mongodbConfig); + MongoDatabase db = client.getDatabase(database); + MongoCollection collection = db.getCollection(collectionName); + if(filter == null){ findIterable = collection.find(); } else { findIterable = collection.find(filter); } - findIterable = findIterable.skip(split.getSkip()).limit(split.getLimit()); + findIterable = findIterable.skip(split.getSkip()) + .limit(split.getLimit()) + .batchSize(fetchSize); cursor = findIterable.iterator(); } @@ -130,29 +133,37 @@ public Row nextRecordInternal(Row row) throws IOException { @Override protected void closeInternal() throws IOException { - if (cursor != null){ - cursor.close(); - MongodbUtil.close(); - } + MongodbUtil.close(client, cursor); } @Override public InputSplit[] createInputSplits(int minNumSplits) throws IOException { ArrayList splits = new ArrayList<>(); - long docNum = filter == null ? collection.count() : collection.count(filter); - if(docNum <= minNumSplits){ - splits.add(new MongodbInputSplit(0,(int)docNum)); - return splits.toArray(new MongodbInputSplit[splits.size()]); - } + MongoClient client = null; + try { + client = MongodbUtil.getMongoClient(mongodbConfig); + MongoDatabase db = client.getDatabase(database); + MongoCollection collection = db.getCollection(collectionName); - long size = Math.floorDiv(docNum,(long)minNumSplits); - for (int i = 0; i < minNumSplits; i++) { - splits.add(new MongodbInputSplit((int)(i * size), (int)size)); - } + long docNum = filter == null ? collection.count() : collection.count(filter); + if(docNum <= minNumSplits){ + splits.add(new MongodbInputSplit(0,(int)docNum)); + return splits.toArray(new MongodbInputSplit[splits.size()]); + } - if(size * minNumSplits < docNum){ - splits.add(new MongodbInputSplit((int)(size * minNumSplits), (int)(docNum - size * minNumSplits))); + long size = Math.floorDiv(docNum,(long)minNumSplits); + for (int i = 0; i < minNumSplits; i++) { + splits.add(new MongodbInputSplit((int)(i * size), (int)size)); + } + + if(size * minNumSplits < docNum){ + splits.add(new MongodbInputSplit((int)(size * minNumSplits), (int)(docNum - size * minNumSplits))); + } + } catch (Exception e){ + LOG.error("{}", e); + } finally { + MongodbUtil.close(client, null); } return splits.toArray(new MongodbInputSplit[splits.size()]); diff --git a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormatBuilder.java b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormatBuilder.java index 2c73cedb7b..b97fcd2116 100644 --- a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormatBuilder.java +++ b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbInputFormatBuilder.java @@ -22,6 +22,7 @@ import com.dtstack.flinkx.reader.MetaColumn; import java.util.List; +import java.util.Map; /** * The builder for mongodb reader plugin @@ -61,6 +62,14 @@ public void setMetaColumns(List metaColumns){ format.metaColumns = metaColumns; } + public void setMongodbConfig(Map mongodbConfig){ + format.mongodbConfig = mongodbConfig; + } + + public void setFetchSize(int fetchSize){ + format.fetchSize = fetchSize; + } + public void setFilter(String filter){ format.filterJson = filter; } diff --git a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbReader.java b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbReader.java index b781d01127..2a32a73aa1 100644 --- a/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbReader.java +++ b/flinkx-mongodb/flinkx-mongodb-reader/src/main/java/com/dtstack/flinkx/mongodb/reader/MongodbReader.java @@ -26,7 +26,9 @@ import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.types.Row; +import java.util.HashMap; import java.util.List; +import java.util.Map; import static com.dtstack.flinkx.mongodb.MongodbConfigKeys.*; @@ -52,6 +54,10 @@ public class MongodbReader extends DataReader { protected String filter; + protected Map mongodbConfig; + + protected int fetchSize; + public MongodbReader(DataTransferConfig config, StreamExecutionEnvironment env) { super(config, env); @@ -62,13 +68,19 @@ public MongodbReader(DataTransferConfig config, StreamExecutionEnvironment env) database = readerConfig.getParameter().getStringVal(KEY_DATABASE); collection = readerConfig.getParameter().getStringVal(KEY_COLLECTION); filter = readerConfig.getParameter().getStringVal(KEY_FILTER); + fetchSize = readerConfig.getParameter().getIntVal(KEY_FETCH_SIZE, 100); metaColumns = MetaColumn.getMetaColumns(readerConfig.getParameter().getColumn()); + + mongodbConfig = (Map)readerConfig.getParameter().getVal(KEY_MONGODB_CONFIG, new HashMap<>()); + mongodbConfig.put(KEY_HOST_PORTS, hostPorts); + mongodbConfig.put(KEY_USERNAME, username); + mongodbConfig.put(KEY_PASSWORD, password); + mongodbConfig.put(KEY_DATABASE, database); } @Override public DataStream readData() { MongodbInputFormatBuilder builder = new MongodbInputFormatBuilder(); - builder.setHostPorts(hostPorts); builder.setUsername(username); builder.setPassword(password); @@ -76,6 +88,8 @@ public DataStream readData() { builder.setCollection(collection); builder.setFilter(filter); builder.setMetaColumns(metaColumns); + builder.setMongodbConfig(mongodbConfig); + builder.setFetchSize(fetchSize); builder.setMonitorUrls(monitorUrls); builder.setBytes(bytes); diff --git a/flinkx-mongodb/flinkx-mongodb-writer/pom.xml b/flinkx-mongodb/flinkx-mongodb-writer/pom.xml index 95035030f1..13ea9450a5 100644 --- a/flinkx-mongodb/flinkx-mongodb-writer/pom.xml +++ b/flinkx-mongodb/flinkx-mongodb-writer/pom.xml @@ -66,7 +66,7 @@ + tofile="${basedir}/../../plugins/mongodbwriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormat.java b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormat.java index 6d2859902e..9fb3fd6273 100644 --- a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormat.java +++ b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormat.java @@ -23,7 +23,9 @@ import com.dtstack.flinkx.outputformat.RichOutputFormat; import com.dtstack.flinkx.reader.MetaColumn; import com.dtstack.flinkx.writer.WriteMode; +import com.mongodb.MongoClient; import com.mongodb.client.MongoCollection; +import com.mongodb.client.MongoDatabase; import org.apache.commons.lang.StringUtils; import org.apache.flink.configuration.Configuration; import org.apache.flink.types.Row; @@ -31,12 +33,9 @@ import java.io.IOException; import java.util.ArrayList; -import java.util.HashMap; import java.util.List; import java.util.Map; -import static com.dtstack.flinkx.mongodb.MongodbConfigKeys.*; - /** * OutputFormat for mongodb writer plugin * @@ -59,26 +58,24 @@ public class MongodbOutputFormat extends RichOutputFormat { protected String replaceKey; - protected String mode = WriteMode.INSERT.getMode(); + protected String mode; private transient MongoCollection collection; + private transient MongoClient client; + + protected Map mongodbConfig; + @Override public void configure(Configuration parameters) { - super.configure(parameters); - Map config = new HashMap<>(4); - config.put(KEY_HOST_PORTS,hostPorts); - config.put(KEY_USERNAME,username); - config.put(KEY_PASSWORD,password); - config.put(KEY_DATABASE,database); - - collection = MongodbUtil.getCollection(config,database,collectionName); } @Override protected void openInternal(int taskNumber, int numTasks) throws IOException { - + client = MongodbUtil.getMongoClient(mongodbConfig); + MongoDatabase db = client.getDatabase(database); + collection = db.getCollection(collectionName); } @Override @@ -119,7 +116,6 @@ protected void writeMultipleRecordsInternal() throws Exception { @Override public void closeInternal() throws IOException { - super.closeInternal(); - MongodbUtil.close(); + MongodbUtil.close(client, null); } } diff --git a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormatBuilder.java b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormatBuilder.java index 3790536adf..e382e72816 100644 --- a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormatBuilder.java +++ b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbOutputFormatBuilder.java @@ -22,6 +22,7 @@ import com.dtstack.flinkx.reader.MetaColumn; import java.util.List; +import java.util.Map; /** @@ -70,6 +71,11 @@ public void setReplaceKey(String replaceKey){ format.replaceKey = replaceKey; } + + public void setMongodbConfig(Map mongodbConfig){ + format.mongodbConfig = mongodbConfig; + } + @Override protected void checkFormat() { if(format.hostPorts == null){ diff --git a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbWriter.java b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbWriter.java index 0ea3841899..4cf2068457 100644 --- a/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbWriter.java +++ b/flinkx-mongodb/flinkx-mongodb-writer/src/main/java/com/dtstack/flinkx/mongodb/writer/MongodbWriter.java @@ -22,12 +22,15 @@ import com.dtstack.flinkx.config.WriterConfig; import com.dtstack.flinkx.reader.MetaColumn; import com.dtstack.flinkx.writer.DataWriter; +import com.dtstack.flinkx.writer.WriteMode; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSink; import org.apache.flink.streaming.api.functions.sink.OutputFormatSinkFunction; import org.apache.flink.types.Row; +import java.util.HashMap; import java.util.List; +import java.util.Map; import static com.dtstack.flinkx.mongodb.MongodbConfigKeys.*; import static com.dtstack.flinkx.mongodb.MongodbConfigKeys.KEY_COLLECTION; @@ -54,6 +57,8 @@ public class MongodbWriter extends DataWriter { protected String replaceKey; + protected Map mongodbConfig; + public MongodbWriter(DataTransferConfig config) { super(config); @@ -63,10 +68,16 @@ public MongodbWriter(DataTransferConfig config) { password = writerConfig.getParameter().getStringVal(KEY_PASSWORD); database = writerConfig.getParameter().getStringVal(KEY_DATABASE); collection = writerConfig.getParameter().getStringVal(KEY_COLLECTION); - mode = writerConfig.getParameter().getStringVal(KEY_MODE); + mode = writerConfig.getParameter().getStringVal(KEY_MODE, WriteMode.INSERT.getMode()); replaceKey = writerConfig.getParameter().getStringVal(KEY_REPLACE_KEY); columns = MetaColumn.getMetaColumns(writerConfig.getParameter().getColumn()); + + mongodbConfig = (Map)writerConfig.getParameter().getVal(KEY_MONGODB_CONFIG, new HashMap<>()); + mongodbConfig.put(KEY_HOST_PORTS, hostPorts); + mongodbConfig.put(KEY_USERNAME, username); + mongodbConfig.put(KEY_PASSWORD, password); + mongodbConfig.put(KEY_DATABASE, database); } @Override @@ -81,6 +92,7 @@ public DataStreamSink writeData(DataStream dataSet) { builder.setMode(mode); builder.setColumns(columns); builder.setReplaceKey(replaceKey); + builder.setMongodbConfig(mongodbConfig); builder.setMonitorUrls(monitorUrls); builder.setErrors(errors); diff --git a/flinkx-mysql/flinkx-mysql-core/src/main/java/com/dtstack/flinkx/mysql/MySqlDatabaseMeta.java b/flinkx-mysql/flinkx-mysql-core/src/main/java/com/dtstack/flinkx/mysql/MySqlDatabaseMeta.java index 48517f2f24..67c2b2bc66 100644 --- a/flinkx-mysql/flinkx-mysql-core/src/main/java/com/dtstack/flinkx/mysql/MySqlDatabaseMeta.java +++ b/flinkx-mysql/flinkx-mysql-core/src/main/java/com/dtstack/flinkx/mysql/MySqlDatabaseMeta.java @@ -114,6 +114,11 @@ public String getSplitFilter(String columnName) { return String.format("%s mod ${N} = ${M}", getStartQuote() + columnName + getEndQuote()); } + @Override + public String getSplitFilterWithTmpTable(String tmpTable, String columnName){ + return String.format("%s.%s mod ${N} = ${M}", tmpTable, getStartQuote() + columnName + getEndQuote()); + } + @Override public String getMultiInsertStatement(List column, String table, int batchSize) { return "INSERT INTO " + quoteTable(table) diff --git a/flinkx-mysql/flinkx-mysql-dreader/pom.xml b/flinkx-mysql/flinkx-mysql-dreader/pom.xml index c15ff9cf26..5298a2fd9a 100644 --- a/flinkx-mysql/flinkx-mysql-dreader/pom.xml +++ b/flinkx-mysql/flinkx-mysql-dreader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/mysqldreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-mysql/flinkx-mysql-reader/pom.xml b/flinkx-mysql/flinkx-mysql-reader/pom.xml index 4a73e6b7e9..ee232417c5 100644 --- a/flinkx-mysql/flinkx-mysql-reader/pom.xml +++ b/flinkx-mysql/flinkx-mysql-reader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/mysqlreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-mysql/flinkx-mysql-writer/pom.xml b/flinkx-mysql/flinkx-mysql-writer/pom.xml index f4cb55dac2..e4275ce0e6 100644 --- a/flinkx-mysql/flinkx-mysql-writer/pom.xml +++ b/flinkx-mysql/flinkx-mysql-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/mysqlwriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsConfigKeys.java b/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsConfigKeys.java index 49b68fbdd8..580996b24b 100755 --- a/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsConfigKeys.java +++ b/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsConfigKeys.java @@ -54,4 +54,6 @@ public class OdpsConfigKeys { public static final String KEY_MODE = "mode"; + public static final String KEY_BUFFER_SIZE = "bufferSize"; + } diff --git a/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsUtil.java b/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsUtil.java index 1549f5dfce..707b82df7d 100644 --- a/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsUtil.java +++ b/flinkx-odps/flinkx-odps-core/src/main/java/com/dtstack/flinkx/odps/OdpsUtil.java @@ -54,7 +54,9 @@ public class OdpsUtil { private static final Logger LOG = LoggerFactory.getLogger(OdpsUtil.class); - public static int MAX_RETRY_TIME = 10; + public static int MAX_RETRY_TIME = 3; + + public static final long BUFFER_SIZE_DEFAULT = 64 * 1024 * 1024; public static Odps initOdps(Map odpsConfig) { String odpsServer = odpsConfig.get(OdpsConfigKeys.KEY_ODPS_SERVER); @@ -136,7 +138,7 @@ public static TableTunnel.DownloadSession createMasterSessionForNonPartitionedTa final String projectName, final String tableName) { final TableTunnel tunnel = new TableTunnel(odps); - if (StringUtils.isNoneBlank(tunnelServer)) { + if (StringUtils.isNotEmpty(tunnelServer)) { tunnel.setEndpoint(tunnelServer); } @@ -158,7 +160,7 @@ public static TableTunnel.DownloadSession createMasterSessionForPartitionedTable final String projectName, final String tableName, String partition) { final TableTunnel tunnel = new TableTunnel(odps); - if (StringUtils.isNoneBlank(tunnelServer)) { + if (StringUtils.isNotEmpty(tunnelServer)) { tunnel.setEndpoint(tunnelServer); } @@ -183,7 +185,7 @@ public TableTunnel.DownloadSession call() throws Exception { public static TableTunnel.DownloadSession getSlaveSessionForNonPartitionedTable(Odps odps, final String sessionId, String tunnelServer, final String projectName, final String tableName) { final TableTunnel tunnel = new TableTunnel(odps); - if (StringUtils.isNoneBlank(tunnelServer)) { + if (StringUtils.isNotEmpty(tunnelServer)) { tunnel.setEndpoint(tunnelServer); } @@ -203,7 +205,7 @@ public TableTunnel.DownloadSession call() throws Exception { public static TableTunnel.DownloadSession getSlaveSessionForPartitionedTable(Odps odps, final String sessionId, String tunnelServer, final String projectName, final String tableName, String partition) { final TableTunnel tunnel = new TableTunnel(odps); - if (StringUtils.isNoneBlank(tunnelServer)) { + if (StringUtils.isNotEmpty(tunnelServer)) { tunnel.setEndpoint(tunnelServer); } @@ -429,14 +431,7 @@ public static void truncateNonPartitionedTable(Odps odps, Table tab) { } public static Table getTable(Odps odps, String projectName, String tableName) { - Table table = odps.tables().get(projectName, tableName); -// try { -// table.getOwner(); -// } catch (Exception e) { -// e.printStackTrace(); -// throw new RuntimeException(e); -// } - return table; + return odps.tables().get(projectName, tableName); } public static TableTunnel.UploadSession createMasterTunnelUpload(final TableTunnel tunnel, final String projectName, diff --git a/flinkx-odps/flinkx-odps-reader/pom.xml b/flinkx-odps/flinkx-odps-reader/pom.xml index 2b8d9d968a..7fcb63629c 100644 --- a/flinkx-odps/flinkx-odps-reader/pom.xml +++ b/flinkx-odps/flinkx-odps-reader/pom.xml @@ -66,7 +66,7 @@ + tofile="${basedir}/../../plugins/odpsreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-odps/flinkx-odps-reader/src/main/java/com/dtstack/flinkx/odps/reader/OdpsInputFormat.java b/flinkx-odps/flinkx-odps-reader/src/main/java/com/dtstack/flinkx/odps/reader/OdpsInputFormat.java index e6179e51dc..0d0651d05a 100644 --- a/flinkx-odps/flinkx-odps-reader/src/main/java/com/dtstack/flinkx/odps/reader/OdpsInputFormat.java +++ b/flinkx-odps/flinkx-odps-reader/src/main/java/com/dtstack/flinkx/odps/reader/OdpsInputFormat.java @@ -186,7 +186,7 @@ public Row nextRecordInternal(Row row) throws IOException { val = metaColumn.getValue(); } - if(val != null){ + if(val != null && val instanceof String){ val = StringUtil.string2col(String.valueOf(val),metaColumn.getType(),metaColumn.getTimeFormat()); } diff --git a/flinkx-odps/flinkx-odps-writer/pom.xml b/flinkx-odps/flinkx-odps-writer/pom.xml index 1e65017236..7b7fc17abc 100644 --- a/flinkx-odps/flinkx-odps-writer/pom.xml +++ b/flinkx-odps/flinkx-odps-writer/pom.xml @@ -79,7 +79,7 @@ + tofile="${basedir}/../../plugins/odpswriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormat.java b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormat.java index 7da4847f35..1b134a790c 100644 --- a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormat.java +++ b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormat.java @@ -20,13 +20,13 @@ import com.aliyun.odps.Odps; import com.aliyun.odps.Table; +import com.aliyun.odps.data.Binary; import com.aliyun.odps.data.Record; import com.aliyun.odps.tunnel.TableTunnel; import com.aliyun.odps.tunnel.TunnelException; import com.aliyun.odps.tunnel.io.TunnelBufferedWriter; import com.dtstack.flinkx.common.ColumnType; import com.dtstack.flinkx.exception.WriteRecordException; -import com.dtstack.flinkx.odps.OdpsConfigKeys; import com.dtstack.flinkx.odps.OdpsUtil; import com.dtstack.flinkx.outputformat.RichOutputFormat; import com.dtstack.flinkx.util.DateUtil; @@ -34,7 +34,6 @@ import org.apache.flink.types.Row; import java.io.IOException; import java.math.BigDecimal; -import java.util.HashMap; import java.util.Map; /** @@ -59,6 +58,8 @@ public class OdpsOutputFormat extends RichOutputFormat { protected Map odpsConfig; + protected long bufferSize; + private transient Odps odps; private transient TableTunnel tunnel; @@ -95,6 +96,7 @@ public void openInternal(int taskNumber, int numTasks) throws IOException { session = OdpsUtil.createMasterTunnelUpload(tunnel, projectName, tableName, partition); try { recordWriter = (TunnelBufferedWriter) session.openBufferedWriter(); + recordWriter.setBufferSize(bufferSize); } catch (TunnelException e) { throw new RuntimeException("can not open record writer"); } @@ -134,12 +136,21 @@ private Record row2record(Row row, String[] columnTypes) throws WriteRecordExcep case BOOLEAN: record.setBoolean(i, Boolean.valueOf(rowData)); break; - case INT: case TINYINT: + record.set(i, Byte.valueOf(rowData)); + break; case SMALLINT: + record.set(i, Short.valueOf(rowData)); + break; + case INT: + record.set(i, Integer.valueOf(rowData)); + break; case BIGINT: record.setBigint(i, Long.valueOf(rowData)); break; + case FLOAT: + record.set(i, Float.valueOf(rowData)); + break; case DOUBLE: record.setDouble(i, Double.valueOf(rowData)); break; @@ -151,11 +162,16 @@ private Record row2record(Row row, String[] columnTypes) throws WriteRecordExcep break; case DATE: case DATETIME: + record.set(i, DateUtil.columnToDate(column, null)); + break; case TIMESTAMP: record.setDatetime(i, DateUtil.columnToTimestamp(column,null)); break; + case BINARY: + record.set(i, new Binary(rowData.getBytes())); + break; default: - throw new IllegalArgumentException(); + record.set(i,column); } } @@ -177,7 +193,7 @@ public void closeInternal() throws IOException { try { session.commit(); } catch (TunnelException e) { - e.printStackTrace(); + throw new IOException("commit session error:",e); } } diff --git a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormatBuilder.java b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormatBuilder.java index 682ff86a93..37f6bd36eb 100644 --- a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormatBuilder.java +++ b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsOutputFormatBuilder.java @@ -67,6 +67,10 @@ public void setWriteMode(String writeMode) { this.format.writeMode = StringUtils.isBlank(writeMode) ? "APPEND" : writeMode.toUpperCase(); } + public void setBufferSize(long bufferSize){ + format.bufferSize = bufferSize; + } + @Override protected void checkFormat() { diff --git a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsWriter.java b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsWriter.java index b7bdf63f0c..a129691e5b 100644 --- a/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsWriter.java +++ b/flinkx-odps/flinkx-odps-writer/src/main/java/com/dtstack/flinkx/odps/writer/OdpsWriter.java @@ -20,6 +20,7 @@ import com.dtstack.flinkx.config.DataTransferConfig; import com.dtstack.flinkx.config.WriterConfig; import com.dtstack.flinkx.odps.OdpsConfigKeys; +import com.dtstack.flinkx.odps.OdpsUtil; import com.dtstack.flinkx.writer.DataWriter; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSink; @@ -48,7 +49,7 @@ public class OdpsWriter extends DataWriter { protected String projectName; - protected String writeMode; + protected long bufferSize; public OdpsWriter(DataTransferConfig config) { super(config); @@ -58,8 +59,13 @@ public OdpsWriter(DataTransferConfig config) { partition = writerConfig.getParameter().getStringVal(OdpsConfigKeys.KEY_PARTITION); mode = writerConfig.getParameter().getStringVal(OdpsConfigKeys.KEY_WRITE_MODE); projectName = writerConfig.getParameter().getStringVal(OdpsConfigKeys.KEY_PROJECT); - writeMode = writerConfig.getParameter().getStringVal(OdpsConfigKeys.KEY_MODE); + bufferSize = writerConfig.getParameter().getLongVal(OdpsConfigKeys.KEY_BUFFER_SIZE, 0); + if (bufferSize == 0){ + bufferSize = OdpsUtil.BUFFER_SIZE_DEFAULT; + } else { + bufferSize = bufferSize * 1024 * 1024; + } List columns = (List) writerConfig.getParameter().getVal(OdpsConfigKeys.KEY_COLUMN_LIST); if(columns != null || columns.size() != 0) { @@ -80,7 +86,7 @@ public DataStreamSink writeData(DataStream dataSet) { builder.setPartition(partition); builder.setColumnNames(columnName); builder.setColumnTypes(columnType); - builder.setWriteMode(writeMode); + builder.setWriteMode(mode); builder.setTableName(tableName); builder.setOdpsConfig(odpsConfig); builder.setDirtyPath(dirtyPath); @@ -88,6 +94,7 @@ public DataStreamSink writeData(DataStream dataSet) { builder.setSrcCols(srcCols); builder.setErrorRatio(errorRatio); builder.setErrors(errors); + builder.setBufferSize(bufferSize); OutputFormatSinkFunction sinkFunction = new OutputFormatSinkFunction(builder.finish()); DataStreamSink dataStreamSink = dataSet.addSink(sinkFunction); diff --git a/flinkx-oracle/flinkx-oracle-core/src/main/java/com/dtstack/flinkx/oracle/OracleDatabaseMeta.java b/flinkx-oracle/flinkx-oracle-core/src/main/java/com/dtstack/flinkx/oracle/OracleDatabaseMeta.java index dcd8d25bec..4fecbab2e4 100644 --- a/flinkx-oracle/flinkx-oracle-core/src/main/java/com/dtstack/flinkx/oracle/OracleDatabaseMeta.java +++ b/flinkx-oracle/flinkx-oracle-core/src/main/java/com/dtstack/flinkx/oracle/OracleDatabaseMeta.java @@ -76,6 +76,11 @@ public String getSplitFilter(String columnName) { return String.format("mod(%s, ${N}) = ${M}", getStartQuote() + columnName + getEndQuote()); } + @Override + public String getSplitFilterWithTmpTable(String tmpTable, String columnName) { + return String.format("mod(%s.%s, ${N}) = ${M}", tmpTable, getStartQuote() + columnName + getEndQuote()); + } + @Override protected String makeMultipleValues(int nCols, int batchSize) { String value = makeValues(nCols); diff --git a/flinkx-oracle/flinkx-oracle-reader/pom.xml b/flinkx-oracle/flinkx-oracle-reader/pom.xml index 7b096de29c..675275e8d6 100644 --- a/flinkx-oracle/flinkx-oracle-reader/pom.xml +++ b/flinkx-oracle/flinkx-oracle-reader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/oraclereader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-oracle/flinkx-oracle-writer/pom.xml b/flinkx-oracle/flinkx-oracle-writer/pom.xml index d717a2094d..eca5ebdd39 100644 --- a/flinkx-oracle/flinkx-oracle-writer/pom.xml +++ b/flinkx-oracle/flinkx-oracle-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/oraclewriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-postgresql/flinkx-postgresql-core/src/main/java/com/dtstack/flinkx/postgresql/PostgresqlDatabaseMeta.java b/flinkx-postgresql/flinkx-postgresql-core/src/main/java/com/dtstack/flinkx/postgresql/PostgresqlDatabaseMeta.java index f482c491eb..7b9586af5f 100644 --- a/flinkx-postgresql/flinkx-postgresql-core/src/main/java/com/dtstack/flinkx/postgresql/PostgresqlDatabaseMeta.java +++ b/flinkx-postgresql/flinkx-postgresql-core/src/main/java/com/dtstack/flinkx/postgresql/PostgresqlDatabaseMeta.java @@ -156,6 +156,11 @@ public String getSplitFilter(String columnName) { return String.format(" mod(%s,${N}) = ${M}", getStartQuote() + columnName + getEndQuote()); } + @Override + public String getSplitFilterWithTmpTable(String tmpTable, String columnName) { + return String.format(" mod(%s.%s,${N}) = ${M}", tmpTable, getStartQuote() + columnName + getEndQuote()); + } + @Override public int getFetchSize(){ return 1000; diff --git a/flinkx-postgresql/flinkx-postgresql-reader/pom.xml b/flinkx-postgresql/flinkx-postgresql-reader/pom.xml index 412449f844..46f159e0cf 100644 --- a/flinkx-postgresql/flinkx-postgresql-reader/pom.xml +++ b/flinkx-postgresql/flinkx-postgresql-reader/pom.xml @@ -69,7 +69,7 @@ + tofile="${basedir}/../../plugins/postgresqlreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-postgresql/flinkx-postgresql-writer/pom.xml b/flinkx-postgresql/flinkx-postgresql-writer/pom.xml index 5dddde0f65..423f100eb9 100644 --- a/flinkx-postgresql/flinkx-postgresql-writer/pom.xml +++ b/flinkx-postgresql/flinkx-postgresql-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/postgresqlwriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-rdb/pom.xml b/flinkx-rdb/pom.xml index 993958e2c6..b9e273bde3 100644 --- a/flinkx-rdb/pom.xml +++ b/flinkx-rdb/pom.xml @@ -41,7 +41,7 @@ + tofile="${basedir}/../plugins/common/${project.name}-${git.branch}.jar"/> diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/DatabaseInterface.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/DatabaseInterface.java index d2ff546f20..08a5d10b89 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/DatabaseInterface.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/DatabaseInterface.java @@ -67,6 +67,8 @@ public interface DatabaseInterface { String getSplitFilter(String columnName); + String getSplitFilterWithTmpTable(String tmpTable, String columnName); + int getFetchSize(); int getQueryTimeout(); diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcConfigKeys.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcConfigKeys.java index 4efc0c5ae5..4ecb1f9842 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcConfigKeys.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcConfigKeys.java @@ -38,8 +38,13 @@ public class JdbcConfigKeys { public static final String KEY_QUERY_TIME_OUT = "queryTimeOut"; + public static final String KEY_REQUEST_ACCUMULATOR_INTERVAL = "requestAccumulatorInterval"; + public static final String KEY_INCRE_COLUMN = "increColumn"; public static final String KEY_START_LOCATION = "startLocation"; + public static final String KEY_CUSTOM_SQL = "customSql"; + + public static final String KEY_USE_MAX_FUNC = "useMaxFunc"; } diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcDataReader.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcDataReader.java index f3c2f6504e..16391e8fbc 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcDataReader.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/datareader/JdbcDataReader.java @@ -27,9 +27,11 @@ import com.dtstack.flinkx.rdb.util.DBUtil; import com.dtstack.flinkx.reader.DataReader; import com.dtstack.flinkx.reader.MetaColumn; +import org.apache.commons.lang3.StringUtils; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.types.Row; + import java.util.List; /** @@ -66,6 +68,12 @@ public class JdbcDataReader extends DataReader { protected int queryTimeOut; + protected int requestAccumulatorInterval; + + protected boolean useMaxFunc; + + protected String customSql; + public void setDatabaseInterface(DatabaseInterface databaseInterface) { this.databaseInterface = databaseInterface; } @@ -87,9 +95,17 @@ public JdbcDataReader(DataTransferConfig config, StreamExecutionEnvironment env) metaColumns = MetaColumn.getMetaColumns(readerConfig.getParameter().getColumn()); fetchSize = readerConfig.getParameter().getIntVal(JdbcConfigKeys.KEY_FETCH_SIZE,0); queryTimeOut = readerConfig.getParameter().getIntVal(JdbcConfigKeys.KEY_QUERY_TIME_OUT,0); + requestAccumulatorInterval = readerConfig.getParameter().getIntVal(JdbcConfigKeys.KEY_REQUEST_ACCUMULATOR_INTERVAL,2); splitKey = readerConfig.getParameter().getStringVal(JdbcConfigKeys.KEY_SPLIK_KEY); increColumn = readerConfig.getParameter().getStringVal(JdbcConfigKeys.KEY_INCRE_COLUMN); startLocation = readerConfig.getParameter().getStringVal(JdbcConfigKeys.KEY_START_LOCATION,null); + customSql = readerConfig.getParameter().getStringVal(JdbcConfigKeys.KEY_CUSTOM_SQL,null); + useMaxFunc = readerConfig.getParameter().getBooleanVal(JdbcConfigKeys.KEY_USE_MAX_FUNC,true); + + increColumn = StringUtils.isEmpty(increColumn) ? null : increColumn; + if(StringUtils.isEmpty(increColumn)){ + useMaxFunc = false; + } } @Override @@ -108,29 +124,36 @@ public DataStream readData() { builder.setMetaColumn(metaColumns); builder.setFetchSize(fetchSize == 0 ? databaseInterface.getFetchSize() : fetchSize); builder.setQueryTimeOut(queryTimeOut == 0 ? databaseInterface.getQueryTimeout() : queryTimeOut); + builder.setRequestAccumulatorInterval(requestAccumulatorInterval); builder.setIncreCol(increColumn); + builder.setIncreColType(getIncrementColType()); builder.setStartLocation(startLocation); + builder.setSplitKey(splitKey); + builder.setNumPartitions(numPartitions); + builder.setUseMaxFunc(useMaxFunc); + builder.setCustomSql(customSql); + builder.setHadoopConfig(hadoopConfig); - boolean isSplitByKey = false; - if(numPartitions > 1 && splitKey != null && splitKey.trim().length() != 0) { - builder.setParameterValues(DBUtil.getParameterValues(numPartitions)); - isSplitByKey = true; - } - if(increColumn != null){ - String increColType = getIncreColType(); - where = DBUtil.buildWhereSql(databaseInterface,increColType,where,increColumn,startLocation); - builder.setIncreColType(increColType); - } + boolean isSplitByKey = numPartitions > 1 && StringUtils.isNotEmpty(splitKey); - String query = DBUtil.getQuerySql(databaseInterface,table,metaColumns,splitKey,where,isSplitByKey); + String query; + if (StringUtils.isNotEmpty(customSql)){ + query = DBUtil.buildQuerySqlWithCustomSql(databaseInterface, customSql, isSplitByKey, splitKey, StringUtils.isNotEmpty(increColumn)); + } else { + query = DBUtil.getQuerySql(databaseInterface, table, metaColumns, splitKey, where, isSplitByKey, StringUtils.isNotEmpty(increColumn)); + } builder.setQuery(query); RichInputFormat format = builder.finish(); return createInput(format, (databaseInterface.getDatabaseType() + "reader").toLowerCase()); } - private String getIncreColType(){ + private String getIncrementColType(){ + if (StringUtils.isEmpty(increColumn)){ + return null; + } + for (MetaColumn metaColumn : metaColumns) { if(metaColumn.getName().equals(increColumn)){ return metaColumn.getType(); diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/DistributedJdbcInputFormatBuilder.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/DistributedJdbcInputFormatBuilder.java index 00ed8f03f2..bfa6225e64 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/DistributedJdbcInputFormatBuilder.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/DistributedJdbcInputFormatBuilder.java @@ -23,6 +23,7 @@ import com.dtstack.flinkx.rdb.DatabaseInterface; import com.dtstack.flinkx.rdb.type.TypeConverterInterface; import com.dtstack.flinkx.reader.MetaColumn; +import org.apache.commons.lang.StringUtils; import java.util.List; @@ -122,6 +123,10 @@ protected void checkFormat() { if(!dataSource.getJdbcUrl().startsWith(jdbcPrefix)){ throw new IllegalArgumentException("Multiple data sources must be of the same type"); } + + if (StringUtils.isEmpty(format.splitKey) && format.numPartitions > 1){ + throw new IllegalArgumentException("Must specify the split column when the channel is greater than 1"); + } } } } diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormat.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormat.java index 94d932f207..76685e9cbb 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormat.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormat.java @@ -28,21 +28,33 @@ import com.dtstack.flinkx.util.ClassUtil; import com.dtstack.flinkx.util.DateUtil; import com.dtstack.flinkx.util.StringUtil; +import com.dtstack.flinkx.util.URLUtil; +import com.google.gson.Gson; import org.apache.commons.lang3.StringUtils; import org.apache.flink.api.common.accumulators.Accumulator; import org.apache.flink.api.common.io.DefaultInputSplitAssigner; import org.apache.flink.api.common.io.statistics.BaseStatistics; import org.apache.flink.configuration.Configuration; -import org.apache.flink.core.io.GenericInputSplit; import org.apache.flink.core.io.InputSplit; import org.apache.flink.core.io.InputSplitAssigner; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.CloseableHttpClient; +import org.apache.flink.hadoop.shaded.org.apache.http.impl.client.HttpClientBuilder; import org.apache.flink.types.Row; import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.Reader; import java.sql.*; import java.util.*; import java.util.Date; import com.dtstack.flinkx.inputformat.RichInputFormat; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.permission.FsPermission; +import org.apache.hadoop.io.IOUtils; +import org.codehaus.jackson.map.ObjectMapper; /** * InputFormat for reading data from a database and generate Rows. @@ -80,8 +92,6 @@ public class JdbcInputFormat extends RichInputFormat { protected boolean hasNext; - protected Object[][] parameterValues; - protected int columnCount; protected String table; @@ -96,18 +106,35 @@ public class JdbcInputFormat extends RichInputFormat { protected String startLocation; + protected String splitKey; + private int increColIndex; protected int fetchSize; protected int queryTimeOut; + protected int requestAccumulatorInterval; + + protected boolean useMaxFunc; + + protected int numPartitions; + + protected String customSql; + protected StringAccumulator tableColAccumulator; + protected StringAccumulator maxValueAccumulator; + protected MaximumAccumulator endLocationAccumulator; protected StringAccumulator startLocationAccumulator; + /** + * The hadoop config for metric + */ + protected Map hadoopConfig; + public JdbcInputFormat() { resultSetType = ResultSet.TYPE_FORWARD_ONLY; resultSetConcurrency = ResultSet.CONCUR_READ_ONLY; @@ -118,54 +145,30 @@ public void configure(Configuration configuration) { } - private void setMetric(){ - Map> accumulatorMap = getRuntimeContext().getAllAccumulators(); - - if(!accumulatorMap.containsKey(Metrics.TABLE_COL)){ - tableColAccumulator = new StringAccumulator(); - tableColAccumulator.add(table + "-" + increCol); - getRuntimeContext().addAccumulator(Metrics.TABLE_COL,tableColAccumulator); - } + @Override + public void openInternal(InputSplit inputSplit) throws IOException { + try { + LOG.info(inputSplit.toString()); - if(!accumulatorMap.containsKey(Metrics.END_LOCATION)){ - endLocationAccumulator = new MaximumAccumulator(); - getRuntimeContext().addAccumulator(Metrics.END_LOCATION,endLocationAccumulator); - } + ClassUtil.forName(drivername, getClass().getClassLoader()); - if (startLocation != null){ - endLocationAccumulator.add(startLocation); - if(!accumulatorMap.containsKey(Metrics.START_LOCATION)){ - startLocationAccumulator = new StringAccumulator(); - startLocationAccumulator.add(startLocation); - getRuntimeContext().addAccumulator(Metrics.START_LOCATION,startLocationAccumulator); + if (useMaxFunc){ + getMaxValue(inputSplit); } - } - for (int i = 0; i < metaColumns.size(); i++) { - if (metaColumns.get(i).getName().equals(increCol)){ - increColIndex = i; - break; + initMetric(inputSplit); + + if(!canReadData(inputSplit)){ + LOG.warn("Not read data when the start location are equal to end location"); + + hasNext = false; + return; } - } - } - @Override - public void openInternal(InputSplit inputSplit) throws IOException { - try { - ClassUtil.forName(drivername, getClass().getClassLoader()); dbConn = DBUtil.getConnection(dbURL, username, password); dbConn.setAutoCommit(false); - Statement statement = dbConn.createStatement(resultSetType, resultSetConcurrency); - if (inputSplit != null && parameterValues != null) { - String n = parameterValues[inputSplit.getSplitNumber()][0].toString(); - String m = parameterValues[inputSplit.getSplitNumber()][1].toString(); - queryTemplate = queryTemplate.replace("${N}",n).replace("${M}",m); - - LOG.warn(String.format("Executing '%s' with parameters %s", queryTemplate, Arrays.deepToString(parameterValues[inputSplit.getSplitNumber()]))); - } - if(EDatabaseType.MySQL == databaseInterface.getDatabaseType()){ statement.setFetchSize(Integer.MIN_VALUE); } else { @@ -175,17 +178,21 @@ public void openInternal(InputSplit inputSplit) throws IOException { if(EDatabaseType.Carbondata != databaseInterface.getDatabaseType()) { statement.setQueryTimeout(queryTimeOut); } - resultSet = statement.executeQuery(queryTemplate); + + String querySql = buildQuerySql(inputSplit); + resultSet = statement.executeQuery(querySql); columnCount = resultSet.getMetaData().getColumnCount(); hasNext = resultSet.next(); - if(descColumnTypeList == null) { + if (StringUtils.isEmpty(customSql)){ descColumnTypeList = DBUtil.analyzeTable(dbURL, username, password,databaseInterface,table,metaColumns); + } else { + descColumnTypeList = new ArrayList<>(); + for (MetaColumn metaColumn : metaColumns) { + descColumnTypeList.add(metaColumn.getName()); + } } - if(increCol != null){ - setMetric(); - } } catch (SQLException se) { throw new IllegalArgumentException("open() failed." + se.getMessage(), se); } @@ -201,23 +208,19 @@ public BaseStatistics getStatistics(BaseStatistics cachedStatistics) throws IOEx @Override public InputSplit[] createInputSplits(int minNumSplits) throws IOException { - if (parameterValues == null) { - return new GenericInputSplit[]{new GenericInputSplit(0, 1)}; - } - GenericInputSplit[] ret = new GenericInputSplit[parameterValues.length]; - for (int i = 0; i < ret.length; i++) { - ret[i] = new GenericInputSplit(i, ret.length); + JdbcInputSplit[] splits = new JdbcInputSplit[minNumSplits]; + for (int i = 0; i < minNumSplits; i++) { + splits[i] = new JdbcInputSplit(i, numPartitions, i, startLocation, null); } - return ret; - } + return splits; + } @Override public InputSplitAssigner getInputSplitAssigner(InputSplit[] inputSplits) { return new DefaultInputSplitAssigner(inputSplits); } - @Override public boolean reachedEnd() throws IOException { return !hasNext; @@ -246,7 +249,7 @@ public Row nextRecordInternal(Row row) throws IOException { } } - if(increCol != null){ + if(increCol != null && !useMaxFunc){ if (ColumnType.isTimeType(increColType)){ Timestamp increVal = resultSet.getTimestamp(increColIndex + 1); if(increVal != null){ @@ -272,6 +275,219 @@ public Row nextRecordInternal(Row row) throws IOException { } } + private void initMetric(InputSplit split){ + + if (StringUtils.isEmpty(increCol)){ + return; + } + + Map> accumulatorMap = getRuntimeContext().getAllAccumulators(); + + if(!accumulatorMap.containsKey(Metrics.TABLE_COL)){ + tableColAccumulator = new StringAccumulator(); + tableColAccumulator.add(table + "-" + increCol); + getRuntimeContext().addAccumulator(Metrics.TABLE_COL,tableColAccumulator); + } + + startLocationAccumulator = new StringAccumulator(); + if (startLocation != null){ + startLocationAccumulator.add(startLocation); + } + getRuntimeContext().addAccumulator(Metrics.START_LOCATION,startLocationAccumulator); + + endLocationAccumulator = new MaximumAccumulator(); + String endLocation = ((JdbcInputSplit)split).getEndLocation(); + if(endLocation != null && useMaxFunc){ + endLocationAccumulator.add(endLocation); + } else { + endLocationAccumulator.add(startLocation); + } + getRuntimeContext().addAccumulator(Metrics.END_LOCATION,endLocationAccumulator); + + for (int i = 0; i < metaColumns.size(); i++) { + if (metaColumns.get(i).getName().equals(increCol)){ + increColIndex = i; + break; + } + } + } + + private void getMaxValue(InputSplit inputSplit){ + String maxValue = null; + if (inputSplit.getSplitNumber() == 0){ + maxValue = getMaxValueFromDb(); + maxValueAccumulator = new StringAccumulator(); + maxValueAccumulator.add(maxValue); + getRuntimeContext().addAccumulator(Metrics.MAX_VALUE, maxValueAccumulator); + } else { + if(StringUtils.isEmpty(monitorUrls)){ + return; + } + + try (CloseableHttpClient httpClient = HttpClientBuilder.create().build()) { + + Map vars = getRuntimeContext().getMetricGroup().getAllVariables(); + String jobId = vars.get(""); + + String[] monitors; + if (monitorUrls.startsWith("http")) { + monitors = new String[]{String.format("%s/jobs/%s/accumulators", monitorUrls, jobId)}; + } else { + monitors = monitorUrls.split(","); + for (int i = 0; i < monitors.length; i++) { + monitors[i] = String.format("http://%s/jobs/%s/accumulators", monitors[i], jobId); + } + } + + /** + * The extra 10 times is to ensure that accumulator is updated + */ + int maxAcquireTimes = (queryTimeOut / requestAccumulatorInterval) + 10; + + int acquireTimes = 0; + while (StringUtils.isEmpty(maxValue) && acquireTimes < maxAcquireTimes) { + try { + Thread.sleep(requestAccumulatorInterval * 1000); + } catch (InterruptedException ignore) { + } + + maxValue = getMaxvalueFromAccumulator(httpClient, monitors); + acquireTimes++; + } + + if (StringUtils.isEmpty(maxValue)) { + throw new RuntimeException("Can't get the max value from accumulator"); + } + } catch (IOException e){ + throw new RuntimeException("Can't get the max value from accumulator:" + e); + } + } + + ((JdbcInputSplit) inputSplit).setEndLocation(maxValue); + } + + private String getMaxvalueFromAccumulator(CloseableHttpClient httpClient,String[] monitors){ + String maxValue = null; + Gson gson = new Gson(); + for (String monitor : monitors) { + LOG.info("Request url:" + monitor); + try { + String response = URLUtil.get(httpClient, monitor); + Map map = gson.fromJson(response, Map.class); + + LOG.info("Accumulator data:" + gson.toJson(map)); + + List userTaskAccumulators = (List) map.get("user-task-accumulators"); + for (Map accumulator : userTaskAccumulators) { + if (Metrics.MAX_VALUE.equals(accumulator.get("name"))) { + maxValue = (String) accumulator.get("value"); + break; + } + } + + if (StringUtils.isNotEmpty(maxValue)) { + break; + } + } catch (Exception e) { + LOG.error("Get max value from accumulator error:", e); + } + } + + return maxValue; + } + + private boolean canReadData(InputSplit split){ + if (StringUtils.isEmpty(increCol)){ + return true; + } + + if (!useMaxFunc){ + return true; + } + + JdbcInputSplit jdbcInputSplit = (JdbcInputSplit) split; + return !StringUtils.equals(jdbcInputSplit.getStartLocation(), jdbcInputSplit.getEndLocation()); + } + + private String buildQuerySql(InputSplit inputSplit){ + String querySql = queryTemplate; + + if (inputSplit != null) { + JdbcInputSplit jdbcInputSplit = (JdbcInputSplit) inputSplit; + + if (StringUtils.isNotEmpty(splitKey)){ + querySql = queryTemplate.replace("${N}", String.valueOf(numPartitions)) + .replace("${M}", String.valueOf(jdbcInputSplit.getMod())); + } + + if (StringUtils.isNotEmpty(increCol)){ + String incrementFilter = DBUtil.buildIncrementFilter(databaseInterface, increColType, increCol, + jdbcInputSplit.getStartLocation(), jdbcInputSplit.getEndLocation(), customSql, useMaxFunc); + + if(StringUtils.isNotEmpty(incrementFilter)){ + incrementFilter = " and " + incrementFilter; + } + + querySql = querySql.replace(DBUtil.INCREMENT_FILTER_PLACEHOLDER, incrementFilter); + } + } + + LOG.warn(String.format("Executing sql is: '%s'", querySql)); + + return querySql; + } + + private String getMaxValueFromDb() { + String maxValue = null; + Connection conn = null; + Statement st = null; + ResultSet rs = null; + try { + long startTime = System.currentTimeMillis(); + + String queryMaxValueSql; + if (StringUtils.isNotEmpty(customSql)){ + queryMaxValueSql = String.format("select max(%s.%s) as max_value from ( %s ) %s", DBUtil.TEMPORARY_TABLE_NAME, + databaseInterface.quoteColumn(increCol), customSql, DBUtil.TEMPORARY_TABLE_NAME); + } else { + queryMaxValueSql = String.format("select max(%s) as max_value from %s", + databaseInterface.quoteColumn(increCol), databaseInterface.quoteTable(table)); + } + + String startSql = DBUtil.buildStartLocationSql(databaseInterface, increColType, + databaseInterface.quoteColumn(increCol), startLocation, useMaxFunc); + if(StringUtils.isNotEmpty(startSql)){ + queryMaxValueSql += " where " + startSql; + } + + LOG.info(String.format("Query max value sql is '%s'", queryMaxValueSql)); + + conn = DBUtil.getConnection(dbURL, username, password); + st = conn.createStatement(); + rs = st.executeQuery(queryMaxValueSql); + if (rs.next()){ + if (ColumnType.isTimeType(increColType)){ + Timestamp increVal = rs.getTimestamp("max_value"); + if(increVal != null){ + maxValue = String.valueOf(getLocation(increVal)); + } + } else if(ColumnType.isNumberType(increColType)){ + maxValue = String.valueOf(rs.getLong("max_value")); + } else { + maxValue = rs.getString("max_value"); + } + } + + LOG.info(String.format("Takes [%s] milliseconds to get the maximum value [%s]", System.currentTimeMillis() - startTime, maxValue)); + + return maxValue; + } catch (Throwable e){ + throw new RuntimeException("Get max value from " + table + " error",e); + } finally { + DBUtil.closeDBResources(rs,st,conn); + } + } + private long getLocation(Object increVal){ if(increVal instanceof Timestamp){ long time = ((Timestamp)increVal).getTime() / 1000; @@ -291,10 +507,46 @@ private long getLocation(Object increVal){ } } + private void uploadMetricData() throws IOException { + FSDataOutputStream out = null; + try { + org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration(); + + if(hadoopConfig != null) { + for (Map.Entry entry : hadoopConfig.entrySet()) { + conf.set(entry.getKey(), entry.getValue()); + } + } + + Map vars = getRuntimeContext().getMetricGroup().getAllVariables(); + String jobId = vars.get(""); + String taskId = vars.get(""); + String subtaskIndex = vars.get(""); + LOG.info("jobId:{} taskId:{} subtaskIndex:{}", jobId, taskId, subtaskIndex); + + Path remotePath = new Path(conf.get("fs.defaultFS"), "/tmp/logs/admin/logs/" + jobId + "/" + taskId + "_" + subtaskIndex); + out = FileSystem.create(remotePath.getFileSystem(conf), remotePath, new FsPermission(FsPermission.createImmutable((short) 0777))); + + Map metrics = new HashMap<>(3); + metrics.put(Metrics.TABLE_COL, table + "-" + increCol); + if (startLocationAccumulator != null){ + metrics.put(Metrics.START_LOCATION, startLocationAccumulator.getLocalValue()); + } + if (endLocationAccumulator != null){ + metrics.put(Metrics.END_LOCATION, endLocationAccumulator.getLocalValue()); + } + out.writeUTF(new ObjectMapper().writeValueAsString(metrics)); + } finally { + IOUtils.closeStream(out); + } + } + @Override public void closeInternal() throws IOException { + if(StringUtils.isNotEmpty(increCol) && hadoopConfig != null) { + uploadMetricData(); + } DBUtil.closeDBResources(resultSet,statement,dbConn); - parameterValues = null; } } \ No newline at end of file diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormatBuilder.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormatBuilder.java index d2abbb4819..8230f47b6d 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormatBuilder.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputFormatBuilder.java @@ -22,8 +22,10 @@ import com.dtstack.flinkx.rdb.DatabaseInterface; import com.dtstack.flinkx.rdb.type.TypeConverterInterface; import com.dtstack.flinkx.reader.MetaColumn; +import org.apache.commons.lang.StringUtils; import java.util.List; +import java.util.Map; /** * The builder of JdbcInputFormat @@ -31,7 +33,6 @@ * Company: www.dtstack.com * @author huyifan.zju@163.com */ -@Deprecated public class JdbcInputFormatBuilder extends RichInputFormatBuilder { private JdbcInputFormat format; @@ -52,10 +53,6 @@ public void setQuery(String query) { format.queryTemplate = query; } - public void setParameterValues(Object[][] parameterValues) { - format.parameterValues = parameterValues; - } - public void setUsername(String username) { format.username = username; } @@ -88,6 +85,10 @@ public void setQueryTimeOut(int queryTimeOut){ format.queryTimeOut = queryTimeOut; } + public void setRequestAccumulatorInterval(int requestAccumulatorInterval){ + format.requestAccumulatorInterval = requestAccumulatorInterval; + } + public void setIncreCol(String increCol){ format.increCol = increCol; } @@ -96,27 +97,52 @@ public void setStartLocation(String startLocation){ format.startLocation = startLocation; } + public void setSplitKey(String splitKey){ + format.splitKey = splitKey; + } + public void setIncreColType(String increColType){ format.increColType = increColType; } + public void setUseMaxFunc(boolean useMaxFunc){ + format.useMaxFunc = useMaxFunc; + } + + public void setNumPartitions(int numPartitions){ + format.numPartitions = numPartitions; + } + + public void setCustomSql(String customSql){ + format.customSql = customSql; + } + + public void setHadoopConfig(Map dirtyHadoopConfig) { + format.hadoopConfig = dirtyHadoopConfig; + } + @Override protected void checkFormat() { + if (format.username == null) { LOG.info("Username was not supplied separately."); } + if (format.password == null) { LOG.info("Password was not supplied separately."); } + if (format.dbURL == null) { throw new IllegalArgumentException("No database URL supplied"); } - if (format.queryTemplate == null) { - throw new IllegalArgumentException("No query supplied"); - } + if (format.drivername == null) { throw new IllegalArgumentException("No driver supplied"); } + + if (StringUtils.isEmpty(format.splitKey) && format.numPartitions > 1){ + throw new IllegalArgumentException("Must specify the split column when the channel is greater than 1"); + } } } diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputSplit.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputSplit.java new file mode 100644 index 0000000000..cc03f46bbf --- /dev/null +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/JdbcInputSplit.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dtstack.flinkx.rdb.inputformat; + +import org.apache.flink.core.io.GenericInputSplit; + +/** + * @author jiangbo + * @explanation + * @date 2019/3/6 + */ +public class JdbcInputSplit extends GenericInputSplit { + + private int mod; + + private String endLocation; + + private String startLocation; + + /** + * Creates a generic input split with the given split number. + * + * @param partitionNumber The number of the split's partition. + * @param totalNumberOfPartitions The total number of the splits (partitions). + */ + public JdbcInputSplit(int partitionNumber, int totalNumberOfPartitions, int mod, String startLocation, String endLocation) { + super(partitionNumber, totalNumberOfPartitions); + this.mod = mod; + this.startLocation = startLocation; + this.endLocation = endLocation; + } + + public int getMod() { + return mod; + } + + public String getEndLocation() { + return endLocation; + } + + public String getStartLocation() { + return startLocation; + } + + public void setMod(int mod) { + this.mod = mod; + } + + public void setEndLocation(String endLocation) { + this.endLocation = endLocation; + } + + public void setStartLocation(String startLocation) { + this.startLocation = startLocation; + } + + @Override + public String toString() { + return "JdbcInputSplit{" + + "mod=" + mod + + ", endLocation='" + endLocation + '\'' + + ", startLocation='" + startLocation + '\'' + + '}'; + } +} diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/MaximumAccumulator.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/MaximumAccumulator.java index 0ed159cb62..3b6307a8db 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/MaximumAccumulator.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/inputformat/MaximumAccumulator.java @@ -19,6 +19,8 @@ package com.dtstack.flinkx.rdb.inputformat; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang3.math.NumberUtils; import org.apache.flink.api.common.accumulators.Accumulator; import java.math.BigInteger; @@ -34,13 +36,19 @@ public class MaximumAccumulator implements Accumulator { @Override public void add(String value) { + if(StringUtils.isEmpty(value)){ + return; + } + if(localValue == null){ localValue = value; - } else { + } else if(NumberUtils.isNumber(localValue)){ BigInteger newVal = new BigInteger(value); if(newVal.compareTo(new BigInteger(localValue)) > 0){ localValue = value; } + } else { + localValue = localValue.compareTo(value) < 0 ? value : localValue; } } @@ -56,9 +64,22 @@ public void resetLocal() { @Override public void merge(Accumulator other) { - BigInteger local = new BigInteger(localValue); - if(local.compareTo(new BigInteger(other.getLocalValue())) < 0){ + if (other == null || StringUtils.isEmpty(other.getLocalValue())){ + return; + } + + if (localValue == null){ localValue = other.getLocalValue(); + return; + } + + if(NumberUtils.isNumber(localValue)){ + BigInteger local = new BigInteger(localValue); + if(local.compareTo(new BigInteger(other.getLocalValue())) < 0){ + localValue = other.getLocalValue(); + } + } else { + localValue = localValue.compareTo(other.getLocalValue()) < 0 ? other.getLocalValue() : localValue; } } diff --git a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/util/DBUtil.java b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/util/DBUtil.java index 0c9d967881..daaaa923ac 100644 --- a/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/util/DBUtil.java +++ b/flinkx-rdb/src/main/java/com/dtstack/flinkx/rdb/util/DBUtil.java @@ -23,10 +23,7 @@ import com.dtstack.flinkx.rdb.ParameterValuesProvider; import com.dtstack.flinkx.rdb.type.TypeConverterInterface; import com.dtstack.flinkx.reader.MetaColumn; -import com.dtstack.flinkx.util.ClassUtil; -import com.dtstack.flinkx.util.DateUtil; -import com.dtstack.flinkx.util.SysUtil; -import com.dtstack.flinkx.util.TelnetUtil; +import com.dtstack.flinkx.util.*; import org.apache.commons.lang.StringUtils; import org.apache.flink.types.Row; import org.slf4j.Logger; @@ -58,6 +55,12 @@ public class DBUtil { private static int MICRO_LENGTH = 16; private static int NANOS_LENGTH = 19; + public static final String INCREMENT_FILTER_PLACEHOLDER = "${incrementFilter}"; + + public static final String TEMPORARY_TABLE_NAME = "flinkx_tmp"; + + public static final String CUSTOM_SQL_TEMPLATE = "select * from (%s) %s"; + private static Connection getConnectionInternal(String url, String username, String password) throws SQLException { Connection dbConn; synchronized (ClassUtil.lock_str){ @@ -171,7 +174,7 @@ public static void closeDBResources(ResultSet rs, Statement stmt, public static void commit(Connection conn){ try { - if (!conn.getAutoCommit() && !conn.isClosed()){ + if (!conn.isClosed() && !conn.getAutoCommit()){ LOG.info("Start commit connection"); conn.commit(); LOG.info("Commit connection successful"); @@ -193,27 +196,12 @@ public static void executeBatch(Connection dbConn, List sqls) { } stmt.executeBatch(); } catch (SQLException e) { - e.printStackTrace(); + throw new RuntimeException("execute batch sql error:{}",e); } finally { commit(dbConn); } } - public static void executeOneByOne(Connection dbConn, List sqls) { - if(sqls == null || sqls.size() == 0) { - return; - } - - try { - Statement stmt = dbConn.createStatement(); - for(String sql : sqls) { - stmt.execute(sql); - } - } catch (SQLException e) { - e.printStackTrace(); - } - } - public static Map> getPrimaryOrUniqueKeys(String table, Connection dbConn) throws SQLException { Map> keyMap = new HashMap<>(); DatabaseMetaData meta = dbConn.getMetaData(); @@ -377,57 +365,124 @@ public static Object clobToString(Object obj) throws Exception{ return dataStr; } - public static String buildWhereSql(DatabaseInterface databaseInterface,String increColType,String where, - String increCol,String startLocation){ - if (startLocation == null){ - return where; - } + public static String buildIncrementFilter(DatabaseInterface databaseInterface,String increColType,String increCol, + String startLocation,String endLocation, String customSql, boolean useMaxFunc){ + StringBuilder filter = new StringBuilder(); - String increFilter; - String startTimeStr; + if (StringUtils.isNotEmpty(customSql)){ + increCol = String.format("%s.%s", TEMPORARY_TABLE_NAME, databaseInterface.quoteColumn(increCol)); + } - if(ColumnType.isTimeType(increColType) || (databaseInterface.getDatabaseType() == EDatabaseType.SQLServer && ColumnType.NVARCHAR.name().equals(increColType))){ - startTimeStr = getStartTimeStr(databaseInterface.getDatabaseType(),Long.parseLong(startLocation)); + String startFilter = buildStartLocationSql(databaseInterface, increColType, increCol, startLocation, useMaxFunc); + if (StringUtils.isNotEmpty(startFilter)){ + filter.append(startFilter); + } - if (databaseInterface.getDatabaseType() == EDatabaseType.Oracle){ - startTimeStr = String.format("TO_TIMESTAMP('%s','YYYY-MM-DD HH24:MI:SS:FF6')",startTimeStr); + String endFilter = buildEndLocationSql(databaseInterface, increColType, increCol, endLocation); + if (StringUtils.isNotEmpty(endFilter)){ + if (filter.length() > 0){ + filter.append(" and ").append(endFilter); } else { - startTimeStr = String.format("'%s'",startTimeStr); + filter.append(endFilter); } + } - increFilter = databaseInterface.quoteColumn(increCol) + " > " + startTimeStr; - } else if(ColumnType.isNumberType(increColType)){ - increFilter = databaseInterface.quoteColumn(increCol) + " > " + startLocation; - } else { - startTimeStr = String.format("'%s'",startLocation); - increFilter = databaseInterface.quoteColumn(increCol) + " > " + startTimeStr; + return filter.toString(); + } + + public static String buildStartLocationSql(DatabaseInterface databaseInterface,String incrementColType, + String incrementCol,String startLocation,boolean useMaxFunc){ + if(StringUtils.isEmpty(startLocation)){ + return null; } - if (where == null || where.length() == 0){ - where = increFilter; + String operator = " >= "; + if(!useMaxFunc){ + operator = " > "; + } + + return getLocationSql(databaseInterface, incrementColType, incrementCol, startLocation, operator); + } + + public static String buildEndLocationSql(DatabaseInterface databaseInterface,String incrementColType,String incrementCol, + String endLocation){ + if(StringUtils.isEmpty(endLocation)){ + return null; + } + + return getLocationSql(databaseInterface, incrementColType, incrementCol, endLocation, " < "); + } + + private static String getLocationSql(DatabaseInterface databaseInterface, String incrementColType, String incrementCol, + String endLocation, String operator) { + String endTimeStr; + String endLocationSql; + boolean isTimeType = ColumnType.isTimeType(incrementColType) + || (databaseInterface.getDatabaseType() == EDatabaseType.SQLServer && ColumnType.NVARCHAR.name().equals(incrementColType)); + if(isTimeType){ + endTimeStr = getTimeStr(databaseInterface.getDatabaseType(), Long.parseLong(endLocation), incrementColType); + endLocationSql = incrementCol + operator + endTimeStr; + } else if(ColumnType.isNumberType(incrementColType)){ + endLocationSql = incrementCol + operator + endLocation; } else { - where = where + " and " + increFilter; + endTimeStr = String.format("'%s'",endLocation); + endLocationSql = incrementCol + operator + endTimeStr; } - return where; + return endLocationSql; } - private static String getStartTimeStr(EDatabaseType databaseType,Long startLocation){ - String startTimeStr; + public static String buildWhereSql(String where,String startSql,String endSql){ + StringBuilder whereBuilder = new StringBuilder(); + + if (StringUtils.isNotEmpty(where)){ + whereBuilder.append(where.trim()); + } + + if(StringUtils.isNotEmpty(startSql)){ + if(whereBuilder.toString().length() > 0){ + whereBuilder.append(" and "); + } + whereBuilder.append(startSql); + } + + if(StringUtils.isNotEmpty(endSql)){ + if(whereBuilder.toString().length() > 0){ + whereBuilder.append(" and "); + } + whereBuilder.append(endSql); + } + + return whereBuilder.toString(); + } + + private static String getTimeStr(EDatabaseType databaseType,Long startLocation,String incrementColType){ + String timeStr; Timestamp ts = new Timestamp(getMillis(startLocation)); ts.setNanos(getNanos(startLocation)); - startTimeStr = getNanosTimeStr(ts.toString()); + timeStr = getNanosTimeStr(ts.toString()); if(databaseType == EDatabaseType.SQLServer){ - startTimeStr = startTimeStr.substring(0,23); + timeStr = timeStr.substring(0,23); + } else { + timeStr = timeStr.substring(0,26); + } + + if (databaseType == EDatabaseType.Oracle){ + if(ColumnType.TIMESTAMP.name().equals(incrementColType)){ + timeStr = String.format("TO_TIMESTAMP('%s','YYYY-MM-DD HH24:MI:SS:FF6')",timeStr); + } else { + timeStr = timeStr.substring(0, 19); + timeStr = String.format("TO_DATE('%s','YYYY-MM-DD HH24:MI:SS')", timeStr); + } } else { - startTimeStr = startTimeStr.substring(0,26); + timeStr = String.format("'%s'",timeStr); } - return startTimeStr; + return timeStr; } - public static String getNanosTimeStr(String timeStr){ + private static String getNanosTimeStr(String timeStr){ if(timeStr.length() < 29){ timeStr += StringUtils.repeat("0",29 - timeStr.length()); } @@ -435,7 +490,7 @@ public static String getNanosTimeStr(String timeStr){ return timeStr; } - public static int getNanos(long startLocation){ + private static int getNanos(long startLocation){ String timeStr = String.valueOf(startLocation); int nanos; if (timeStr.length() == SECOND_LENGTH){ @@ -453,7 +508,7 @@ public static int getNanos(long startLocation){ return nanos; } - public static long getMillis(long startLocation){ + private static long getMillis(long startLocation){ String timeStr = String.valueOf(startLocation); long millisSecond; if (timeStr.length() == SECOND_LENGTH){ @@ -471,8 +526,30 @@ public static long getMillis(long startLocation){ return millisSecond; } + public static String buildQuerySqlWithCustomSql(DatabaseInterface databaseInterface,String customSql, + boolean isSplitByKey,String splitKey,boolean increment){ + StringBuilder querySql = new StringBuilder(); + querySql.append(String.format(CUSTOM_SQL_TEMPLATE, customSql, TEMPORARY_TABLE_NAME)); + querySql.append(" WHERE 1=1 "); + + if (isSplitByKey){ + querySql.append(" And ").append(databaseInterface.getSplitFilterWithTmpTable(TEMPORARY_TABLE_NAME, splitKey)); + } + + if (increment){ + querySql.append(" ").append(INCREMENT_FILTER_PLACEHOLDER); + } + + return querySql.toString(); + } + public static String getQuerySql(DatabaseInterface databaseInterface,String table,List metaColumns, - String splitKey,String where,boolean isSplitByKey) { + String splitKey,String customFilter,boolean isSplitByKey){ + return getQuerySql(databaseInterface, table, metaColumns, splitKey, customFilter, isSplitByKey, false); + } + + public static String getQuerySql(DatabaseInterface databaseInterface,String table,List metaColumns, + String splitKey,String customFilter,boolean isSplitByKey,boolean increment) { StringBuilder sb = new StringBuilder(); List selectColumns = new ArrayList<>(); @@ -490,22 +567,27 @@ public static String getQuerySql(DatabaseInterface databaseInterface,String tabl sb.append("SELECT ").append(StringUtils.join(selectColumns,",")).append(" FROM "); sb.append(databaseInterface.quoteTable(table)); + sb.append(" WHERE 1=1 "); StringBuilder filter = new StringBuilder(); if(isSplitByKey) { - filter.append(databaseInterface.getSplitFilter(splitKey)); + filter.append(" AND ").append(databaseInterface.getSplitFilter(splitKey)); } - if(where != null && where.trim().length() != 0) { - if(filter.length() > 0) { - filter.append(" AND "); + if (customFilter != null){ + customFilter = customFilter.trim(); + if (customFilter.length() > 0){ + filter.append(" AND ").append(customFilter); } - filter.append(where); } - if(filter.length() != 0) { - sb.append(" WHERE ").append(filter); + if (increment){ + filter.append(" ").append(INCREMENT_FILTER_PLACEHOLDER); + } + + if(filter.length() > 0) { + sb.append(filter); } return sb.toString(); diff --git a/flinkx-redis/flinkx-redis-writer/pom.xml b/flinkx-redis/flinkx-redis-writer/pom.xml index 233c3b032a..0417a7cd91 100644 --- a/flinkx-redis/flinkx-redis-writer/pom.xml +++ b/flinkx-redis/flinkx-redis-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/rediswriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-sqlserver/flinkx-sqlserver-core/src/main/java/com/dtstack/flinkx/sqlserver/SqlServerDatabaseMeta.java b/flinkx-sqlserver/flinkx-sqlserver-core/src/main/java/com/dtstack/flinkx/sqlserver/SqlServerDatabaseMeta.java index d8b6c5aea2..7e6425ac56 100644 --- a/flinkx-sqlserver/flinkx-sqlserver-core/src/main/java/com/dtstack/flinkx/sqlserver/SqlServerDatabaseMeta.java +++ b/flinkx-sqlserver/flinkx-sqlserver-core/src/main/java/com/dtstack/flinkx/sqlserver/SqlServerDatabaseMeta.java @@ -64,6 +64,11 @@ public String getSplitFilter(String columnName) { return String.format("%s %% ${N} = ${M}", getStartQuote() + columnName + getEndQuote()); } + @Override + public String getSplitFilterWithTmpTable(String tmpTable, String columnName) { + return String.format("%s.%s %% ${N} = ${M}", tmpTable, getStartQuote() + columnName + getEndQuote()); + } + @Override protected String makeMultipleValues(int nCols, int batchSize) { String value = makeValues(nCols); diff --git a/flinkx-sqlserver/flinkx-sqlserver-reader/pom.xml b/flinkx-sqlserver/flinkx-sqlserver-reader/pom.xml index c2420e7b74..5c840aad11 100644 --- a/flinkx-sqlserver/flinkx-sqlserver-reader/pom.xml +++ b/flinkx-sqlserver/flinkx-sqlserver-reader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/sqlserverreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-sqlserver/flinkx-sqlserver-writer/pom.xml b/flinkx-sqlserver/flinkx-sqlserver-writer/pom.xml index 95afbc5477..dfd68e834a 100644 --- a/flinkx-sqlserver/flinkx-sqlserver-writer/pom.xml +++ b/flinkx-sqlserver/flinkx-sqlserver-writer/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/sqlserverwriter/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-stream/flinkx-stream-reader/pom.xml b/flinkx-stream/flinkx-stream-reader/pom.xml index 888220e57b..cf0954e678 100644 --- a/flinkx-stream/flinkx-stream-reader/pom.xml +++ b/flinkx-stream/flinkx-stream-reader/pom.xml @@ -68,7 +68,7 @@ + tofile="${basedir}/../../plugins/streamreader/${project.name}-${git.branch}.jar" /> diff --git a/flinkx-stream/flinkx-stream-writer/pom.xml b/flinkx-stream/flinkx-stream-writer/pom.xml index e524de8b97..57065c8e22 100644 --- a/flinkx-stream/flinkx-stream-writer/pom.xml +++ b/flinkx-stream/flinkx-stream-writer/pom.xml @@ -64,7 +64,7 @@ + tofile="${basedir}/../../plugins/streamwriter/${project.name}-${git.branch}.jar" /> diff --git a/pom.xml b/pom.xml index 32f7a14d6b..e7d1b578e0 100644 --- a/pom.xml +++ b/pom.xml @@ -70,6 +70,28 @@ flinkx-java-docs + + pl.project13.maven + git-commit-id-plugin + 2.2.6 + + + + revision + + + + + yyyy.MM.dd HH:mm:ss + true + true + + false + -dirty + false + + +