直播电商数据仓库
一. 数仓及其维度
什么是数仓
从逻辑上理解,数据库和数仓没有区别,都是通过数据库软件实现存放数据的地方,
但当数据量几何式增长,需要跨机器整合时,数仓就是非常必要的了。
数仓的特点
- 集成性
数仓中存储的数据来源于多个数据源,原始数据在不同数据源中的存储方式各不相同。
要整合成为最终的数据集合,需要从数据源经过一系列抽取、清洗、转换的过程。
- 稳定性
数仓中保存的数据是历史记录,不允许被修改。用户只能通过分析工具进行查询和分析。
- 动态性
数仓的数据会随时间变化而定期更新,这里的定期更新不是指修改数据,
一般是将业务系统发生变化的数据定期同步到数仓,和稳定性不冲突。
- 主题性
传统数据库对应的业务不同,数仓需要根据需求,将不同数据源的数据进行整合,
即数据一般都围绕某一业务主题进行建模。例如“贷款”主题、“存款”主题等。
之所以大型数据仓库系统架构设计复杂,是因为考虑到了未来3-5年的扩展性,
这样的话,未来不用花太大的精力去重建数据仓库系统,就能很稳定运行。
主要体现在数据建模的合理性,数据仓库方案中多出一些中间层,
使海量数据流有足够的缓冲,不至于数据量大很多,就运行不起来了。
数仓的数据来源
- 业务数据
比如用户在电商网站中登录、下单、支付等过程中,需要和网站后台数据库进行增删改查交互,
产生的数据就是业务数据。业务数据通常存储在MySQL、Oracle等数据库中。
用户在使用产品过程中,通过埋点收集与客户端产品交互过程中产生的数据,并发往日志服务器进行保存。 比如页面浏览、点击、停留、评论、点赞、收藏等。用户行为数据通常存储在日志文件中。
- 第三方数据
- 爬虫数据
通常通过技术手段获取其他公司网站的数据。不建议这样去做。
什么是数仓的维度,为何要分层
1. 减少重复开发,在数据开发的过程中可以产生中间层,将公共逻辑下沉,减少重复计算;
2. 清晰数据结构,每个分层分工明确,方便开发人员理解;
3. 方便定位问题,通过分层了解数据血缘关系,在出问题的时候通过回溯定位问题;
4. 简单化复杂问题,和分治法思想类似,分而治之,将复杂的问题简单化,还能解耦
数仓的维度建模和关系型建模的区别
数仓的关系型建模
数仓的维度型建模
维度模型面向业务,将业务用事实表和维度表呈现出来,表结构简单,故查询简单,查询效率较高。
二. 典型维度建模
分层
不同的企业不同的业务类型服务不同的用户,没有定论说分3、4层还是5层,只有最合适的分层。
一般的数仓包含:离线数仓、准实时、实时数仓;我们这里是采用T+1离线数仓,分为五层架构。
典型分层
原始数据,此层保存最原始数据,并且备份,备份时可以压缩。
数据清洗,脱敏,维度退化。
和DWS同,不过DWS层单位是日,而DWT是对截止到当日、或者近7日、近30日的汇总。
三. 每层具体介绍
ODS
任务
数据来源:
由Kafka和flume采集到HDFS上;前端埋点日志以JSON格式形式存在,
- 业务系统数据
由mysql
数据库数据通过sqoop
这种数据同步工具,采集到HDFS上;
DWD
DWD层是对事实表的处理,代表的是业务的最小粒度层。任何数据的记录都可以从这一层获取,为后续的DWS和DWT层做准备。还有对日志行为进行解析,以及对业务数据采用维度模型的方式重新建模(维度退化)。
DWD主要完成:
清洗的手段包括sql、mr、rdd、kettle、python等等。
清洗掉数据不能太多也不能很少。合理范围:1 万条数据清洗掉 1 条。
DWS
DWS层就是关于各个主题的加工和使用,这层是宽表聚合值,是各个事实表的聚合值。这里做轻度的汇总会让以后的计算更加的高效,如:统计各个主题对象计算7天、30天、90天的行为, 应对特殊需求(例如,购买行为,统计商品复购率)会快很多不必走ODS层反复拿数据做加工。
这层会把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析。
涉及的主题包括:访客主题、用户主题、商品主题、优惠券主题、活动主题、地区主题等,更详细可参考【6】.
DWT
这层涉及的主题和DWS层一样包括:访客主题、用户主题、商品主题、优惠券主题、活动主题、地区主题等。只不过DWS层的粒度是对当日用户汇总信息,而DWT层是对截止到当日、或者近7日、近30日等的汇总信息。
以用户主题这个来举列:
*DWS层:用户主题层是记录某一个用户在某一天的汇总行为。
*DWT层:用户主题层是记录某一个用户截止在当日的汇总行为。
ADS
ADS层数据是专门给业务使用的数据层,这层是面向业务定制的应用数据层。
ADS主要完成:
(2)提供给数据产品和数据分析使用的数据,一般会存放在 ES、MySQL等系统中供线上系统使用,也可能会存在 Hive 或者 Druid 中供数据分析和数据挖掘使用。如报表数据,或者说那种大宽表。
直播电商用户行为分析
业务数据库
数据建模
drop database if exists livestreaming_commerce;
-- 创建数据库
create database livestreaming_commerce default character set utf8mb3 collate utf8mb3_bin;
-- 使用数据库
use livestreaming_commerce;
-- 地区表
create table region(
region_code int not null comment '地区代码' primary key,
region_name varchar(10) null comment '地区名称',
parent_code int null comment '当前地区的上一级地区代码'
) comment '中国地区表' charset = utf8mb3;
create table region_district(
region_code int not null comment '地区代码' primary key,
region_name varchar(10) null comment '地区名称',
parent_code int null comment '当前地区的上一级地区代码'
) comment '中国区县表' charset = utf8mb3;
insert into region_district (region_code,region_name,parent_code)
select region_code,region_name,parent_code from region where region_code > 100000;
-- 数据建模
-- 用户表(user)
CREATE TABLE user (
user_id INT COMMENT '用户ID' AUTO_INCREMENT,
username VARCHAR(50) COMMENT '用户名',
gender VARCHAR(10) COMMENT '性别',
age INT COMMENT '年龄',
location VARCHAR(50) COMMENT '地点',
register_time DATETIME COMMENT '注册时间',
PRIMARY KEY (user_id)
);
-- 视频表(video)
CREATE TABLE video (
video_id INT COMMENT '视频ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
title VARCHAR(100) COMMENT '标题',
description TEXT COMMENT '描述',
duration INT COMMENT '时长',
upload_time DATETIME COMMENT '上传时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
PRIMARY KEY (video_id)
);
-- 直播间表(live_room)
CREATE TABLE live_room (
room_id INT COMMENT '直播间ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
title VARCHAR(100) COMMENT '标题',
description TEXT COMMENT '描述',
start_time DATETIME COMMENT '开始时间',
end_time DATETIME COMMENT '结束时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
PRIMARY KEY (room_id)
);
-- 商品表(product)
CREATE TABLE product (
product_id INT COMMENT '商品ID' AUTO_INCREMENT,
seller_id INT COMMENT '卖家ID',
name VARCHAR(100) COMMENT '名称',
description TEXT COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time DATETIME COMMENT '创建时间',
FOREIGN KEY (seller_id) REFERENCES user(user_id),
PRIMARY KEY (product_id)
);
-- 商品订单表(product_order)
CREATE TABLE product_order (
order_id INT COMMENT '订单ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time DATETIME COMMENT '下单时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
PRIMARY KEY (order_id)
);
-- 直播间订单表(live_order)
CREATE TABLE live_order (
order_id INT COMMENT '订单ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time DATETIME COMMENT '下单时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (room_id) REFERENCES live_room(room_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
PRIMARY KEY (order_id)
);
-- 评论表(comment)
CREATE TABLE `comment` (
comment_id INT COMMENT '评论ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content TEXT COMMENT '内容',
comment_time DATETIME COMMENT '评论时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
FOREIGN KEY (video_id) REFERENCES video(video_id),
FOREIGN KEY (live_room_id) REFERENCES live_room(room_id),
PRIMARY KEY (comment_id)
);
-- 退货表(return_order)
CREATE TABLE return_order (
return_id INT COMMENT '退货ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason TEXT COMMENT '原因',
return_time DATETIME COMMENT '退货时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (product_id) REFERENCES product(product_id),
FOREIGN KEY (order_id) REFERENCES product_order(order_id),
PRIMARY KEY (return_id)
);
-- 举报表(report)
CREATE TABLE report (
report_id INT COMMENT '举报ID' AUTO_INCREMENT,
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason TEXT COMMENT '原因',
report_time DATETIME COMMENT '举报时间',
FOREIGN KEY (user_id) REFERENCES user(user_id),
FOREIGN KEY (live_room_id) REFERENCES live_room(room_id),
PRIMARY KEY (report_id)
);
-- 纠纷表(dispute)
CREATE TABLE dispute (
dispute_id INT COMMENT '纠纷ID' AUTO_INCREMENT,
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason TEXT COMMENT '原因',
dispute_time DATETIME COMMENT '纠纷时间',
FOREIGN KEY (buyer_id) REFERENCES user(user_id),
FOREIGN KEY (seller_id) REFERENCES user(user_id),
FOREIGN KEY (order_id) REFERENCES product_order(order_id),
PRIMARY KEY (dispute_id)
);
业务数据
mysql livestreaming_commerce -h spark03 -uroot -p < warehouse/shell/sql/mysql/livestreaming_commerce_mysql.sql
数据仓库数据库
-
-- 删除数据库 drop database if exists ods_lc; drop database if exists dwd_lc; drop database if exists dwt_lc; drop database if exists dws_lc; drop database if exists ads_lc; drop database if exists app_lc; -- 创建数据库 create database if not exists ods_lc location '/data/ods/livestreaming_commerce'; create database if not exists dwd_lc location '/data/dwd/livestreaming_commerce'; create database if not exists dwt_lc location '/data/dwt/livestreaming_commerce'; create database if not exists dws_lc location '/data/dwt/livestreaming_commerce'; create database if not exists ads_lc location '/data/dws/livestreaming_commerce'; create database if not exists app_lc location '/data/ads/livestreaming_commerce';
-
hive -f warehouse/shell/hive/create_db.sql
ODS层
数据建模
-- ODS层
-- 公共数据地区表
-- 地区表
create external table ods_lc.region(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国地区表'
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/public/region';
create external table ods_lc.region_district(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国区县表'
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/public/region_district';
-- 抖音数据存放在 /ods/douyin 目录, 数据格式:867,3,汽车,用车知识,206,2022-01-01 23:18:38
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.douyin_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/user';
-- 视频表(video)
CREATE external TABLE ods_lc.douyin_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.douyin_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.douyin_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.douyin_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.douyin_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.douyin_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.douyin_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.douyin_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.douyin_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/dispute';
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" )
-- ALTER TABLE user SET SERDEPROPERTIES ( "ignore.malformed.json" = "true" );
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
-- 小红书数据存放在 /ods/kuaishou 目录,数据格式:{"data:[{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}]"}
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
抖音
导入数据
方法二:springboot + quarz + hdfs | hive
方法三:springboot + spark + quarz + hdfs | hive
编写sqoop基础脚本
vim warehouse/shell/sqoop/sqoop_collect_data_util.sh
if [$# != 2]
then
echo "参数异常:sqoop_collect_data.util.sh <sql> <hdfs_path>"
exit 33
fi
# 第一个参数为sql语句
sql=$1
# 第二个参数为hdfs存储路径
hdfs_path=$2
sqoop import
--username root
--password 'Lihaozhe!!@@1122'
--connect jdbc:mysql://spark03:3306/livestreaming_commerce?characterEncoding=UTF-8
--input-fields-terminated-by ','
--input-lines-terminated-by 'n'
--num-mappers 1
--target-dir "${hdfs_path}"
--delete-target-dir
--query "${sql}"' and $CONDITIONS'
--null-string '\N'
--null-non-string '\N'
编写全量数据采集脚本
vim warehouse/shell/sqoop/collect_data_full.sh
#!/bin/bash
# 全量数据采集
# 当调用该脚本的时候没有传递参数 z=z 成立 则获取当前日期前一天赋值给变量dt
# 当调用该脚本的时候传递参数 z=z 不成立 则将第一个参数赋值给变量dt
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 day ago"`
else
dt=$1
fi
# sql语句
region_sql="select region_code, region_name, parent_code from region where 1=1"
region_district_sql="select region_code, region_name, parent_code from region_district where 1=1"
# 路径前缀
path_prefix="/ods/public"
# HDFS存储路径
region_path="${path_prefix}/region/${dt}"
region_district_path="${path_prefix}/region_district/${dt}"
# 采集数据
echo "开始采集..."
echo "采集数据表 region 数据..."
bash sqoop_collect_data_util.sh "${region_sql}" "${region_path}"
echo "采集数据表 region_district 数据..."
bash sqoop_collect_data_util.sh "${region_district_sql}" "${region_district_path}"
echo "结束采集..."
bash warehouse/shell/sqoop/collect_data_full.sh
编写增量数据采集脚本
vim warehouse/shell/sqoop/collect_data_incrment.sh
#!/bin/bash
# 曾量数据采集 每天凌晨之后 一个固定时间段执行一次 采集前一天的数据
# 当调用该脚本的时候没有传递参数 z=z 成立 则获取当前日期前一天赋值给变量dt
# 当调用该脚本的时候传递参数 z=z 不成立 则将第一个参数赋值给变量dt
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 day ago"`
pt1=`date +%Y --date="${dt}"`
pt2=`date +%m --date="${dt}"`
pt3=`date +%d --date="${dt}"`
else
dt=$1
pt1=`date +%Y --date="$1"`
pt2=`date +%m --date="$1"`
pt3=`date +%d --date="$1"`
fi
# 转换日期格式,20260101 改为 2026-01-01 sql查询日期区间
dt_new=`date +%Y-%m-%d --date="${dt}"`
# sql语句
user_sql="select user_id, username, gender, age, location, date_format(register_time,'%Y-%m-%d %H:%i:%s') from user where register_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
video_sql="select video_id, user_id, title, description, duration, date_format(upload_time,'%Y-%m-%d %H:%i:%s') from video where upload_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
live_room_sql="select room_id, user_id, title, description
, date_format(start_time,'%Y-%m-%d %H:%i:%s')
, date_format(end_time,'%Y-%m-%d %H:%i:%s') from live_room
where end_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
product_sql="select product_id, seller_id, name, description, price, date_format(create_time,'%Y-%m-%d %H:%i:%s') from product where create_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
product_order_sql="select order_id, buyer_id, seller_id, product_id, quantity, price, date_format(order_time,'%Y-%m-%d %H:%i:%s') from product_order where order_time between '${dt_new} 00:00:00' and '${dt_new} 23:59:59' and 1=1"
# 路径前缀
path_prefix="/ods/douyin"
# HDFS存储路径
user_path="${path_prefix}/user/${pt1}/${pt2}/${pt3}"
video_path="${path_prefix}/video/${pt1}/${pt2}/${pt3}"
live_room_path="${path_prefix}/live_room/${pt1}/${pt2}/${pt3}"
product_path="${path_prefix}/product/${pt1}/${pt2}/${pt3}"
product_order_path="${path_prefix}/product_order/${pt1}/${pt2}/${pt3}"
# 采集数据
echo "开始采集..."
echo "采集数据表 user 数据..."
bash sqoop_collect_data_util.sh "${user_sql}" "${user_path}"
echo "采集数据表 video 数据..."
bash sqoop_collect_data_util.sh "${video_sql}" "${video_path}"
echo "采集数据表 live_room 数据..."
bash sqoop_collect_data_util.sh "${live_room_sql}" "${live_room_path}"
echo "采集数据表 product 数据..."
bash sqoop_collect_data_util.sh "${product_sql}" "${product_path}"
echo "采集数据表 product_order 数据..."
bash sqoop_collect_data_util.sh "${product_order_sql}" "${product_order_path}"
echo "结束采集..."
采集某天的数据
bash warehouse/shell/sqoop/collect_data_incrment.sh 20231122
采集前一天的数据
bash warehouse/shell/sqoop/collect_data_incrment.sh
编写全年增量数据采集脚本
vim warehouse/shell/sqoop/collect_data_year_incrment.sh
#!/bin/bash
# 如果没有输入参数,则获取当前年份
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=("$@")
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 循环打印每一天
for month in {1..12}
do
# 获取当前月份的天数
case $month in
1|3|5|7|8|10|12) days_in_month=31 ;;
4|6|9|11) days_in_month=30 ;;
2) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
done
采集某年全年数据
bash warehouse/shell/sqoop/collect_data_year_incrment.sh 2023
warehouse/shell/sqoop/collect_data_year_incrment.sh
编写某年全年或某年全月增量数据采集脚本
vim warehouse/shell/sqoop/collect_data_year_month_incrment.sh
#!/bin/bash
# 如果没有输入参数,则获取当前年份
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=()
for arg in "$@"
do
if [[ $arg =~ ^[0-9]{4}$ ]]; then
years+=($arg)
elif [[ $arg =~ ^[0-9]{6}$ ]]; then
years+=(${arg:0:4})
months+=(${arg:4:2})
fi
done
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 如果参数是yyyyMM格式,则打印出该年月的每一天
if [ ${#months[@]} -gt 0 ]; then
for month in "${months[@]}"
do
# 获取当前月份的天数
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
# 循环打印每一天
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
else
# 打印当前参数所在年份的每一天
for month in {01..12}
do
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
bash collect_data_incrment.sh $date_string
done
done
fi
done
采集某年全年数据
bash warehouse/shell/sqoop/collect_data_year_month_incrment.sh 2023
采集某年某月全月数据
warehouse/shell/sqoop/collect_data_year_month_incrment.sh 20231122
编写添加分区脚本
vim warehouse/shell/hive/add_partition.sh
#!/bin/bash
# 加外部分区表
# 接收三个参数
# 1、表名称
# 2、分区字段dt的值,格式:20260201
# 3、分区路径(相对或者绝对路径都可以)但个人习惯使用绝对路径
if [ $# != 3 ]
then
echo "参数异常:add_partition.sh <table_name> <partation> <hdfs_path>"
exit 100
fi
# 表名称
table_name=$1
# 分区字段的值,格式:20260201
date=$2
year=$(date -d "$date" +%Y)
month=$(date -d "$date" +%m)
day=$(date -d "$date" +%d)
# 分区路径
path=$3
echo "alter table "${table_name}" add if not exists partition ("y=${year},m=${month},d=${day}") location '"${path}"';"
hive -e "
alter table "${table_name}" add if not exists partition ("y=${year},m=${month},d=${day}") location '"${path}"';
"
使用举例:
bash warehouse/shell/hive/add_partition.sh 20231122 /ods/douyin/user/2022/01/01
编写添加全年或某年某月分区数据脚本
vim warehouse/shell/hive/warehouse/shell/hive/add_partition.sh
if [ $# -eq 0 ]; then
current_year=$(date +'%Y')
years=($current_year)
else
years=()
for arg in "$@"
do
if [[ $arg =~ ^[0-9]{4}$ ]]; then
years+=($arg)
elif [[ $arg =~ ^[0-9]{6}$ ]]; then
years+=(${arg:0:4})
months+=(${arg:4:2})
fi
done
fi
for year in "${years[@]}"
do
# 判断是否为闰年
if [ $((year % 4)) -eq 0 ] && [ $((year % 100)) -ne 0 ] || [ $((year % 400)) -eq 0 ]; then
days_in_feb=29
else
days_in_feb=28
fi
# 如果参数是yyyyMM格式,则打印出该年月的每一天
if [ ${#months[@]} -gt 0 ]; then
for month in "${months[@]}"
do
# 获取当前月份的天数
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
# 循环打印每一天
for (( day=1; day<=$days_in_month; day++ ))
do
# 格式化日期
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
partation_string=$(date -d "$month/$day/$year" +'%Y/%m/%d')
bash add_partation.sh ods_lc.douyin_user $date_string /ods/douyin/user/$partation_string
bash add_partation.sh ods_lc.douyin_video $date_string /ods/douyin/video/$partation_string
bash add_partation.sh ods_lc.douyin_live_room $date_string /ods/douyin/live_room/$partation_string
bash add_partation.sh ods_lc.douyin_product $date_string /ods/douyin/product/$partation_string
bash add_partation.sh ods_lc.douyin_product_order $date_string /ods/douyin/product_order/$partation_string
done
done
else
# 打印当前参数所在年份的每一天
for month in {01..12}
do
case $month in
01|03|05|07|08|10|12) days_in_month=31 ;;
04|06|09|11) days_in_month=30 ;;
02) days_in_month=$days_in_feb ;;
esac
for (( day=1; day<=$days_in_month; day++ ))
do
date_string=$(date -d "$month/$day/$year" +'%Y%m%d')
partation_string=$(date -d "$month/$day/$year" +'%Y/%m/%d')
bash add_partation.sh ods_lc.douyin_user $date_string /ods/douyin/user/$partation_string
bash add_partation.sh ods_lc.douyin_video $date_string /ods/douyin/video/$partation_string
bash add_partation.sh ods_lc.douyin_live_room $date_string /ods/douyin/live_room/$partation_string
bash add_partation.sh ods_lc.douyin_product $date_string /ods/douyin/product/$partation_string
bash add_partation.sh ods_lc.douyin_product_order $date_string /ods/douyin/product_order/$partation_string
done
done
fi
done
使用举例:
bash warehouse/shell/hive/add_partition.sh 202311
bash warehouse/shell/hive/add_partition.sh 202311
快手
数据建模
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time DATE COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
上传数据
hdfs dfs -put warehouse/data/kuaishou/user.log /ods/kuaishou/user
hdfs dfs -put warehouse/data/kuaishou/video.log /ods/kuaishou/video
hdfs dfs -put warehouse/data/kuaishou/live_room.log /ods/kuaishou/live_room
hdfs dfs -put warehouse/data/kuaishou/product.log /ods/kuaishou/product
hdfs dfs -put warehouse/data/kuaishou/product_order.log /ods/kuaishou/product_order
小红书
数据建模
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
上传数据
hdfs dfs -put warehouse/data/xiaohongshu/user.json /ods/xiaohongshu/user
hdfs dfs -put warehouse/data/xiaohongshu/video.json /ods/xiaohongshu/video
hdfs dfs -put warehouse/data/xiaohongshu/live_room.json /ods/xiaohongshu/live_room
hdfs dfs -put warehouse/data/xiaohongshu/product.json /ods/xiaohongshu/product
hdfs dfs -put warehouse/data/xiaohongshu/product_order.json /ods/xiaohongshu/product_order
DWD
小红书
数据建模
-- DWD
-- 数据建模
-- 用户表(user)
CREATE external TABLE dwd_lc.xiaohongshu_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
location '/dwd/xiaohongshu/user';
-- 视频表(video)
CREATE external TABLE dwd_lc.xiaohongshu_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
location '/dwd/xiaohongshu/video';
-- 直播间表(live_room)
CREATE external TABLE dwd_lc.xiaohongshu_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
location '/dwd/xiaohongshu/live_room';
-- 商品表(product)
CREATE external TABLE dwd_lc.xiaohongshu_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
location '/dwd/xiaohongshu/product';
-- 商品订单表(product_order)
CREATE external TABLE dwd_lc.xiaohongshu_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
location '/dwd/xiaohongshu/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE dwd_lc.xiaohongshu_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
location '/dwd/xiaohongshu/live_order';
-- 评论表(comment)
CREATE external TABLE dwd_lc.xiaohongshu_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
location '/dwd/xiaohongshu/comment';
-- 退货表(return_order)
CREATE external TABLE dwd_lc.xiaohongshu_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
location '/dwd/xiaohongshu/return_order';
-- 投诉表(report)
CREATE external TABLE dwd_lc.xiaohongshu_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
location '/dwd/xiaohongshu/report';
-- 纠纷表(dispute)
CREATE external TABLE dwd_lc.xiaohongshu_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
location '/dwd/xiaohongshu/dispute';
ETL
ETL是将业务系统的数据经过**抽取(Extract)、清洗转换(Transform)之后加载(Load)**到数据仓库的过程,
目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据, ETL是BI(商业智能)项目重要的一个环节。
-- 小红书 ETL
-- 用户表(user)
insert overwrite table dwd_lc.xiaohongshu_user (
select json_tuple(people,'user_id','username','gender','age','location','register_time') as (user_id,username,gender,age,location,register_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_user) ods
lateral view explode(people_list) t1 as people) t2);
-- 视频表(video)
insert overwrite table dwd_lc.xiaohongshu_video (
select json_tuple(people,'video_id','user_id','title','description','duration','upload_time') as (video_id,user_id,title,description,duration,upload_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_video) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间表(live_room)
insert overwrite table dwd_lc.xiaohongshu_live_room (
select json_tuple(people,'room_id','user_id','title','description','start_time','end_time') as (room_id,user_id,title,description,start_time,end_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_live_room) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品表(product)
insert overwrite table dwd_lc.xiaohongshu_product (
select json_tuple(people,'product_id','seller_id','name','description','price','create_time') as (product_id,seller_id,name,description,price,create_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_product) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品订单表(product_order)
insert overwrite table dwd_lc.xiaohongshu_product_order (
select json_tuple(people,'order_id','buyer_id','seller_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_prouct_order) ods
lateral view explode(people_list) t1 as people) t2);
数据分析
-- 数据分析需求示例
-- 1. 求每个用户的视频数量
SELECT user_id, COUNT(*) AS video_count
FROM video
GROUP BY user_id
ORDER BY video_count DESC
LIMIT 30;
-- 2. 求每个视频的点赞数量
SELECT video_id, COUNT(*) AS like_count
FROM video_like
GROUP BY video_id
ORDER BY like_count DESC
LIMIT 30;
-- 3. 求每个商品的销量和销售额
SELECT product_id, SUM(quantity) AS total_sales, SUM(quantity * price) AS total_revenue
FROM product_order
GROUP BY product_id
ORDER BY total_sales DESC
LIMIT 30;
-- 4. 求每个直播间的观众数量
SELECT room_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY room_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 5. 求每个用户的评论数量
SELECT user_id, COUNT(*) AS comment_count
FROM user_comment
GROUP BY user_id
ORDER BY comment_count DESC
LIMIT 30;
-- 6. 求每个用户的评价数量
SELECT user_id, COUNT(*) AS review_count
FROM product_review
GROUP BY user_id
ORDER BY review_count DESC
LIMIT 30;
-- 7. 求每个用户的直播间评价数量
SELECT user_id, COUNT(*) AS review_count
FROM live_review
GROUP BY user_id
ORDER BY review_count DESC
LIMIT 30;
-- 8. 求每个用户的点赞数量
SELECT user_id, COUNT(*) AS like_count
FROM video_like
GROUP BY user_id
ORDER BY like_count DESC
LIMIT 30;
-- 9. 求每个用户的观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 10. 求每个用户的购买数量和消费金额
SELECT buyer_id, COUNT(*) AS purchase_count, SUM(quantity * price) AS total_spent
FROM product_order
GROUP BY buyer_id
ORDER BY purchase_count DESC
LIMIT 30;
-- 11. 求每个用户的卖出数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM product_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 12. 求每个用户的直播间订单数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM live_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 13. 求每个用户的直播间观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 14. 求每个用户的视频观看数量
SELECT user_id, COUNT(*) AS view_count
FROM video_view
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 15. 求每个用户的视频观看时长
SELECT user_id, SUM(duration) AS total_duration
FROM video
GROUP BY user_id
ORDER BY total_duration DESC
LIMIT 30;
-- 16. 求每个直播间的观众数量和观看时长
SELECT room_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY room_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 17. 求每个商品的评价数量
SELECT product_id, COUNT(*) AS review_count
FROM product_review
GROUP BY product_id
ORDER BY review_count DESC
LIMIT 30;
-- 18. 求每个直播间的评价数量
SELECT room_id, COUNT(*) AS review_count
FROM live_review
GROUP BY room_id
ORDER BY review_count DESC
LIMIT 30;
-- 19. 求每个视频的评论数量
SELECT video_id, COUNT(*) AS comment_count
FROM user_comment
WHERE target_id IS NOT NULL
GROUP BY video_id
ORDER BY comment_count DESC
LIMIT 30;
-- 20. 求每个直播间的评论数量
SELECT room_id, COUNT(*) AS comment_count
FROM user_comment
WHERE target_id IS NULL
GROUP BY room_id
ORDER BY comment_count DESC
LIMIT 30;
-- 21. 求每个用户的视频点赞数量
SELECT user_id, COUNT(*) AS like_count
FROM video_like vl
JOIN video v ON vl.video_id = v.video_id
GROUP BY user_id
ORDER BY like_count DESC
LIMIT 30;
-- 22. 求每个用户的直播间观众数量
SELECT user_id, COUNT(*) AS viewer_count
FROM live_viewer
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 23. 求每个用户
-- 23. 求每个用户的视频观看数量和观看时长
SELECT user_id, COUNT(*) AS view_count, SUM(duration) AS total_duration
FROM video_view vv
JOIN video v ON vv.video_id = v.video_id
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 24. 求每个用户的直播间观众数量和观看时长
SELECT user_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
-- 25. 求每个用户的购买数量和消费金额
SELECT buyer_id, COUNT(*) AS purchase_count, SUM(quantity * price) AS total_spent
FROM product_order
GROUP BY buyer_id
ORDER BY purchase_count DESC
LIMIT 30;
-- 26. 求每个用户的卖出数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM product_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 27. 求每个用户的直播间订单数量和收入金额
SELECT seller_id, COUNT(*) AS sales_count, SUM(quantity * price) AS total_income
FROM live_order
GROUP BY seller_id
ORDER BY sales_count DESC
LIMIT 30;
-- 28. 求每个用户的视频观看数量
SELECT user_id, COUNT(*) AS view_count
FROM video_view
GROUP BY user_id
ORDER BY view_count DESC
LIMIT 30;
-- 29. 求每个用户的视频观看时长
SELECT user_id, SUM(duration) AS total_duration
FROM video
GROUP BY user_id
ORDER BY total_duration DESC
LIMIT 30;
-- 30. 求每个用户的直播间观众数量和观看时长
SELECT user_id, COUNT(*) AS viewer_count, SUM(duration) AS total_duration
FROM live_viewer lv
JOIN video v ON lv.room_id = v.video_id
GROUP BY user_id
ORDER BY viewer_count DESC
LIMIT 30;
hive数仓etl
-- hive
-- 1. 初始化数据库
-- 删除数据库
drop database if exists ods_lc cascade;
drop database if exists dwd_lc cascade;
drop database if exists dwt_lc cascade;
drop database if exists dws_lc cascade;
drop database if exists ads_lc cascade;
drop database if exists app_lc cascade;
-- 创建数据库
create database if not exists ods_lc location '/data/ods/livestreaming_commerce';
create database if not exists dwd_lc location '/data/dwd/livestreaming_commerce';
create database if not exists dwt_lc location '/data/dwt/livestreaming_commerce';
create database if not exists dws_lc location '/data/dwt/livestreaming_commerce';
create database if not exists ads_lc location '/data/dws/livestreaming_commerce';
create database if not exists app_lc location '/data/ads/livestreaming_commerce';
-- 数据建模
-- ODS层
-- 公共数据地区表
-- 地区表
create external table ods_lc.region(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国地区表'
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/public/region';
create external table ods_lc.region_district(
region_code string comment '地区代码',
region_name string comment '地区名称',
parent_code string comment '当前地区的上一级地区代码'
) comment '中国区县表'
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/public/region_district';
-- 抖音数据存放在 /ods/douyin 目录, 数据格式:867,3,汽车,用车知识,206,2022-01-01 23:18:38
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.douyin_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/user';
-- 视频表(video)
CREATE external TABLE ods_lc.douyin_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.douyin_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.douyin_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.douyin_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.douyin_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.douyin_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.douyin_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.douyin_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.douyin_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
partitioned by (y string,m string,d string)
row format delimited fields terminated by ','
lines terminated by 'n'
location '/ods/douyin/dispute';
-- 快手数据存放在 /ods/kuaishou 目录,数据格式:{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}
-- 数据建模
-- 用户表(user)
CREATE external TABLE ods_lc.kuaishou_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/user';
-- WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" )
-- ALTER TABLE user SET SERDEPROPERTIES ( "ignore.malformed.json" = "true" );
-- 视频表(video)
CREATE external TABLE ods_lc.kuaishou_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/video';
-- 直播间表(live_room)
CREATE external TABLE ods_lc.kuaishou_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_room';
-- 商品表(product)
CREATE external TABLE ods_lc.kuaishou_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product';
-- 商品订单表(product_order)
CREATE external TABLE ods_lc.kuaishou_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE ods_lc.kuaishou_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/live_order';
-- 评论表(comment)
CREATE external TABLE ods_lc.kuaishou_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/comment';
-- 退货表(return_order)
CREATE external TABLE ods_lc.kuaishou_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/return_order';
-- 投诉表(report)
CREATE external TABLE ods_lc.kuaishou_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/report';
-- 纠纷表(dispute)
CREATE external TABLE ods_lc.kuaishou_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
location '/ods/kuaishou/dispute';
-- 小红书数据存放在 /ods/kuaishou 目录,数据格式:{"data:[{"videoId":1,"userId":494,"title":"动物","description":"野生动物","duration":59,"uploadTime":"2022-01-13 05:25:12"}]"}
create external table ods_lc.xiaohongshu_user(info string) comment '用户表' location '/ods/xiaohongshu/user';
create external table ods_lc.xiaohongshu_video(info string) comment '视频表' location '/ods/xiaohongshu/video';
create external table ods_lc.xiaohongshu_live_room(info string) comment '直播间' location '/ods/xiaohongshu/live_room';
create external table ods_lc.xiaohongshu_product(info string) comment '商品表' location '/ods/xiaohongshu/product';
create external table ods_lc.xiaohongshu_prouct_order(info string) comment '商品订单表' location '/ods/xiaohongshu/product_order';
create external table ods_lc.xiaohongshu_live_order(info string) comment '直播间订单表' location '/ods/xiaohongshu/live_order';
create external table ods_lc.xiaohongshu_comment(info string) comment '评论表' location '/ods/xiaohongshu/comment';
create external table ods_lc.xiaohongshu_return_order(info string) comment '退货表' location '/ods/xiaohongshu/return_order';
create external table ods_lc.xiaohongshu_report(info string) comment '投诉表' location '/ods/xiaohongshu/report';
create external table ods_lc.xiaohongshu_dispute(info string) comment '纠纷表' location '/ods/xiaohongshu/dispute';
-- DWD
-- 数据建模
-- 小红书
-- 用户表(user)
CREATE external TABLE dwd_lc.xiaohongshu_user (
user_id INT COMMENT '用户ID',
username string COMMENT '用户名',
gender string COMMENT '性别',
age INT COMMENT '年龄',
location string COMMENT '地点',
register_time timestamp COMMENT '注册时间'
) comment '用户表'
location '/dwd/xiaohongshu/user';
-- 视频表(video)
CREATE external TABLE dwd_lc.xiaohongshu_video (
video_id INT COMMENT '视频ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
duration INT COMMENT '时长',
upload_time timestamp COMMENT '上传时间'
) comment '视频表'
location '/dwd/xiaohongshu/video';
-- 直播间表(live_room)
CREATE external TABLE dwd_lc.xiaohongshu_live_room (
room_id INT COMMENT '直播间ID',
user_id INT COMMENT '用户ID',
title string COMMENT '标题',
description string COMMENT '描述',
start_time timestamp COMMENT '开始时间',
end_time timestamp COMMENT '结束时间'
) comment '直播间'
location '/dwd/xiaohongshu/live_room';
-- 商品表(product)
CREATE external TABLE dwd_lc.xiaohongshu_product (
product_id INT COMMENT '商品ID',
seller_id INT COMMENT '卖家ID',
name string COMMENT '名称',
description string COMMENT '描述',
price DECIMAL(10, 2) COMMENT '价格',
create_time timestamp COMMENT '创建时间'
) comment '商品表'
location '/dwd/xiaohongshu/product';
-- 商品订单表(product_order)
CREATE external TABLE dwd_lc.xiaohongshu_product_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '商品订单表'
location '/dwd/xiaohongshu/product_order';
-- 直播间订单表(live_order)
CREATE external TABLE dwd_lc.xiaohongshu_live_order (
order_id INT COMMENT '订单ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
room_id INT COMMENT '直播间ID',
product_id INT COMMENT '商品ID',
quantity INT COMMENT '数量',
price DECIMAL(10, 2) COMMENT '价格',
order_time timestamp COMMENT '下单时间'
) comment '直播间订单表'
location '/dwd/xiaohongshu/live_order';
-- 评论表(comment)
CREATE external TABLE dwd_lc.xiaohongshu_comment (
comment_id INT COMMENT '评论ID',
user_id INT COMMENT '用户ID',
video_id INT COMMENT '视频ID',
live_room_id INT COMMENT '直播间ID',
content string COMMENT '内容',
comment_time timestamp COMMENT '评论时间'
) comment '评论表'
location '/dwd/xiaohongshu/comment';
-- 退货表(return_order)
CREATE external TABLE dwd_lc.xiaohongshu_return_order (
return_id INT COMMENT '退货ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
product_id INT COMMENT '商品ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
return_time timestamp COMMENT '退货时间'
) comment '退货表'
location '/dwd/xiaohongshu/return_order';
-- 投诉表(report)
CREATE external TABLE dwd_lc.xiaohongshu_report (
report_id INT COMMENT '举报ID',
user_id INT COMMENT '用户ID',
live_room_id INT COMMENT '直播间ID',
reason string COMMENT '原因',
report_time timestamp COMMENT '举报时间'
) comment '投诉表'
location '/dwd/xiaohongshu/report';
-- 纠纷表(dispute)
CREATE external TABLE dwd_lc.xiaohongshu_dispute (
dispute_id INT COMMENT '纠纷ID',
buyer_id INT COMMENT '买家ID',
seller_id INT COMMENT '卖家ID',
order_id INT COMMENT '订单ID',
reason string COMMENT '原因',
dispute_time timestamp COMMENT '纠纷时间'
) comment '纠纷表'
location '/dwd/xiaohongshu/dispute';
-- 小红书 ETL
-- 用户表(user)
insert overwrite table dwd_lc.xiaohongshu_user (
select json_tuple(people,'user_id','username','gender','age','location','register_time') as (user_id,username,gender,age,location,register_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_user) ods
lateral view explode(people_list) t1 as people) t2);
-- 视频表(video)
insert overwrite table dwd_lc.xiaohongshu_video (
select json_tuple(people,'video_id','user_id','title','description','duration','upload_time') as (video_id,user_id,title,description,duration,upload_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_video) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间表(live_room)
insert overwrite table dwd_lc.xiaohongshu_live_room (
select json_tuple(people,'room_id','user_id','title','description','start_time','end_time') as (room_id,user_id,title,description,start_time,end_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_live_room) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品表(product)
insert overwrite table dwd_lc.xiaohongshu_product (
select json_tuple(people,'product_id','seller_id','name','description','price','create_time') as (product_id,seller_id,name,description,price,create_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_product) ods
lateral view explode(people_list) t1 as people) t2);
-- 商品订单表(product_order)
insert overwrite table dwd_lc.xiaohongshu_product_order (
select json_tuple(people,'order_id','buyer_id','seller_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_prouct_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 直播间订单表(live_order)
insert overwrite table dwd_lc.xiaohongshu_live_order (
select json_tuple(people,'order_id','buyer_id','seller_id','seller_id','room_id','product_id','quantity','price','order_time') as (order_id,buyer_id,seller_id,room_id,product_id,quantity,price,order_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_live_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 评论表(comment)
insert overwrite table dwd_lc.xiaohongshu_comment (
select json_tuple(people,'comment_id','user_id','video_id','live_room_id','content','upload_time') as (comment_id,user_id,video_id,live_room_id,content,comment_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_comment) ods
lateral view explode(people_list) t1 as people) t2);
-- 退货表(return_order)
insert overwrite table dwd_lc.xiaohongshu_return_order (
select json_tuple(people,'return_id','buyer_id','seller_id','product_id','order_id','reason','return_time') as (return_id,buyer_id,seller_id,product_id,order_id,reason,return_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_return_order) ods
lateral view explode(people_list) t1 as people) t2);
-- 投诉表(report)
insert overwrite table dwd_lc.xiaohongshu_report (
select json_tuple(people,'report_id','user_id','live_room_id','reason','report_time') as (report_id,user_id,live_room_id,reason,report_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_report) ods
lateral view explode(people_list) t1 as people) t2);
-- 纠纷表(dispute)
insert overwrite table dwd_lc.xiaohongshu_dispute (
select json_tuple(people,'dispute_id','buyer_id','seller_id','order_id','reason','dispute_time') as (dispute_id,buyer_id,seller_id,order_id,reason,dispute_time) from (
select people from (select split(regexp_replace(regexp_replace(get_json_object(info,'$.data'),'\[|\]',''),'\}\,\{','\}\;\{'),";") people_list from ods_lc.xiaohongshu_dispute) ods
lateral view explode(people_list) t1 as people) t2);
-- 快手
select name, description, price, total_quantity from product inner join
(select product_id,sum(quantity) total_quantity from product_order
group by product_id order by total_quantity) t
on t.product_id = product.product_id;
-- 商品销售数量统计 原谅我一不小心做成了内部表 该表的数据从ods_lc.product_order中抽取
create table dwd_lc.kuaishou_product_order_total_quantity
(product_id int,total_quantity int)
partitioned by (create_date date);
-- 从ods层导入输入到 dwd_lc.product_order_total_quantity
insert into table dwd_lc.kuaishou_product_order_total_quantity
select product_id,sum(quantity) total_quantity,current_date() create_date from ods_lc.kuaishou_product_order
group by product_id order by total_quantity;
-- DWS
-- 数据建模
-- 快手
-- 商品销售数量统计 原谅我一不小心做成了内部表 该表的数据从ods_lc.product_order和dwd_lc.product_order_total_quantity中抽取
create table dws_lc.kuaishou_product_order_total_quantity(
product_id int,
name string,
description string,
price decimal(10,2),
total_quantity int
)
partitioned by (create_date date);
-- 从ods层和dwd层导入输入到 ws_lc.kuaishou_product_order_total_quantity
insert into table dwd_lc.kuaishou_product_order_total_quantity
select product_id,name,description,price from ods_lc.kuaishou_product_order ods
inner join
(select product_id,sum(quantity) total_quantity,current_date() create_date from ods_lc.kuaishou_product_order
group by product_id order by total_quantity) dwd
on dwd.product_id = ods.product_id;
mysql数仓etl
# use livestreaming_commerce;
# alter database livestreaming_commerce character set utf8mb3 collate utf8mb3_bin;
# alter table comment convert to character set utf8mb3 collate utf8mb3_bin;
# alter table dispute convert to character set utf8mb3 collate utf8mb3_bin;
# alter table live_order convert to character set utf8mb3 collate utf8mb3_bin;
# alter table live_room convert to character set utf8mb3 collate utf8mb3_bin;
# alter table product convert to character set utf8mb3 collate utf8mb3_bin;
# alter table product_order convert to character set utf8mb3 collate utf8mb3_bin;
# alter table region convert to character set utf8mb3 collate utf8mb3_bin;
# alter table region_district convert to character set utf8mb3 collate utf8mb3_bin;
# alter table report convert to character set utf8mb3 collate utf8mb3_bin;
# alter table return_order convert to character set utf8mb3 collate utf8mb3_unicode_ci;
# alter table user convert to character set utf8mb3 collate utf8mb3_bin;
# alter table video convert to character set utf8mb3 collate utf8mb3_bin;
create database dwd_lc;
create table dwd_lc.product_order_total_quantity (product_id int,total_quantity int,create_dated date);
insert into dwd_lc.product_order_total_quantity
select product_id,sum(quantity) total_quantity,curdate() from product_order
group by product_id;
create database dws_lc;
create table dwd_lc.product_order_total_quantity (
product_id int,total_quantity int,create_dated date
);
create database dwt_lc;
create table dwt_lc.product_order_total_quantity (
product_id int,
name varchar(50),
description varchar(50),
price decimal(10,2),
total_quantity int,
create_dated date
);
insert into dwt_lc.product_order_total_quantity
select ods.product_id, name, description, price, total_quantity, create_dated create_date from livestreaming_commerce.product ods
inner join
(select product_id, total_quantity, create_dated from dwd_lc.product_order_total_quantity) dwd
on dwd.product_id = ods.product_id;
create database aps_lc;
create table aps_lc.product_order_total_quantity (
product_id int,
name varchar(50),
description varchar(50),
price decimal(10,2),
total_quantity int,
create_dated date
);
insert into aps_lc.product_order_total_quantity
select product_id, name, description, price, total_quantity, create_dated create_date
from dwt_lc.product_order_total_quantity order by total_quantity desc limit 10;
数据可视化 FineBi
部署环境
环境 | 版本 |
---|---|
openEuler Linux | 22.03 |
MySQL | 8.0.35 |
JDK | 1.8 |
Tomcat | 9.0.83 |
FineBi | 6.0 |
环境准备
升级系统内核和软件
yum -y update
reboot
安装常用工具软件
yum -y install vim tar net-tools
安装MySQL8
将 MySQL Yum 存储库添加到系统的存储库列表中
sudo yum -y install https://repo.mysql.com//mysql80-community-release-el8-9.noarch.rpm
通过运行以下命令并检查其输出来验证是否已启用和禁用正确的子存储库
sudo yum repolist enabled | grep mysql
通过以下命令安装 MySQL
sudo yum -y install mysql-community-server
启动 MySQL 服务器
sudo systemctl start mysqld
sudo systemctl status mysqld
sudo systemctl enable mysqld
在服务器初始启动时,假设服务器的数据目录为空,会发生以下情况:
-
创建了一个超级用户帐户
'root'@'localhost
。超级用户的密码已设置并存储在错误日志文件中。要显示它,请使用以下命令:sudo grep 'temporary password' /var/log/mysqld.log
通过使用生成的临时密码登录并为超级用户帐户设置自定义密码,尽快更改 root 密码:
mysql -uroot -p
修改密码
ALTER USER 'root'@'localhost' IDENTIFIED BY 'Lihaozhe!!@@1122';
FLUSH PRIVILEGES;
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'Lihaozhe!!@@1122';
FLUSH PRIVILEGES;
update mysql.user set host = '%',plugin='mysql_native_password' where user='root';
FLUSH PRIVILEGES;
exit;
sudo systemctl restart mysqld
防火墙开放端口
–zone #作用域
firewall-cmd --zone=public --add-port=3306/tcp --permanent
防火墙重载
firewall-cmd --reload
# 使用新密码连接
mysql -h 你自己的IP地址 -uroot -p
创建finebi外接数据库finedb
create database finebi character set utf8 collate utf8_bin;
安装jdk和tomcat
下载jdk
jdk 官网下载页面:https://www.oracle.com/java/technologies/downloads/
下载tomcat
tomcat 官网下载页面:https://tomcat.apache.org/download-90.cgi
解压jdk和tomcat并修改目录名称
tar -zxvf jdk-8u391-linux-x64.tar.gz -C /usr/local/
tar -zxvf apache-tomcat-9.0.83.tar.gz -C /usr/local/
mv /usr/local/jdk1.8.0_391/ /usr/local/jdk-8
mv /usr/local/apache-tomcat-9.0.83 /usr/local/tomcat-9
设置tomcat.pid文件生成
vim /usr/local/tomcat-9/bin/catalina.sh
CATALINA_PID="/usr/local/tomcat-9/bin/tomcat.pid"
配置环境变量
vim /etc/profile.d/my_env.sh
export JAVA_HOME=/usr/local/jdk-8
export CATALINA_BASE=/usr/local/tomcat-9
export CATALINA_HOME=/usr/local/tomcat-9
export PATH=$PATH:$JAVA_HOME/bin:$CATALINA_BASE/bin:$CATALINA_HOME/bin
source /etc/profile.d/my_env.sh
编写开机启动服务
cd /usr/lib/systemd/system
vim tomcat-9.service
[Unit]
Description=finereport-server
After=network.target
[Service]
Type=forking
Environment="JAVA_HOME=/usr/local/jdk-8"
PIDFile=/usr/local/tomcat-9/bin/tomcat.pid
ExecStart=/usr/local/tomcat-9/bin/startup.sh
ExecReload=/usr/local/tomcat-9/bin/shutdown.sh
ExecReload=/bin/kill -s HUP $MAINPID
PrivateTmp=true
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl start tomcat-9.service
systemctl enable tomcat-9.service
设置防火墙端口放行
firewall-cmd --zone=public --add-port=8080/tcp --permanent
firewall-cmd --reload
重启系统
reboot
浏览器访问测试
安装finebi
官网下载
上传安装包
解压安装包到指定目录
/usr/local
tar -zxvf tomcat-linux-x64.tar.gz -C /usr/local/
删除 MySQL5 数据库驱动
### 上传 MySQL 数据库驱动
```bash
cd /usr/local/tomcat-linux-x64/webapps/webroot/WEB-INF/lib/
ls mysql*
删除MySQL5的驱动后上传MySQL8的驱动
mysql-connector-j-8.0.33.jar mysql驱动
protobuf–java-3.5.1.jar mysql驱动依赖
rm -f mysql-connector-java-5.1.49-bin.jar /usr/local/tomcat-9/webapps/finebi
复制 finebi web 工程到 tomcat 服务器发布
cp -r /usr/local/tomcat-linux-x64/webapps/webroot /usr/local/tomcat-9/webapps/finebi
创建 finebi 外接数据库
字符集:
create database finebi default character set utf8 collate utf8_bin;
重启 tomcat
systemctl restart tomcat-9
浏览器访问测试
设置账号密码
根据使用场景选择数据库
本案例使用外接数据库
外接数据库配置
修改外接数据库参数
- 数据库类型
mysql
- 驱动
登录 finebi
finebi 新手入门案例
步骤:
准备公共数据
点击公共数据
点击新建文件夹
修改文件夹名称
上传数据
鼠标悬停在文件夹上,右侧出现 +
保存数据
更新数据
新建分析主题
点击我的分析
点击新建分析主题
选择数据
处理数据
调整数据列
两种方式:
在数据中分析
组件
点底部组件
维度与指标
维度
指标
指标:维度上量化的值
商品销售额top10
需求分析:
维度:商品名称
指标:销售额
图表类型
使用分组表
定义维度
定义维度上的指标
销售额排序
销售额降序
鼠标悬停在指标中的销售额上出现下拉列表
在指标销售额下拉列表中选中过滤
添加条件下拉列表中选择且条件
在第一个条件下拉列表中选择最大的N个
调整参数如下图
效果图
修改组件名称
鼠标悬停在组件名称,长右侧出现三个点
点击重命名
或者直接双击组件名称,组件变为编辑状态
编辑完成后回车确认或者鼠标点击其它空白区域即可
商品毛利额
毛利额 = 销售额 – 成本额
点击数据
在数据窗口点击新增公式列
计算公式列
毛利额 = 销售额 – 成本额
制作组件
在图形中分析
类别销售额占比
维度:商品类别
指标:销售额
图表类型选择饼图
将商品类别拖入图形属性中标签中
将销售额拖入图形属性中标签中
调整销售额单位
销售额占比
销售额占比 = 类型销售额数量 / 销售总额
再次将销售额拖入图形属性的标签中
第二次拖入图形属性标签中的销售额下拉列表
毛利额环比增长率
同比增长率=(本期数-同期数)÷同期数×100%
环比增长率=(本期数-上期数)/上期数×100%。
维度:时间
指标:毛利额
时间维度
修改日期颗粒度
毛利额指标
快速计算环比增长率
修改指标名称
毛利额环比增长率自定义图表
复制毛利额环比增长率组件
图表类型
自定义图表
修改毛利额为折线图
在图形属性中找到毛利额增长率,点开下拉列表
在下拉列表中选择线
毛利额标签
将数据中的毛利额拖入到图形属性中毛利额的标签属性中
修改图形属性中标签中毛利额的数值格式
方式一:纵轴为指标并列
方式二:设置值轴方向
注意:纵轴使用的是指标聚合
毛利额环比增长率标签
将数据中的毛利额拖入到图形属性中毛利额环比增长率的标签属性中
快速计算毛利额环比增长率
图形属性标签
设置颜色
如何找到合适的图表
- 确定表达的主题
- 确定表达的关系
- 选择图表形式
表达的主题
观察数据的角度
表达的关系
选择图表形式
毛利率环比增长率
毛利率 = (销售额 – 成本额) / 销售额 * 100%
毛利额 = 销售额 – 成本额
毛利率 = 毛利额 / 销售额 * 100%
添加计算字段
搜索右侧三个点点开
在计算字段中编写毛利率公式
毛利率 = (销售额 – 成本额) / 销售额 * 100%
毛利额 = 销售额 – 成本额
毛利率 = 毛利额 / 销售额 * 100%
由于毛利额已经计算过所此处直接引用
寻找毛利率下滑原因
创建钻取目录
加入到钻取目录
拖入方式加入钻取目录
拖入维度和指标
在指标中设置毛利率数值格式
指标排序
- 毛利率
- 毛利额
- 销售额
目录钻取
仪表板
添加仪表板
添加组件到仪表板
从左侧列表拖拽组件到仪表板
仪表板样式
交互式组件
美化仪表板
直播电商数据分析
连接数据库
本案例使用MySQL数据库
更换驱动
-
重启 BI 工程。
连接MySQL
-
数据连接
-
新建数据连接
-
在常用中点击MySQL
-
MySQL数据连接
注意:如果连接连接MySQL8,将驱动修改为
com.mysql.cj.jdbc.Driver
若出现中文乱码或日期错乱的情况,可在「数据连接URL」后加后缀,格式为:jdbc:mysql://hostname:port/database?generateSimpleParameterMetadata=true&useUnicode=true&characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai
其中
serverTimezone=Asia/Shanghai – 设置以”上海时区“为准
characterEncoding=utf8 – 编码转化
-
测试数据连接
-
保存数据连接
公共数据中添加数据连接
公共数据
新建文件夹
添加DB数据表
选择对应的数据连接和该数据连接下的数据表
保存数据表
更新数据
创建分析主题
批量选择数据
保存选择数据
保存并更新数据
商品销量和销售额top10
在公共数据中导入SQL数据集
select name, description, price, total_quantity from product inner join
(select product_id,sum(quantity) total_quantity from product_order
group by product_id order by total_quantity desc limit 10) t
on t.product_id = product.product_id
原文地址:https://blog.csdn.net/qq_24330181/article/details/134734411
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:http://www.7code.cn/show_26052.html
如若内容造成侵权/违法违规/事实不符,请联系代码007邮箱:suwngjj01@126.com进行投诉反馈,一经查实,立即删除!