美文网首页@IT·互联网程序员
先收藏!海量CDN日志高性价比分析方案

先收藏!海量CDN日志高性价比分析方案

作者: 阿里云技术 | 来源:发表于2019-08-07 11:14 被阅读36次

概述

CDN产生大量日志,可以进行日志转存到OSS bucket中。也可以进行实时日志推送, 推送到日志服务SLS中进行实时分析,也可以在日志服务SLS中配置OSS日志投递,将海量CDN日志长期保存在OSS中。在OSS中的海量CDN日志,利用Data Lake Analytics进行分析。

假设CDN的原始日志(不是进过日志服务投递OSS链路投递到OSS的,如果是日志服务投递到OSS的,请参考https://yq.aliyun.com/articles/705888), 通过转存、上传,直接上传到OSS,文本下面部分介绍在DLA中建表分析的步骤。

步骤一:针对OSS上CDN日志建表


假设您已经将CDN日志传到OSS上,每行日志如下示例:

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">
[9/Jun/2015:01:58:09 +0800] 188.165.15.75 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"

</pre>

其中每个字段含义为:

在DLA中建表,采用正则表达式的方式对CDN日志进行解析,正则表达式是对上述字段进行分别识别,每个字段和对应的正则表达式片段为:

为方便您复制粘贴,上述对应字段的样例和对应的正则表达式片段对应如下:

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">
[9/Jun/2015:01:58:09 +0800] 
-> (-|\\[[^\\]]*\\])
188.165.15.75
-> ([^ ]*)
-
-> ([^ ]*) 
1542 
-> ([^ ]*)
"-"
-> ([^ \"]*|\"[^\"]*\")
"GET http://www.aliyun.com/index.html"
-> ([^ \"]*|\"[^\"]*\")
200 
-> (-|[0-9]*) 
191 
-> (-|[0-9]*)
2830
-> (-|[0-9]*)
MISS
-> ([^ ]*)
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)"
-> ([^ \"]*|\"[^\"]*\")
"text/html"
-> ([^ \"]*|\"[^\"]*\")

</pre>

当然,上述字段的顺序可能根据实际情况有些变化,不过没有关系,可以根据上述字段对应正则进行顺序调整组合。

对于上述样例日志,对应的正则表达式为:

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">

(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")

</pre>

假设在OSS上,存储的CDN日志目录为:

oss://your_cdn_log_bucket/log/

则在DLA中建表:

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">

CREATE EXTERNAL TABLE cdn_log (
 log_timestamp VARCHAR,
 access_ip VARCHAR,
 proxy_ip VARCHAR,
 response_time VARCHAR,
 referer VARCHAR,
 request VARCHAR,
 httpcode SMALLINT,
 request_size BIGINT,
 response_size BIGINT,
 cache_hit_status VARCHAR,
 ua_header VARCHAR,
 file_type VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
 "input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://your_cdn_log_bucket/log/'
TBLPROPERTIES ('recursive.directories' = 'true');

</pre>

上述建表为非分区表,如果在OSS日志数据存储为分区模式,可以对应修改为分区表模式,关于分区,可以参考:https://yq.aliyun.com/articles/705888 中的“步骤四”和“步骤五”。

步骤二:查询、分析CDN日志

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">

SELECT * FROM cdn_log;
-> 
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.115.203 | - | 777 | "-" | "GET http://www.kalabandha.com/" | 200 | 201 | 7159 | MISS | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+


</pre>

可以利用DLA的系统函数,对数据字段进行分析:https://help.aliyun.com/document_detail/71065.html

例如:

<pre style="margin: 0px; padding: 0px; border: 0px; font: inherit; vertical-align: baseline; word-break: break-word;">

SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
 access_ip,
 proxy_ip,
 response_time,
 substring(referer, 2, length(referer) - 2) as referer,
 substring(request, 2, length(request) - 2) as request,
 httpcode,
 request_size,
 response_size,
 cache_hit_status,
 substring(ua_header, 2, length(ua_header) - 2) as ua_header,
 substring(file_type, 2, length(file_type) - 2) as file_type 
FROM cdn_log;
->
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.115.203 | - | 777 | - | GET http://www.kalabandha.com/ | 200 | 201 | 7159 | MISS | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+

</pre>

作者:Roin
阅读原文
本文为云栖社区原创内容,未经允许不得转载。

相关文章

网友评论

    本文标题:先收藏!海量CDN日志高性价比分析方案

    本文链接:https://www.haomeiwen.com/subject/nptodctx.html