美文网首页
Elasticsearch Metric Aggregation

Elasticsearch Metric Aggregation

作者: darcyaf | 来源:发表于2020-01-03 15:21 被阅读0次

Avg Aggregation(平均数)

  • 使用字段
POST /exams/_search?size=0
{
    "aggs" : {
        "avg_grade" : { "avg" : { "field" : "grade" } }
    }
}
  • 使用脚本
POST /exams/_search?size=0
{
    "aggs" : {
        "avg_grade" : {
            "avg" : {
                "script" : {
                    "source" : "doc.grade.value"
                }
            }
        }
    }
}

注: 用missing字段当作缺失值

Weighted Avg Aggregation(带权平均数)

As a formula, a weighted average is the ∑(value * weight) / ∑(weight)

  • 字段
    如果field 是有多个值的,会当为多个值来处理
    但是权重的字段不能为数组类型,否则会抛Encountered more than one weight for a single document的错误
POST /exams/_doc?refresh
{
    "grade": [1, 2, 3],
    "weight": 2
}

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "field": "grade"
                },
                "weight": {
                    "field": "weight"
                }
            }
        }
    }
}
  • 脚本

TODO: 脚本中权重可以为数组,但是只会用第一个值, 权重和字段的值都为数组的时候情况貌似又不一样

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "script": "doc.grade.value + 1"
                },
                "weight": {
                    "script": "doc.weight.value + 1"
                }
            }
        }
    }
}

Cardinality Aggregation

an approximate count of distinct values(近似去重计数)

  • 使用的是HyperLogLog++算法, 占用内存为(precision_threshold * 8) bytes
  • 当高于precision_threshold时计数可能会变得模糊,基本上 precision_threshold越大,计数越准确,最大(40000)
  • 当需要对基数比较大的字符串做计数时,可自生成hash值或者用mapper-murmur3预先生成hash值,这样会快跟多,省cpu
  • 缺失值用 missing, 但是脚本中定义没用,报错No field found for [sd] in mapping with types []
  • 字段
POST /exams/_search?size=0
{
  "aggs":{
    "type_count": {
      "cardinality":{
        "field": "weight",
        "precision_threshold": 100
      }
    }
  }
}
  • 脚本

doc.grade.valuedoc['type'].value等同

POST /sales/_search?size=0
{
    "aggs" : {
        "type_promoted_count" : {
            "cardinality" : {
                "script": {
                    "lang": "painless",
                    "source": "doc['type'].value + ' ' + doc['promoted'].value"
                }
            }
        }
    }
}

Extended Stats Aggregation

拓展统计聚合
统计的拓展版本,会包含平方和(sum_of_squares),方差(variance),标准差(std_deviation),标准差范围(std_deviation_bounds,默认平均数±2倍标准差,可通过sigma参数配置倍数)
missing字段

  • 字段
GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "sigma" : 3 
            }
        }
    }
}
  • 脚本
GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "script" : {
                    "source" : "doc['grade'].value",
                    "lang" : "painless"
                 }
             }
         }
    }
}
GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "script" : {
                    "lang" : "painless",
                    "source": "_value * params.correction",
                    "params" : {
                        "correction" : 1.2
                    }
                }
            }
        }
    }
}

Geo Bounds Aggregation

获取地理坐标边界
wrap_longitude: 是否可以与国际日期更改线重叠

POST /museums/_search?size=0
{
    "query" : {
        "match" : { "name" : "musée" }
    },
    "aggs" : {
        "viewport" : {
            "geo_bounds" : {
                "field" : "location", 
                "wrap_longitude" : true 
            }
        }
    }
}

Geo Centroid Aggregation

坐标重心

POST /museums/_search?size=0
{
    "aggs" : {
        "cities" : {
            "terms" : { "field" : "city.keyword" },
            "aggs" : {
                "centroid" : {
                    "geo_centroid" : { "field" : "location" }
                }
            }
        }
    }
}

Max/Min Aggregation

最大值/最小值

POST /sales/_search?size=0
{
    "aggs" : {
        "max_price" : { "max" : { "field" : "price" } }
    }
}

Percentiles Aggregation

百分比,将数值型字段做排序,然后给出每个百分比的最大值,默认为[ 1, 5, 25, 50, 75, 95, 99 ]
可通过percents字段指定你需要的百分比
可通过keyed字段指定用map形式还是key=x,value=x形式
missing 字段
这个用的算法是TDigest (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests)

  • 精度是q(1-q),百分比越高,准确率越高
  • 数据集越小,精度越高
  • compression字段控制最大node使用数为20*compression,可以控制精度和内存的平衡,默认是100
  • image.png
  • HDR Histogram (High Dynamic Range Histogram)这个会比t-digest更快,但hdr只支持正数,需要用number_of_significant_value_digits指定有效单位,如果数据范围未知容易引起高内存使用率
GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "tdigest": {
                  "compression" : 200 
                }
            }
        }
    }
}
GET exams/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "grade",
                "percents" : [95, 99, 99.9],
                "hdr": { 
                  "number_of_significant_value_digits" : 3 
                },
                "keyed":false
            }
        }
    }
}

Percentile Ranks Aggregation

和上面的类似,上面主要看x%的数据最大是多少,这个主要是看小于x的数据占比是多少.

参数和上面的类似,hdr和compression都可以使用

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "field" : "load_time", 
                "values" : [500, 600]
            }
        }
    }
}

Scripted Metric Aggregation

直接用脚本输出一个
走的就是map-reduce那一套,需要写一个init脚本,map脚本,combine脚本,reduce脚本

POST exams/_search?size=0
{
    "query" : {
        "match_all" : {}
    },
    "aggs": {
        "profit": {
            "scripted_metric": {
                "init_script" : "state.transactions = []", 
                "map_script" : "state.transactions.add(doc.type.value == 'sale' ? doc.amount.value : -1 * doc.amount.value)",
                "combine_script" : "double profit = 0; for (t in state.transactions) { profit += t } return profit",
                "reduce_script" : "double profit = 0; for (a in states) { profit += a } return profit"
            }
        }
    }
}

Stats Aggregation

概要聚合
返回min, max, sum, count ,avg

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : { "stats" : { "field" : "grade" } }
    }
}
POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "square_hats" : {
            "sum" : {
                "field" : "price",
                "script" : {
                    "source": "_value * _value"
                }
            }
        }
    }
}

Top Hits Aggregation

TopN聚合

POST /example/_search?size=0
{
  "aggs":{
    "a_a":{
      "terms":{
        "field": "a",
        "order":{
          "top_hit": "asc"
        },
        "size": 2
      },
      "aggs":{
        "top_tags_hits":{
          "top_hits": {
            "size": 2,
            "_source":{
              "includes": ["a","b"]
            },
            "sort": [
              {
                "b":{
                  "order":"desc"
                }
              }
              ]
          }
        },
        "top_hit":{
          "avg":{
            "script": {
              "source": "doc.b"
            }
          }
        }
      }
    }
  }
}
  • items中可用自定义的聚合指标来做排序
  • _source中可自定义需要返回的数据字段
  • top_hits中的sort用来控制top的顺序,最小topN或者最大topN
  • 内部数据topN聚合
    对doc.comments做聚合
GET /sales/_search
{
  "query": {
    "term":{"tags":"car"}
  },
  "aggs":{
    "by_sale":{
      "nested": {
        "path": "comments"
      },
      "aggs":{
        "by_user":{
          "terms":{
            "field": "comments.username",
            "size": 2
          },
          "aggs": {
            "by_nested": {
              "top_hits": {
                "size": 4
              }
            }
          }
        }
      }
    }
  }
}

Value Count Aggregation

值计数聚合,和Cardinality Aggregation类似,但是不做去重

POST /example/_search?size=0
{
    "aggs" : {
        "types_count" : { "value_count" : { "field" : "c" } }
    }
}

Median Absolute Deviation Aggregation

绝对离差中位数,鲁棒性很强,计算公式为:median(|median(X) - Xi|,平均值变得很大,但是绝对离差中位数还是很小
可使用compression控制性能和精确性的平衡,默认是1000
当加入一个很大的离异值时:

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating" 
      }
    }
  }
}
{
  "took" : 447,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "review_average" : {
      "value" : 15.625
    },
    "review_variability" : {
      "value" : 1.5
    }
  }
}

相关文章

网友评论

      本文标题:Elasticsearch Metric Aggregation

      本文链接:https://www.haomeiwen.com/subject/isoboctx.html