The BLEU metric.

BLEU is a classic metric for evaluating the similarity between a target and reference text based on similarity of n-grams.

Configuration

Configuration for the parameters of the BLEU metric:

title: "BleuConfig"
type: "object"
$defs:
  TokenizerSpec:
    type: "object"
    properties:
      name:
        type: "string"
        description: "Name of the tokenizer to be used."
      config:
        type: "object"
        description: >
          "Any additional configuration that tokenizer needs. "
          "This will be in the form specified by the JSONSchema for that tokenizer."
    required:
      - "name"
      - "config"
properties:
  max_ngram_order:
    type: "integer"
    description: "The number of n-grams to count."
  smooth_method:
    type: "string"
    description: "The method used to smooth counts before calculating BLEU."
    enum:
      - "none"
      - "add_k"
      - "floor"
      - "exp"
  smooth_value:
    type: "number"
    description: "The value to use for smoothing."
  output_spans:
    type: "boolean"
    description: >
      "Whether to output information about the quality of individual character "
      "spans within each target text."
  tokenizer:
    $ref: "#/$defs/TokenizerSpec"

Data

Accepted data format of the BLEU metric:

title: "BleuData"
type: "object"
properties:
  target:
    type: "string"
    description: "Target text to evaluate."
  references:
    type: "array"
    description: "The references to evaluate the target against."
    items:
      type: "string"
required:
  - "target"
  - "references"

Results

Format of the results of the BLEU metric:

title: "BleuResult"
type: "object"
$defs:
  BleuSpan:
    type: "object"
    properties:
      value:
        type: "number"
        description: >
          "A value corresponding to the maximum n-gram length matched by this token, "
          "divided by the maximum n-gram length overall."
      start:
        type: "number"
        description: "The span start position (in characters, inclusive)."
      end:
        type: "number"
        description: "The span end position (in characters, exclusive)."
    required:
      - "value"
      - "start"
      - "end"
  BleuStats:
    type: "object"
    properties:
      value:
        type: "number"
        description: "BLEU score for the entire dataset."
      correct:
        type: "array"
        description: >
          "List of counts of correct ngrams, 1 <= n <= max_ngram_order. "
          "When smoothing is used, these will be the adjusted counts after smoothing."
        items:
          type: "number"
      total:
        type: "array"
        description: >
          "List of counts of total ngrams, 1 <= n <= max_ngram_order. "
          "When smoothing is used, these will be the adjusted counts after smoothing."
        items:
          type: "number"
      precisions:
        type: "array"
        description: "List of precisions, 1 <= n <= max_ngram_order"
        items:
          type: "number"
      brevity_penalty:
        type: "number"
        description: "The brevity penalty."
      sys_len:
        type: "number"
        description: "The cumulative system length."
      ref_len:
        type: "number"
        description: "The cumulative reference length."
      spans:
        type: "array"
        description: "A list of spans within an example and matching scores."
        items:
          $ref: "#/$defs/BleuSpan"
    required:
      - "value"
properties:
  overall:
    $ref: "#/$defs/BleuStats"
  examples:
    type: "array"
    items:
      $ref: "#/$defs/BleuStats"
required:
  - "overall"
  - "examples"