Metrics aggregate, store, and compute statistics for time-based data.
- Eventual consistency
- Built-in metrics
- Using metrics in reports
- Using metrics in automations
Each metric has a unique, namespaced identifier (e.g.
cerb.workers.active) using dot-notation.
cerb. prefix is reserved for built-in metrics. You can create your metrics with any other prefix.
By convention, the prefix is most often a domain name you own in reverse order (e.g.
com.example.metric.name) to ensure global uniqueness. This makes it easier to combine and share metrics across multiple Cerb environments.
Each metric is one of these types:
||A value that increments over time (e.g. automation invocations and durations, new tickets received, failed worker logins)|
||A direct reading of a value at a given time (e.g. CPU load, active workers, open tickets per bucket, assignments per worker)|
Metric types automatically configure reporting and visualizations, but do not affect how the underlying data samples are stored.
A single metric may be further partitioned by adding up to three optional key/value pairs called dimensions to its samples. The possible dimensions are defined ahead of time on the metric, but can be provided in any order with the samples.
A dimension is of one of the following types:
|Type||Description of value|
||A platform extension ID (e.g. record type)|
||A whole number|
||An ID of a given record type|
||Any textual value|
Each unique combination of metric and dimensions can function as a separate metric.
For example, the
cerb.tickets.open.elapsed metric has dimensions for
When just querying the
cerb.tickets.open.elapsed metric, the results will combine all dimensions together. The metric could also be queried for any combination of groups, or for a specific set of buckets. These dimensions could be reported independently or compared to each other.
For performance, all text-based dimension values are mapped to a corresponding integer value. This requires much less storage and enables fast lookups and filtering. The
record types are more efficient because they can be directly stored and used as integers, without bloating the mapping.
Metrics aggregate samples at three levels of detail called periods.
|Period||Count per day|
These base periods can be further combined in reports: 15 mins, 2 hours, 12 hours, weeks, months, quarters, and years.
A fast-moving metric can be charted every 5 minutes over the past few hours; while a longer-term trend can be visualized daily over the past few months.
The statistics within these periods are precalculated for efficient reporting. This is much faster than reporting based on the samples themselves.
Why not also store months and years?
The main reason is we're more likely to overflow the maximum value for
sum over larger periods. By combining smaller periods we can scale the units (e.g. bytes to gigabytes, seconds to hours) as the time range increases.
Samples collected within the same period are aggregated into a statistic set with the following values:
||The number of samples collected for this period|
||The sum of all sample values for this period|
||The smallest sample value observed this period|
||The largest sample value observed this period|
||The average sample value observed this period|
average is not actually stored, since it's easily computed as the
sum divided by
This allows new samples to be continuously added during the same period.
For instance, a
counter that increases by
+1 for 1,000 events within the same 5-minute period would only store a single statistics set rather than one thousand individual samples. It would look like:
For a counter that increments by
+1, the most useful statistic is
average of the samples are always
However, counters can also increase by values other than
+1. We might have a counter for the time automations spend executing in milliseconds. In that case, the
min (fastest execution time in the period),
max (slowest execution time in the period), and
average (average execution time in the period) would be useful in addition to the overall
sum (total execution time in the period).
gauge that takes a reading every 5 minutes would store a single statistics set per hour rather than 12 individual samples.
Let’s assume a gauge measured the current number of open tickets every 15 minutes for an hour. It took these readings:
It would store this hourly statistics set:
For a gauge, the
max statistics are the most useful. The
sum is almost never useful directly, but it is used with
samples to compute the
From these results, we could report that the average number of open tickets during this hour was
43.5, with a high of
49 and a low of
38. If we sampled 3x more often (every 5 minutes instead of 15 minutes) we’d still only have a single statistics set for the hour, but we’d have a higher
sum value, and the
average would be more precise.
The daily statistics set for this gauge would record the
average for the entire day.
The ideal frequency of measuring samples for a gauge depends on how that metric will be used. Something that changes infrequently, like the size of your Cerb database in gigabytes, could be measured once daily, weekly, or even monthly. There would be no value in sampling that every minute.
The more expensive a reading, the less often it should be sampled. When you create a report with sparse samples on a gauge metric, Cerb will fill in the missing samples by repeating the most recent prior value until the next value.
Keeping the data for high-frequency periods has a cost. There are (288) 5-minute periods per standard day, which could produce up to (105,120) 5-minute periods per standard year. That’s further multiplied by the distinct dimensions per metric. More storage and processing power is required with each metric and dimension you add.
The length of time that data is kept is called its retention.
Generally, the farther back you look in time, the less useful the smaller periods are (e.g. 5 minutes).
If you’re reporting on data from a year ago, it probably doesn’t matter what a metric’s average value was for a specific 5-minute period. For prior period comparisons, it’s often most useful to compare with larger periods like days, weeks, months, or years.
Cerb automatically removes aging high-frequency data to keep storage and processing costs lower.
The default retention time per period is:
|5 minutes||24 hours|
The 5-minute period is a “sliding window” over the trailing 24 hours. When a new 5-minute period starts, the oldest period (crossing 24 hours in age) is removed. This maintains a constant set of (288) most recent 5-minute periods per metric.
When multiple samples are collected in a short time period, with the same metric and dimensions, they are combined into a single statistics set. This single set is then pushed into a background queue for processing.
With this approach, you can efficiently collect samples at very high volumes without sacrificing performance.
For instance, the email parser is not delayed by sampling many metrics for every new inbound message (e.g. language, region, org, industry, classification, intent, etc).
Because samples are processed in the background, the most recent data may not show up on charts for a few minutes.
These metrics are managed automatically by Cerb:
||How long automations are executed. Dimensions:
||How often automations are executed. Dimensions:
||How long behaviors are executed. Dimensions:
||How often behaviors are executed. Dimensions:
||How many successful messages are sent through a mail transport. Dimensions:
||How many unsuccessful messages are attempted through a mail transport. Dimensions:
||How often each worker searches for a given record type. Dimensions:
||Snippet usage over time by worker. Dimensions:
||Open ticket counts over time by group and bucket. Dimensions:
||How long tickets spent in the open status by group and bucket. Dimensions:
||How often webhooks are executed. Dimensions:
||Seat usage by workers. Dimensions:
Using metrics in reports
Using metrics in automations
You can record samples on a metric from automations with the metric.increment command.