When the deprecated metrics are enabled, Marathon uses Kamon.io
for its metrics. You can query the metrics via the /metrics
HTTP endpoint or
configure the metrics to report periodically to:
--reporter_graphite
.--reporter_datadog
.--reporter_datadog
(datadog reporter supports statsd).For the specific syntax see the metrics command-line flags section.
Although we try to prevent unnecessary disruptions, we do not provide stability guarantees for metric names between major and minor releases.
We will not change the name of a metric non-method-call (see below) metric in a patch release if this is not required to fix a production issue, which is very unusual.
All metric names have to be prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example, if we write that the name of a metric is service.mesosphere.marathon.uptime
, it might be available under stats.gauges.marathon_test.service.mesosphere.marathon.uptime
in your configuration.
service.mesosphere.marathon.uptime
(gauge) - The uptime of the reporting Marathon process in milliseconds. This is helpful to diagnose stability problems that cause Marathon to restart.
service.mesosphere.marathon.leaderDuration
(gauge) - The duration since the last leader election happened
in milliseconds. This is helpful to diagnose stability problems and how often leader election happens.
service.mesosphere.marathon.app.count
(gauge) - The number of defined apps. Be advised that high app number may lead to a degraded performance.
service.mesosphere.marathon.group.count
(gauge) - The number of defined groups. This number influences the performance of Marathon: if you have a high number of groups, your performance will be lower than for a low number of groups. Note that each term between the slashes in your an ID corresponds to a group. The app /shop/frontend
is in the frontend
group, which is in the shop
group, which is in the root group.
v0.15
service.mesosphere.marathon.task.running.count
(gauge) - The number of tasks that are
currently running.
v0.15
service.mesosphere.marathon.task.staged.count
(gauge) - The number of tasks that are
currently staged. Tasks enter staging state after they are launched. A consistently high number of staged tasks indicates that a lot of tasks are stopping and being restarted. Either you have many app updates/manual restarts or some of your apps have stability problems and are automatically restarted frequently.
v0.15
service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdateProcessor.queued
(gauge) - The number of queued status updates.
v0.15
service.mesosphere.marathon.core.task.update.impl.ThrottlingTaskStatusUpdateProcessor.processing
(gauge) - The number of status updates currently being processed.
v0.15
service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl.publishFuture
(timer) - This metric calculates how long it takes Marathon to process status updates.
v0.15
service.mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl.publishFuture
(timer) - This metric calculates how long it takes Marathon to process status updates.
v0.15
service.mesosphere.marathon.state.GroupManager.queued
(gauge) - The number of app configuration updates in the queue. Use --max_queued_root_group_updates
to configure the maximum.
v0.15
service.mesosphere.marathon.state.GroupManager.processing
(gauge) - The number of currently processed app configuration updates. Since we serialize these updates, this is either 0 or 1.
Marathon stores its permanent state in “repositories.” The important ones are:
GroupRepository
for app configurations and groups.TaskRepository
for the last known task state. This is the repository with the largest data churn.Other repositories include:
AppRepository
for versioned app configuration.DeploymentRepository
for currently running deployments.TaskFailureRepository
for the last failure for every application.We have statistics about read and write requests for each repository. To access them, substitute *
with the name of a repository:
service.mesosphere.marathon.state.*.read-request-time.count
- The number of read requests.
service.mesosphere.marathon.state.*.read-request-time.mean
- The exponential weighted average of the read request times.
service.mesosphere.marathon.state.*.write-request-time.count
- The number of write requests.
service.mesosphere.marathon.state.*.write-request-time.mean
- The exponential weighted average of the write request times.
Note: Many of the repository metrics were not measured correctly prior to v0.15.
org.eclipse.jetty.servlet.ServletContextHandler.dispatches
(timer) - The
number of HTTP requests received by Marathon is available under .count
.
There are more metrics around HTTP requests under the
org.eclipse.jetty.servlet.ServletContextHandler
prefix.
For more information, consult the code.
jvm.threads.count
(meter) - The total number of threads. This number should be below 500.
jvm.memory.total.used
(meter) - The total number of bytes used by the Marathon JVM.
These metrics are created automatically by instrumenting certain classes in our code base.
You can disable these instrumented metrics with --disable_metrics
. This flag will only disable this code instrumentation, not all metrics.
These timers can be very valuable in diagnosing problems, but they require detailed knowledge of the inner workings of Marathon. They can also degrade performance noticeably.
Since these metric names directly correspond to class and method names in our code base, expect the names of these metrics to change if the affected code changes.
Our metrics library calculates derived metrics like “mean” and “p99” using a sliding average window algorithm. This means that every time you fetch the /metrics
endpoint you are looking at the average of the last N seconds. By default the length of the window is 30 seconds, but this can be configured with the --metrics_averaging_window
flag.
For getting the most accurate results it is recommended to configure your polling interval to the size of the sliding average window.
Statsd typically creates derived statistics (mean, p99) from mean values Marathon reports. Our metrics package also reports derived statistics. To avoid accidentally aggregating statistics multiple times, be sure you know where you are reporting and computing mean values.