Marathon 1.9 brings support for multi-role, allowing a single Marathon instance to launch services for more than one role. The role
field is added to pods and services, with the possible values restricted to either the default Mesos role (--mesos_role
) or the associated group role. A group-role convention is introduced to enforce a relationship between Marathon groups and the corresponding roles for services within them. In order to assist users that wish to migrate their Marathon instance to use multi-role, an incremental pathway is defined and explained in this document.
Multi-role functionality itself is enabled by default; no command-line parameter is required to enable it. However, there are run-time configuration options (group enforceRole
setting and service role
field) to control it, along with the command-line parameter --new_group_enforce_role
to affect default behavior.
While it is possible to partially use multi-role in Marathon, it is recommended that Marathon be fully “upgraded” to multi-role. This is to say:
enforceRole
is enabled for all top-level groups--new_group_enforce_role
command-line parameter is set to top
.Each Marathon service now contains a role
field, which defaults to the role specified by the command line parameter --mesos_role
, which defaults to *
. For example, if the command line parameter --mesos_role
is slave_public
(as is the default in DC/OS), and the following service is posted to Marathon /v2/apps
:
{
"id": "/dev/bigbusiness",
"cmd": "sleep 3600",
"instances": 10,
"cpus": 0.05,
"mem": 128
}
… then, the resulting Marathon service will receive the default role slave_public
, as follows:
{
"id": "/dev/bigbusiness",
"cmd": "set; sleep 3600",
"instances": 10,
"cpus": 0.05,
"mem": 128,
"role": "slave_public"
}
Marathon will then revive offers for the role slave_public
and match offers allocated to the role slave_public
in order to launch this service.
The group role is the name of the top-level group, as illustrated by the following table:
Full service id | Marathon group | Group-role |
---|---|---|
/dev/bigbusiness | /dev | “dev” |
/bigbusiness | / | N/A |
/dev/team1/bigbusiness | /dev/team1 | “dev” |
/frontend/ui | /frontend | “frontend” |
A service may only be assigned one of two possible roles: the default role (that which is specified by --mesos_role
), or, the group role. Note that services placed directly in the root group do not have a group-role.
Migrating a service to multi-role means modifying an existing service role from using the default Mesos role to using the group role.
Migrating ephemeral services is straightforward: update the role, and Marathon relaunches the service, accepting and using resources from offers for the new role. However, there is a caveat when migrating services with reservations.
In order to migrate the bigbusiness
app described earlier in this document from using the default Mesos role to the group role, you can simply post the following:
echo '{"role": "dev"} | curl -X PATCH http://local:8080/v2/apps/dev/bigbusiness -H "Content-Type: application/json" --data "@-"
This will trigger a deployment where all of the old bigbusiness
instances running as the role slave_public
are killed, and replaced by new instances running as the role dev
. Further, Marathon will update its FrameworkInfo
information with Mesos to also include the role dev
, so that it can begin receiving offers allocated to the role. This may require the Mesos create role
permission if the role does not already exist. See the section below on Mesos permission caveats.
Migrating resident services is a bit different, since changing the role for reserved resources would result in the reservations being destroyed (and subsequently the volume data), and this is often not desired. Instead of wiping old reserved instances, Marathon allows a resident service role to be changed for newly created instances, continuing to leave the existing instances running with the old role.
For example, say that you have deployed the following resident service:
$ cat <<EOF | curl -v -X POST http://localhost:8080/v2/pods -H "Content-Type: application/json" --data "@-"
{
"id": "/dev/bigbusiness",
"containers": [
{
"name": "sleep1",
"exec": { "command": { "shell": "pwd; ls -la; date >> data/launches.txt; sleep 1000" } },
"resources": { "cpus": 0.1, "mem": 32 },
"volumeMounts": [{
"name": "data",
"mountPath": "data"
}]
}
],
"volumes": [
{
"name": "data",
"persistent": {"size": 16, "type": "root"}
}
]
}
EOF
When you check the roles for the running instances, you will see that the instances were launched with the role slave_public
:
$ curl http://localhost:8080/v2/pods/dev/bigbusiness::status | jq '{specRole: .spec.role, instanceRoles: [.instances[].role]}'
{
"specRole": "slave_public",
"instanceRoles": [
"slave_public"
]
}
When you attempt to PUT the update with the role dev
to the pods endpoint, as follows:
$ cat <<EOF | curl -v -X PUT http://localhost:8080/v2/pods/dev/bigbusiness -H "Content-Type: application/json" --data "@-"
{
"id": "/dev/bigbusiness",
"containers": [
{
"name": "sleep1",
"exec": { "command": { "shell": "pwd; ls -la; date >> le-data/launches.txt; sleep 1000" } },
"resources": { "cpus": 0.1, "mem": 32 },
"volumeMounts": [{
"name": "data",
"mountPath": "le-data"
}]
}
],
"volumes": [
{
"name": "data",
"persistent": {"size": 16, "type": "root"}
}
],
"role": "dev"
}
EOF
… the API request will fail, indicating that resident services cannot change the role for existing instances. In case you would like to proceed, so that new instances are allocated to the new role, then you can do so by appending the parameter “?force=true”:
cat <<EOF | curl -v -X PUT http://localhost:8080/v2/pods/dev/bigbusiness?force=true -H "Content-Type: application/json" --data "@-"
{
"id": "/dev/bigbusiness",
"containers": [
{
"name": "sleep1",
"exec": { "command": { "shell": "pwd; ls -la; date >> le-data/launches.txt; sleep 1000" } },
"resources": { "cpus": 0.1, "mem": 32 },
"volumeMounts": [{
"name": "data",
"mountPath": "le-data"
}]
}
],
"volumes": [
{
"name": "data",
"persistent": {"size": 16, "type": "root"}
}
],
"role": "dev"
}
EOF
The deployment may take a little bit of time to succeed. Check your pod spec role versus the instance role:
$ curl http://localhost:8080/v2/pods/dev/bigbusiness::status | jq '{specRole: .spec.role, instanceRoles: [.instances[].role]}'
{
"specRole": "dev",
"instanceRoles": [
"slave_public"
]
}
You can see your spec’s role was updated, but the instance you launched prior to the update continues to use the role slave_public
. This means your reservation was preserved.
Let’s scale up the pod!
cat <<EOF | curl -v -X PUT http://localhost:8080/v2/pods/dev/bigbusiness?force=true -H "Content-Type: application/json" --data "@-"
{
"id": "/dev/bigbusiness",
"scaling": {"instances": 2, "kind": "fixed"},
"containers": [
{
"name": "sleep1",
"exec": { "command": { "shell": "pwd; ls -la; date >> le-data/launches.txt; sleep 1000" } },
"resources": { "cpus": 0.1, "mem": 32 },
"volumeMounts": [{
"name": "data",
"mountPath": "le-data"
}]
}
],
"volumes": [
{
"name": "data",
"persistent": {"size": 16, "type": "root"}
}
],
"role": "dev"
}
EOF
Then, check the role of your instances now that you have a freshly added instance after the role change:
curl http://localhost:8080/v2/pods/dev/bigbusiness::status | jq '{role: .spec.role, instanceRoles: [.instances[].role]}'
{
"role": "dev",
"instanceRoles": [
"slave_public",
"dev"
]
}
You can see that the new pod instance has the new role, dev
, while the old pod instance continues to have the old role, slave_public
. Moving forward, all new instances of the bigbusiness pod will be launched with the new role, dev
.
Marathon allows the mixed role usage in order to support a reasonable, incremental upgrade pathway for users wishing to migrate their existing Marathon instances to use multi-role. However, as stated in the introduction, this isn’t the ideal. Ideally, Marathon is configured to enforce services to use the group role. This is why the enforceRole
property exists on groups.
enforceRole
The group enforceRole
property is settable only for top-level groups; when enabled, it forces all new services in that group (or in subgroups) to use the group role, and receive the group role as a default. Existing services still must be migrated to multi-role, by modifying the respective role
field for each of the existing services.
As an example, say you have deployed the following bigbusiness app, using the default role:
{
"id": "/dev/bigbusiness",
"cmd": "set; sleep 3600",
"instances": 10,
"cpus": 0.05,
"mem": 128,
"role": "slave_public"
}
Now, use the PATCH method to enable role enforcement for a group:
$ echo '{"enforceRole": true}' | curl -f -X PATCH http://localhost:8080/v2/groups/dev -H "Content-Type: application/json" --data "@-"
Note: the API will simply respond with an empty entity, with a 200 status if the operation is successful. You can check the value for the enforceRole
field:
$ curl http://localhost:8080/v2/groups/dev | jq .enforceRole
true
Now that enforceRole
is enabled for the top-level group /dev
, you can see that deploying a new app will receive the group role, dev
, by default:
$ cat <<EOF | curl -f -X POST http://localhost:8080/v2/apps -H "Content-Type: application/json" --data "@-"
{
"id": "/dev/bigbusiness-role-enforced",
"cmd": "set; sleep 3600",
"instances": 10,
"cpus": 0.05,
"mem": 128
}
EOF
{
"id":"/dev/bigbusiness-role-enforced",
...
"role":"dev"
}
From this point on, if you try to update any service in the top-level group /dev
from the role dev
back to slave_public
then you will get a validation rejection.
$ echo '{"role": "slave_public"}' | curl -v -X PATCH http://localhost:8080/v2/apps/dev/bigbusiness-role-enforced -H "Content-Type: application/json" --data "@-"
...
{"message":"Object is not valid","details":[{"path":"/role","errors":["got slave_public, expected one of: [dev]"]}]}
You are allowed to modify services in /dev
use the old role slave_public
but, after you migrated, it is rejected:
$ echo '{"instances": 2, "role": "slave_public"}' | curl -v -X PATCH http://localhost:8080/v2/apps/dev/bigbusiness -H "Content-Type: application/json" --data "@-"
...
{"version":"2019-08-18T21:11:33.328Z","deploymentId":"d1f5654c-95f9-45ca-9fdd-0a08eceba7fb"}
$ echo '{"role": "dev"}' | curl -v -X PATCH http://localhost:8080/v2/apps/dev/bigbusiness -H "Content-Type: application/json" --data "@-"
...
{"version":"2019-08-18T21:12:34.259Z","deploymentId":"ac0930bf-8141-4452-967e-9f4b66229cef"}
$ echo '{"role": "slave_public"}' | curl -v -X PATCH http://localhost:8080/v2/apps/dev/bigbusiness -H "Content-Type: application/json" --data "@-"
...
{"message":"Object is not valid","details":[{"path":"/role","errors":["got slave_public, expected one of: [dev]"]}]}
$
enforceRole
for all new groups, by defaultWhen you begin migrating Marathon to use multi-role, it is recommended to turn on role enforcement for all new groups, by default. This is controlled by specifying the command line parameter open --new_group_enforce_role top
.
Note, providing this command-line parameter does not affect existing services or existing groups. You must enable enforceRole
for existing groups manually using the PATCH /v2/groups method, as seen above.
With the introduction of the role
field, the field acceptedResourceRoles
is easily misunderstood. It is very important to recognize there are two separate concepts with regards to roles in Mesos:
dev
can be allocated for the role dev
.With this in mind, given a service with the role dev
, only the following acceptedResourceRoles
field values are valid, and have the following interpretation:
["*"]
- only unreserved resources will be considered; resources already reserved for the role dev
will be declined.["*", "dev"]
- both unreserved resources and resources already reserved for the role dev
will be considered for offer matching.["dev"]
- only resources already reserved for dev
will be considered for offer matching. Unreserved resources will be declined.It is invalid to specify a role other than the service role or "*"
in acceptedResourceRoles
. In Marathon 1.9, invalid roles in acceptedResourceRoles
will be automatically removed, and acceptedResourceRoles
will default to ["*"]
if all specified roles are removed. In Marathon 1.10 and later, invalid resource roles will be rejected with a validation error.
It is vitally important that Marathon has access to use the roles that are specified in services.
Marathon blindly assumes it has the permissions to both create and use the new roles specified. If Marathon attempts to subscribe to a role to which it does not have permission, then this results in a non-functional Marathon, as Mesos reacts to unauthorized role subscriptions by dropping the connection.
See the http://mesos.apache.org/documentation/latest/authorization/ section in Mesos for documentation on the respective permissions. In DC/OS, the appropriate permissions are granted for the root Marathon out-of-the-box.