AWS Harness — Local Testing Client
Status: Preview. AgentCore Harness is a managed agent layer announced at AWS re:Invent 2025 — currently in preview with limited regional coverage. This integration tracks the preview API surface; expect breaking changes before GA.
This folder hosts the deployment of the Zscaler MCP Server as an AgentCore
Harness tool. For the existing AgentCore Runtime deployment
(Direct Runtime + experimental Gateway), see
../bedrock-agentcore/.
Pick a topology
The script supports two end-to-end deployment shapes. Pass --topology (or
set TOPOLOGY in .env); omit both and the script prompts interactively.
| Topology | Tool type | MCP server runs on | IdP | When to pick |
|---|---|---|---|---|
ecs (default) | remote_mcp | ECS Express Mode service (Fargate + auto-ALB) | n/a (Basic auth from Token Vault) | Simplest path; fewest moving parts; no Cognito to manage. PR #47 behaviour. |
gateway (PR #48) | agentcore_gateway | AgentCore Runtime (managed) | Amazon Cognito | No ALB / ECS / Fargate. Same Gateway can later front other MCP clients (Cursor/Claude/Strands) with a real OIDC login. |
ecs topology (PR #47) gateway topology (PR #48)
───────────────────── ─────────────────────────
User ──SigV4──► Harness User ──SigV4──► Harness
│ │
│ remote_mcp │ agentcore_gateway
│ (Token Vault Basic) │ (Token Vault OAuth)
▼ ▼
ECS Express AgentCore Gateway
(ALB+HTTPS) (CUSTOM_JWT Cognito)
│ │
▼ │ OAuth2 outbound
MCP container ▼
AgentCore Runtime
(jwt Cognito)
│
▼
MCP container
What is AgentCore Harness?
Harness is a managed agent. You declare model, systemPrompt, tools,
memory, and limits once via bedrock-agentcore-control:CreateHarness;
AWS runs the agent loop. Under the hood it is Strands Agents on AgentCore
Runtime — both still exist; Harness is just the higher-level surface AWS
now markets as the go-to-production path.
| Property | Detail |
|---|---|
| Service | Amazon Bedrock AgentCore — bedrock-agentcore-control (control plane) + bedrock-agentcore (data plane) |
| API | CreateHarness, GetHarness, UpdateHarness, DeleteHarness, ListHarnesses, InvokeHarness |
| boto3 | ≥ 1.43.0 (earlier versions don't expose the API) |
| Tools | remote_mcp, agentcore_gateway, agentcore_browser, agentcore_code_interpreter, inline_function |
| Auth | Static Authorization header (plain or resolved from AgentCore Identity Token Vault) |
| Memory | Optional (AgentCore Memory). Opt-in per harness. |
| Observability | Auto-emitted to CloudWatch under /aws/bedrock-agentcore/harness/<id> |
The Zscaler MCP Server fits exclusively as a remote_mcp tool — a
URL Harness will call over HTTPS with whatever headers we configure.
Architecture
┌──────────────────────────────────────────┐
│ User / Bedrock console / boto3 client │
└────────────────┬─────────────────────────┘
│ InvokeHarness (data plane)
▼
┌──────────────────────────────────────────┐
│ AgentCore Harness (managed) │
│ model: claude-sonnet-4-6 │
│ systemPrompt: "Zscaler admin assistant" │
│ tools: │
│ - type: remote_mcp │
│ name: zscaler │
│ url: https://…ecs….on.aws/mcp │
│ headers: │
│ Authorization: ${arn:…vault:…} │
└────────────────┬─────────────────────────┘
│ HTTPS, Basic header sourced from Token Vault
▼
┌──────────────────────────────────────────┐
│ Zscaler MCP Server │
│ (deployed by THIS script to │
│ Amazon ECS Express Mode — Fargate + │
│ managed ALB + auto-scaling + auto-TLS) │
│ Image: zscaler/zscaler-mcp-server │
│ (same image as AgentCore Runtime;│
│ override via ZSCALER_MCP_IMAGE_URI│
│ for dev builds) │
│ ZSCALER_MCP_AUTH_MODE=zscaler │
└──────────────────────────────────────────┘
│
▼
Zscaler OneAPI
What this script creates (end-to-end)
The deploy command stands up the entire path in a single run:
- ECS task execution role —
zscaler-mcp-ecs-task-execution-role(configurable). Trustsecs-tasks.amazonaws.com. Attached AWS-managedAmazonECSTaskExecutionRolePolicyfor same-account ECR + CloudWatch Logs; layered inline policy scopes cross-account ECR pull to whichever registry account is in the image URI (defaults to the AWS Marketplace ECR account709825985650). - ECS infrastructure role —
zscaler-mcp-ecs-infrastructure-role(configurable). Trustsecs.amazonaws.com. Attached AWS-managedAmazonECSInfrastructureRoleforExpressGatewayServices— used by ECS only during create/update/delete to provision the ALB, target groups, security groups, and auto-scaling policies on your behalf. - ECS cluster —
zscaler-mcp(configurable). Created if missing; preserved on destroy if it pre-existed (so we don't disturb shared workloads). - CloudWatch log group —
/ecs/zscaler-mcp(configurable). Container stdout/stderr is streamed here with themcplog-stream prefix. - ECS Express service —
zscaler-mcp-server(configurable). A singleCreateExpressGatewayServicecall provisions the entire stack — ALB, target group with/mcp/health check, security groups, auto-scaling target — and returns a stable public HTTPS endpoint of the formxxxxx.ecs.<region>.on.aws. The container runs onstreamable-httpat port 8000 withZSCALER_MCP_AUTH_MODE=zscalerso it validates the inboundAuthorization: Basic …header against Zscaler's/oauth2/v1/tokenendpoint. - AgentCore Identity Token Vault credential provider —
zscaler-mcp-creds(configurable). StoresBasic base64(client_id:client_secret)so the Harness can substitute it into the outboundAuthorizationheader at invocation time. - AWS IAM Harness execution role —
zscaler-mcp-harness-execution-role(configurable). Trustsbedrock-agentcore.amazonaws.com. Inline policy mirrors the AWS-published harness execution role policy — all of these are required, the harness will silently failInvokeHarnesswithFailed to start MCP client: ... TaskGroupif any are missing:bedrock:InvokeModel*,bedrock:Converse*— call the reasoning modelecr-public:GetAuthorizationToken,sts:GetServiceBearerToken— the under-the-hood AgentCore Runtime pulls its container image from ECR Public on every sessionxray:Put*,cloudwatch:PutMetricData(scoped to namespacebedrock-agentcore) — AgentCore Observabilitylogs:*scoped to/aws/bedrock-agentcore/runtimes/*— the auto-managed runtime writes its own application logsbedrock-agentcore:GetWorkloadAccessToken*scoped toworkload-identity-directory/default/workload-identity/harness_<name>-*bedrock-agentcore:GetResourceApiKey,bedrock-agentcore:GetResourceOauth2Token— Token Vault resolutionsecretsmanager:GetSecretValue(scoped tobedrock-agentcore-identity*) plus a scopedkms:Decrypt— Token Vault backing secrets
- AgentCore Harness —
zscaler-mcp-harness(configurable). Wired to a model + system prompt + theremote_mcptool block pointing at the ECS Express URL with the Token Vault credential provider as theAuthorizationsubstitution target.
State persisted to .aws-harness-state.json (gitignored).
Why ECS Express Mode and not App Runner? AWS stopped onboarding new customers to App Runner on Apr 30, 2026 and pointed everyone at ECS Express Mode as the replacement. Express Mode keeps the same single-API-call UX (one
CreateExpressGatewayServiceprovisions ALB
- target groups + security groups + auto-scaling) and adds proper Fargate scaling. The image, env vars, and Harness wiring are identical regardless of host.
Want to host the MCP server elsewhere? Set
MCP_URL=...in.env(or pass--mcp-url). The ECS Express steps are skipped and Harness is wired to your existing endpoint. The endpoint must be HTTPS, non-SigV4, and accept the sameAuthorization: Basic …header — i.e. it must be running the Zscaler MCP Server withZSCALER_MCP_AUTH_MODE=zscaler.
Critical constraint — remote_mcp headers are static
Harness's remote_mcp tool can only send static Authorization headers.
It has no SigV4 signer. That means:
- A SigV4-protected AgentCore Runtime URL will not work as a
remote_mcptarget. Pointing Harness athttps://bedrock-agentcore.<region>.amazonaws.com/runtimes/...produces HTTP 403 on every invocation. - Recommended: let this script deploy the MCP server to ECS Express
Mode (default path — auto-managed ALB + HTTPS). Or stand it up
yourself behind any other non-SigV4 endpoint (Lambda + API Gateway
without IAM auth, EC2 + systemd + HTTPS, Cloud Run, ACA, on-prem +
ngrok) and pass the URL via
MCP_URL=…. The server's existingZSCALER_MCP_AUTH_MODE=zscalermode handles the Basic header Harness delivers regardless of host. - Alternative (now shipped): AgentCore Gateway between Harness and a
SigV4 Runtime URL — the Gateway does the OAuth → SigV4 protocol switch
for you. Set
--topology gateway(orTOPOLOGY=gateway) and the script provisions AgentCore Runtime + Gateway + Amazon Cognito as the IdP in a single command. See Gateway topology below.
The deploy script will warn if you point it at an obvious SigV4-only
URL and let you proceed anyway — useful for testing the Harness creation
flow before the MCP endpoint is finalised.
Gateway topology (PR #48)
The Gateway topology eliminates the ECS Express / ALB / Fargate footprint
entirely. The MCP server runs on AgentCore Runtime (the same compute
the sibling ../bedrock-agentcore/ script uses) and Harness reaches it
through an AgentCore Gateway. Amazon Cognito is the inbound IdP —
fully AWS-native, no Auth0 / Okta / Entra ID required.
How auth works (all three boundaries)
The whole flow uses Cognito-issued JWTs (client_credentials grant), brokered by a single OAuth2 credential provider in AgentCore Identity Token Vault. One Cognito App Client serves all three legs.
[1] user / console ──SigV4──────────────► Harness
[2] Harness ──HarnessGatewayOutboundAuth.oauth──►
fetches Cognito client_credentials token
from the OAuth2 credential provider
──Bearer <Cognito JWT>──► Gateway
[3] Gateway ──customJWTAuthorizer (Cognito JWKS)──┘
validates aud / client_id / signature
──target outbound: OAuth2 (same provider)──►
refetches Cognito token, presents to
──Bearer <Cognito JWT>──► Runtime
[4] Runtime ──customJwtAuthorizer (Cognito JWKS)──┘
same validation as Gateway
──container env: jwt mode──► MCP server
[5] MCP server ──Zscaler Secret Manager via TaskRole──►
loads ZSCALER_* creds from Secrets Manager
──Basic auth─────► Zscaler OneAPI
What the script creates (gateway topology)
A single deploy --topology gateway run provisions:
- Amazon Cognito User Pool —
zscaler-mcp-harness-up(configurable).AdminCreateUserOnlyset to true — we never mint users. - Cognito Resource Server — identifier
zscaler-mcp(becomes theaudclaim on tokens). Has one custom scope:invoke. - Cognito App Client —
zscaler-mcp-harness-client, client_credentials grant only, withzscaler-mcp/invokeinAllowedOAuthScopes. Generates a client secret on create. - Cognito hosted-UI domain — auto-suffixed with the AWS account ID
for global uniqueness. Hosts the
/oauth2/tokenendpoint. - AgentCore Identity OAuth2 credential provider —
zscaler-mcp-cognito-oauth. Stores the Cognito (client_id,client_secret,discoveryUrl) tuple. Backs both the Harness→Gateway and Gateway→Runtime auth legs. - Runtime execution role —
zscaler-mcp-harness-runtime-role. Grants ECR pull, CloudWatch Logs PutLogEvents on/aws/bedrock-agentcore/runtimes/*, Secrets ManagerGetSecretValue+ scopedkms:Decrypton the Zscaler secret. - AgentCore Runtime —
zscaler_mcp_runtime. Configured withauth: jwt+customJwtAuthorizerpointing at Cognito, env vars includeZSCALER_SECRET_NAMEso the container'szscaler_mcp.configmodule loads credentials via boto3 at boot. - Gateway service role —
zscaler-mcp-harness-gateway-role. Trustsbedrock-agentcore.amazonaws.com. Inline policy grantsbedrock-agentcore:InvokeAgentRuntimeon the Runtime ARN. - AgentCore Gateway —
zscaler-mcp-gateway.protocolType=MCP,authorizerType=CUSTOM_JWTagainst the Cognito User Pool. - Gateway target —
zscaler-mcp-runtime.mcpServertarget type pointing at the Runtime's invocation URL. Outbound credential provider = the OAuth2 provider from (5),grantType=CLIENT_CREDENTIALS. - Harness execution role — same as the ECS topology
(
zscaler-mcp-harness-execution-role). - AgentCore Harness — wired with an
agentcore_gatewaytool block whoseoutboundAuth.oauth.providerArnpoints back at (5).
State persisted to .aws-harness-state.json with a topology: "gateway"
marker plus all the IDs above.
Why this topology is cleaner
| ECS topology (PR #47) | Gateway topology (PR #48) | |
|---|---|---|
| Compute | Fargate task in ECS Express service | AgentCore Runtime (managed) |
| Networking | ALB + target group + security groups + auto-scaling | None (Runtime is internal to AgentCore) |
| Health checks | ALB /health probe (FastMCP HealthCheckMiddleware) | None (Runtime polls READY status itself) |
| Inbound auth on MCP | Basic (Zscaler-mode middleware on the container) | JWT (validated by AgentCore Runtime before the container is hit) |
| IdP | None (Token Vault holds static Basic header) | Amazon Cognito (1 User Pool, 1 App Client, 1 Resource Server, 1 domain) |
| Reachable by non-Harness MCP clients | Yes — https://…ecs….on.aws/mcp is a plain URL | Yes — Gateway exposes a Cognito-fronted MCP URL too |
| Destroy blast radius | Cluster + service + task defs + ALB + roles | Runtime + Gateway + target + Cognito + roles |
| Cost when idle | ALB ~$16/mo + 1 Fargate task | Near-zero (Runtime + Gateway charged per invocation) |
Limitations (preview API)
- Gateway inbound auth is CUSTOM_JWT only as of the 2023-06-05
service spec. Cognito is the easiest IdP because the script provisions
it for you; any OIDC-compliant IdP would also work but you'd have to
bring your own and edit the deploy script's
_deploy_gateway_topology()to use its discovery URL. - Domain prefix collisions: Cognito hosted-UI domain prefixes are
globally unique within a region. The script suffixes the prefix with
the AWS account ID, but two separate AWS accounts in the same region
trying to use the same prefix will see one fail. Override
--cognito-domain-prefixif this hits you. - No interactive
--keep-runtimeoption on destroy yet. If you need to preserve the Runtime (e.g. it's also the backend for the../bedrock-agentcore/deploy), pass--keep-roleto keep the Runtime exec role + Gateway service role, then manually runaws bedrock-agentcore-control delete-gateway-target/delete-gatewayto remove just the Gateway pieces. - Cognito tokens cap at 1 hour by default. Token Vault refreshes them automatically; you should never see token-expiry errors from Harness.
Switching between topologies
The two topologies are mutually exclusive in a single state file — the script tracks one Harness at a time. To switch:
# Tear down the current deployment (whichever topology)
python harness_mcp_operations.py destroy --yes
# Redeploy with the other topology
python harness_mcp_operations.py deploy --topology gateway
# or
python harness_mcp_operations.py deploy --topology ecs
If you want both topologies running side-by-side, pass different
--harness-name values (and clone the script directory so the state
files don't collide).
Prerequisites
| Requirement | Notes |
|---|---|
| AWS account with AgentCore Harness preview access | Currently limited to a subset of regions. Confirm via aws bedrock-agentcore-control list-harnesses --region <r>. |
| AWS CLI / boto3 credentials | The script uses the default credential chain. aws sts get-caller-identity should succeed. |
| Python 3.10+ | One runtime dependency (boto3). |
| Bedrock model access | At minimum the Claude Sonnet 4.6 inference profile (or whichever model you pick). Anthropic models additionally require a one-time use-case form in the Bedrock console. |
Permission to call ecs:CreateExpressGatewayService and iam:PassRole on ecsTaskExecutionRole / ecsInfrastructureRoleForExpressServices | Granted by AdministratorAccess. Tighter least-privilege policies are documented in the ECS Express Mode getting-started guide. |
| Default VPC in the target region | ECS Express auto-selects subnets + security groups from the default VPC when networkConfiguration is omitted. Most accounts have one out of the box; otherwise create one with aws ec2 create-default-vpc. |
| AWS Marketplace subscription to Zscaler MCP Server | Free (BYOL). Required if you let the script use the default Marketplace image. Skip this if you set ZSCALER_MCP_IMAGE_URI to your own ECR. |
linux/amd64 image | The Marketplace image is multi-arch; dev builds must include linux/amd64 (ECS Fargate is amd64 by default). Use make docker-build-multiarch IMAGE=<your-ecr-uri>:<tag> to build a manifest list, or … PLATFORMS=linux/amd64 to push amd64 only. A plain docker build on Apple Silicon ships an arm64-only image and crashes Fargate with exec format error. |
| Zscaler OneAPI credentials | ZSCALER_CLIENT_ID, ZSCALER_CLIENT_SECRET, ZSCALER_CUSTOMER_ID, ZSCALER_VANITY_DOMAIN from the ZIdentity console. All four are required when this script deploys the MCP server itself. |
Install
cd integrations/aws/harness
uv venv .harness-venv --python 3.11
source .harness-venv/bin/activate
uv pip install -r requirements.txt
Both .harness-venv/ and the deployment state file are listed in
integrations/aws/harness/.gitignore.
Configure
Copy the template and fill in your Zscaler credentials:
cp env.properties .env
${EDITOR:-vim} .env
You can also pass values as CLI flags or let the script prompt you
interactively — env.properties is for convenience.
Deploy
python harness_mcp_operations.py deploy --region us-east-1
The script walks the full stack in one run (≈4-5 minutes — most of it is the ECS Express ALB + target group health-check warm-up):
| # | Step | What happens |
|---|---|---|
| 1 | Load configuration | Reads .env, merges with CLI flags. |
| 2 | Verify AWS credentials | sts:GetCallerIdentity. |
| 3 | Pick MCP source | ECS Express (default) or pre-existing MCP_URL. |
| 4 | Zscaler OneAPI credentials | Validates CLIENT_ID / CLIENT_SECRET (+ CUSTOMER_ID / VANITY_DOMAIN on the ECS Express path). |
| 5 | Container image source | Defaults to the Marketplace ECR image; override via ZSCALER_MCP_IMAGE_URI. |
| 6 | ECS IAM roles | Task execution role (ecs-tasks.amazonaws.com + AmazonECSTaskExecutionRolePolicy + cross-account ECR inline) and infrastructure role (ecs.amazonaws.com + AmazonECSInfrastructureRoleforExpressGatewayServices). |
| 7 | ECS cluster + log group | Cluster created if missing (tracked for symmetric destroy); CloudWatch log group created idempotently. |
| 8 | ECS Express service | CreateExpressGatewayService, polls until status is ACTIVE and the PUBLIC ingress endpoint is published. Returns the stable *.ecs.<region>.on.aws URL. |
| 9 | Stage credentials in Token Vault | CreateApiKeyCredentialProvider storing Basic base64(client_id:client_secret). |
| 10 | Harness execution role | IAM role for the Harness itself. Sleeps ~10s for propagation. |
| 11 | Pick Bedrock reasoning model | Claude Sonnet 4.6 (default), Claude Opus 4.7, Nova Pro, or Llama 3.3 70B. Skipped if MODEL_ID is set. |
| 12 | CreateHarness | Submits the harness with model + system prompt + remote_mcp tool, polls until READY. |
Steps 5–8 are skipped when MCP_URL is set; the script wires Harness
to your existing endpoint instead.
On success, prints the harness ARN, the ECS Express public endpoint
(when applicable), the Bedrock console URL for the playground, and the
next commands to run (logs / invoke / destroy).
What success looks like
── Deployment summary ─────────────────────────────────────────────────────────
HarnessId = zscaler-mcp-harness-AbCdEfGhIj
HarnessArn = arn:aws:bedrock-agentcore:us-east-1:123456789012:harness/zscaler-mcp-harness-AbCdEfGhIj
Model = us.anthropic.claude-sonnet-4-5-20250929-v1:0
MCP URL = https://abc123.ecs.us-east-1.on.aws/mcp
MCP Host = ECS Express — zscaler-mcp-server (cluster: zscaler-mcp)
Execution Role = arn:aws:iam::123456789012:role/zscaler-mcp-harness-execution-role
Credential Provider = arn:aws:bedrock-agentcore:us-east-1:123456789012:token-vault/default/apikeycredentialprovider/zscaler-mcp-creds
Console = https://us-east-1.console.aws.amazon.com/bedrock-agentcore/home?region=us-east-1#/harnesses/zscaler-mcp-harness-AbCdEfGhIj
Lifecycle commands
python harness_mcp_operations.py status --region us-east-1
python harness_mcp_operations.py logs --region us-east-1
python harness_mcp_operations.py invoke "list my zpa segment groups" --region us-east-1
python harness_mcp_operations.py destroy --region us-east-1 [--yes] [--keep-role] [--keep-ecs]
| Command | Description |
|---|---|
deploy | End-to-end walk-through (above). Idempotent on re-deploy — reuses existing ECS cluster, ECS Express service, IAM roles, log group, and credential provider when they already exist by name. |
status | GetHarness + pretty-print of status, model, tool list, timestamps. When the ECS Express host is in state, also prints the service status, cluster, image, and public endpoint. |
logs | Tails the auto-managed AgentCore runtime log group under /aws/bedrock-agentcore/runtimes/<runtime-id>. The runtime ID is discovered by scanning that prefix for groups tied to your harness; the group only materialises on the first InvokeHarness. (For container-side / ECS logs use aws logs tail /ecs/zscaler-mcp --follow.) |
invoke | One-shot smoke test: opens an InvokeHarness event stream, prints text deltas, surfaces the stop reason and token usage. |
destroy | Reverse-order tear-down: DeleteHarness → wait → DeleteApiKeyCredentialProvider → delete Harness exec role → DeleteExpressGatewayService → delete ECS CloudWatch log group → delete cluster (only if we created it) → delete ECS task execution + infrastructure roles → delete .aws-harness-state.json. The auto-managed Harness runtime log group under /aws/bedrock-agentcore/runtimes/* is owned by AWS and drained automatically when the harness is deleted — we do not touch it. Use --keep-role to preserve IAM roles and --keep-ecs to preserve the MCP host across redeploys. |
Authentication design
The Zscaler MCP Server has five auth modes (OIDCProxy, JWT,
API Key, Zscaler, None). Harness's remote_mcp tool pairs with
them as follows:
| MCP auth mode | Harness header config | Recommended? |
|---|---|---|
| Zscaler | Authorization: ${arn:…/zscaler-mcp-creds} — Token Vault resolves to Basic base64(client_id:client_secret) | Yes — this script's default. Rotation handled by Token Vault. |
| API Key | Authorization: ${arn:…/zscaler-api-key} — plain bearer | Yes, if you'd rather use a static bearer instead of OneAPI. |
| JWT | Authorization: Bearer <long-lived JWT> plaintext, or Token-Vault resolved if rotation is needed | Less common. JWT is usually short-lived. |
| OIDCProxy | Use agentcore_gateway tool type instead, with outboundAuth.oauth configured on the Gateway | Topology C. The OIDC flow can't run from inside remote_mcp. |
| None | No Authorization header | Dev / testing only. Do not deploy without auth. |
Secrets Manager — where Zscaler credentials live
By default this script does not put ZSCALER_CLIENT_SECRET (or
ZSCALER_CLIENT_ID / ZSCALER_VANITY_DOMAIN / ZSCALER_CUSTOMER_ID /
ZSCALER_CLOUD) in the ECS task definition as plaintext env vars. The
five-key credential bundle goes to AWS Secrets Manager and the container
fetches it at boot.
Why
Plaintext env vars in an ECS task definition are visible to anyone with
ecs:DescribeTaskDefinition and the value is logged in CloudTrail on
every RegisterTaskDefinition / CreateExpressGatewayService /
UpdateExpressGatewayService call. Secrets Manager scopes the read to
a single secret ARN, audits each fetch separately, and enables
credential rotation without touching the task definition.
How it works (zero container-code changes)
The container image already ships with zscaler_mcp/config.py, a
side-effect module that runs at process boot via aws_entrypoint.py:
- The deploy script writes the credential JSON to
zscaler-mcp-harness/credentialsin Secrets Manager. - The ECS task definition gets only
ZSCALER_SECRET_NAME=<that-name>— never the actual credential values. - The task execution role gets a scoped two-statement inline policy
(
secretsmanager:GetSecretValueon the secret ARN +kms:Decryptfiltered bykms:ViaService=secretsmanager.<region>). - At container boot,
config.pycallsGetSecretValuevia boto3, parses the JSON, andos.environ-injects each key. The SDK then initialises exactly as if the keys had been passed as env vars.
Result: the same SDK code, the same env-var shape — but no Zscaler
credential ever appears in aws ecs describe-task-definition,
CloudTrail, or the ECS console.
Three modes
| Mode | How to activate | Lifecycle |
|---|---|---|
| Default — script-managed secret | Leave ZSCALER_SECRET_NAME unset in .env. The script creates zscaler-mcp-harness/credentials (override via --secret-name) and seeds it from your ZSCALER_* .env values. | destroy schedules deletion with a 7-day recovery window. Use --force-secret-delete to skip the window. Re-deploys after a .env rotation update the secret in place via PutSecretValue. |
| Bring-your-own secret | Set ZSCALER_SECRET_NAME=<arn-or-name> in .env, pointing at a pre-existing secret managed by Terraform / CloudFormation / another team. Secret JSON must use the same key names (ZSCALER_CLIENT_ID, ZSCALER_CLIENT_SECRET, etc.). | The script verifies the secret exists, scopes the IAM policy to its ARN, but never overwrites the value or deletes the secret — even on destroy. |
| Plaintext opt-out (dev/debug only) | Pass --no-secrets-manager to deploy. | Restores the legacy behaviour where the 5 credential keys go straight into the ECS task definition. No Secrets Manager resource is created, no IAM policy is attached. Production deploys should never use this. |
IAM additions
The existing zscaler-mcp-ecs-task-execution-role (idempotent on
re-deploy) gets one extra inline policy when Secrets Manager is on:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadZscalerCredentialsSecret",
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": [
"arn:aws:secretsmanager:<region>:<account>:secret:zscaler-mcp-harness/credentials-<random>",
"arn:aws:secretsmanager:<region>:<account>:secret:zscaler-mcp-harness/credentials-<random>-*"
]
},
{
"Sid": "DecryptZscalerCredentialsSecret",
"Effect": "Allow",
"Action": ["kms:Decrypt"],
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:ViaService": "secretsmanager.<region>.amazonaws.com"
}
}
}
]
}
The -* ARN wildcard handles the random 6-char suffix Secrets Manager
appends to secret ARNs and is constrained to the same logical secret
(IAM ARN matching is exact otherwise — bare secret:foo does NOT
match secret:foo-aBc123).
File layout
integrations/aws/harness/
├── harness_mcp_operations.py # interactive deployment / lifecycle script
├── env.properties # .env template (copy to .env, fill in)
├── requirements.txt # boto3>=1.43.0
├── .gitignore # local state + venv
├── README.md # this file
└── .aws-harness-state.json # generated by `deploy` (gitignored)
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Harness playground returns Failed to start MCP client: ... unhandled errors in a TaskGroup (1 sub-exception) on every invocation | Harness execution role is missing one of the AWS-mandated grants — most commonly ecr-public:GetAuthorizationToken + sts:GetServiceBearerToken (without these, the auto-managed AgentCore Runtime can't pull its own container from ECR Public and the MCP tool loader never starts). | destroy --keep-ecs and deploy again — the script now mirrors the full AWS-published harness execution role policy. To verify by hand: aws iam get-role-policy --role-name zscaler-mcp-harness-execution-role --policy-name HarnessInline. |
Harness playground returns AccessDeniedException ... not authorized to perform: secretsmanager:GetSecretValue | Harness exec role can't read the Token Vault's backing secret in Secrets Manager. | Same fix as above — destroy --keep-ecs && deploy. The policy includes a scoped secretsmanager:GetSecretValue on bedrock-agentcore-identity* plus a kms:Decrypt (scoped via kms:ViaService). |
Harness playground returns runtimeClientError: Failed to load tool 'zscaler' (type=remote_mcp): … not authorized to perform: bedrock-agentcore:GetResourceApiKey on resource: <arn> — where <arn> rotates between five distinct shapes across retries (workload-identity-directory/default, …/workload-identity/harness_<name>-…, token-vault/default, token-vault/default/apikeycredentialprovider/<provider>, or the OAuth2 equivalent) | AgentCore Identity does multiple distinct IAM authz checks per GetResourceApiKey / GetResourceOauth2Token call — and every one of them must independently pass. The canonical service-authorization reference declares GetResourceApiKey requires permission on three distinct resource types (apikeycredentialprovider, token-vault, workload-identity), plus AgentCore additionally checks the workload-identity-directory root. The simpler scope-credential-provider-access page is incomplete — it omits the apikeycredentialprovider sub-ARN. Critical IAM-matching gotcha: IAM ARN matching is exact (no prefix matching), so listing token-vault/default does NOT cover token-vault/default/apikeycredentialprovider/<name>. Both ARN forms have to be in the Resource list. | destroy --keep-ecs && deploy. The current ResolveTokenVaultCredentials statement enumerates all five resource ARNs (workload-identity-directory root, workload-identity, token-vault, apikeycredentialprovider/, oauth2credentialprovider/), so every authz check the runtime makes lands on a statement that allows it. |
Add-tool dialog in Bedrock console shows the MCP URL without a trailing slash (e.g. …/mcp) but a previous deploy stored …/mcp/ | Older script revisions ended the URL with /mcp/. The harness's built-in MCP client issues POSTs and won't follow the 307 redirect FastMCP emits on /mcp → /mcp/, so the tool fails to initialise with the TaskGroup error. | destroy && deploy to regenerate the harness with the trimmed URL the AWS console expects. |
botocore.exceptions.NoCredentialsError | No AWS creds resolved. | aws configure, set AWS_PROFILE, or export AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY. |
EndpointConnectionError: bedrock-agentcore-control.<region>.amazonaws.com | Harness preview not available in that region. | Pick a region from the AWS preview list. As of writing, us-east-1 is the safest bet. |
AccessDeniedException: User … is not authorized to perform: bedrock-agentcore-control:CreateHarness | IAM user / role missing preview permissions. | Add bedrock-agentcore-control:* and bedrock-agentcore:* to the calling principal. |
ValidationException: harnessName … must satisfy regular expression pattern: [a-zA-Z][a-zA-Z0-9_]{0,39} | Invalid characters in --harness-name — most commonly hyphens (-), which the AgentCore API rejects. | Use letters, digits, and underscores only (no hyphens), max 40 chars. Default is zscaler_mcp_harness. |
MalformedPolicyDocument on IAM role creation | Account hasn't been onboarded to AgentCore yet (the service principal bedrock-agentcore.amazonaws.com isn't recognised). | Enable AgentCore in the Bedrock console first — usually a one-click opt-in on the AgentCore landing page. |
| Harness creates but every invocation returns 403 from the MCP URL | The MCP URL is SigV4-only (AgentCore Runtime invocation endpoint). | Re-run deploy without MCP_URL (uses ECS Express) or point MCP_URL at a non-SigV4 endpoint (Cloud Run, API Gateway w/o IAM, EC2+HTTPS). |
invoke prints reasoning but no text from the model | stopReason: max_iterations_exceeded — agent loop hit the iteration cap. | Bump maxIterations in harness_mcp_operations.py::create_harness (default 25). |
ResourceNotFoundException on delete_api_key_credential_provider | The provider was already removed. | Safe to ignore — destroy already prints [INFO] Credential provider … already absent. |
exec /usr/local/bin/python: exec format error in ECS task logs | The image at ZSCALER_MCP_IMAGE_URI is single-arch ARM64 (what docker build defaults to on Apple Silicon), while ECS Fargate Express runs AMD64. The image isn't actually wrong; the build command is. | Rebuild with the repo's Makefile target. For a dev-only ECS push: make docker-build-multiarch IMAGE=<your-ecr-uri>:<tag> PLATFORMS=linux/amd64 (single-arch, 1 ECR row). For Graviton + Mac too: drop PLATFORMS= and you get the default manifest list (linux/amd64 + linux/arm64, 3 ECR rows). Alternatively, unset ZSCALER_MCP_IMAGE_URI to fall back to the multi-arch Marketplace image. |
ECR shows several "untagged" rows after docker-build-multiarch | Buildx normally produces 5 rows: 1 image-index (the latest tag), 2 per-platform images (the actual amd64 / arm64 binaries the index points to), and 2 "unknown/unknown" rows with 0-byte size (in-toto provenance + SBOM attestations). The two attestation rows are useful for SLSA / AWS Marketplace but pure noise for dev. | The docker-build-multiarch Makefile target passes --provenance=false --sbom=false, so dev pushes show 3 rows total: the latest tag plus one untagged row per architecture. Drop to 1 row by adding PLATFORMS=linux/amd64 (single-arch, no manifest list). CI keeps attestations enabled for the official Marketplace image. |
ECSExpressGatewayService stays CREATING for >5 minutes | Default VPC subnets are misconfigured / no Internet egress; or the container is restart-looping. | aws logs tail /ecs/zscaler-mcp --follow to inspect container output. aws ecs describe-express-gateway-service --service-arn <arn> shows status.statusReason. |
iam:PassRole denied on ecsTaskExecutionRole / ecsInfrastructureRoleForExpressServices | Deploying principal doesn't have iam:PassRole for those role ARNs. | Grant iam:PassRole on the two role ARNs with the condition iam:PassedToService = ecs.amazonaws.com. The exact policy is in the ECS Express getting-started guide. |
Re-deploy detects drift (Updating ECS Express service … image: … → …) and does a rolling deployment | The script now diffs the live service's image + healthCheckPath against the current ZSCALER_MCP_IMAGE_URI / desired health path and calls UpdateExpressGatewayService when they differ — zero-downtime rolling replace, no destroy needed. | Expected. Wait for the second "ECS Express status = ACTIVE" line, then test. If you actually want a from-scratch rebuild (different cluster, different IAM topology, fresh ALB), run destroy first. |
First POST /mcp in CloudWatch returns 421 Misdirected Request with WARNING:mcp.server.transport_security:Invalid Host header: zs-<hash>.ecs.us-east-1.on.aws | FastMCP's DNS-rebinding guard rejects every request whose Host header isn't in ZSCALER_MCP_ALLOWED_HOSTS. The ECS Express FQDN is AWS-generated and can't be known at .env time. | The script merges the discovered FQDN into the container's ZSCALER_MCP_ALLOWED_HOSTS on every deploy (deduplicating, preserving every other entry already in your .env). If you're upgrading from a build that pre-dates this fix, re-run deploy and an UpdateExpressGatewayService rolls in the FQDN. Full opt-out: ZSCALER_MCP_DISABLE_HOST_VALIDATION=true in .env — the script then touches nothing. |
| Same 421 but Host header is some other domain (CloudFront, custom DNS, API Gateway in front of ECS Express) | You're fronting the MCP server with infrastructure that rewrites or adds a Host the script doesn't know about. | Add the externally-visible hostname to your .env: ZSCALER_MCP_ALLOWED_HOSTS=mcp.acme.com. The script merges the ECS Express FQDN AND 127.0.0.1:*,localhost:* into whatever you provide, so all three hostnames end up in the container's allowlist. |
After destroy, the ECS console still lists ACTIVE task definitions like zscaler-mcp-zscaler-mcp-server:17 / …:18 | delete_express_gateway_service only tears down the service + ALB + target groups; it deliberately leaves task-definition revisions intact as immutable history. They're account-scoped (not cluster-scoped), so they don't block cluster deletion — but they accumulate across deploy cycles and pollute the console. | destroy now runs a two-step cleanup per family ({cluster}-{service}): deregister_task_definition on every ACTIVE revision, then delete_task_definitions (batched 10 at a time per AWS API limit). To clean up leftovers from older script revisions manually: aws ecs list-task-definitions --family-prefix zscaler-mcp-zscaler-mcp-server --status ACTIVE then aws ecs deregister-task-definition --task-definition <arn> and finally aws ecs delete-task-definitions --task-definitions <arn1> <arn2> …. |
destroy finishes but delete_cluster raises ClusterContainsServicesException even though the express service is gone | ECS Express's delete_express_gateway_service returns the instant the service flips to INACTIVE, but the underlying ALB + target groups + occasional draining task instances keep tearing down asynchronously. A cluster delete that races into that tail still sees the residue and refuses. | delete_ecs_cluster now polls every 15s for up to 3 min, treating ClusterContains{Services,Tasks,ContainerInstances}Exception + ResourceInUseException as "wait a bit longer" instead of hard failures. If you still hit the timeout, the script prints the exact aws ecs list-services command to inspect and you can finish the cluster delete manually. |
destroy reports Keeping ECS cluster … even though the script originally created it. The cluster persists across destroy cycles. | Cluster ownership used to be tracked via a per-deploy state-file boolean (cluster_created_by_us). After destroy --keep-ecs → deploy, the second deploy correctly reused the existing cluster — and recorded cluster_created_by_us=False. Every subsequent destroy then refused to delete it. Bug. | Fixed: ownership is now read from the cluster's managed-by=zscaler-mcp-harness tag (attached on every CreateCluster call). The tag survives any number of deploy/destroy cycles, so a cluster the script ever created stays "owned" forever. To verify: aws ecs describe-clusters --clusters zscaler-mcp --include TAGS. If you ran a deploy before this fix and the destroy summary still says "will be kept", just re-run deploy once — the new code rechecks tags and gets it right next destroy. |
deploy fails immediately with ECS cluster zscaler-mcp is currently DEPROVISIONING — another deploy or destroy is in flight | A previous destroy is still tearing down the cluster (Express Mode's ALB/target-group teardown can take 1-3 min after the service is INACTIVE). Re-deploying instantly would race into a ClusterAlreadyExistsException from the AWS API. | Wait 1-3 minutes for the cluster status to clear, then re-run deploy. Or use --ecs-cluster-name <other-name> to create a fresh cluster alongside. Inspect with aws ecs describe-clusters --clusters zscaler-mcp. |
deploy fails with InvalidParameterException ... Unable to Start a service that is still Draining | The cluster's ECS service (not the cluster itself) is in DRAINING state from a recent destroy. ECS Express's delete_express_gateway_service returns when the service flips to INACTIVE, but the underlying classic ECS service keeps draining ALB targets + tasks for another 1-3 min — and list_services hides DRAINING services from results, so discover_ecs_express_service can't detect it before attempting CreateExpressGatewayService. | When you re-run deploy, the new interactive cluster picker (Step 7) detects the cluster, surfaces the draining service in the count, and offers three escape hatches: (a) wait 1-3 min and pick option 1 to reuse, (b) pick option 2 to auto-generate a fresh zscaler-mcp-<random> cluster name and deploy alongside, or (c) pick option 3 to specify a custom cluster name. Option 2 is the fastest unblock when an old destroy is still in flight. |
deploy prompts "How would you like to handle the existing cluster?" every time, even on a clean re-deploy | This is intentional. The script no longer silently reuses an existing default-named cluster — it presents the choice so you don't accidentally deploy into someone else's cluster. | Three ways to skip the prompt: (a) the default option (1) reuses the cluster — just press Enter; (b) pass --ecs-cluster-name <any-name> on the CLI to bypass the prompt entirely (the resolver only prompts when the name is the default and a cluster with that name already exists); (c) for non-default workflows, set ECS_CLUSTER_NAME=<my-cluster> in .env. |
Container logs show RuntimeError: Could not load Zscaler configuration from Secrets Manager: AccessDenied shortly after startup | The ECS task execution role's scoped secretsmanager:GetSecretValue policy isn't there (rare — usually means a prior deploy was run with an older script revision and the role wasn't refreshed, or an out-of-band IAM change removed it). | Just re-run deploy. ensure_ecs_task_execution_role is idempotent and re-puts the ReadZscalerSecrets inline policy on every deploy. To verify by hand: aws iam get-role-policy --role-name zscaler-mcp-ecs-task-execution-role --policy-name ReadZscalerSecrets. |
Container logs show RuntimeError: ... ResourceNotFoundException on the Secrets Manager fetch | The ECS task is still running the OLD task-definition revision that points at a secret name that no longer exists (e.g. you ran destroy then deploy again with a different --secret-name, but the express service caught up on the second update). | aws ecs update-express-gateway-service --service-arn <arn> with no other changes forces a fresh rollout against the latest task definition. Or just re-run deploy once — the script will detect drift and roll out a fresh revision. |
Deploy fails with InvalidRequestException: ... scheduled for deletion on create_secret | A previous destroy soft-deleted the secret with the default 7-day recovery window and the same name is being reused. | The script handles this automatically (calls restore_secret + put_secret_value). If you need to skip the recovery window on future destroys, run destroy --force-secret-delete. To clean up manually: aws secretsmanager delete-secret --secret-id zscaler-mcp-harness/credentials --force-delete-without-recovery. |
| Operator wants to manage the secret out-of-band (Terraform) but the script keeps refreshing it | The script only treats the secret as "managed externally" when ZSCALER_SECRET_NAME is set in .env BEFORE the first deploy. If you let the script create the secret and now want to take it over, set ZSCALER_SECRET_NAME=<arn> in .env and re-run deploy — subsequent runs (and destroy) will leave it alone. | Decision is recorded in .aws-harness-state.json under zscaler_secret_managed_externally. Set the env var BEFORE the first deploy for the cleanest experience. |
Gateway topology only: deploy --topology gateway aborts at Step 5 with Ecr uri region 'us-east-1' does not match the application region '<other>'. Container images must be in the same region as the application. | AgentCore Runtime requires the container image in the same region as the Runtime. Unlike ECS Fargate (which happily pulls cross-region), the AgentCore control plane validates this hard at CreateAgentRuntime time. The default Marketplace image lives in us-east-1, so deploying the Gateway topology anywhere else needs either a region change or a same-region copy of the image. | Two paths. (A) Easiest: redeploy in us-east-1 — set AWS_REGION=us-east-1 in .env or pass --region us-east-1. (B) Stay in your current region: replicate the image to ECR in your own account. The script now fails fast (before any IAM role is created) and prints the exact docker pull / docker tag / docker push commands plus the ZSCALER_MCP_IMAGE_URI=… line to add to .env. If you ran an older script revision and a Runtime exec role was created before the validation hit, clean it up with aws iam delete-role --role-name zscaler-mcp-harness-runtime-role (and any inline policies it carries). |
Where to go next
../../../local_dev/aws_harness_agent/integration-analysis.md— full architecture write-up, three integration topologies, deliverables, open questions.../bedrock-agentcore/README.md— the AgentCore Runtime deployment path (the URL we co-deploy from here, or the alternative if you don't need Harness at all).../bedrock-agentcore/strands_agent_chat.py— local terminal client for the AgentCore Runtime path; the equivalent for the Harness path is theinvokesubcommand of this script (a full multi-turn Harness chat client is on the roadmap).