# `FlyDeploy.BlueGreen.PeerManager`

Manages the lifecycle of peer BEAM nodes for blue-green deploys.

## Architecture Overview

Blue-green mode runs two BEAM layers on a single Fly machine: a **parent**
node that never serves traffic, and a **peer** node (a child BEAM process
started via OTP's `:peer` module) that runs the user's full application and
binds the HTTP port. On upgrade, a *new* peer boots with new code (its
Endpoint binds via SO_REUSEPORT alongside the old), the old peer's Endpoint
is stopped, and the old peer is terminated.

```
┌─ Fly Machine (single VM instance) ──────────────────────────────────────┐
│                                                                         │
│  Parent BEAM (long-lived, never serves traffic)                         │
│  ├─ BlueGreen.Supervisor                                                │
│  │   ├─ PeerManager          ← this module                              │
│  │   │   • starts/stops peer BEAM processes via :peer                   │
│  │   │   • handles cutover (stop old Endpoint)                          │
│  │   │   • on startup, checks S3 for pending blue-green reapply         │
│  │   │                                                                  │
│  │   └─ Poller (mode: :blue_green)                                      │
│  │       • polls S3 "blue_green_upgrade" field                          │
│  │       • on change → calls PeerManager.upgrade(tarball_url)           │
│  │                                                                      │
│  └─ (no Endpoint, no Repo, no app processes)                            │
│                                                                         │
│  Peer BEAM (child process, serves all traffic)                          │
│  ├─ User's full supervision tree                                        │
│  │   ├─ FlyDeploy Poller (mode: :hot)    ← polls "hot_upgrade" field    │
│  │   │   • applies hot code upgrades in-place inside the peer           │
│  │   │   • on startup, checks S3 for pending hot upgrade reapply        │
│  │   ├─ Repo, PubSub, Counter, ...                                      │
│  │   └─ Endpoint                         ← binds port via reuseport     │
│  │                                                                      │
│  └─ Code loaded from /tmp/fly_deploy_bg_<ts>/ (not /app/)               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

## How Peers Are Started

Each peer is a separate OS process started via `:peer.start/1`. The parent:

1. Finds the bundled `erl` binary from the ERTS directory
2. Builds exec args: `-boot start_clean` (bypasses the release boot script),
   `-config` (sys.config from the release or extracted tarball),
   `-args_file` (vm.args)
3. Passes `-pa` flags for all code paths (ebin directories)
4. Calls `:peer.start(%{name: ..., exec: {erl, args}, ...})`
5. Boots Elixir + Logger via `:erpc.call`
6. Marks the peer with `Application.put_env(:fly_deploy, :__role__, :peer)`
7. Injects SO_REUSEPORT config so the Endpoint can bind alongside an existing peer
8. Calls `ensure_all_started(otp_app)` (blocking — returns when fully started)

Why `-boot start_clean`? Without it, `:peer` inherits the parent's release
boot script which auto-starts all apps before we can mark `__role__: :peer`,
and computes node names from FLY_IMAGE_REF causing invalid names.

## Blue-Green Upgrade Flow

When the parent's Poller detects a new `"blue_green_upgrade"` in S3:

```
Poller ──→ PeerManager.upgrade(tarball_url)
             │
             ├─ 1. Download tarball from S3
             ├─ 2. Extract to /tmp/fly_deploy_bg_<ts>/
             ├─ 3. Build code paths from extracted ebin dirs
             ├─ 4. Start new peer with new code paths
             │      └─ Peer fully boots (Endpoint binds via reuseport)
             ├─ 5. Stop old peer's Endpoint
             └─ 6. Stop old peer entirely
```

Key properties:
- **Zero downtime**: Both old and new Endpoints serve simultaneously via
  SO_REUSEPORT during the brief overlap, then old Endpoint stops.
- **Clean state**: The new peer starts fresh — no `code_change/3`, no state
  migration. This is the key difference from hot upgrades.
- **New PID**: Every process gets a new PID (new BEAM process).

## Hot Upgrades Inside Peers

The peer runs its own `FlyDeploy.Poller` with `mode: :hot` (started as
`{FlyDeploy, otp_app: :my_app}` in the user's supervision tree). This
Poller polls the `"hot_upgrade"` field in S3, completely independent of
the parent's Poller which watches `"blue_green_upgrade"`.

When a hot upgrade is detected inside the peer:

```
Peer's Poller ──→ FlyDeploy.hot_upgrade(tarball_url, app)
                    │
                    ├─ Download tarball from S3
                    ├─ Copy .beam files to where :code.which() says
                    │   they're loaded (/tmp/fly_deploy_bg_<ts>/lib/...)
                    ├─ Detect changed modules via :code.modified_modules()
                    ├─ Phase 1: Suspend ALL processes using changed modules
                    ├─ Phase 2: Purge + load ALL new code
                    ├─ Phase 3: :sys.change_code on ALL processes
                    └─ Phase 4: Resume ALL processes
```

This works because the Upgrader uses `:code.which(module)` to find where
each module is currently loaded from, then copies new beams to that same
path. Whether the peer loaded code from `/app/lib/` or
`/tmp/fly_deploy_bg_<ts>/lib/`, the hot upgrade lands in the right place.

## S3 State: Separate Fields

The deployment metadata in S3 (`releases/<app>-current.json`) has two
independent fields so blue-green and hot upgrades coexist:

```json
{
  "image_ref": "registry.fly.io/app:deployment-ABC",
  "blue_green_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.0.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-DEF",
    ...
  },
  "hot_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.1.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-GHI",
    ...
  }
}
```

Rules:
- `mix fly_deploy.hot` (default mode) → writes `"hot_upgrade"`, preserves
  `"blue_green_upgrade"`
- `mix fly_deploy.hot --mode blue_green` → writes `"blue_green_upgrade"`,
  clears `"hot_upgrade"` (new peer = fresh start, old hot patches subsumed)
- `fly deploy` (cold deploy) → machines detect image_ref mismatch and
  reset both fields to nil

## Restart Reapply Flow

When a Fly machine restarts (crash, scaling, `fly machine restart`), both
layers are reapplied from S3:

```
Machine restarts
  │
  ├─ Parent boots
  │   └─ BlueGreen.Supervisor starts
  │       ├─ PeerManager.init
  │       │   ├─ resolve_startup_code(otp_app)
  │       │   │   └─ Reads S3 "blue_green_upgrade" field
  │       │   │       → Downloads tarball → extracts to /tmp/bg_<ts>/
  │       │   ├─ start_peer(otp_app, new_code_paths)
  │       │   │   └─ Peer boots with /tmp/bg_<ts>/ code (v2)
  │       │   │       ├─ {FlyDeploy, otp_app: :app} starts Poller (mode: :hot)
  │       │   │       │   └─ startup_apply_current reads S3 "hot_upgrade"
  │       │   │       │       → Downloads v2-hot tarball
  │       │   │       │       → Copies beams to /tmp/bg_<ts>/ paths
  │       │   │       │       → Loads via :c.lm() (no suspend at startup)
  │       │   │       │       → Peer now running v2-hot code
  │       │   │       ├─ Counter, Repo, PubSub, ...
  │       │   │       └─ Endpoint (binds port via reuseport)
  │       │
  │       └─ Poller (mode: :blue_green)
  │           └─ Polls for future blue-green upgrades
  │
  └─ Result: machine serves v2-hot traffic (blue-green base + hot overlay)
```

## Cutover Details

With SO_REUSEPORT, both old and new Endpoints bind the same port
simultaneously. The new peer's Endpoint starts during `start_peer`
(blocking `erpc.call`). Once it's up, we just stop the old Endpoint.
There is zero gap — both peers serve traffic during the overlap.

## Why the Parent Never Serves Traffic

The parent node's only job is process management:
- Start/stop peer BEAM processes
- Poll S3 for blue-green upgrades
- Coordinate cutover

It has no Repo, no Endpoint, no business logic processes. This means:
- Parent crashes don't affect traffic (peer keeps running independently)
- Parent restarts cleanly without port conflicts
- Upgrade logic is isolated from application logic

## Tarball Types

PeerManager handles two tarball formats:

- **Full release** (blue-green mode): Contains `lib/` + `releases/`
  (sys.config, vm.args, boot files, consolidated protocols). The peer uses
  100% new code paths — no mixing with the parent's code.

- **Beam-only** (hot mode, fallback): Contains just `.beam` files and
  consolidated protocols. Merged with the parent's existing code paths
  (new ebin dirs replace matching app dirs).

Full release tarballs are detected by the presence of a `releases/`
directory with a `sys.config` file.

# `child_spec`

Returns a specification to start this module under a supervisor.

See `Supervisor`.

# `get_info`

Returns a status map for this machine's blue-green state.

# `peer_node`

Returns the active peer's node name.

Useful for remsh-ing into the peer from the parent:

    /app/bin/myapp rpc 'IO.puts(FlyDeploy.BlueGreen.PeerManager.peer_node())'
    RELEASE_NODE=<output> /app/bin/myapp remote

# `start_link`

# `upgrade`

Triggers a blue-green upgrade with new code paths.

Called by the Poller when it detects a new release in S3.
Downloads the tarball, extracts it, and starts a new peer with the new code.

Returns `{:error, :upgrade_in_progress}` if an upgrade is already running.

# `upgrading?`

Returns true if a blue-green upgrade is currently in progress.

Uses `:persistent_term` so it's readable from any process without
blocking on the PeerManager GenServer (which is busy doing the upgrade).

---

*Consult [api-reference.md](api-reference.md) for complete listing*