Work · Nike · 2025

A regulated data migration with a fixed deadline

A population of legacy consent records needed to move through a fixed renewal path. The work was not just writing a batch job. It was proving the deadline math, keeping the live platform safe, and leaving a trail the team could defend later.

Role: Lead engineer — spike, design, build, rollout, validation
Stack: Batch data processing, production write paths, infrastructure as code, and source-of-record validation
Outcome: Migration completed within the regulatory window, with validation against the source of record and a follow-up reconciliation for edge cases.

Large population of regulated consent records moved through the required renewal path within the deadline.
Validation compared post-migration state against the source of record instead of relying on count checks alone.
Pipeline stayed inside its capacity envelope without competing with live traffic.
Rollback path designed and tested before the higher-risk rollout phases.

The problem in one paragraph

The platform stores consent records keyed by jurisdiction and policy. For the relevant regulated consent category, records had to be renewed on a defined cadence. The product fix was to re-prompt affected users on their next eligible visit. The data work was to mark those records correctly without disturbing valid consent that did not need to move.

Constraints

A few things made this more than a one-time script:

Volume. A large population of records spread across data slices that could not be treated like a simple table scan.
Liveness. The migration had to coexist with production traffic, not block it.
Reversibility. Every step needed a way back. Marking a record incorrectly as "needs re-prompt" is recoverable; marking a record as renewed when it was not is a different class of mistake.
Auditability. Compliance reviewers need the trail. Every record touched needed a corresponding entry in the audit pipeline that downstream legal/compliance could reconcile against.
Time. The deadline was real.

Approach

The shape ended up being three stages:

Spike. First, answer "what does the data actually look like." Source-of-record queries sized the affected population by rule-relevant attributes and record age. This is where the deadline math happened: given the renewal cadence, what needs to move by what date, and what sustained rate is required through the window.

Data pipeline. A batch job read from source-of-record data, applied the renewal criteria, and wrote back through the platform's normal write path. Bounded filters kept the job scoped; throttled writes kept it from competing with live traffic for capacity. The run was defined through infrastructure as code so it could be reproduced across environments.

Validation. A second query set compared the post-migration state against the source of record. Not a count check — a record-level diff. Edge cases got logged and triaged manually. That validation is what made the deadline-met story honest rather than aspirational.

Tradeoffs

The tradeoffs that mattered:

Single run vs phased batches. I chose a smaller number of controlled runs because the source shape made repeated restarts cost more than they saved. The downside: if a run died midway, restart math was nontrivial. I wrote that math down before I needed it.
Bounded filter vs full scan. Bounded filters let the job skip records that did not need touching — much cheaper, but the filter definition was the thing most likely to be subtly wrong. I treated that definition as the highest-value test target.
Coexisting with live traffic vs blocking writes. I didn't block live writes during the migration. The cost was that "in-flight" records during the migration window required a second pass after the main run. The benefit was that the platform stayed up.

What I'd do next

The reconciliation tail was the thing I'd most want to design differently next time. A cleaner mechanism for catching long-tail edge cases up front — instead of leaving them as a follow-up — would have made the rollout feel less like two stages and more like one.

Also: more dashboards earlier. I built dashboards reactively as the rollout surfaced questions. Building them ahead of time would have saved a few stressful hours of "wait, is the throughput we expected actually the throughput we're seeing."

Selected outcomes

The problem in one paragraph

Constraints

Approach

Tradeoffs

What I'd do next