Federated learning sounds clean in papers. Each client trains locally, a server aggregates, nobody shares raw data. In practice, getting it to actually run across distributed nodes on shared cloud infrastructure with mismatched GPU drivers is a different problem entirely.
This is a writeup of what I built for our privacy-preserving brain tumor detection project, focusing on the parts that were actually hard.
The problem
Medical imaging data can't leave hospital systems. That's not a technical constraint - it's a legal and ethical one. Federated learning is the standard approach here: train a shared model without centralizing patient data.
The specific challenge we were working on was multimodal brain tumor detection using both MRI and CT scans. Most federated learning work in medical imaging treats these as separate problems. We wanted a single model that could process both modalities together, trained across simulated hospital clients where the data distribution was intentionally non-uniform.
The dual-stream architecture
My main contribution to the project was the neural network architecture. The core question was how to jointly learn from MRI and CT scans without one modality dominating the other.
The approach I landed on was late fusion with two separate processing streams. MRI inputs and CT inputs go through their own independent CNN trunks before being concatenated and passed to a shared classification head. This lets each stream develop modality-specific feature representations before combining them.
MRI input → Stream A (CNN) ─┐
├─ concat → classifier head → output
CT input → Stream B (CNN) ─┘
Early fusion - combining the inputs before any processing - consistently performed worse in our experiments. The model would effectively learn to ignore one modality when the other was more discriminative for a given sample. Late fusion kept both streams honest.
Non-IID data and personalized heads
The harder problem was the federated setup. In real hospital systems, data is non-IID by definition. One hospital might have mostly glioma cases. Another might have a disproportionate number of pediatric scans. A global model trained with standard FedAvg on this kind of data converges slowly and performs worse than a centralized model.
We addressed this with personalized local heads. The shared dual-stream trunk is aggregated globally across all clients. Each client keeps its own local classification head that is never shared. The idea is that the global trunk learns general feature extraction while the local head adapts to each client's specific data distribution.
Rahul built the Flower system infrastructure for this. My contribution was the architectural experiments - specifically comparing gradient-enabled versus gradient-free configurations for the local heads and figuring out where client drift was happening.
Client drift is what happens when local fine-tuning pulls each client's model away from the global optimum. After several rounds of local training, clients have models that perform well on their local data but diverge from each other, which degrades the global aggregation. The personalized head approach mitigates this by only drifting the head, not the shared trunk.
Getting it to actually run
This is the part that took the most time and gets the least coverage in papers.
We ran our experiments on Chameleon Cloud, a shared academic testbed. The GPU nodes available to us had Tesla P100s. Our PyTorch version wanted CUDA 11.8. The P100 drivers on the nodes supported an older CUDA version. Getting these to agree took a full day of environment debugging.
The fix involved pinning specific PyTorch and CUDA versions and setting environment variables explicitly rather than relying on automatic detection. The error messages when there's a CUDA version mismatch are often misleading - they point to the model code, not the environment. If training fails immediately on the first forward pass with a cryptic CUDA error, check your driver version before touching your model.
The other infrastructure problem was session persistence. Federated training runs take hours. SSH connections to shared cloud nodes drop. We lost two full training runs before I set up tmux sessions for all training processes. After that, long runs completed without interruption.
My workflow for avoiding environment issues: get the model logic working locally on macOS first, then port to the cloud. Running a small-scale simulation locally catches architectural bugs before you burn cloud GPU time finding them.
Results
The dual-stream architecture outperformed single-modality baselines on tumor classification. The personalized local head approach improved global model accuracy by 12% over the standard FedAvg baseline with a shared global head, despite the non-IID data distribution across clients.
The 12% improvement is specifically the gain from personalized heads over standard federated aggregation. It's not a comparison against a centralized model - a centralized model trained on pooled data would likely do better, but that's the point. We're not pooling the data. The question is how well you can do without it.
What I'd do differently
The late fusion approach worked but there's a known limitation: if one modality is missing for a given patient, the model degrades significantly. Attention-based fusion that can handle missing modalities gracefully would be more robust in a real deployment.
The personalized head approach also requires careful tuning of how many local steps to run before aggregation. Too few and the local head doesn't adapt. Too many and drift accumulates in the trunk even though we're trying to prevent it. We tuned this empirically but a more principled approach based on the client's data distribution would be better.
The infrastructure friction - GPU compatibility, session management, environment pinning - was honestly the biggest time sink in the project. Cloud research infrastructure is rough around the edges and the documentation assumes a level of sysadmin familiarity that most ML courses don't build.
