Federated learning improves site performance in multicenter deep learning without data sharing

Patient demographics

		Private Test Set Institution
		NCI	SUNY	UCLA
Patient demographics	Age (years)	66 (47–83)	66 (49–81)	65 (50–83)
Patient demographics	Prostate size (cc)	65.5 (21.7–231)	72.9 (26.8–210)	52.1 (15.8–147)

Table 1.

Patient demographics

		Private Test Set Institution
		NCI	SUNY	UCLA
Patient demographics	Age (years)	66 (47–83)	66 (49–81)	65 (50–83)
Patient demographics	Prostate size (cc)	65.5 (21.7–231)	72.9 (26.8–210)	52.1 (15.8–147)

Table 2.

Image acquisition parameters

	Private Test Set Institution
	NCI
	with endorectal coil (n = 50)	without endorectal coil (n = 50)	SUNY	UCLA
Vendor(s)	Philips Medical Systems		Siemens	Siemens
Field strength	3T		3T	3T
In-plane resolution (mm)	0.273mm	0.352mm	0.625mm	0.664mm
Slice thickness (mm)	3mm	3mm	3mm	1.5mm
Repetition Time (TR, ms)	4775	3686	5500	2230
Echo Time (TE, ms)	120	120	136	204

	Private Test Set Institution
	NCI
	with endorectal coil (n = 50)	without endorectal coil (n = 50)	SUNY	UCLA
Vendor(s)	Philips Medical Systems		Siemens	Siemens
Field strength	3T		3T	3T
In-plane resolution (mm)	0.273mm	0.352mm	0.625mm	0.664mm
Slice thickness (mm)	3mm	3mm	3mm	1.5mm
Repetition Time (TR, ms)	4775	3686	5500	2230
Echo Time (TE, ms)	120	120	136	204

Table 2.

Image acquisition parameters

	Private Test Set Institution
	NCI
	with endorectal coil (n = 50)	without endorectal coil (n = 50)	SUNY	UCLA
Vendor(s)	Philips Medical Systems		Siemens	Siemens
Field strength	3T		3T	3T
In-plane resolution (mm)	0.273mm	0.352mm	0.625mm	0.664mm
Slice thickness (mm)	3mm	3mm	3mm	1.5mm
Repetition Time (TR, ms)	4775	3686	5500	2230
Echo Time (TE, ms)	120	120	136	204

	Private Test Set Institution
	NCI
	with endorectal coil (n = 50)	without endorectal coil (n = 50)	SUNY	UCLA
Vendor(s)	Philips Medical Systems		Siemens	Siemens
Field strength	3T		3T	3T
In-plane resolution (mm)	0.273mm	0.352mm	0.625mm	0.664mm
Slice thickness (mm)	3mm	3mm	3mm	1.5mm
Repetition Time (TR, ms)	4775	3686	5500	2230
Echo Time (TE, ms)	120	120	136	204

Table 3.

Model evaluation results—private test sets

		Private Test Set Institution
		NCI (n = 20)	SUNY (n = 20)	UCLA (n = 20)	Overall (n = 60)
Private models	NCI	0.925 ± 0.016	0.854 ± 0.050	0.720 ± 0.165	0.833 ± 0.131*
	SUNY	0.887 ± 0.027	0.906 ± 0.018	0.768 ± 0.064	0.854 ± 0.074*
	UCLA	0.777 ± 0.102	0.575 ± 0.177	0.883 ± 0.069	0.745 ± 0.178*
FL Model		0.920 ± 0.029	0.880 ± 0.034	0.885 ± 0.032	0.895 ± 0.036

		Private Test Set Institution
		NCI (n = 20)	SUNY (n = 20)	UCLA (n = 20)	Overall (n = 60)
Private models	NCI	0.925 ± 0.016	0.854 ± 0.050	0.720 ± 0.165	0.833 ± 0.131*
	SUNY	0.887 ± 0.027	0.906 ± 0.018	0.768 ± 0.064	0.854 ± 0.074*
	UCLA	0.777 ± 0.102	0.575 ± 0.177	0.883 ± 0.069	0.745 ± 0.178*
FL Model		0.920 ± 0.029	0.880 ± 0.034	0.885 ± 0.032	0.895 ± 0.036

*

Significantly lower than FL model (P < .001).

Table 3.

Model evaluation results—private test sets

		Private Test Set Institution
		NCI (n = 20)	SUNY (n = 20)	UCLA (n = 20)	Overall (n = 60)
Private models	NCI	0.925 ± 0.016	0.854 ± 0.050	0.720 ± 0.165	0.833 ± 0.131*
	SUNY	0.887 ± 0.027	0.906 ± 0.018	0.768 ± 0.064	0.854 ± 0.074*
	UCLA	0.777 ± 0.102	0.575 ± 0.177	0.883 ± 0.069	0.745 ± 0.178*
FL Model		0.920 ± 0.029	0.880 ± 0.034	0.885 ± 0.032	0.895 ± 0.036

		Private Test Set Institution
		NCI (n = 20)	SUNY (n = 20)	UCLA (n = 20)	Overall (n = 60)
Private models	NCI	0.925 ± 0.016	0.854 ± 0.050	0.720 ± 0.165	0.833 ± 0.131*
	SUNY	0.887 ± 0.027	0.906 ± 0.018	0.768 ± 0.064	0.854 ± 0.074*
	UCLA	0.777 ± 0.102	0.575 ± 0.177	0.883 ± 0.069	0.745 ± 0.178*
FL Model		0.920 ± 0.029	0.880 ± 0.034	0.885 ± 0.032	0.895 ± 0.036

*

Significantly lower than FL model (P < .001).

Table 4.

Model evaluation results—ProstateX challenge dataset

		ProstateX (n = 343)
Private models	NCI	0.872 ± 0.062*
	SUNY	0.838 ± 0.043*
	UCLA	0.812 ± 0.136*
FL Model		0.889 ± 0.036

*

Significantly lower than FL model (P < .001).

Table 4.

Model evaluation results—ProstateX challenge dataset

		ProstateX (n = 343)
Private models	NCI	0.872 ± 0.062*
	SUNY	0.838 ± 0.043*
	UCLA	0.812 ± 0.136*
FL Model		0.889 ± 0.036

*

Significantly lower than FL model (P < .001).

In comparison, the FL model performed well on all 3 test sets. The FL model exhibited private test set mean Dice coefficients between 0.880 and 0.920, yielding an overall result of 0.895. The statistical analysis using 2-sided paired t-tests demonstrated that the FL model was significantly superior to any of the private models (P < .001 for all comparisons).

The private models exhibited varied performance on the challenge dataset (Dice coefficient range: 0.812–0.872). The generic FL model outperformed each of the private models, with an overall mean Dice coefficient of 0.889. The statistical analysis again demonstrated that the FL model was significantly superior to any of the private models (P < .001).

DISCUSSION

We sought to demonstrate that data-distributed learning can be successfully operationalized across multiple institutions with real patient data using federated learning, and that the resulting model would gain the benefit of having learned from each of the private datasets without ever needing to transfer or pool data at a single location.

Since no transfer of protected health information (or even deidentified health information) was required, we were able to address the privacy and data governance limitations inherent to multicenter studies through the use of simplified 2-way collaboration agreements, rather than requiring the negotiation of a complex 4-way collaboration and material transfer agreement that would have been required if data were shared across institutions. This allowed for expedited ethics and compliance reviews because of the minimal risk posed by the FL paradigm and enabled us to be assured that our patients’ privacy was maintained.

The FL model that we trained performed well across all of the private datasets, yielding an overall performance level that was significantly better than that of any of the private models alone. This suggests that the FL model was able to benefit from the advantage of learning important institution-specific knowledge through the FL aggregation paradigm, without requiring any individual training site to “see” the full breadth of inputs.

Additionally, our results showed that the FL model performed significantly better than any of the individual private models on the held-out challenge dataset, suggesting that the model also attained the expected advantages inherent in training with more data through the FL aggregation method, even though the full dataset was not seen at any single training site.

Our work does have limitations. In this work, we did not attempt to address the potential for an inside actor (ie, 1 of the participating institutions) to attempt to recover the underlying patient data through a model inversion attack on the trained weights shared during federated learning. Future enhancements to the federated approach could include the addition of calibrated distortion to shared model weights in order to suppress the potential for inversion. However, we believe the method we demonstrate in this paper significantly better protects the privacy of patients than the current standard of direct sharing of data. In addition, though model inversion is a technical risk that cannot be ruled out, we empirically believe that the practical risk of inversion outside of crafted malintent on the part of study designers to be low due to the weight averaging scheme in place. Finally, we note that the sharing of trained model weights is an accepted practice within healthcare,²⁸^,²⁹ and, in the worst case, our method is no less secure as only model weights are ever transmitted.

Secondly, the task we used (prostate segmentation on T2-weighted MRI) is relatively simple and all private models achieved high performance on their own institutional datasets. Thus, we were unable to demonstrate the expected benefit that an FL-trained model would significantly outperform a single-site-trained model on that single site’s data. In addition, because we used similarly sized private datasets at each institution, we did not explore the potential in varying the federated model aggregation methodology, which could be extended to differentially weight model weights from institutions based on data quantity, quality, or other metrics. Thirdly, adding additional institutions to the federation may present new challenges in heterogeneity of imaging data quality, governance, intellectual property, and model generalizability. In order to ensure that the FL model performs well at each institution in a large federation, it may be necessary in future work to explore adding an additional private fine-tuning step at each institution, though care must be taken to avoid losing generalizability through overfitting. This may require the use of additional techniques, such as the Learning without Forgetting method.³⁰

CONCLUSION

The power of federated learning was successfully demonstrated across 3 academic institutions using real clinical prostate imaging data. The federated model demonstrated improved performance across both held-out test sets from each institution and an external test set, validating the FL paradigm. This methodology could be applied to a wide variety of DL applications in medical image analysis and merits further study to enable accelerated development of DL models across institutions, enabling greater generalizability in clinical use.

FUNDING

This work was supported by NIH NCI grant/contract numbers F30CA210329, R21CA220352, P50CA092131, ZIDBC011242, Z1ACL040015, HHSN261200800001E, NIH NIGMS grant number GM08042, NIH grant numbers, the AMA Foundation, an NVIDIA Corporation Academic Hardware Grant, the NIH Center for Interventional Oncology, the Intramural Research Program of the NIH, and a cooperative research and development agreement (CRADA) between NIH and nVIDIA.

The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.

AUTHOR CONTRIBUTIONS

KVS, SH, TS, HR, MGF, BJW, DE, BT, WS, and CWA contributed to the conception, design, oversight, and guidance of the work. KVS, SH, TS, AGR, RK, BJW, PLC, AMP, LSM, SSR, BT, and CWA contributed to data collection and/or annotation. HR, ZX, JT, DX, and MGF contributed to the underlying federated learning framework and adaptation for the application to this analysis. KVS, SH, TS, and HR primarily contributed to data analysis and interpretation. KVS drafted the first version of the article. All authors participated in critical revision of the article, approved the article for submission and publication, and agreed to be accountable for the work.

DATA AVAILABILITY STATEMENT

Data from the private datasets cannot be shared publicly for the privacy of individuals that participated in the study and IRB requirements. The PROSTATEx dataset is available publicly at the Cancer Imaging Archive, doi: 10.7937/K9TCIA.2017.MURS5CL.

ACKNOWLEDGMENTS

The authors appreciated the opportunity to present an abstract containing partial preliminary results from this project at the RSNA 2020 conference.

CONFLICT OF INTEREST STATEMENT

LSM and AMP report a financial interest in Avenda Health outside the submitted work. BJW reports personal fees and nonfinancial support from Philips during the conduct of the study. BJW, BT, and PLC report IP-related royalties from Philips. The NIH has cooperative research and development agreements with NVIDIA, Philips, Siemens, Xact Robotics, Celsion Corp, Boston Scientific, and research partnerships with Angiodynamics, ArciTrax, and Exact Imaging. No other authors have competing interests to disclose.

REFERENCES

1

AMA Council on Ethics and Judicial Affairs.

Code of Medical Ethics of the American Medical Association

.

Chicago, IL

:

American Medical Association

;

2017

.

Google Preview

OpenURL Placeholder Text

2

Gulshan

V

,

Peng

L

,

Coram

M

, et al.

Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

.

JAMA

2016

;

316

(

22

):

2402

.

3

Quellec

G

,

Charrière

K

,

Boudi

Y

, et al.

Deep image mining for diabetic retinopathy screening

.

Med Image Anal

2017

;

39

:

178

–

93

.

4

Balachandar

N

,

Chang

K

,

Kalpathy-Cramer

J

, et al.

Accounting for data variability in multi-institutional distributed deep learning for medical imaging

.

J Am Med Informatics Assoc

2020

;

27

(

5

):

700

–

8

.

5

Yuan

Y

,

Chao

M

,

Lo

Y-C.

Automatic skin lesion segmentation using deep fully convolutional networks with Jaccard distance

.

IEEE Trans Med Imaging

2017

;

36

(

9

):

1876

–

86

.

6

Harangi

B.

Skin lesion classification with ensembles of deep convolutional neural networks

.

J Biomed Inform

2018

;

86

:

25

–

32

.

7

Esteva

A

,

Kuprel

B

,

Novoa

RA

, et al.

Dermatologist-level classification of skin cancer with deep neural networks

.

Nature

2017

;

542

(

7639

):

115

–

8

.

8

Bulten

W

,

Pinckaers

H

,

van Boven

H

, et al.

Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study

.

Lancet Oncol

2020

;

21

(

2

):

233

–

41

.

9

Ehteshami Bejnordi

B

,

Veta

M

,

Johannes van Diest

P

, et al. ; and the CAMELYON16 Consortium.

Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

.

JAMA

2017

;

318

(

22

):

2199

.

10

Coudray

N

,

Ocampo

PS

,

Sakellaropoulos

T

, et al.

Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning

.

Nat Med

2018

;

24

(

10

):

1559

–

67

.

11

Chilamkurthy

S

,

Ghosh

R

,

Tanamala

S

, et al.

Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study

.

Lancet

2018

;

392

(

10162

):

2388

–

96

. doi:10.1016/S0140-6736(18)31645-3

12

McKinney

SM

,

Sieniek

M

,

Godbole

V

, et al.

International evaluation of an AI system for breast cancer screening

.

Nature

2020

;

577

(

7788

):

89

–

94

.

13

Russakovsky

O

,

Deng

J

,

Su

H

, et al.

ImageNet large scale visual recognition challenge

.

Int J Comput Vis

2015

;

115

(

3

):

211

–

52

.

14

Yasaka

K

,

Abe

O.

Deep learning and artificial intelligence in radiology: current applications and future directions

.

PLOS Med

2018

;

15

(

11

):

e1002707

.doi:10.1371/journal.pmed.1002707

15

De Fauw

J

,

Ledsam

JR

,

Romera-Paredes

B

, et al.

Clinically applicable deep learning for diagnosis and referral in retinal disease

.

Nat Med

2018

;

24

(

9

):

1342

–

50

. doi:10.1038/s41591-018-0107-6

16

Chang

K

,

Balachandar

N

,

Lam

C

, et al.

Distributed deep learning networks among institutions for medical imaging

.

J Am Med Inform Assoc

2018

;

25

(

8

):

945

–

54

. doi:10.1093/jamia/ocy017

17

Rieke

N

,

Hancox

J

,

Li

W

, et al. The Future of Digital Health with Federated Learning. Published Online First: 18 March 2020. http://arxiv.org/abs/2003.08119 Accessed June 20,

2020

18

Kaissis

GA

,

Makowski

MR

,

Rückert

D

, et al.

Secure, privacy-preserving and federated machine learning in medical imaging

.

Nat Mach Intell

2020

;

2

(

6

):

305

–

11

.

19

Li

W

,

Milletarì

F

,

Xu

D

, et al. Privacy-preserving federated brain tumour segmentation. In: proceedings of the 10th International Workshop on Machine Learning in Medical Imaging (MLMI 2019), in conjunction with MICCAI 2019; October 13 2019; Shenzhen, China.

20

Roth

HR

,

Chang

K

,

Singh

P

, et al. Federated Learning for Breast Density Classification: A Real-World Implementation. In: proceedings of the 2nd MICCAI Workshop on Domain Adaptation and Representation Transfer (DART 2020), in conjunction with MICCAI 2020.; October 8 2020; held virtually due to COVID-19 (formerly scheduled for Lima, Peru).

21

Sheller

MJ

,

Edwards

B

,

Reina

GA

, et al.

Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data

.

Sci Rep

2020

;

10

(

1

):

12598

. doi:10.1038/s41598-020-69250-1

22

Armato

SG

,

Huisman

H

,

Drukker

K

, et al.

PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images

.

J Med Imag

2018

;

5

(

04

):

1

.

23

Larobina

M

,

Murino

L.

Medical Image File Formats

.

J Digit Imaging

2014

;

27

(

2

):

200

–

6

.

24

Liu

S

,

Xu

D

,

Zhou

SK

, et al.

3D anisotropic hybrid network: transferring convolutional features from 2D Images to 3D anisotropic volumes

.

Lect Notes Comput Sci

2017

;

11071

:

851

–

8

.