Big data


Over the past 60 years a great number of very large datasets have been generated from the experimental exposure of animals to external radiation and internal contamination. This accumulation of ‘big data’ has been matched by increasingly large epidemiological studies from accidental and occupational radiation exposure, and from plants, humans and other animals affected by environmental contamination. We review the creation, sustainability and reuse of this legacy data, and discuss the importance of Open data and biomaterial archives for contemporary radiobiological sciences, radioecology and epidemiology.

It is important to define what we mean by data archives and datasets. Much of the data collected into historical archives – i.e. those whose collection has ceased – to all intents and purposes closed legacy datasets – are derived from very large experiments which are often loosely based on hypothesis testing, and not designed to test specific biological or physical mechanisms; i.e. a wide range of data collected in order to inform a broad question. This includes lifespan studies, cancer studies and those with broadly defined endpoints. Such archives include the very large human radiation exposure datasets, some of which are still collecting data – large-scale epidemiological datasets, and the results of extremely large-scale animal exposure experiments. In all cases we can legitimately describe these as ‘Big data’ – some of these were possibly the largest and most complex data collection exercises in the biological sciences conducted to the date they were completed. Examples are the important epidemiological study of the Japanese atomic bomb survivors (Ozasa et al. 2018) and the large Million worker study which includes worker cohorts from U.S. Department of Energy (DoE) Manhattan Project facilities, nuclear power plants, industrial radiographers, U.S. Department of Defense (DoD) nuclear weapons test participants, and medical technicians and physicians (Boice et al. 2018). Data complexity characterizes big data as much as volume, and large complex datasets are at the same time more difficult to manage and more potentially fruitful in analysis. The characteristic of this type of data, and indicative of its ongoing value, is that it is possible to reanalyze, recode, integrate and aggregate data, and to reinterpret it according to changing scientific paradigms. The size of some datasets might lend itself to machine learning approaches, for example, to generate classifiers, but trained or deep learning methodologies such as deep neural networks have not yet to our knowledge been applied in radiation epidemiology or animal irradiation experiments. However, machine learning is being applied to discovery of radiation-specific transcriptional signals for example (Zhao et al. 2018) and increasingly in radiotherapy and medical physics (Sahiner et al. 2019). Successful application of support vector machines in determining the directionality of aerial radiation dispersal (Yoshikane and Yoshimura 2018) provides a model for retrospective studies where sufficient data is available.

Some of these archives have associated with the physical specimens, blood, tissue, or histopathological slides. We discuss several of these physical resources that include significant patient or animal data as part of their structure. One such example would be the Chernobyl thyroid tissue bank (Thomas 2012). Another class of archive is derived from literature curation or integration of primary data with literature-derived data. As they are subjected to expert manual curation these resources can be very valuable.

Firstly, we consider first legacy archives and projects that are effectively closed to further data accretion together with efforts to make them accessible and useable. Secondly, we discuss archives of long-term experiments and epidemiological datasets that are still accumulating data, and consider archives of physical resources such as organisms, tissues, blood, and non-biotic material. A summary of the resources we discuss is shown in Table 1. Finally, we discuss the archiving and dissemination of data from active studies and that deposited as part of currently funded work and publications, together with a consideration of the current emphasis on open data and the issues surrounding compliance with open data mandates in the community. Our intention was to include datasets and archives from various areas of radiation research to address the point we want to make about the importance of archiving and data sharing. We are aware that the list of datasets and archives mentioned in this review cannot be complete and that more databases exist which were not included in the current work.

Legacy data archives

Beginning in the late 1890s with the discovery of X rays and then radium (Sekiya and Yamasaki 2016) early animal and human exposures were often accidental and sporadic with a small number of individuals involved. In the early part of the 20th-century, with more radionuclides becoming available, small quantities were widely used in patent health products, particularly radium, for example in the radium-containing drink, Radiothor, which contained 74 kBq of a mixture of radium 226 and radium 228 in each bottle (Macklis 1990). These patent medicines and other products were considered to confer health benefits until, after some notable deaths, radiopharmaceuticals were brought under regulation in the early 1930s, shortly after the time when the mutagenic action of radiation exposure was definitively established (Muller 1927). One of the first large-scale data collection exercises concerned internal occupational exposure of US radium dial painters (Fry 1998). Long-term studies of these workers traced 1322 women first employed between 1913 and 1929, and 1403 women first employed between 1930 and 1949. Follow-ups and analysis have continued up to the late 1990s demonstrating the importance of long-term studies and data sustainability along with the analysis of data not previously envisaged when it was first collected. As radiation began to be used clinically and its effects were beginning to be appreciated, X-irradiation was widely employed, and in some cases on a very large scale. For example, cranial X-irradiation was used in the treatment of the fungal scalp disease Tinea capitis over the period 1948–1960 which was the subject of a very large follow-up study beginning in 1968 (Sadetzki et al. 2005) whose data remains available.

Legacy data from human exposure

The stimulus for large-scale animal and human experimentation with radiation exposure was a consequence largely of the United States nuclear weapons programme and the subsequent release of the first nuclear weapons over Hiroshima and Nagasaki. It is no coincidence that the foundation of the International Journal of Radiation Biology in 1959 coincided with the surge of interest in the acute and long-term effects of radiation, and to a great extent, the history of big data in radiobiology is parallel to the history of this journal.

In 1946, following Congressional hearings, the US Atomic Energy Commission was established and shortly afterward, in 1947, its chairman David Lilienthal commissioned a Medical Board of Review, to report on the agency’s biomedical program (Hewlett et al. 1990). The board strongly recommended a broad research and training program:

‘both urgent and extensive.’ The need is urgent because of the extraordinary danger of exposing living creatures to radioactivity. It is urgent because effective defensive measures (in the military sense) against radiant energy are not yet known.’1


There was increasing public concern about the effects of nuclear fallout and especially after the leakage of data concerning the impact on bystanding observers and the local population following the testing of US nuclear weapons over Bikini Atoll in the Marshall Islands in the early 1950s. Operation Crossroads and Operation Castle Bravo generated serious concern about the danger of irradiation and contamination, particularly in the light of the first alarming analyses of the Japanese A bomb survivors (discussed below) in the immediate aftermath of Hiroshima and Nagasaki. While it is not the aim of this commentary to unpick the political and economic events which led up to the first large-scale animal testing of radiation exposure, the motivation for these studies, the data from which has still not been exhaustively analyzed, constitute some of the largest datasets in radiation science.

Following the initial studies on the survivors of Hiroshima and Nagasaki, a significant number of experiments were carried out on human subjects between 1940s and 1970s which came to light in the early 1990s (Stone 1993; McCally et al. 1994). Primary data are scattered through US agencies and universities and have not so far been made public to our knowledge, though these datasets would certainly benefit current research. Below we consider the collection of large-scale data on the Japanese A bomb survivors and other human exposures, both occupational and accidental, before moving to a consideration of the major animal exposure experiments starting in the late 1950s.

Hiroshima and Nagasaki survivors; the LSS study

Following the dropping of two atomic bombs over Nagasaki and Hiroshima in 1945, the Atomic Bomb Casualty Commission (ABCC), now the Radiation Effects Research Foundation (RERF), was set up in 1946 to monitor the health of the survivors. By the end of 1945 more than 200,000 had died of the combined effects of physical injuries, acute radiation sickness and late effects. By 1950 there was also concern about gonadal doses and germline mutation. The lifespan study cohort (LSS) was established in 1958 comprising standardized data on 120,321 individuals, including co-resident but unexposed controls (Ozasa et al. 2018). Further cohorts have also been established (Ozasa 2016): the adult health study (AHS) aimed at gathering morbidity data for disease additional to cancer, and the In Utero programme focussed on 3268 individuals exposed in utero. A third study examines the heritable impact of exposure, the ‘F1’ study, which aims at elucidating the impact of radiation exposure on the germline. Summary data for all these cohorts are available, but access to detailed individual-level data requires RERF approval.

Occupational and accidental exposure in the Soviet Union

Human exposure data for the period starting in the 1940s up until the early 1980s are available from the Mayak plant in the Southern Urals in Russia, derived from close monitoring of workers in Mayak where from, starting from the late 1940s, highly enriched uranium, tritium and plutonium was produced for Russian nuclear weapons. Occupational exposure and accidents were recorded between 1948 and 1982, with more than 30% of workers estimated to have been exposed over the working lifetime to more than 1 Gy of mainly external γ doses, the average internal 239Pu contamination being 2.19 ± 0.15 kBq (Azizova et al. 2008). The data consist of ICD9-coded medical records, doses, cause of death, work history and demographic information on 12,585 workers, and are augmented by biological samples, both from blood and autopsy. Tissue collections and data from this resource have been used with considerable impact for example on studies of cardiac exposure (Azimzadeh et al. 2017).

Distinct from the Mayak cohorts are the studies on the Techa river where over nearly a decade, starting from 1949 the Mayak plant discharged liquid radioactive waste (7.6e6 m3) into the river, thereby polluting large areas of the surrounding region and exposing the surrounding population to long term internal contamination. Data have been collected from this area since the early 1960s including demographic and clinical information from approximately 29,000 inhabitants. These data contain information on sex, cause of death, period of exposure and estimates of dose. The Techa river database is one of the few containing information about protracted environmental radiation exposures in a general population (Krestinina et al. 2005).

Semipalatinsk nuclear test site

From 1949 to 1989 nuclear weapons testing was conducted by the former Soviet Union at the Semipalatinsk Nuclear Test Site, Kazakhstan, including 111 atmospheric or near-ground tests between 1949 and 1962. Four nuclear weapons tests, conducted from 1949 to 1956, resulted in non-negligible radiation exposures to the public, corresponding up to approximately 300 mGy external dose. The population living around the test site is one of the largest human cohorts exposed to radiation from nuclear weapons tests. As a follow-up of research that started in the 1960s, a registry that contains information on more than 300,000 individuals residing in the areas neighboring the test site was established. The registry contains relevant information about those who lived at the time of the testing as well as about their children and grandchildren, including to some extent biological material (Apsalikov et al. 2019). To date, only a few studies have been conducted which were either completely (Grosche et al. 2011) or partially (Land et al. 2008) based on the information from a precursor of the registry. The registry can now be used for future studies, and detailed information on a data set for a three-generation study is already included in STORE (

Wismut uranium miners study

The WISMUT study contains data on approximately 59,000 male uranium miners, first employed between 1946 and 1989, at the Wismut Company in Germany. It contains demographic, cancer and other mortality data. It is the largest single study on the health risks of occupational exposure to ionizing radiation and inhalation of radioactive radionuclides in uranium mining (Kreuzer et al. 2010). The data can be accessed through the STORE database ( (see below).

German thorotrast study

The thorium-containing radioactive contrast agent Thorotrast® was used from 1929 until the1950s as a contrast agent in angiography and arteriography. The thorium in Thorotrast persists throughout the lifetime of the exposed patients who consequently are exposed to a lifetime’s chronic internal exposure. Several cohort studies were initiated, notably in Germany, and the German Thorotrast study cohort was established retrospectively in 1968 with a follow-up until 2004. The study comprises 2326 Thorotrast patients and 1890 patients of a matched control group. The dataset contains demographic, dosimetric, morbidity and mortality data and can be obtained on application through the STORE database ( (Grosche et al. 2016)).

Japanese thorotrast study

Parallel to the above, a study of 436 Thorotrast-exposed patients was also carried out in Japan and both patient data and material are available ( Data includes estimates of thorium amount deposited and cumulative dose in major organs, and confirmed pathological diagnosis (Fukumoto 2014).

Kyshtym, Chernobyl and Fukushima

Six nuclear accidents have occurred in the past, Kyshtym (1957), Windscale Piles (1957), Three Mile Island (1979), Chernobyl (1986), Tokaimura (1999) and Fukushima (2011).

In the accident at the Mayak plant on 29th September 1957 (the ‘Kyshtym Accident’) (Akleyev et al. 2017) 20 MCi (740 PBq) of radionuclides were released from a chemical explosion on the site. The subsequent spread of contamination was monitored, and the exposed population enrolled into the database of the URCRM, which contains the results of long-term dosimetric monitoring and medical follow-up of the population. The cohort contains around 21,000 individuals being, along with the Techa river cohorts, one of the largest prospective datasets available from accidental contamination of civilian populations.

The Chernobyl accident in 1986 affected the populations in Ukraine and Russia but mainly Belarus. In addition to the affected general population around 600,000 workers were involved in the cleanup operation. The cleanup workers were mainly exposed to γ radiation with an estimated mean dose ranging from 20 to 185 mGy. There have been several overlapping studies performed on these populations with endpoints including thyroid cancer, leukaemia and lymphoma. Both closed and continuing studies being subject to intensive analysis, reviewed comprehensively by Cardis et al. (Cardis and Hatch 2011; Hatch and Cardis 2017).

The Fukushima Daichi Nuclear power plant incident in 2011, following the Tōhoku earthquake and tsunami, involved a core melt-through damaging three reactor cores followed by hydrogen explosions. As with Chernobyl, both the local population and emergency workers were exposed to external mainly γ radiation and internal contamination with a maximum external dose to emergency workers of around 700 mSv and residents around 25 mSv (Hasegawa et al. 2015). Large-scale health surveys of the TEPCO emergency workers are being established by RERF – the NEW study (Kitamura et al. 2018), with around 5000 workers having been recruited to date. The Fukushima Health Management Survey of Fukushima residents (Ishikawa et al. 2015) was created by the Fukushima prefecture and contains dose estimated for individuals, based on their movements during the accident, and overall health assessment, thyroid ultrasound examination, mental health and lifestyle survey, and a pregnancy and birth survey. Emerging data from the epidemiological studies suggest that a very significant measure of morbidity has its origins in psychological aspects of displacement or fear of radiation and social issues, and it will be interesting to see how future analysis of these psychosocial datasets feeds into future disaster planning and mitigation strategies.

Comprehensive epidemiologic data resource (CEDR database); U.S. Department of energy

The CEDR is the U.S. Department of Energy (DOE) electronic database that contains de-identified data on health studies of DOE contract workers and environmental studies of areas surrounding DOE facilities. The resource currently contains 76 studies of over 1 million workers at 31 DOE sites. Much of the data is from epidemiological studies at US nuclear facilities and provides access to individual-level data in many cases, with primary raw and derived datasets. A complete description of the data and the resource can be found in

Additional human datasets

An excellent review listing the major human epidemiological datasets available – with a focus on cardiovascular diseases – was published recently (Kreuzer et al. 2015) although access to these datasets is largely on a discretionary basis where there are issues of data consent and local personal data legislation. Notably included in these large datasets are the International Nuclear Workers Study (INWORKS) (Hamra et al. 2016), an integrated study of more than 380,000 nuclear workers in three countries (USA, UK, and France), and that of the UK nuclear workers, UK NRRW (Haylock et al. 2018) which is partially proprietary. A large cohort of 948,174 children (with follow-up data) exposed to ionizing radiation by CT scans was set up as a joined effort of nine European countries (Bernier et al. 2018). As with the INWORKS study, these data are proprietary and held at IARC, Lyon.

A more comprehensive description and discussion of human datasets have been published recently (Zander et al. 2019).

Large-scale animal experiments

In the early 1950s, there were significant concerns about the scientific utility and ethics of radiation exposure experiments on humans. Sheilds Warren, the Chair of the AEC reported in 19502 (cited in Faden 1996):

‘We have learned enough from animals and from humans at Hiroshima and Nagasaki to be quite certain that there are extraordinary variables in this picture. There are species variables, genetics variables within species, variations in condition of the individual within that species.’ The danger of failing to provide data had to be weighed against the danger of providing misleading data: ‘It might be almost more dangerous or misleading to give an artificial accuracy to an answer that is of necessity an answer that spreads over a broad range in light of these variables.’


In 1951, following the Operation Greenhouse hydrogen bomb tests on Enewetak, 4000 mice exposed to radiation from the blast were taken to Oak Ridge and received by Jacob for long-term study (National Academy of Engineering 1984). This was the beginning of a very large series of non-human mammal internal and external exposure experiments. From Warren again:

‘Jacob was the recipient of large numbers of mice, survivors from a Pacific nuclear test, placed with various degrees of shielding along radii from the point of explosion. He had the foresight to follow these animals to the time of their natural death. As a result of these studies, much new information was developed about the late effects of radiation, about biological dosimetry, and about the similarity of certain radiation effects to those of aging.’


Between 1952 and 1992 more than 200 large-scale experiments were conducted on non-human animals, mainly mice and beagles, in the USA, Europe and Japan. For example, at Argonne National Laboratory (ANL) 700 beagles and 50,000 mice were used in experiments between the late 1960s and early 1990s as excellently reviewed by Haley et al. (Haley et al. 2011). This included the JANUS studies on whole body γ and neutron radiation of inbred strains of Mus musculus but also Peromyscus sp. funded by the now Department of Energy, which emerged from the AEC. These were generally lifespan studies and involved detailed cross-sectional, longitudinal and terminal pathological investigation over a wide range of irradiation doses, dose rates, quality, and timing.

The Argonne beagle dog experiments, carried out at Argonne National Laboratory, the Pacific National lab, UC Davis, and the University of Utah from 1952 to 1991 and supported by grants from the Atomic Energy Commission, investigated the effects of 60Co radiation on nearly 5000 beagle dogs. In addition, internal contamination with radium, Pu, Cf, and 90Sr, was investigated – the latter considered an important component of nuclear fallout. Types of exposures ranged from external radiation to inhalation and using acute, chronic and fractionated doses.

Taken together these large-scale mammalian studies form the basis of much of our knowledge concerning the acute and chronic long-term effects of external and internal radiation, and constitute a huge data resource. While some of the data, or at least data analyses have been published, by the 1980s it was clear that the primary data from these experiments were in danger of being lost. Given the high estimated cost of $2bn, at current costs, needed to repeat these experiments even if the necessary infrastructures were still available, it became apparent in the 1980s that it was desirable to salvage this legacy data and put it into the public domain for further use and analysis. Consequently, the data from the Argonne Janus mouse studies carried out between 1969 and 1992, including around 50,000 mice, was curated (Wang et al. 2010) and is now housed in the Northwestern University Radiation archives (NURA) along with beagle data from ANL which includes data from thousands of dogs in mainly lifespan studies. Both datasets have associated tissues, also preserved at NURA (Haley et al. 2011) and are freely available. The data and tissues archived at NURA have been used for new analyses, for example the effects of radioprotective agents (Paunesku et al. 2008), interspecies sensitivity (Liu et al. 2013) and gender effects (Haley et al. 2011).

The European Radiobiological Archive (ERA)

In the mid-1980s, the European Late Effects Project Group (EULEP) embarked on an initiative to collect and collate data covering all available information on European long-term radiobiological animal studies. The Office of Biological and Environmental Research of the US Department of Energy, and in Japan, the Japanese Late Effects Group started similar efforts around the same time to archive the American and Japanese data in the US National Radiobiology Archives (NRA) and the Japanese Radiobiological Archives (JRA), respectively. The result was an aggregated database of primary data from European, Japanese and US sources, the International Radiation Archive (IRA) (Gerber et al. 1999). The JANUS data and Argonne beagle data held at Northwestern University (NURA archive) were also included. The resulting collection of datasets contains nearly all radiation biology studies using animals carried out between 1960 and 1998 in Europe, the US, and Japan, involving a total of more than 400,000 animals (Gerber et al. 1996; Gerber and Wick 2004) (see Table 2). This exercise in international data acquisition and curation was begun by Dr. George Gerber but was picked up in a formal project funded by the European Commission in 2006 when it was decided to integrate all of the data across datasets (Gerber et al. 2006). By then the data had been included in a simple non-relational database and had been hand curated from the original sources. In some cases these were institutional reports, but in others punched card and IBM tapes were transcribed. This raised multiple problems. Firstly, that of the accuracy of transcription was uncertain. More importantly, the lack of standar