Pages Menu
Categories Menu

Rare Disease Datasets – Databases

Rare Disease Database

Center for Machine Learning and Intelligent Machines Dataset for Heart Disease

Data Set Downloads of Disease

Resource: Hidalgo CA, Blumm N, Barabasi A-L, Christakis NA. PLoS Computational Biology, 5(4):e1000353 doi:10.1371/journal.pcbi.1000353 Bulk Download of Medical Database Sets for Academic Use Only


Source Data and Study Population:

Hospital claims offer reliable, systematic, and complete data for disease detection [1,2,3]. Each record of our original dataset consists of the date of visit, a primary diagnosis and up to 9 secondary diagnoses, all specified by ICD9 codes of up to 5 digits. The first three digits specify the main disease category while the last two provide additional information about the disease. In total, the ICD-9-CM classification consists of 657 different categories at the 3 digit level and 16,459 categories at 5 digits. For a detailed list of currently used ICD9 codes see We compiled raw Medicare claims [4,5] based on MedPAR records for 1990-1993 that contain information on 96% of elderly Americans whether they use health care or not [6].

For the 32 million elderly Americans aged 65 or older enrolled in Medicare and alive for the entire study period, there were a total of 32,341,347 inpatient claims, pertaining to 13,039,018 individuals (the remaining individuals were not hospitalized at any point during this period). Demographically, our data set consists of patients over 65 years old (see Fig 1 for the age distribution) and is composed mainly of white patients, with a higher percentage of females (Fig 2). Yet, the data set is large enough to estimate race and gender specific comorbidity patterns.

Data Limitations

The medical claims were made available to us is in the ICD-9-CM format, representing a controlled nomenclature constructed mainly for insurance claim purposes. Therefore in some cases more than one code corresponds to a particular disease, whereas in other cases codes are not specific enough for research purposes. For example, at the 5-digit level there are 33 diagnoses associated with hypertension, which reduce to five at the 3-digit level. The vast majority of diseases however, can be univocally assigned to an ICD9 code.

While hospital claims have been proposed as a reliable method for disease detection [7,8,9], our data does not capture a complete cross section of the population. The dataset consists of medical claims associated with hospitalizations of elderly citizens in the United States, thus it contains limited information about diseases that are not common among elders from an industrialized country, such as many infectious diseases or pregnancy related conditions. Nor does it contain information on patients who were not hospitalized. It does contain however, a wealth of information about different types of heart diseases and cancers, which are highly prevalent among elderly patients and are of major interest to the medical community.

We distinguish four main groups in the data set given by (Males = M, Females=F, White=W, Black = B)

Number of Patients per Demographic Group

Genetic and Rare Disease Data Center

Health Hotlines Rare Disease Index

Portal for Rare Diseases and Orphan Drugs

Rare Disease Search

Rare Diseases and Related Terms

Royal College of Pathologists