Big data in the field of cancer research
Data warehouses can be used to store structured data, texts or images with a view to developing clinical and translational research and could perhaps be used in the future as a tool to assist decision-making in clinical practice. The ConSoRe project (continuum – soins – recherche or healthcare-research continuum) of the Centres de Lutte contre le Cancer is currently being deployed to use this technology in cancer treatment and research.
Projects of this kind are now possible for three reasons. Firstly, hospitals have amassed a considerable volume of clinical information in the form of texts and images. As an example, the Institut Curie holds digitized clinical files (texts and images) dating back to 2000 with longitudinal follow-up of more than 100,000 cancer patients, representing a total of more than 10,000,000 documents. Secondly, progress in IT provides the computing power and tools required to process all this data. Finally, progress in artificial intelligence is opening up new potential for automatic text and image processing.
With these data warehouses all kinds of questions can be asked simply and intuitively and structured data can be extracted and shared between hospitals and research centres. Above all, these large volumes of data can be analysed by the so-called Big Data revolution which provides new prospects such as deep neural networks and machine learning and deep learning techniques. The aim is to find out how to generate knowledge from data collected in “real life”.
It is now easy to obtain patient lists or to search for cryopreserved biological samples using multiple search criteria. It is believed that data warehouses will, in the future, assist in diagnosis, provide better statistical predictions of patient outcomes, facilitate inclusion in trials and enable searches for identical or similar cases. In terms of public health, these technologies will facilitate the generation of health warnings, improve pharmacovigilance and materials vigilance and boost epidemiological research. One of the major advantages compared with clinical studies is that real life data can be accessed, for example on drug consumption with a view to studying the impact of co-medication.
However, some problems remain to be solved. For example, for machine learning, learning rules have to be determined allowing for the fact that learning data bases change very fast : new treatments can change the history of cancer and the classification of cancers evolves over time. New points of vigilance emerge: how can appropriation of this knowledge by doctors be ensured? How should the results be displayed? What confidence can be attributed to predictions? How can the numerous ethical questions be addressed and patients be protected? What roles will public authorities play and how will citizens become partners so that these techniques will be perceived as progress rather than a danger?
These multisource and multiformat warehouses can be used to build cohorts, to develop cooperation between centres and to use all the data from throughout the hospital information system to identify new hypotheses. Feedback from the ConSoRe project has shown that it is possible to create a warehouse without having to change hospital information systems and to search both structured and unstructured data sets The performance is astonishingly fast, of the order of a second, similar to a Google search, and other CLCCs can be searched remotely.