A sector digitalizes by fully leveraging its data to improve process efficiency. Whether that process is a car assembly line, a grocery store’s supply chain, a marketing campaign or biotherapy manufacturing, at a high-level the data journey is the same: in simple, abstracted terms the data must first be captured and then organized before they can be mined for insights.
When Life Sciences companies talk about Digitalization, they tend to be much enamoured of hot topics like IoT for data capture and AI for data mining, but far less keen on unglamorous but essential data management.



Much as we cannot take crude oil straight from the ground and put it directly into our cars, so too we cannot easily jump into mining raw data. The process of refining breaks oil/data into constituent parts that are inherently more valuable/meaningful and easy to work with.
In practice, nobody actually works with completely raw data as we always impose some order, but do our data organization efforts go far enough? After 20+ years as a biological data scientist across dozens of diverse projects, I am certain we do not.
Consider a late-stage CLD study of a half dozen candidate production cell lines running in bioreactors. Over ~20 days, temperature and pH are monitored more than daily, metabolite and product concentrations are measured once a day, and viable cell counts are monitored perhaps every second day.
Often such data are managed in spreadsheets, perhaps with culture time on the rows and the various measurements in headed columns, with separate sheets for each cell line. This cannot easily be queried to e.g. compare the growth and productivity of cell lines, so someone will have to pull the relevant data into comparison plots. In Digital Transformation speak, the data are digitized (in electronic form) but the analysis process is not digitalized (automated through code).
Routine Life Sciences datasets like this tend to be used to answer one or two canned questions (e.g. “Which cell line has the highest titer?”) but almost always they can tell us much more. One of our collaborators shared data on clones with an abnormally high propensity for late-stage clumping but they did not know this until we pointed it out. Useful content was overlooked because they did not consider all of their data. Did they choose the right clone to move forward? Possibly not, even though they actually did have all the data needed to make a better choice. We need to leverage all of our data.
When data is this disjointed it also is difficult to put into a wider context that might reveal further insights. No, lab scientists, your cells are not growing badly because “they’re having one of those days”. Put a temperature logger in your shared incubator and you’ll soon see the issue (we’ve done it). How many times has a good cell line been discarded because its poor growth wasn’t understood in a wider context? We need to make our data easy to connect up to wider data.
Data can die if its meaning is known only to the spreadsheet author. Previously, we have been given data on glucose and glutamate concentrations, curiously marked up in both columns as “Glu”. All was resolved smoothly through discussion, but what would have happened to that institutional knowledge had the author moved on? We need to mark up our data so that it carries its meaning with it.
Such loss of meaning, content and context makes Data Scientists weep for what might have been (not quite, but wasted data are very annoying to us). To have the best chance of extracting maximum value, we must organize things better so that we know the identity of all data and how they relate to other data. We need a data warehouse.
I won’t endorse any particular solution but will offer my view that most Life Sciences managers, rather than rolling their own solution or buying into a generic data platform, should bring in a data warehouse developed for a similar application to their own.
If you are looking to source a data warehouse, your provider should:
- Already deal with similar data, meaning they have good domain understanding and an approximately correct data model (a wiring diagram of how the various data and metadata relate to each other).
- Be willing to support you in adapting their platform to your application – poor implementation choices early on can be detrimental down the line.
- Support the use of FAIR standards (ways to embed meaning into data so it is easy to connect, share and query).
- Provide simple, spreadsheet-like User Interfaces for non-coding bench scientists to upload and annotate data.
- Provide programming interfaces (APIs) for Data Scientists to retrieve data through code, including by complex queries.
- Deal with all underlying IT infrastructure (security, backups, recovery, uptime, )
Services like this will come at a cost, but developing and running your own solution will take far longer and cost far more.
We work with dozens of top Biopharma companies who are all working on Digital to some extent. Only those that crack Data Management will fully unlock the value hidden in their data and accomplish a full Digital Transformation.
Data Warehousing in CellAi®
CellAi is a software platform for microscope image interpretation at the Edge. We know our AI models are robust because we check them against laboratory data engineered to emulate dirty, noisy real-world variability. We could not do this without a data warehouse articulating metadata about cell lines, media, instruments, artefacts and other sources of variability within our data. |