Adding Dimensionality to Data Mapping for Higher-Level Analysis

Paradigm4 is transforming the industry through advanced computing capable of unifying data from disparate sources, allowing companies to navigate Big Data like never before and reducing the time to clinic. Aside from changing how experiments are conducted and prioritizing the value of data, the company has ushered in a new way to interact with graphs. Zachary Pitluk, Ph.D., Vice President, Life Sciences and Healthcare, explains how Paradigm4’s innovative technology is elevating companies to unique, first-in-class players and enhancing data cohesion.

David Alvaro (DA): Can you give us a brief overview of Paradigm4’s offering?

Zachary Pitluk (ZP): We’re an integrative analytics platform that unifies a number of different capabilities. We have a premier scientific data management system — SciDB — for slicing and exploring complex and multimodal data using dimensionally accessed data storage. We couple that with a world-class MapReduce system that brings together BurstModeTM (which automates large scale elastic computing) and flexFS™ (a high-throughput file system). We’ve recently added an integration layer that allows us to rapidly take data from flexFS and put it into SciDB. For those who are familiar with older database technologies, this is quite a different approach for organizing and storing data cost-effectively. We’re essentially preserving the structure of an n-dimensional array in a database so that it can intuitively be sliced and grabbed like an algebraic multimodal array.

Paradigm4 stands out from others in the industry by having best-in-class file storage via flexFS, which has grown significantly. Multiple companies have hundreds of terabytes of data stored in our system, which speaks volumes about our performance.

DA: I’m curious about changes in data demand. Big Data is a hot topic, but the conversation has turned more toward integrating data from heterogeneous sources. What drives this change? Is this because it’s finally possible to unify multiple data sources, or is there an increasing need stemming from R&D?

ZP: The pharmaceutical industry is still struggling with the idea of democratizing data. First, they mastered ELNs (electronic lab notebooks), and then — tragically — they threw everything into a data lake and decided to find it later. The current driver for change has come with the realization that data has value beyond the initial experiment. It’s especially useful in well-designed omics experiments, where a data set can serve as both a positive and a negative control. It can be thought of as having money in the bank — and that’s where the pressing need is coming from.

One-off experiments in early R&D are not nearly as valuable as the very well-organized data sets that we’re seeing in translational biology; the U.K. Biobank is a great example. For a lot of the single-cell data, especially the single-cell atlases serving as reference data, when you achieve a normal range and can compare all of your smaller-scale experiments to that, it’s extremely valuable.

DA: At a certain point, it seems that there may be more value in mining data from previous experiments than repeating new ones — are people in the industry realizing that they should be looking back and trying to extract insights?

ZP: Yes, and there are many ways in which that hypothesis and these large data sets have value — again, especially with positive and negative control. For example, we’ve seen well-designed studies, like the Bluesky Project, repeatedly used by the customer for years, because even though it was on one particular disease area, it was also used as a control for an entirely other disease area. It served as an exemplar of negative control for machine learning and AI, because it had audio, video, and wearable data. The data were all aligned, enabling checking, for instance, whether there was a signal that was the same from the machine learning on other data sets like Scratch. This could be quickly confirmed, as there were video and audio to match things up. The Bluesky Project is an example of an incredibly well-developed data set that returned a lot of value, and Paradigm4 is proud to have participated in providing a robust and simple way to harmonize, review, and extract data.

DA: Do you have to convince people of the value of that concept, or is it generally recognized since many are seeking solutions to mine existing data?

ZP: The industry has received the message, and we’re in a position to be focused on applications. We understand the pain points, and whether it’s biobanks, single-cell, or multi-omics, we have a solution that can help you.

DA: What are the challenges of figuring out how to integrate different types of data while those data sets continue to increase in size, and what do the solutions look like?

ZP: Individual data sets in single-cell analysis are managed using tools that are largely too rigid to scale, beyond a few patients’ worth of data. Similarly, with single-cell multi-omics, we see a definite need to work across many samples — meaning hundreds to thousands — to address the fundamental challenges, which are normalization and batch effects.

This is just like the early days in the development of variant calling techniques. While you’re going back to the fundamentals, you can’t take on the challenges of sequencing until you can look across these data sets, calculate, and figure out IT systems. Ultimately, you can’t work with a sample of 10 people and expect it to be relevant at population or global levels. You need to have thousands to tens of thousands of patients before you can confidently identify a lung, for example, or all the gut-associated lymphoid tissue (GALT) cells in the intestine, and so forth.

DA: Can you could take me through the fundamental science behind REVEAL and explain some of the individual applications, including where you see the potential to offer other specific use cases?

ZP: Paradigm4’s high performance computing platform, REVEAL™, empowers data scientists to perform deeper analysis in an ad hoc manner. We’ve created a software stack that has abstracted away the details of data management and doing big MapReduce calculations or large distributed computations like a big principal component analysis (PCA). Using our system, data scientists work exclusively in R studio, a very familiar environment. We’re focused on REVEAL: Biobank, REVEAL: SingleCell, and REVEAL: MultiOmics, but we’re also currently developing REVEAL: Proteomics, with the prototype expected in Q2. Following that will be REVEAL: Metabolomics, which will be incredibly valuable for people to have at their fingertips. Where available, these applications have plenty of public data in them.

It’s important to point out that these matrices represent relationships, so they are effectively networks or graphs., An essential aim in biology is capturing relationships and layering on top of any one connection the numerical data that supports the connection. Going forward, we’ll enable queries through sets of these relationships, which are in essence pathways. That’s the ultimate goal for Big Data — you want hypotheses that can be tested at the physiological level and not limited to the individual molecule interaction level.

An example is the PheGe browser, which is a graph between variants and phenotypes; each edge is a relationship association. We’re currently integrating eQTL and pQTL data, which will quickly enable hypothesis testing along the central paradigm of molecular biology.

DA: Will you refine the apps based on user-generated data and changes?

ZP: Yes, that’s our approach. We work hand-in-glove with our customers to make sure that their ideas are implemented — that’s at the root of how our system has become so powerful. We have over 10 FTE years in the REVEAL: Biobank, and we’ve formed true partnerships with Alnylam, BMS, Amgen and Genentech. As a result, we can get their scientists exactly what they need so that they can ignore the implementation details and focus on the big picture analysis.

DA: What is your approach to designing the user interface so that it can meet the needs of so many?

ZP: At the highest level, we have the graphical user interfaces (GUIs) that allow users to ask simple questions, do data selections, and see what data is available without coding. Then the data analyst partners who usually work with them very closely in data analysis can quickly get up to speed on any of these data sets using the R API in the REVEAL apps, because the vignettes we provide are plug and play, which is advantageous.

You see this typical dysfunction with other vendors’ research analytical platforms (RAP) that provide a lot of command line interface with examples captured in non-functioning blogs contributed by academics. We’ve really focused on making sure that our vignettes and help modules are a deliverable that always works as soon as you log in.

Our customer retention is a testament to our product, and its success is attributed to the underlying technology and of course to the excellence of our customer solution and development teams.

DA: Is it typical for a customer to come on board for one product and then gradually expand across the suite?

ZP: Certainly, that’s part of how we created it. We focused on some of the most compelling Big Data challenges, which are largely different sides of the same coin. The biggest issue that our customers are facing is being able to triangulate data to figure out, for example, if there’s a mutation — and if so, does it affect RNA, proteins, the cell, the metabolites? What they want to accomplish is obvious, but how to design that is not.

DA: Is there a typical profile of the types of customers you have mostly worked with?

ZP: We’re mostly focused on biopharma and specifically on the people that are interested in developing biomarkers — right now, it’s computational and translational biology, as those scientists are tasked with working with large data sets. As we grow, our customers will be more on the development side and in clinical teams. As the data sets get bigger and better, the marketing and the real-world analytics groups will likely become more involved, since they work with real-world data sets.

DA: Increasingly in translational medicine, there’s a lot of discussion of people trying to move away from animal testing and developing better in silico methods. Do you see more of a role of this kind of computation in lessening the burden on those animal preclinical systems?

ZP: I think that data are going to come from large population studies and the U.K. Biobank, in particular — for instance, being able to look at the prescription records and map those out onto a time axis with corresponding details of different tests in the clinical record, like we are doing with REVEAL: Biobank. That’s where you’re going to begin to really understand how medicines affect patients, because so many (patients) exist in polypharmacy — they’re prescribed multiple drugs. Considering that Tylenol has a pronounced effect on liver physiology, and you then layer in a statin or a beta blocker for hypertension, the data can reveal the consequences of these medicines used in combination in humans.

I believe that biobanks will revolutionize safety sciences, coupled with our technology, which enables predictive analysis. For example, several years ago, the Norwegian biobank found that albuterol inhaler users had a lower incidence of Parkinson’s disease. That’s an amazing physiological connection that you can only explore through population-scale data sets. Even though that was a therapeutic indication for an intervention, the safety profile will become more advanced with time.

DA: It seems to me that that’s the reverse of the primary case. We went from having all of these omics data without the computational tools to manipulate them, and now we have those tools and realize the potential in patient data, but the data sets are not robust enough. However, when it is, we’ll gain more insight from real-world data.

ZP: Exactly, at the core of this is hypothesis testing and the triangulation of data types to confirm or disprove hypotheses. Once that’s put into process, it doesn’t matter whether you’re pointing it at real-world data, prescription usage, lab-scale testing, or molecular analysis.

DA: Can you explain the concept of cloud computing and the issues that flexFS was created to help solve?

ZP: flexFS was born from the need to support large-scale analysis and data storage in the cloud. The cloud is famous for providing a virtually unlimited number of machines for performing calculations. Using BurstMode, we developed a task manager for the use of hundreds of workers efficiently. This means that you’re able to tune the amount of data each machine gets and keep their CPU at close to a maximum level of computation. As a result, we have been able to execute large linear logistic regressions or other calculations with tens of covariates — and do complicated math quickly.

Getting the data to the machines and getting it out is hugely complicated and time consuming without flexFS. However, flexFS scales linearly with the number of workers; it can accept data from hundreds and thousands of machines. It’s completely unique in that way and beats any other solution, especially considering the cost ratio, as it’s very inexpensive relative to what’s offered.

The final piece that complements flexFS is the P4 bridge, which allows us to do high-throughput dimensional indexing on the data output from the burst workers and store it back into SciDB. A recent example we worked on computed 1.7 billion linear regressions using 30 covariates. The reason people always show results for four covariates is because the curve goes hyper at four and gets exponentially more difficult. With flexFS, it took 20 minutes using the bridge to store the results back into SciDB. Now it’s actually sliceable and understandable with the GUI PheGe. The entire round trip from raw to processed data costs only $118.

When we first started these exercises of working with a billion regressions, it cost a thousand dollars — we’ve lowered that by more than a factor of 10 while enhancing the speed by 10. The technology is unparalleled because we wrote the software, starting with BurstMode, then flexFS, and now P4 bridge.

DA: In this particular context, you’ve accelerated things while bringing down the cost. Are there any other bottlenecks to be addressed, or has this solved the main challenges?

ZP: Well, here at “Santa’s workshop,” we’re never satisfied with last year’s results. We’re currently focused on an idea designed to solve one of the challenges with Big Data, which is having to work with an entire, intact data set. SciDB solves part of that problem through the dimensional indexing and “chunking” — picking a Big Data setup and distributing it around a cluster.

This concept has been abstracted so that we can export chunks of data back into flexFS to be stored in parquet and arrow file formats. There’s a chunk map just like in SciDB. These are high-performance file formats, and machine learning tools like TensorFlow can thus pick up even small pieces of data for analysis without the traditional extract, transform, and load process.

We’re taking what makes computations so efficient out of SciDB and we’re abstracting it into flexFS, where you should again be picking up 10- or 100-fold speed improvements in order to support the machine learning and big analytics. As an example, there’s no need to assess all of the chromosome when you can use a small range of interest for analysis. Known data sets that represent graphs can be represented in a chunked file structure. While the horizon for this is approximately three to five years away, the building blocks are already in place.

DA: While we’re on a Santa’s workshop theme, what are your plans for different applications and user options for REVEAL, including what’s next?

ZP: We’ll be doing more with data curation and providing pre-loaded apps that streamline validation and uptake. Over the next few years, I believe that people will really put our tools into practice so that we’ll have many more algorithms that have been diversified and can run quickly at scale. Fundamental problems that are holding back the field, like the normalization of batch-effects issues in REVEAL: SingleCell or questions on inter-comparing two data sets, will be addressed. The answer boils down to practice. It takes our competitors months to generate barely any usable information — with REVEAL and SciDB, it takes seconds.

DA: You mentioned that you’re more focused on biotech than academic labs. Was that essentially a commercial decision?

ZP: Yes, we work with customers that can help support the company, but over the next few years we plan to make a number of web portals available to academic users. Even now, we’re more than happy to work with academic groups, and we do. While we know that the technology isn’t at the cost level that they can typically access, but it’s coming.

DA: Can you tell me about your strategic partnerships?

ZP: We have an interesting development with the HDF5 Group and flexFS. HDF5 is a scientific file format that stores data in a two-dimensional matrix with metadata in a separate compartment. HDF5 has been used mostly in on-premise HPC environments but they are moving that to the cloud. The first step is an evaluation of flexFS as a backend. That should open up use with other formats, like the Allotrope Data Format, which is based on HDF5.

We’d love to form partnerships with mass spectrometry companies in particular, because the problems that we’re solving with REVEAL: Proteomics are fundamental to the “many-to-many,” the “many-to-one,” and the “one-to-many” categories. The many-to-many challenge is also clear in metabolomics using mass spectrometry.

DA: Do you see Paradigm4’s current business model evolving from service to a product relationship, including off-the-shelf products?

ZP: It’s going to be off-the-shelf. However, the challenge with biobanks is that you need a material transfer agreement (MTA) with the originator of the data. We want to be a trusted research environment where customers can push a button and set up the REVEAL: Biobank to rapidly work with biobank scale data. There’s tremendous value in what we’ve done with REVEAL: SingleCell, where we import pre-curated public data sets so that customers can validate quickly. Look out for more pre-packaged data sets, especially reference data sets, like HUGO and ENSEMBL.

DA: How do you envision the role that your work will have on the industry, including at the patient level, and in the future?

ZP: Patients will definitely benefit from customers that work closely with Paradigm4. While I can’t disclose the name of this customer, they’ve taken three novel targets into the clinic in 18 months using REVEAL: Biobank. This is phenomenal from a patient perspective, because there will be three new rare disease areas that are going to be treated quickly. This will elevate this particular customer to a unique product company from a me-too product company, which is a huge transition and the goal of most pharma organizations. For example, there are six angiotensin 2 receptor blockers. Candesartan, which was one of the best from Boehringer Ingelheim, was fifth to market — nobody can afford to be fifth anymore, and everyone wants to be first-in-class.

We’ve already architected PheGe as a classic graph. A graph is the node and an edge connecting the nodes, and we can store millions of edges for any pair of nodes, which is unheard of. As we advance, the offering will shift from being a clever data store to a knowledge graph in arrays, where you’re not limited by edges so it can be computed in its entirety. I think that, in the next couple of years, capability is going to truly explode.

My intuition is that we’ll soon be able to organize all components into arrays, where you can migrate the results from the molecular and intermolecular level up to the pathway level. It will bring in diseases, medicines, and all other relevant factors. I started my Ph.D. in 1984, and I’ve been waiting patiently since then to go from just correlations to actually testing hypotheses across molecular modalities — I’m excited by what the future holds.