Embracing Data Silos: Semantic Search and Analytics Innovation – DATAVERSITY
Walk around any large organization and hear people groan
about finding the right data to do their work. In the typical organization, data sits in multiple places, lost behind technical
and functional boundaries. These isolated systems, referred to as “data silos,”
have often existed for good purposes and reasons such as helping each business
function do their job well and meet legal requirements.
However, without a cohesive and unified data view, informed
decision-making across an organization becomes difficult, and inefficiencies
arise. Increases in data volume and velocity intensify this headache.
Companies tend to solve this in three ways: Maintain a
distributed network of specialized databases; shift to a centralized database;
or transition gradually to a federated system. Many managers then go directly
to a technical solution or hire a data scientist to deal with the messy data.
Upgrading to a centralized database seems tempting. Eric Little, the CEO of LeapAnalysis and the Chief Data Officer of the OSTHUS Group, summed up this traditional mindset in a recent DATAVERSITY® interview:
“I need to take all my data laying all around my company and somehow put it in one big master system that I will build. That means getting my data across the entire enterprise, even those from 30-year-old systems which are not in use, and somehow connect with hundreds and thousands of employees across the world, who may have data scattered in a collection of text files or Excel sheets. On top of which, the person who knows what the columns mean in that 30-year-old system may be dead or retired, cruising in a catamaran along Costa Rica.”
For companies with tons of files stores, maybe even some in a data warehouse or a variety of relational systems, reorganizing data looks daunting, especially as it often involves a heavy dose of extract, transform and load (ETL). Nowadays, organizations wish to store raw data into a centralized data lake. However, extensive costs and a year-plus long project of merging data into the latest new shiny technology can be problematic.
Torsten Osthus, CEO for OSTHUS Group and a co-founder of LeapAnalysis, reflected in the same interview, “in the mid-2000s, the software industry focused on system integration and capabilities instead of data integration and managing data as a corporate asset.” But this approach is running into a brick wall with AI and machine learning. Furthermore, as Osthus said, organizations miss bringing contextual knowledge from people’s heads into the systems.
Machine learning is data hungry and voracious, on the order of petabytes of data, in order for it to be successful and to “learn.” For example, Little said, Life Science workers & researchers see “massive image files from high throughput screening, or have to search for data on proteomics and genomics” e.g. to better understand biomarkers for diseases, or they must sift through the variety of “MRI’s and scans” from doctors’ offices. Machine learning may be used to do some augmented analytics, but, as Little said, “you are not going to be able to database all that in a central location where everyone has access.”
Even if all the information was stored, there are legal
ramifications. Little remarked, “certain data at one of our customer sites can’t
leave Germany for legal reasons. How do you port it over to the U.S.? It can’t
leave.” Furthermore, employees, like the IT guru, (e.g., master of the
machines), can be quite protective of the data sources they use and control. The
idea that everyone is going to form a circle in an Enterprise Information
Management system, “hold hands and sing Kumbaya
is a fallacy,” explained Little.
Data silos are reality, designed for a business purpose and need to stay; so, how can organizations deal with them? It’s a central piece of the LeapAnalysis puzzle to help organizations figure this out.
How to Make Data
Silos Work
Achieving success with data silos requires a different
approach “than thinking about what we can do with code now or even solely
computer science,” said Little. “It is about making computers better with searching
and working in a new way.” Little’s background in philosophy and cognitive
neuroscience provides this new context. He stressed the importance of the “semantic
component, the controlled vocabularies and taxonomies. All the logical stuff that organizes information”
so that computations (e.g., machine learning techniques) actually work.
Torsten Osthus added to Little’s ideas:
“Let us do machine learning. But, we need to leverage the data, information and knowledge as contextual assets of digitization. Especially, we need to bring people’s knowledge, data, and business process know-how together. Brains are silos in organizations as well, those with data assets to tap. Disrupted data comes from a bottom-up approach. Create a knowledge graph, a semantic engine under the hood based on a top-down approach and bring all the data and knowledge together. It is a true federated approach where data can stay in its original source.”
Our brains thrive as pattern and association machines. So,
can a computer, with a knowledge graph behind a search & analytics engine.
Connect metadata to the knowledge graph and each silo, and make data FAIR: Findable, Accessible, Interoperable,
and Reusable, said Osthus. The user
sees the schema in the relevant data sources to explore further.
How does one get from knowledge graph to results? Little commented, “we find a very clever way to do machine learning on the data source. Pull the schema, read, and align it. If we get weird columns, go to the subject matter experts to extract meaning.” Everything stays where it is in the silo, including the Data Governance, Data Stewardship, and security. Little described how the different search engine components work:
“Put a virtual layer between the silos and the user interface. A knowledge graph lies within this middleware with semantic models, connected to a data connector & translator using API’s, REST connectors, or whatever. We make the data sources locally intelligent to self-report what they are, where they are and how to get to them. User queries from the top interface pass through the middleware via SPARQL, a language that talks with this knowledge graph. A mechanism in the knowledge graph talks directly to the data sources, filters data elements and brings the best matches as search results. Those results can then have deeper analytics run against them, be visualized, etc.”
In a matter of one click, the search engine returns
high-level data from multiple sources across the data ecosystem. From the
results, a person can identify data resource pockets – sets of patterns that
answer their query more quickly (and learn/improve performance over time). They
can further narrow down the query or explore in detail, as permissions allow.
This tool can either expunge results, cache them, or export
them in a different format, e.g., CSV. The
user interrogates the knowledge engine via a query or analytic, forming “a
semantic to everything translator, through SPARQL, while leaving the data in
place and making it easier to fetch the detailed information.” This model
depicts a true data federation where data stays in place with no intensive ETL
– search and analysis can happen on-the-fly.
Speed and Knowledge
LeapAnalysis puts Little’s ideas into practice with the philosophy, “Fast as hell, no ETL.” Now customers can integrate data in minutes to hours rather than months to years, bringing the right data together. As Little explained:
“We solve the problem of speed to knowledge to solve actual business problems. Can a person get a quick way to go to that knowledge? Not just building technology for the sake of building technology. Pull concepts in queries through semantics and do it in an intelligent way, through a knowledge graph. Attributes of the items inside of the algorithms, the classifiers, become clearer because the algorithms are now connected to the concepts in the knowledge graph.”
Little and Osthus highlighted four other features:
“Using a search engine to map a question semantically has
been horrible for years,” Little said. Partially due to this negative
experience, businesses have solved information disorganization by either
combining everything from disparate sources in one place or by hiring a data scientist
or similar expert with domain knowledge, to squeeze information from all the
data located all over the place – a very manual effort. Such a person needs to
know the ins and outs of searching, like an auto mechanic tuning up an engine.
Little and Osthus are “making the alignment between different
meanings simpler through a truly federated system.” A chemist, biologist or
bioinformatician can jump into their research without needing to learn a new
centralized data system or sending it to someone in IT.
Osthus provided a parting thought:
“In the past, data integration was driven by costly programming and writing complex SQL Statements. Now it’s a business perspective, that can be done by the users. Embrace your Data Silos.”
Image used under license from
Shutterstock.com