The data science community likes to speak of dark data and how they, as experts, can bring tremendous insight to any business by simply shining their analytical light on those hidden bits, bytes, and characters.
So, what constitutes dark data? Gartner defines it as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).”
Recently, I read Neil DeGrasse Tyson and James Trefil’s tome Cosmic Queries. It is the latest effort by those who know and study the physics of the universe, to explain to the curious what is known, and what is not known, about its history and makeup. I learned some things.
One unknown is what makes up dark matter. Dark matter is there, and massive experimental efforts like the Large Hadron Collider that crosses the French and Swiss borders continue to prove out the theories surrounding dark matter and others like it. There is, however, much more to learn. Astrophysicists must peer into the dark matter and analyze what is there.
The core thinking in enterprises is that they retain data because they are required to. Compliance rules dictate the securing and retention of certain records for some period and that requirement often turns into an expensive proposition.
But is keeping data just expense? Is there value embedded in it? I suspect those whose livelihoods depend on data analytics will argue that there is value and, of course, offer several options, none of which are free, to peer in and break out the significant, business-consequential nuggets from the dark data.
Often forgotten in the effort to mine dark data, are the information assets that do not, and did not ever, exist in a database. Those are the documents that enterprises have captured for decades into content services platforms (CSP) also known as enterprise content management systems (ECM).
Data that has structure is relatively straightforward for analytics professionals to process. Yes, they may complain about having to join this with that, resolve duplicates, account for gaps, discard meaningless stuff, and so on. But through it all they are dealing with coded inputs that a program understands. Other than the occasional need to consider national language encoding, the machine knows how to handle the data.
The unstructured nature of documents is often why they went into the CSP in the first place. If all the information contained on the page was already in a nicely formatted data structure, there may be few reasons to keep the original except, maybe, to retain a signature facsimile.
However, documents do not arrive in an organization with comprehensive matching data structures. They arrive in a completely unstructured form. Depending on the business type of document, it may make sense to manually convert (data enter) some of its contents into structured data. In some other cases it may make sense to automatically convert some of the unstructured contents using document automation.
In either case the cost of conversion limits how much data gets transcribed. In most situations, conversion is limited to the few values needed to associate a document with a business transaction.
The billions and billions of documents residing in enterprise CSPs are dark content. Many of them hold information that, if properly analyzed, could reveal new insights about customers, about trading partners, about market trends, about virtually anything.
So how can a reasonable businessperson access those insights and do it in a fashion that does not break the bank?
The approach differs depending on how ingestion systems initially brought in the content. The content’s age is another factor. CSP technology and practices evolved over time. Early systems captured simple images often by scanning physical pieces of paper. As time went on, PDF became a standard and systems moved to storing the text of the document along with the image.
Lately many organizations have abandoned the use of physical documents and traditional mail altogether and are exchanging documents electronically through email and other channels. Many CSPs now directly ingest the electronic documents arriving via those channels.
If content is relevant to an analytics effort and exists only as a simple image, the characters on the document must first convert into text. There are any number of different conversion software options, both in the cloud and on premises, that will accurately and effectively complete that transition.
Once documents convert and the CSP supplies access to all the text of the content, it creates an opportunity to use any number of different analytical approaches. Some CSPs supply direct access to analytic tooling while others may require copying the content into an analytics friendly file system.
The primary tools for analysis are, of course, readily available machine learning artificial intelligence (AI) systems. The corpus of content becomes a base for the development of models that can return answers to help grow and optimize the business. In fact, a simple content clustering exercise, often one of the first steps in model development, can guide creation of the right questions the model should seek to answer.
Tools for creating and optimizing AI models, once strictly the domain of computer scientists, are now readily accessible to anyone with the technical savvy and curiosity to take on an analytics project.
Some day the science will figure out dark matter. It will take time and money. Getting real business value from the dark content is way easier, and far less costly.