Copywriter, technical writer, translator (FR>EN, ES>EN, IT>EN), journalist

New tools are making mining mountains of data easier

An introduction to e-discovery’s key concepts.

Attorneys will always need to participate in discovery. One secret to saving money on fees is to limit the number of documents they need to handle.

Fortunately, even as document collections continue to expand to mind-boggling proportions, technologies used to pare these collections down are maturing and gaining acceptance in legal circles.

Getting a handle on available technology means first understanding the concepts that underpin the available tools.

A document retention policy can help an organization control the volume of documents it has, and by extension the volume of documents it must search and produce during a discovery project.

A network collection tool scans multiple data sources, like SharePoint file servers and Exchange servers, to collect specific data. When discovery teams begin to cull document collections, they de-duplicate by flagging all documents whose contents are the same. They also take near-duplication into account, flagging documents that are very similar to each other, such as different versions of the same contract.

“You don’t want to make people review the same document 10 different times,” says Martin Felsky, a lawyer and partner with Harrington LLP in Toronto.

Keyword search involves looking for specific words or phrases in a document collection and is “a good place to start, just to get your arms around some of the key issues and documents in a case,” says Matt Nelson, senior e-discovery counsel for Symantec in San Francisco and author of Predictive Coding for Dummies.

Discussion threading, a method of visually grouping messages by topic, can bolster understanding of what the truly important topics are.

Hacking away at document collections using keyword search alone may do more harm than good. For instance, if correspondents speak of a company takeover, lawyers who use the search keyword “takeover” risk false negatives (relevant documents that are not retrieved because they aren’t responsive).

They may miss documents that contain synonyms such as “buyout” and “purchase.” Synonymy’s grammatical cousin, polysemy, can flood search results with false positives (responsive documents that are retrieved, yet prove irrelevant). Does “stock,” for instance, refer to an equities market, a store’s back room or a soup ingredient?

Concept searching helps lawyers deal with synonymy and polysemy. It recognizes conceptual similarities between documents by noticing how words relate to each other, how often they appear together or near each other, how far apart they tend to be and how often they do or do not appear in other documents.

Steps to cull the initial collection aim for greater precision (proportion of retrieved documents that are relevant) but no tool will filter all false positives or prevent all false negatives.

“It becomes a matter of proportionality and risk that the client is willing to take,” and the client must weigh “the cost and any delays,” says Dominic Jaar, a partner and national leader of KPMG’s information management and e-discovery practice in Montréal.

Technology-assisted review (TAR), also referred to as predictive coding, involves letting machines do most of the heavy lifting during review, the costliest part of the e-discovery process.

The TAR process starts with humans reviewing and coding a “seed set” of documents. “They flag documents as relevant or irrelevant,” says Felsky.

The documents and coding are then fed to the system, which uses this information to extrapolate decisions on the remaining documents. Reviewers iteratively code more sample documents and/or give feedback to the system on documents it codes, thus “tuning” the system and improving its precision.

“The program can go through a million-document collection and flag the remaining documents in the same way the legal team flagged the samples,” says Felsky.

TAR may be new in legal circles, but other industries have long relied on similar algorithm-driven systems. For instance, credit card fraud prevention starts with computers tracking where credit card holders use their cards, what they buy and other data. The computers then build a profile for each credit card holder. Using customer profiles, they “can then predict whether a transaction fits a profile or not,” says Jaar

E-discovery tools can handle review prioritization as well, something most lawyers know from legal databases or consumer tools like Google in which the search tool prioritizes results.

Prioritization can come about in several visual formats. Front-end analytics tools give clues to reviewers of where to look, what might be relevant, what to capture and what to leave behind. They can cluster results using graphics like bubbles, heat maps, old-fashioned graphs, even what Don Cameron, a Toronto-based partner with Bereskin & Parr, likens to “a petri dish, those things they put little flecks of germs on and allow them to grow into circles and lines on the dish.”

In one case, Cameron’s graphic resulted from intelligent categorization of 3,000 documents, which Bereskin & Parr staff had reduced from 10,000. He liked the visualization of the software’s sophisticated algorithm. “If this was a table of contents, your eyes would cross pretty quickly,” he says.

Such graphics “help you focus on what you’re looking for,” Felsky adds.“You get a visual map of the data and you can see where documents are clumped together or relate to one another.”

Since document collections frequently consist of files in different formats (like .doc, .pdf, e-mail), tools that can handle diverse data sets are quickly becoming a must.

“We need more discussion about what tools people use, how they use them, how reliable they are,” Felsky says.

This article originally published in Lawyers Weekly Magazine. For a PDF of this article, click here.