Mining the Mountain of Digital Discovery

by Jason Slick, President, Cynic Inc.

In today’s global economy the importance of data organization and classification has become the cornerstone on which a financial investigation relies.Everything from converting to digital data to searching information for classification has become an arduous task as fraud cases have become more and more complicated. To address this increase in data complexity, a few standardizations of technique and technology can make a huge difference in the number of hours needed to take a fraud investigation from start to finish. In this article I will discuss the various techniques and processes for building financial fraud cases using computer technology. From data entry and conversion to advanced search algorithms that aid in classification, managing a financial fraud case has never been easier, if you know the right techniques and tools to use.

Since this article is written for the average fraud investigator, it is important to clarify some of the more advanced technical words I am going to be using.

Important Definitions

OPTICAL CHARACTER RECOGNITION, abbreviated as OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text into machine-editable text.

DATA OBJECT is a digital representation of a real-world object’s state and characteristics.

DATA MINING is the process of searching and sorting through large amounts of data and picking out relevant information.

An ALGORITHM is a sequence of instructions often used for calculation and data processing.

Now I know you are saying to yourself, “oh no, I cannot learn advanced computer terminology” or “I better let my computer guy read this”, but stick with me and I will clarify the importance of a basic understanding of these terms. Even if you use someone or some software that has these features hidden from you in a nice user-interface, it is always good to understand the concepts that make the computer work for the modern-day investigator.

Data Entry and Data Conversion

With the advent of technology, the data entry process has become the most important step in building a financial fraud case that is both accurate and complete. Whether the data entry process is manual or computer-aided, precise information is of the utmost importance.In this section, I will discuss the various data entry techniques that will be available to the modern financial investigation team.

Whether you are an old-pro or new to the business, everyone is familiar with the discovery that is important to any fraud case. Bank records, credit card statements and receipts make up most of the information, translating this information into a form which is usable by the computer can be a daunting task. The computer is not yet smart enough to show it a piece of paper and have it recognize the type of document, then enter the data and organize it. This is where the data entry group and the “eyeballs” of humans play the most important role.You must have people verify all information that is inputted into the computer is accurate. There is no point for a fraud examiner to even analyze data if it is incorrect. The case is only as good as the information obtained. Having a person who can validate the integrity of data entered whether manual or using Optical Character Recognition is important. Humans and computers can mistakenly enter 5’s for S’s and commas for periods; which makes for errors in the case management process. Catching any problem is the job of the data entry group and there are many tools available to assist in this process.

The tried and true methods of manual data entry have been overtaken by the advances OPTICAL CHARACTER RECOGNITION, referred to as OCR. While it can take up to two minutes for a human to enter one bank transaction, the computer can read a scanned document and translate the image to text in seconds. There are many applications such as Nuance OmniPage, Abbyy FineReader and Adobe Acrobat that work well at character recognition, finding one that is useful for you is important. How they work is simple. If you do not have a PDF or any other image type of the document, you must first get the information into a digital image, using a scanner is the best method. Most scanners work easily with programs such as Adobe Acrobat or Microsoft Word for making useable images of documents. The most important setting during scanning is quality of image; I try and have at least 300 dpi setting (dots per inch) for any scanned document. The higher the resolution the better chance the software has for accurate translation. Once you have the image of the document, use your OCR application to translate or recognize the characters, then save the data in whichever text or file format you plan on using. Most OCR programs can work with Word, Excel and other file types and have the ability to export the data as a text file using comma or tab delimited values. Using some sort of delimiter gives the data a consistent marker that aids with data separation and organization, while maintaining an application neutral data file. Open the file in something such as Notepad or other simple text editor; from here you can use the features of copy and paste to further organize the data. What you are looking for is consistency of information layout and one line of data per transaction. Make sure word wrap is turned off for the application and verify each transaction has the same number of entries.

datetime1amount1

datetime2amount2

datetime3amount3

In this example the transaction can be either debit or credit depending on whether the value is positive (credit) or negative (debit). I understand you cannot see it, but hidden in the code that separates the datetime from amount is a t which means tab. The computer can now read the example and be instructed to separate the two values from each other using the tab as the delimiter. By creating patterned data, the amount of manual data manipulation is decreased through consistency. In fact, there are many occasions where the time of OCRing just a few transactions is slower than manual data entry so being consistent with your delimiters is the key. Hint: Try and create delimiters that are unique and have almost no chance of being in the text you are working with

Example 2 of simple bank transactions that are custom text delimiter of %$%

datetime1%#%amount1

datetime2%#%amount2

Either way the data has to be verified as correct and laid out in such a way to be useful for the investigator. Using tools like Microsoft Excel or most any spreadsheet program, you can import the delimited text easily with all data going into their respective columns. Some applications like Excel allow the programming of macros and algorithms that can be used to calculate the running balance of transactions as they apply to the bank or credit card statements. There are many features embedded into spreadsheet programs, like sum total, that perform these calculations quickly. As an example, a bank statement has a daily balance included on the statement. Using the sum of the transactions, a person can compare the digital data with what the statement says. If you are missing any information or there is an error in typing like extra numbers (50.00 instead of 5.00), the error will be noticeable because of the deviation from the statement or failure of the application to perform as intended.

Data Objects and the Financial Investigator

Data Objects are not an easy concept to grasp for some, but it is important to have a small knowledge of them to understand how computers handle data efficiently. From the previous definition given, a Data Object is a digital representation of a real-world item. What does this mean? In the case of a bank account, this means that a Bank Account Data Object has certain properties and associations that are universal to every bank account. Bank accounts always have an account number and opened date. Bank Accounts also can contain other data objects within them, like bank transactions. Bank Transaction Data Objects have their own properties such as datetime of transaction, whether the type is a debit or credit, and the amount of the transaction. By creating Data Objects for every possible financial transaction, we can build a set of rules that the computer can interpret and aid the investigator in case management.

With Data Objects the computer can determine whether a field must meet some type of parameter or criteria, like the date being in a particular format like mm/dd/yyyy or amount of transaction being of the money type. This allows for the computer to warn the person, or a computer application, of data entry errors that need to be addressed. We can guarantee that check amounts are money fields and names need to exist for our case’s subjects. By using Data Objects we can maintain minimum standards of data that guarantee our search algorithms and association techniques will function correctly.

Data Mining and Classification

A very important feature that an investigator needs is the ability to search, sort and classify the data. Without searching and sorting features, the classification of financial transactions can be a laborious. The investigator needs to be able to classify transactions as business expenses, income, investments, etc. efficiently and consistently. In this section I will discuss the various methodologies available to the investigator that assist with data organization and classification.

The first thing a financial investigator needs to do with classification is to organize the recurring items, whether they are expenses or incomes. For example, a business might have an office with a rent payment, electricity, phones and internet that are due and paid every month. These transactions have a tendency to be paid from the same account or even in the same manner consistently, like always by check or using a debit card. These transactions usually always have the same information profile, like notations in the bank statements. An example of this is transactions for the cable tv/internet might always have a check being written to Time Warner; by using the name of the company in a simple search we can easily find every transaction that has a reference to that name. Simple search is just typing a word or words in exactly as you expect them to appear from within some text. Simple search and sorting techniques help by narrowing the data set of available financial objects that require more classification by categorizing repeating data more effectively. If there are on average ten checks written by a company every month and six of those checks are for company bills, we have limited the unorganized data set by 60%. This greatly reduces the size of the unorganized data and can effectively aid the investigator in more quickly recognizing patterns of fraud or uncharacteristic spending habits.

The second pass on the data needs to determine the movement or transfer of money between accounts. Sometimes the transfer of money is easy to spot because there is a direct one-to-one correlation between the values, e.g. a $1000 debit from one account becoming a $1000 credit in another account. These transfers sometimes can be hard for an investigator to spot with many accounts. This is where a simple search on the amount fields can show just the transactions with that particular amount. Other times searching is more difficult. As an example some of the money was transferred to an asset like a car or even put in their pocket as cash. Anytime that funds are taken out of a bank account and spent as cash, a transfer of money needs to be made to a virtual transaction in the subject’s cash account. A virtual transaction is a representation of an assumed value or item that needs to be accounted for. An example of this is if the person says they paid cash for a $2000 tv, the investigation needs to show a debit from the cash account for the money. Sometimes money is split up, like a $1000 debit going to pay a $500 company expense while the other $500 went into the person’s pocket. By sorting out all of these account transfers we have reduced the chance of tabulating the transactions more than once.

By time we reach the third pass we should have about 80% of all data organized and need to apply only few more advanced searching techniques to further organize the case. There are many different methods for searching that can be applied to financial objects; we are only going to discuss a few. The math and computer science behind these search techniques are a bit technical for this article, but I feel it is important to mention them. Fuzzy searching is the names given to finding strings of text that have an approximate match, like finding ball when looking for bill. Soundex searching is looking for words that sound similar in the English language, such as pelt and belt. Wildcard searching uses a character that can be substituted for any subset of a string; wildcards are usually represented by the * key (star). An example of this is searching for fra* should reveal results like France, fray, Frank, frazzle, etc. Fuzzy and Soundex searching are good tools for finding data where typing errors have occurred. Just because we have built our data objects properly does not mean that errors in data entry did not occur. A data object value that is a text field will accept ‘5+3ve’ just as it will ‘Steve’; if you find several errors like this there could be a breakdown in your data entry process somewhere that needs to be addressed.

The rest of the investigation after the third pass just comes down to solid fraud investigator work, mental data mining if you will. Remember what you might be looking for could have been categorized or sorted in the data mining process. Smart criminals try to hide fraud through fake businesses, employees and other methods. If you have organized your data correctly, when you make that connection every related transaction is already at your fingertips.