When performing forensic investigations, CPAs often deal with documents that contain a significant amount of textual data. This type of data may be found in email or deposition transcripts. It can also be found in large transaction data tables or bank statements.

When analyzing transaction data sets or bank statements in an investigation, forensic accountants may want to create subsets of records (i.e., groups that have something in common) within the data in order to gain insights about it. Accountants may group records based on their transaction description, which is often textual data. This analysis may reveal certain patterns in the transaction data that are deemed suspicious, noncompliant, or even fraudulent.

Analyzing textual data, such as transaction descriptions, is often challenging. This type of data may be filled with extraneous strings of characters, numbers, or words that make it hard to find the text that would be helpful in an investigative analysis. Fortunately, there is an advanced technological method that can help resolve issues that CPAs encounter when analyzing text: natural language processing.

Natural language processing (NLP) is a subfield of artificial intelligence that gives computers the ability to automatically read, understand, and derive meaning from human languages. Data analysts use machine learning technology to execute natural language processing on data. Instead of navigating through data to identify textual patterns visually, one can use natural language processing through machine learning, which runs algorithms through textual data and identifies patterns automatically. Indeed, this is a very helpful form of technology that forensic accountants can use when trying to segregate transaction data for a fraud investigation. [For more on NLP, see James Pustejovsky and Amber Stubbs, “Natural Language Annotation for Machine Learning, Ch. 1,” O’Reilly, https://bit.ly/3j5Sbca; Jayeeta Putatunda, “Natural Language Processing (NLP) in Fraud Analytics,” Indellient, https://bit.ly/3y0sher; and Jan Wijffels, “UDPipe Natural Language Processing–Basic Analytical Use Cases,” Comprehensive R Archive Network, https://bit.ly/3d7EXbe.]

This article uses a simple case study to show how NLP can benefit a forensic accountant who is analyzing transaction data in a fraud investigation. This case study demonstrates the use of R, an open-source programming language used for data analysis and statistical computing, as well as RStudio, an open-source desktop application that uses R programming for analysis (see https://www.rstudio.com/).

Case Study—Investigation into Purchase Card Transactions

A company named XYZ Co. is under investigation. The director of marketing allegedly misappropriated company funds for personal gain. XYZ Co. uses a corporate purchase card for some of its expenditures. The company has given card access to members of its purchasing department. However, it was reported that some department heads outside of purchasing have been given access to the purchase card as well, including the director of marketing.

Only five vendors have been approved for purchases using the purchase card: Aramark, CDW, Sodexo, Staples, and WB Mason. Company policy states that all purchases greater than $1,000 must be reviewed and approved by the director of purchasing.

Because of the allegations made against the director of marketing, our author’s forensic accounting team has been hired to investigate the company’s purchase card data to see if any inappropriate transactions were made. We retrieved the corporate purchase card data from 1/1/2014 to 3/10/2021, which includes 220,584 transactions.

When skimming the transaction detail, we notice that the dataset appears to be slightly imperfect. All of the transaction descriptions include either leading text at the beginning or trailing text at the end. For example, the very first purchase is from Staples (cell B2 in the Excel Screenshot 1), but the description text includes “1-01-2014” at the end. Furthermore, a purchase from CDW (cell B13) includes “1012014” as leading text.

Screenshot 1

The extraneous text before and after the vendor names creates unnecessary noise in the data set, making it difficult to perform effective analyses. For example, if we wanted to create a pivot table that summarizes the number of purchases by vendor, the leading and trailing texts would make it very difficult to present this information.

As shown in Screenshot 2, an Excel pivot table can tabulate the number of transactions by text description. The pivot table shows that “1122019 Wb Mason” appeared 16 times in the data set. The pivot table also shows that “1142016 Wb Mason” appeared 16 times. Because of leading and trailing texts, using a pivot table to summarize the number of transactions per vendor is ineffective.

Screenshot 2

Fortunately, we already know that the company’s corporate card can be used for only five vendors. Since we know who those vendors are, we can use the COUNTIF function on Excel by inserting the vendor names into the formula. The formula for counting Aramark purchases is:



After using this formula for the five vendors, it is clear that only 82% of the purchases on the corporate card were charged to these approved vendors (Screenshot 3). But which other vendors could XYZ Co. personnel have purchased from to make up the other 18%?

Screenshot 3

One way to identify other vendors in the purchase card data is to scroll through the dataset and find them manually. Early in the data, there is a transaction labeled “Uber Trip 1-01-2014”; this transaction seems to be noncompliant. The Sort & Filter feature in Excel can be used to find other “Uber Trip” transactions in the dataset. Furthermore, we can continue to manually scroll through the data to identify other unauthorized vendors.

With 220,584 lines of data, manually going through the data set to find unapproved vendors is highly inefficient. One solution is NLP through R programming, which will run algorithms through the text within the purchase card data set and identify patterns of words and phrases.

The first step is to import the corporate purchase card data into RStudio by entering the code below. (Note that when entering this code, the Excel file must be saved in RStudio’s working directory folder.)

Purchases <- read.csv(“NLP Purchases.csv”)


The next step is to install the UDPipe and TextRank R packages. UDPipe is an NLP R package that allows users to analyze text within a data set. In this example, UDPipe will run algorithms throughout all the text in the transaction description column of our data and organize it in a way to enable further insightful analysis. TextRank is a text-processing R package that can be used to find the most-used keywords and phrases within the purchase card dataset, and then summarize and visualize the data. These packages are installed into RStudio with the following code:





Once these R packages have been installed and launched into RStudio, we can begin our NLP analysis of the purchase card data by downloading UDPipe’s modeling tool. When entering the code to download this tool, the language in which the data is written must be entered. UDPipe has ready-made models for 65 languages. In this case, the language is English:

ud _ model <- udpipe _ download _ model(language = “english”)


The next step is to perform language annotation on all the texts in the transaction description column of the dataset. Language annotation is the act of creating metadata in the dataset by marking up elements of the text that we would like to analyze. When language annotation is initiated, R programming is running algorithms through the texts in the “Description” column of the data and identifying and organizing all the words in a manner that allows users to perform further analysis. This procedure can be executed with the following code:

ud _ model <- udpipe _ load _ model(ud _ model$file _ model) x <- udpipe _ annotate (ud _ model, x = Purchases $ Description) x <- as.data.frame(x)


Running this code creates a data table called “x,” which lays out all the textual data from the “Description” column of the purchase card data. This new data table can now be further analyzed to identify any patterns in the text.

In this case study, we would like to identify the combination of words that has appeared the most frequently. To do this, we will use the TextRank R package and execute the “textrank_keywords” function. This function will run an algorithm that finds relevant keywords in a text where combinations of words follow each other. This is a useful analytical tool for this specific example because we are aware of some phrases in the data that signify noncompliance and would like to see if there are more of them. For example, phrases like “Uber Trip” are already known to be in the data, but it is possible that there are other relevant word combinations.

NLP via R programming can help find these phrases that are deep within extraneous strings of text. In this example, we decided to create an analysis that would summarize the most frequently used combination of words within the “Description” column of our data set. In our code, we decided that the maximum number of words in the phrases we would like to analyze was four. Furthermore, we specified that each of those words would be separated by spaces. This textual analysis can be executed through the following code:

Purchases.Analysis <- textrank _ keywords (x$lemma, ngram _ max = 4, sep = “ “) Purchases.Results <- subset(Purchases. Analysis$keywords, ngram > 1 & freq >= 2) Purchases.Results


After creating the summary of most frequently used phrases, this analysis can be imported as a data table in Excel with the following code:

write.csv(Purchases.Results,“C:\\Users\\acastillo\\Documents\\PurchaseResults.csv”,row.names = FALSE)


The resulting table appears in Screenshot 4.

Screenshot 4

“Wb Mason” is clearly the most frequently used phrase within the dataset, with 39,964 transactions. Before performing any NLP, we already knew that the company made purchases with WB Mason 39,964 times through a simple Excel formula (Screenshot 3). The NLP analysis found the same number of transactions for WB Mason.

The rest of the results from the text mining analysis reveals a lot about XYZ Co.’s purchasing activity. As seen in Screenshot 4, XYZ Co. entered into noncompliant transactions with Uber, Amazon Marketplace, Tao Nightclub, and Starbucks Coffee. These purchases from the other three “noncompliant” vendors were revealed through NLP and now can be the focus of further investigation.

The number of transactions with these unauthorized vendors equals 40,436; this represents the remaining number of transactions that were unidentified by the initial analysis. The number of transactions from the five approved vendors equals 180,148. Combining the number of transactions from the approved vendors with those from unauthorized vendors equals 220,584, which represents the total number of transactions made with XYZ Co.’s corporate purchase card from 1/1/2014 through 3/10/2021.

After going through the transactions that involved these unauthorized vendors, we see that none of them had a dollar amount greater than $1,000; this means that none of these transactions required review or approval by the director of purchasing at XYZ Co. This deficiency in the company’s internal controls allowed employees—such as the director of marketing—to use the corporate purchase card for personal gain with petty thefts going heretofore undetected. Fortunately, with the use of NLP, the forensic team in our case study was able to identify the fraudulent transactions that were essentially hidden within a very large data set.

Finding the Needle in the Haystack

Data gathered in forensic investigations can often be unstructured or unclean. Potentially useful textual data might be obfuscated by a lot of noise, such as leading or trailing strings of text. Fortunately, NLP through machine learning helps combat this problem. The simplified case study above demonstrated how forensic accountants can benefit from using this technology for investigative data analysis.

Furthermore, forensic accountants may use natural language processing during investigations that involve other documents that contain textual data, such as email exchanges, contracts, or deposition transcripts. Instead of manually reading countless documents, an investigator can use this technology to automatically determine which documents are the most relevant for further analysis. This process can significantly reduce the amount of time spent on tedious and exhaustive procedures.

As evidence gathered for forensic investigations trends toward becoming completely digital, accountants will benefit from having data analysis skills, such as machine learning and programming. Using natural language processing is indeed a great example of how implementing advancing technologies can enhance an investigative analysis. Because forensic accountants rely heavily on handling and analyzing a significant amount of data, it is important for them to stay up to date on the latest technologies that can further improve the quality and efficiency of their work.

Andre Castillo, CPA, CFE, is an advisory supervisor and data team member at Marks Paneth LLP.