What Is Big Data?
As the name implies, “big data” refers to data sets that are too large or complex to analyze using traditional methods. Big data is typically defined by three features: 1) volume (the sheer size of the data); 2) velocity (the speed at which it is generated); and 3) variety (the various media and formats that data assumes).
What has brought the phrase “big data” into the public vernacular over the last several years are advances in technology that enable human institutions (corporations, government agencies) to manipulate and utilize big data. Sophisticated algorithms and artificial intelligence (AI) systems are able to ingest massive amounts of data from diverse sources and use it to identify patterns or predict behaviors and outcomes in ways that were recently unimaginable. Companies use big data to personalize advertising. Political campaigns use it to microtarget voters. Healthcare providers use it to improve patient outcomes. And the IRS has entered the game in a major way, harnessing big data to better predict and identify tax noncompliance as well as more effectively prosecute willful noncompliance.
Historical Context
Data analytics is not a new concept to the IRS. In the 1960s, the IRS initiated the Taxpayer Compliance Measurement Program (TCMP). Under the TCMP, the IRS randomly selected returns for detailed, line-by-line audits, the results of which were fed into computers that analyzed the data for the purpose of aiding the audit selection process. This laid the groundwork for the automated Discriminant Function Analysis (DIF) system, in which every tax return was given a DIF score that was intended to reflect the probability that it included underreported tax.
The TCMP and DIF score were effective in increasing audit-selection accuracy, evidenced by a dramatic decrease in the number of IRS “no-change” letters in the decades since their implementation. The DIF score gradually become less accurate in the 1990s as the IRS phased out the TCMP. In 2002, the TCMP was replaced by the IRS’s National Research Program, which reflected an attempt to collect similar data using more efficient means. In 2011, the IRS further escalated its reliance on data analytics by forming the Office of Compliance Analytics (OCA). In 2016, the OCA was incorporated into the division of Research, Applied Analytics and Statistics (RAAS), which today serves as the IRS’s centralized research and analytics organization.
The Future of Big Data in Tax Enforcement
Advances in data analytics have come at a time when the IRS has been hobbled by severe budget cuts, leading to a 31% decline in full-time employees working in enforcement roles between 2010 and 2021 (Internal Revenue Service Congressional Justification & Annual Performance Report and Plan, Fiscal Year 2022, Publication 4450, Rev. 5-2021, https://www.irs.gov/pub/irs-pdf/p4450.pdf). Over that same period, the examination rate for individual returns dropped by 48%. The number of IRS Special Agents has also decreased by more than 25% since 2012 (Kristin Broughton, “Tax Crime Enforcement Unit Relying More on Analytics to Spot Crime,” Wall Street Journal, June 12, 2019, https://on.wsj.com/3mvVpHB) The IRS estimates that these cuts have reduced total enforcement revenue collected (TERC) by approximately $70 billion (IRS, 2021).
The IRS has made no mystery of its intent to rely heavily on big data and data analytics to help it plug these gaps. In 2019, the IRS’s Large Business and International Division (LB&I) announced the Large Corporate Compliance (LCC) program, which employs data analytics to select corporate taxpayers for audit (IR-2019-95: “LB&I Announces Large Corporate Compliance Program,” May 16, 2019, https://bit.ly/3kp7vj4). On the criminal enforcement side, 2019 saw the incorporation of the National Coordinated Investigations Unit (NCIU), as an official section of the IRS’s Criminal Investigation Division (CID). The NCIU relies heavily on data analytics to drive case selection for the CID (Michael Cohn, “IRS Criminal Investigation Leveraging More Data Analytics,” Accounting Today, Nov. 14, 2018, https://bit.ly/3B7PH2P). Reliance on data analytics to inform decision making is also listed among the IRS’s six key strategic goals in its strategic plan for the fiscal years 2018 through 2022 (Publication 3744, Rev. 4-2018, https://bit.ly/3gvqxDl). And in his June 8, 2021, written testimony to the Senate Finance Committee, IRS Commissioner Charles Rettig touted the use of data analytics as a key part of the IRS’s strategy for shrinking the tax gap (https://bit.ly/3ko5YKe).
The potential benefits of big data to the IRS are vast. Notwithstanding measures such as the DIF score, IRS enforcement has historically been backward looking, with the percentage of audits that result in no-change letters reaching as high as 40% (Kimberly A. Houser and Debra Sanders, “The Use of Big Data Analytics By The IRS: Efficient Solutions of the End of Privacy as We Know It?” Vanderbilt J. Ent. & Tech. L. vol. 19, pp. 817, 822, 2017, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2943002).
Big data enables IRS enforcement to be forward looking, using sophisticated algorithms and AI to analyze the data at its disposal. The IRS contends that the use of such predictive analytics will result in a fairer and more cost-effective audit process by enabling the IRS to focus its limited resources on returns with the greatest probability of tax noncompliance (IRS, 2021; Rettig, 2021).
For its part, CID is harnessing big data in similar ways to help identify tax evasion and an array of other criminal activity, such as identity theft, money laundering, Bank Secrecy Act violations, and cyber-crimes. In a modern take on the DIF score, CID is utilizing data at its disposal to score returns based on their probability of containing fraudulent entries. In 2020, the IRS created the Fraud Enforcement Office under the National Fraud Program. The Fraud Enforcement Office will use data analytics, among other tools, to increase the number of fraud referrals to CID (IR-2020-49, “IRS Criminal Investigation Veteran Selected as New Fraud Enforcement Director,” May 16, 2019, https://bit.ly/3zhnENQ). Don Fort, the former chief of CID, put it this way: “The future for CI must involve leveraging the vast amount of data we have to help drive case selection and make us more efficient in the critical work that we do. Data analytics is a powerful took for identifying areas of tax noncompliance” (Sony Kassam, “IRS Catches $10 Billion in Tax Fraud in 2018,” Bloomberg Tax, Nov. 14, 2018, https://bit.ly/3sQX6Rd).
Balancing Privacy and Transparency
In concept, there appears to be little for taxpayers to dislike about the use of big data by the IRS. A more efficient IRS that is better equipped to identify and focus its resources on actual tax noncompliance has the potential to close the tax gap at a reduced cost. Where the largest concerns lie, however, is the lack of transparency in “how the sausage is made.”
In order to obtain the most comprehensive data sets for purposes of its analytical models, the IRS pulls data from multiple sources. One of the biggest sources of data remains the IRS’s own records, as almost 250 million tax returns were filed last year (IRS Data Book, 2020, Publication 55-B, Rev. 6-2021, https://bit.ly/3jesVQF). The IRS also has access to vast data shared by sister agencies such as FINCEN (the Financial Crimes Enforcement Network) through, among other sources, Foreign Bank Account Reports (FBAR), Suspicious Activity Reports (SAR), and Currency Transaction Reports (CTR). On top of this is a wealth of data that the IRS has accumulated as part of the Offshore Voluntary Disclosure Program and Foreign Bank Program. Finally, there is data collected from third-party information reporting.
More controversial than these sources is the IRS’s reliance on data in the public domain, such as social media platforms (e.g., Facebook, Instagram) and GPS and cell phones. In 2018, the IRS entered a $99 million contract with Palantir Technologies, a company whose sole business model is to harvest and sell often-sensitive personal information from publicly available sources (Siri Bulusu, “Palantir Deal May Make IRS Big Brother-ish While Chasing Cheats,” Bloomberg Tax, Nov. 15, 2018, https://bit.ly/38cZW9Q).
Although there is nothing illegal about mining taxpayer personal information available in the public domain, such practices can be distasteful to many taxpayers (Bulusu, 2018). Such criticisms are heightened given the IRS’s lack of transparency surrounding precisely how it uses big data to select returns for audit and taxpayers for investigation. The IRS has legitimate reasons to protect its “secret sauce,” including to prevent the gaming of its algorithms. To date, it has been largely successful in protecting such disclosure from Freedom of Information Act requests (e.g., Huene v. U.S. Dep’t of the Treasury, 11-vc-2109, 2012 WL 3730635, E.D. Cal. Aug. 24, 2012). Without the checks and balances that only public scrutiny can provide, however, the IRS is vulnerable to accusations that its data collection prerogatives and proprietary systems run afoul of privacy laws or contain inherent biases (e.g., Susan Ariel Aaronson, “Big Data, Big Problems as Privacy and Bias Concerns Persist,” Barron’s, Jan, 8, 2021, https://bit.ly/3mw1iF2; Byron Tau, “Treasury Watchdog Warns of Government’s Use of Cellphone Data Without Warrants,” Wall Street Journal, Feb. 22, 2021, available at https://on.wsj.com/3gvpjbo). These are real concerns that the IRS must address as it relies more on big data analytics.
Here to Stay
The era of big data has the potential to usher in an unprecedented transformation in the way the IRS operates. By and large, the IRS appears to welcome these changes, which have the potential to yield benefits to law enforcement and taxpayers alike. There will be growing pains as the IRS tries to balance its new tools with concerns over privacy and transparency, but big data at the IRS appears here to stay.