As big data gets even bigger, software programs that have served CPAs well for years, such as Microsoft Excel, find themselves straining under the weight. Analytics visualization software is designed to deal with massive amounts of data and provide new functions, but as always, there is a learning curve for new users. The authors provide a step-by-step guide for some of the basic functions of one such program, Tableau.
* * *
In a recent CPA Journal article (Deniz Appelbaum, Alexander Kogan, and Miklos A. Vasarhelyi, “An Introduction to Data Analysis for Auditors and Accountants,” February 2017, http://bit.ly/2wVBzOb), Appelbaum et al. wrote: “The advent of data analytics and big data is not a fad; it is a real phenomenon driven by new technologies being adopted by many businesses. Accountants and auditors are currently very far behind the curve.” The authors agree, but would also add that it will likely not be difficult for accountants to catch up thanks to the advent of analytics visualization software, such as Tableau, Microsoft Power BI, and Qlik. Tschakert et al. specifically suggested that Tableau be used as part of accounting analytics education (Norbert Tschakert, Julia Kokina, Stephen Kozlowski, and Miklos Vasarhelyi, “How Business Schools Can Integrate Data Analytics into the Accounting Curriculum,” The CPA Journal, September 2017, http://bit.ly/2wHFaiu), and another CPA Journal article discussed the use of Tableau for forensic auditing and accounting (George Aldhizer, “Visual and Text Analytics: The Next Step in Forensic Auditing and Accounting,” June 2017, http://bit.ly/2rODuyT). Furthermore, visualization programs can process big data much more efficiently than Microsoft Excel.
Visualization software, with its user-friendly, intuitive interface, can be used to handle large amounts of diverse data, join data from multiple sources, filter and clean data, and provide informative reports. Furthermore, it can generate multiple visual representations of data, such as charts, tables, and maps. For example, a company can generate daily sales reports compiled from data imported from multiple subsidiaries’ different systems; it can then summarize by product, calculate gross margins, and report all this information summarized by geographical region. The above-mentioned interface allows these tasks to be done in seconds, as opposed to the hours they would take using traditional software such as Excel.
This article introduces data visualization software to CPAs by walking them through the steps to import a large data set; filter, compile, and merge the data; generate descriptive statistics; and create summarized information and visualizations. The authors chose Tableau for purposes of this article, but the same techniques can be applied in other analytics visualization software programs.
The data set used comes from the National Center for Charitable Statistics(NCCS) website (http://nccs.urban.org/), which maintains the IRS Business Master Files (BMF) using data derived from the IRS Forms 1023 and 1024 of the over 1.5 million U.S. not-for-profit organizations that file IRS Form 990. The authors chose this data set to demonstrate Tableau’s ability to handle big data and the types of data relevant to accountants. It is worth noting that Excel could only load 1,048,576 of 1,584,587 records, leaving 536,011 records unloaded. The data fields used are shown in Exhibit 1; the data itself can be downloaded from the NCCS (https://urbn.is/2IEu5EA).
Fields Used from Sample Data Set
To load data in Tableau, click on the “Text file” option under the “Connect to a File” menu. Select the relevant file (previously saved on a local drive), and the interface seen in Exhibit 2 will appear. Click on “Sheet 1,” and Tableau will automatically classify qualitative fields [e.g., National Taxonomy of Exempt Entities (NTEE) code, state, city] as “dimensions” and quantitative fields (e.g., income, asset, revenue) as “measures.” To check the completeness of the data set, use “Number of Records,” a measure automatically generated by Tableau. Double clicking on “Number of Records” displays a bar chart, while clicking on the “Show Me” toolbar at the right hand top side of the screen and choosing the “Table” icon displays the count of the number of records. (The “Show Me” toolbar is used to access many different chart options, and is a key feature of Tableau.) In this example, all 1,584,587 records were successfully imported.
After checking the completeness of the data set (in this example, all 1,584,587 records were successfully imported), users can filter the data according to certain attributes relevant to the information needed. In Tableau, filters can eliminate unnecessary, incomplete, or erroneous records, or restrict the data used in the analysis to records meeting certain specified criteria. In this example, two filters are added: one using the OUTNCCS field and the other using the RULEDATE field.
According to the data dictionary that accompanies the charitable statistics data, the OUTNCCS field is an out-of-scope flag, a binary field containing values of either “In” or “Out.” Double clicking on the OUTNCCS field from the list of dimensions will place the field in the “Rows” shelf above Sheet 1, with a table showing the number of records with the value of “In” and “Out” within the body of Sheet 1. In this example, there are 3,142 records that are considered out of scope because the entity either is not a 501(c)(3) or is a foreign entity, or for some other reason. To filter these out-of-scope entities, drag the OUTNCCS field from the list of dimensions to “Filters.” Tableau will display a Filter dialog box as illustrated in Exhibit 3. Selecting “In” and deselecting “Out,” then clicking “OK,” will restrict the remaining data to include only those 1,581,445 records considered in scope.
The RULEDATE field gives the year and month of the IRS ruling letter recognizing the entity’s exempt status and is coded as YYYYMM; a ruling letter date of August 2015, for example, would be coded as 201,508 (Tableau automatically applies commas to what it determines to be numerical data, but this is of no consequence to the analysis).
Double-clicking on RULEDATE under the list of measures will return a small table in the body of Sheet 1 that shows the number of records and, by default, the sum of the RULEDATE field. To change the descriptive statistic for this field, select the SUM (Ruledate) field in the “Measure Values” box, click on the drop-down arrow, and click the arrow next to Measure (Sum). Selecting “Maximum” will show the most recent year/month rule date in the data; likewise, selecting “Minimum” will show the earliest year/month ruling date in the data. This should indicate that the most recent ruling letter date in the data set is July 2016, and the earliest ruling letter date is zero, which suggests that the ruling date is missing or incorrect for these records. To filter out records with a missing or incorrect ruling date, drag RULEDATE from the list of measures to “Filters.” Selecting “Use All” in the resulting dialog box prompts a second dialog box to appear, as seen in Exhibit 4.
Note that “Range of Values” is highlighted. To eliminate all the records that have a value of zero, simply replace the 0 in the box at the minimum end of the range with a 1; or, to restrict the data to include only charities that received a ruling letter after, for example, January 1985, enter 198501 instead. Filtering out values of zero changes the minimum ruling date to 190001 (i.e., January 1900), and the remaining number of records in the data set is now 1,566,248. This filtered data set will be used in the next step.
When analyzing data, it is frequently necessary or desirable to join two or more data sets. This is difficult to efficiently or easily do when using spreadsheets, but in Tableau it is relatively easy. For example, the sample data set includes a field labeled “NTEE1,” which is a code applied to each record that classifies the charitable entity according to one of the 26 groups representing charitable industries. It does not, however, include a field that provides the name of the industry associated with the code. Producing that information requires joining the original data file to another table that contains the code and names. The authors created an Excel file for this purpose, available with the online version of this article.
In Tableau, tables can be joined by equating a field in one data set to a field in another as long as the two fields have matching values in common; the two data sets do not have to be the same size. Like a database management system, Tableau also allows different types of joins:
- Inner join: includes records where there is a match in both tables
- Left join: includes all records from the left table and all matches from the right table. Records from the left table without a corresponding match in the right table are displayed as having null values for the missing fields from the right table.
- Right join: includes all records from the right table and all matches from the left. Records from the right table without a corresponding match in the left table are displayed as null values for the missing fields from the left table.
- Full outer join: includes all records from both tables. Fields without matches are displayed as null values.
This example uses an inner join because the records being sought are ones where there are both an NTEE1 code and an associated charitable industry name. The two files can be joined by first clicking on the “Data Source” tab at the bottom of the screen, then clicking on the “Add” link at the top of the screen next to the “Connections” label. Select the relevant file type (here, Excel) under the “Add a Connection” list, then choose the desired file. Bring the NTEE sheet into the space above the Data Source listing of fields, and Tableau will display a “Join” window. Click “Add new join clause” under “Data Source” and select NTEE1 from the drop-down list, then click on “NTEE” under the NTEE file name.
To ensure that the data is joined correctly, click the “Update Now” button, then scroll to the end of the field listing. The number of records should not change, although the number of fields will have increased. This joined data set is used in the example below.
When using a spreadsheet to analyze data, users will often develop pivot tables to summarize and report information derived from the more detailed underlying data. While useful, Excel’s pivot tables are limited in several ways that Tableau is not. First, as noted above, Tableau easily handles much larger data sets. Second, Tableau provides a graphical user interface that is more intuitive than a spreadsheet’s columns and rows interface. Third, adding data from another sheet or table in Excel requires the user to first run the “vlookup” function, then develop the pivot table; in Tableau, this action is much more streamlined through the join function. Finally, Tableau has much greater visualization options for reporting summarized information extracted from the data than spreadsheets have.
In this example, the question to be answered is how many charities are classified under each NTEE Description group. Finding the answer is merely a matter of adding the fields of interest to the sheet. First, add the “NTEE Description”—which was brought in from the Excel file as described above—by finding and selecting the NTEE Description under the dimensions list. Because it is a dimension, Tableau will automatically add this field to the “Rows” shelf, and a list of all the NTEE groups will appear.
To determine the number of charities in each group, double-click the EIN (Employer Identification Number) field under the measures list. The sum of the EINs will appear by each group name by default; since this is not the number desired, change the measure being applied (following the steps for “Ruledate” described above) to “Count (Distinct).” Count (Distinct) is a built-in function that counts the number of unique EINs. The results will be the number of individual charitable entities that are grouped by NTEE.
Next, to find the aggregate income for each NTEE group, and which group has the largest income, repeat the same process with the “Income” field. The “SUM” measure will give the aggregate total, and clicking on the “Sort” button will sort the records in ascending or descending order, making finding the largest value a simple task. The results show that the NTEE group with the largest income is Health, totaling more than $1.3 trillion dollars, followed by Education, totaling $633 billion.
Finally, it is time to visualize the data. For example, to show the income of each NTEE group in a horizontal bar graph (instead of a table), click on the horizontal bar graph button in the “Show Me” menu; Tableau will convert the table into a horizontal bar graph like the one in Exhibit 5. If the Dimension variable in the analysis is a geographical variable, Tableau can also create a geographical map. For example, to generate a map displaying the aggregate income of nonprofit organizations by city within the United States, remove all the existing variables from the worksheet, or open a new worksheet. Then double-click “City” from the list of dimensions; this will open a map of the United States. Finally, double-click “Income” from the list of measures. The SUM (Income) will load by default, displaying a map similar to the one in Exhibit 6.
Catching Up with the 21st Century
With analytics visualization software such as Tableau, CPAs can quickly learn the foundational techniques to analyze large amounts of diverse data. Programs like these are not only capable of processing big data; they are also user friendly and intuitive. CPAs should find the basic tools in this article useful for handling big data. Beyond these foundational techniques, Tableau is also capable of advanced analytics and charting functions, which are beyond the scope of the current article. Tableau also offers training videos on its website at https://www.tableau.com/learn/training.