Working with Data
One of the most important aspects of working with data requires understanding the difference between good and bad data. Good data forms the basic building block for loan portfolio management and can often mean the difference between good and bad decisions. What does good data look like? It generally consists of four items:
1. It is in a clean and readable format
2. It is organized and consistent across systems
3. It is centralized
4. It is robust
A familiar term that describes how you should think of data is GIGO “Garbage in, Garbage out”. It does not matter what type of analysis you do or how well you do your analysis if the data you’re putting into your analytics system is incorrect. It is critical that your data be clean and readable in order to make decisions from its analysis.
1. Clean Format
So what does clean data mean? Mel Valenzuela, VP of Lending at Keesler Federal Credit Union put it this way, “Imagine trying to put a 5,000 piece jigsaw puzzle together without having the cover, you would have no idea what it should look like.”
To start with your data needs to have labels for your database, the headers of each column, and the name of each file. This might seem trivial, but you would be surprised how often we see data sets that do not have labels, especially headers. Data without headers is the puzzle without a cover. If step 1 in having clean data is using labels then step 2 is ensuring your labels are legible by the lay person.
For example, if a database table’s name was BI120113 what would you think is in that table? You might be able to guess that “12” represents the month, “01” represents the day, and “13” represents the year. But what does BI stand for? Does it stand for business intelligence, bank investment, balloon interest, branch info, or something else? Contrast that with a table name of 2013_12_01_Branch_Info. You probably will have a lot better idea of what data is in this table, and you don’t have to be an analyst, programmer, or statistician to decipher it. Without a key for the name or a legible title the data can be misinterpreted and take time away from performing analysis and making decisions.
2. Consistent Across Systems
Another common mistake we encounter is when different departments have different formats for the data they input or the data gets changed from one system to another. To illustrate this problem let’s look at some of the challenges when analyzing vehicle data. Let’s assume that we are making a loan on a 2013 Jeep Grand Cherokee. In the database each variable for an auto loan has a separate column (Auto_Year, Auto_Make, Auto_Model, etc.). If John is taking the loan application and enters “Gd” instead of “Grand” and Mary enters “Gr” instead of “Grand ”this might seem insignificant, but multiply these discrepancies across 10-20 loan officers who are using different acronyms or words it becomes a major problem. There might be 30 different ways the same vehicle is inputted into the database. When a query is run on the database looking for “Grand Cherokee” it will leave out the vehicles entered as “Gd” and the “Gr”. Inputting messy data can quickly compound into a much more complicated problem.
One solution to avoid messy data is to standardize what every person in the organization enters at the point of application. For example, instead of an input field you might consider having a standard dropdown list where the user selects a predetermined data field. If an application is well organized and well designed a dropdown will be just as easy, if not easier, as data entry. Of course there will be situations that do not fit into standard fields, but in these cases you can provide an “other” option or the option to enter the data manually. If you allow free text consider popup warnings or alerts to ensure the user does not have a better dropdown. This will remind the user of the importance of selecting from the predefined dropdown and only use manual entry as a last resort.
3. Central Data Warehouse
Once you standardized your data across systems try ensuring all of your data flows into a central data repository as much as possible. This will make sure everyone is looking at the same data. As Mr. Valenzuela put it, “without the data in a central repository you’ll need an advanced degree in statistics just to know how to combine and analyze all of it.”
It used to be common that a data warehouse was only used to store data, but today most good data systems also perform data standardization, accept multiple and flexible fields, and produce analytics reports.
4. Robust Data
More data available for analysis increases the likelihood that relationships in variables will provide meaningful business value. In the case of predictive analytics without an acceptable sample size the output of the analysis will not produce statistically relevant results. Statistical relevant results, for those of you unfamiliar with statistics, simply mean that the results are unreliable and usually require more data. However, not all data have equal value. Key data is too numerous to mention here, but as you work to improve your analytical capabilities you should give serious thought to the data you will capture, how often it will be refreshed, and how it will be used.
Having clean data is the basic building block of making good decisions from analytics. Start making better use of your analysis today, and clean up your data.
For an in depth guide on analytics, take a look at “A Lenders Guide to the Basics of Analytics”.