This is part 8 of a series of blogs on setting up a governance program for the effective management of investment product data - in this blog I will explain why building and maintaining a data dictionary is probably one of the most important factors in the success of your program.
Like many business buzzwords, data dictionary means different things to different people. The common thread is that the dictionary is an inventory of the data items being consumed or produced within a specific defined business unit or process.
Why do we create them? Again, there are many reasons – but the most prevalent one is to bring a common understanding to play within a specific environment such that everyone is speaking the same language when it comes to data. Data management projects live and die by the quality of their data dictionaries because even within small teams you can have wildly different nomenclatures in existence for what seems at face value very simple, easily understood data items.
Before I get onto what makes up a data dictionary I would like to clear up a couple of misnomers I often come across:
- A data dictionary is not a document. Documents are two-dimensional, while data dictionaries work across many planes. They are best represented in a relational database, or if needs must, a set of interrelated Excel worksheets.
- A data dictionary is not a project resource – yes, every data management project needs a dictionary, but as a resource it has a life outside of the project. You do not create a dictionary to serve the needs of a project only – the dictionary is also required within the business-as-usual activities that come into play post a project delivery i.e. it is a resource that requires and demands constant attention, updating and refinement.
So what is commonly found in a data dictionary? As I mentioned earlier it is a centralised inventory of information on data items/fields that describes in detail the data items semantics, how the data relates to other data, where the data is consumed, where it is processed and from where it is sourced. The dictionary should also describe the correct format and syntax for each field.
So for each entry in the dictionary I would expect to find the following
- A specific unique name for the item
- A clear definition of the data items meaning, including references to other common/aka names for the item
- A list of all “consumer” entities and processes that consume/use this data item
- A list of all the “suppliers” or source systems that produce this data and deliver to processes downstream
- Specific mention of any master rules for choosing correct source system for specific situations
- A list of all business rules applied to the data item as part of any data quality management process that touches the data
- Reference to stewards or stewardship teams that are responsible for the management of the data
- Reference to subject matter expert(s) who can deal with questions about the data item
- Detailed syntax specification for the data item – including type, structure, format and example values
- Good dictionaries allow users enter and update specific notes and references à la a wiki
If you have constructed your data dictionary using a database then you can easily provide very helpful alternate views of the dictionary for example:
- Show all data items consumed by process X
- Show all business rules
- Show the data items touched by Rule Y
- For data item Z show all sources
- For data item R show all consumers
- and so on…
More advanced dictionary implementation have an integrated audit trail with the live system that can instantly show as-of transactional views i.e. the dictionary and the real-world systems it relates to are integrated.
So how does one build the dictionary? In MoneyMate we build them out using a SIPOC process in reverse [COPIS]
- So we start off identifying all of the consumers of information
- From here working out what outputs are consumed by each consumer#
- From here working out which processes deliver the outputs
- From here working out which inputs are used in each of the processes
- Before finally identifying the source/supplier systems producing the inputs
A critical element of the COPIS/SIPOC analysis is identifying where certain data items have multiple source systems – in these cases we need to carefully specify the master data rules that indicate which source is correct for the variety of situations that dictate different usage of the data.
Examples of this problem would be:
- You could have multiple back-office providers which means your daily NAV could be flowing from multiple parties/systems
- You could also have different legal structures in play that have different statements of record for different data types e.g. for holdings you maybe using the accounting book of record for your mutual funds but for managed accounts you are taking data from your investment records.
- You could have standard source of performance for all in-house funds, but for sub-advised you take data from the sub-advisor
Clearly the dictionary needs to capture all of this information in a well structured manner and allow for specific notation of the master rules for each item which has more than one source.
So hopefully you have a better understanding of what a data dictionary is, what it contains and why it is needed. If you have anything to add yourself – send me a PM or comment below.
Next up in the series is a review of the role technology should play….