What is a data warehouse anyway?
Lets say that you are a car manufacturing company. You have been making cars for more than 50 years and you have managed to implement IT in all your departments. You have a sales management system that records sales throughout all your network dealers. This sales system not only records the sales history but also records information about potential customers. Then you have the accounting system that generates invoices for the sales made and also keeps track of the company income and expenditure. You may also have a financial system that helps you make your budgetary decisions and provides you a financial road map. Your company would also have a customer relationship management system and a campaign management system. Of course, you would have a state of the art operations systems with advanced inventory management system. There may be various other systems but lets concentrate on what we need out of this system. Different people in the organization may need different kind of information out of this system. The dealer would need the daily or monthly sales figures. She may also need a report of potential customers. At a regional level the manager might need information at aggregate levels. i.e. sales by city by month or a list of potential customers that have a certain fixed budget and fall under a particular age group. As the company matures, it may need deeper analytics such as optimized inventory distribution for its network etc. A data warehouse helps the company in storing data in such a way that such information retrieval becomes faster and simpler. It allows the company to store a large amount of historical system without in any way affecting its transaction systems. A well designed data warehouse system with advanced Business Intelligence tool can do wonders for a company.
How is the database different from a data warehouse?
A database is designed to store the transactional or operational data whereas a data warehouse is designed to store historical information. For example, a database would be used by a transaction system that generates invoice for the car sales. But imagine how fast the data grows in this system. If we add up transactions for a month for all stores across the country and we don’t remove this data then the database would become extremely slow. The database is designed to give quick response time for transactions. A data warehouse on the other hand is designed to store hundreds of gigabytes of data and contains query optimizers that return quick results. They may not be good at supporting transactions.
So how does the database and data warehouse coexist? How does the data flow?
A company generally has multiple transaction systems. The systems may be from different vendors and may have different underlying structures. For example the accounting system may be completely different from a procurement system. They may use different vendors and completely different technology. However, we eventually want to generate reports that obtain data from both the systems. We also want to make sure that a ‘single version of truth’ is maintained in the organization. There have been many cases where two departments cannot come to a conclusion since they have different numbers for the same entity. A data warehouse therefore contains data from multiple source systems. A process known as ETL (Extract, Transform and Load) is used to extract data from the various systems in an organization and feed the data to a single data warehouse system. The data may undergo cleaning, transformation, deduplication etc before it is loaded into the data warehouse.
The diagram below gives a simplified view of the process.