The present data warehouse concern is architectural in nature; should your data movement you’re your source system(s) to your database be kicked off with a push or a pull? Or to put it another way, do you want to pull the data yourself from the source system, or do you want it delivered to you? You will find every this question out before designing their structure. The significance of the question goes far beyond how lazy you are when it comes to getting the data.
There are a few questions you need to answer first, and none of them are really about how lazy you are. The over-simple view would be, “When you need a Windows update for your PC,
would you rather have it run nightly, or kick it off each time yourself?” I call this over-simple because it only deals with the “Who is in control” aspect, and really, you are in control both ways. The better question is, “Who is better to inconvenience?” But as I suggested before, there are other factors which have a greater influence.
Suppose you have a single real-time system that is always up and always very busy, where you
know that the source of the data will have new transactions occurring during your extraction,
causing latency issues with your source system.Well, that would be ideally suited to doing a push
not a pull. I mean, it is easy to manage a push if you have just one source system, and a pull would
really slow your source system down, especially if it hit at a busy time. Maybe the system could
even build the source data file(s) for extraction piecemeal during lighter usage periods. A web
store would be a good source data target for this type of extraction. It works best in an Enterprise
Service Bus (ESB) communicating its data across a SOA architecture or in an EAI framework. One
can blame the modern world for the rise of business systems that need to have no downtime.
Now suppose you have a gigantic system like a retail store’s POS, and that store shuts down at night, during which time there is no chance of a sale being processed. The data payload is very
large, and you do not have to worry about latency issues with the source system.
This would be a lovely time to do a pull. It would be an even better time to do a pull if multiple
stores in similar time zones are involved where their data could all be processed in parallel,
instead of requiring each store to kick of its own process. This E part of ETL (Extract) is a
powerful workhorse that must lock out anyone else from accessing the source data, and solidly
continue its job until done. Size does not impact decisions about how and when it works; it kicks
off at the scheduled time and runs until it has reached completion. This is the old structure –brick and mortar businesses with regular business hours, and regular hours of not being open.
The real driver of the decision, as I said much earlier, is “Who can I inconvenience?” If the source data location can be inconvenienced, set it and forget it let the operation kick off by ETL when no
one is around to inconvenience. But if someone is always around who can be inconvenienced,
you need a solution other than ETL to avoid latency issues. So just remember the following motto: Data extraction can lead to distraction, but a real-time endeavour wants data more clever.
is an expert marketer, author and consultant who specialize in software testing tools and resources. He regularly writes and contributes his insightful articles on Advanton Inc., a leading Customer Engagement and Enterprise Solutions Company headquartered in the US with its global delivery centre in India and has over 50 notable clients.