Datawarehouse: Push vs Pull How Should you initially move your data

The present data warehouse concern is architectural in nature; should your data movement you’re your source system(s) to your database be kicked off with a push or a pull? Or to put it another way, do you want to pull the data yourself from the source system, or do you want it delivered to you? You will find every data warehouse testing this question out before designing their structure. The significance of the question goes far beyond how lazy you are when it comes to getting the data.

There are a few questions you need to answer first, and none of them are really about how lazy you are. The over-simple view would be, “When you need a Windows update for your PC, would you rather have it run nightly, or kick it off each time yourself?” I call this over-simple because it only deals with the “Who is in control” aspect, and really, you are in control both ways.The better question is, “Who is better to inconvenience?” But as I suggested before, there are other factors which have a greater influence.

Suppose you have a single real-time system that is always up and always very busy, where you know that the source of the data will have new transactions occurring during your extraction,causing latency issues with your source system.Well, that would be ideally suited to doing a push not a pull. I mean, it is easy to manage a push if you have just one source system, and a pull would really slow your source system down, especially if it hit at a busy time.

Maybe the system couldeven build the source data file(s) for extraction piecemeal during lighter usage periods. A web store would be a good source data target for this type of extraction.

It works best in an Enterprise Service Bus (ESB) communicating its data across a SOA architecture or in an EAI framework. One can blame the modern world for the rise of business systems that need to have no downtime.

Now suppose you have a gigantic system like a retail store’s POS, and that store shuts down at night, during which time there is no chance of a sale being processed. The data payload is verylarge, and you do not have to worry about latency issues with the source system.
This would be a lovely time to do a pull. It would be an even better time to do a pull if multiple stores in similar time zones are involved where their data could all be processed in parallel,instead of requiring each store to kick of its own process.

This E part of ETL (Extract) is apowerful workhorse that must lock out anyone else from accessing the source data, and solidlycontinue its job until done. Size does not impact decisions about how and when it works; it kicksoff at the scheduled time and runs until it has reached completion. This is the old structure –brick and mortar businesses with regular business hours, and regular hours of not being open. The real driver of the decision, as I said much earlier, is “Who can I inconvenience?”

If the source data location can be inconvenienced, set it and forget it let the operation kick off by ETL when no one is around to inconvenience. But if someone is always around who can be inconvenienced, you need a solution other than ETL to avoid latency issues. So just remember the following motto: Data extraction can lead to distraction, but a real-time endeavour wants data more clever.

Author Bio: Scott Andery is an expert marketer, author and consultant who specialize in software testing tools and resources. He regularly writes and contributes his insightful articles on Advanton Inc., a leading Customer Engagement, CLOUD solutions, and Free Online Presence Solutions company for small businesses.

Push vs Pull How Should you initially move your data

About ADVANTON

Our Associations

Contact Us

Reader Interactions

Leave a Reply Cancel reply

Footer

About ADVANTON

Follow us

Our Associations

Contact Us