Staff Software Engineer, Observability Group, Atlassian

Big data processing systems like Hive or Mapreduce work in multiple phases where the output of one phase is dumped to disk and then subsequently read by a later phase. When running in clouds like AWS, this may lead to two problems:
1. If using on-demand or guaranteed instances, you may have to hold on to an instance if it has some intermediate output, even if it's not running computer. Thus you're paying for (expensive) compute, though you only need (cheap) storage.
2. If using spot instances, you may lose the intermediate output before its consumer has run or read it. In this case the earlier task which wrote the ouput would have to be run again, leading to slowing down the job as a whole.

Offloading the output to a shared, (relatively) cheap storage like Lustre solves both of the above problems while not adding significantly to the cost or time for running a big data job; and in most cases saving on both.

Big data processing systems like Hive or Mapreduce work in m

Offloading intermediate output of big data systems to Lustre

Big data processing systems like Hive or Mapreduce work in m