• Post Reply Bookmark Topic Watch Topic
  • New Topic

Recursive MapReduce job  RSS feed

Stefan Elsen
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello everyone,

I seem to be stuck with a possibly unusual problem. Then again, I am relatively new to the whole Hadoop system, so there is a good chance my problem is based on insufficient knowledge.
Anyhow, here goes:
I have a relatively large set of data (1+ PB) distributed across many files (millions+). The reason there are so many actual files is that I intend to implement a kind of versioning system that exhibits a high frequency of revisions but a low amount of actual change between them (a few MB each, at most). As a consequence there is significant similarity between subsequent versions, and a complete rewriting of the entire version each time seems like an extreme waste of space to me. In order to manage this large number of files, I intend to use a tree structure that groups blocks of files into larger blocks, thus reducing the number of files that must be recreated each time a new version is created. As a consequence, however, there now is a tree of files referencing files, referencing more files, and so forth.

An arbitrary task over these trees would have to recursively parse the structure. An optimal solution would be to move the task-code to the nodes currently storing the referenced file(s), which, in turn, spawn more tasks recursively on yet different nodes.
The only way I found so far (with the limited knowledge I have), is to pull the result of a MapReduce job back into the driver, and spawn more jobs with the extracted file names from there. And - unless I missed something - the driver will not migrate to the data requested, which means I have a constant data-feed back to the node processing the driver, which, in turn, renders the whole idea pointless for some of the more expensive tasks.

So, my question is simple: is there a smarter approach to this problem? Or am I missing some possibly trivial detail?
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!