R's NoteBook: Seminar Notes- Data Pipelines development/deployment and management using Data Swarm

Below is my learning notes from Mike Starr's presentation on Dataswarm. Full video here:
https://www.youtube.com/watch?v=M0VCbhfQ3HQ&list=PL_EeYa3aRS55QAbL851AF5FIHlCcN9xbp

1. Key Takeaways:

Dataswarm is a dependency graph description language. It's not a code that runs to completion or does anything. It just defines what you want to do.
DataSwarm's primary objective is the operator schedule the pipeline in a specific date. Users write python code which defines the pipeline, and it's delegate to the driver script to run the stuff.
DataStorm advantage: write functions that generates pipeline instead of write them manually
In facebook, datastorm runs every major batch pipeline

2. Summary:

"At Facebook, data is used to gain insights for existing products and drive development of new products. In order to do this, engineers and analysts need to seamlessly process data across a variety of backend data stores. Dataswarm is a framework for writing data processing pipelines in Python. Using an extensible library of operations (e.g. executing queries, moving data, running scripts), developers programmatically define dependency graphs of tasks to be executed. Dataswarm takes care of the rest: distributed execution, scheduling, and dependency management. "

Below is a high level data flow for batch processing; An action leads to a web request on the backend server, that backend server generates logs of events. Those log events then go to the data warehouse.

Dataswarm at a high level is a tool that enables Data Scientists to convert logs into useful information.

DataSwarm introduces three concepts:
1. Tasks (execution method)
2. Dependencies ( tasks can depend on each other) - see dependency graph below
3. Scheduling ( what DS and ML wants!!) - be able to run a snapshot of your graph every day

3. Comparison:

Example: Finding cat GIFs

with the following consideration..
1. We want to handle retries
2. Selectively run parts of the pipeline
3. Run the pipeline over a bunch of machines

3.1 Finding Cat GIF without Dataswarm

Here is the finding Cat GIF without data swarm python code:

to schedule task:

to handle failures:

to only run parts that are interesting:

to do a remote execution of Hive (find a Hive machine and do the query):

3.2 Finding Cat GIF with Dataswarm

Here is the CAT gif pipeline that written as data swarm pipeline.
It sets the default operators, create three methods ( find, dump, post) that specifies what pipeline should do, and specify dependencies between operators so it knows ( first find cat gif, then dump them, then run the PHP scripts )

1.The global default constructs sets default parameters for every task that we are going to instantiate in this pipeline. ( sets default parameter for everything in this file)

2.Tell dataswarm how to do work by creating instances of operators.

3.specify dependencies beween tasks using dep_list ( short for dependencies between tasks) dependency list specify what tasks need to run before.

4. specify date stamp ( ds stands for date stamp)

5. PHP operators that runs php statements

6. Run the execution script (to run the pipeline). The date specifies when the pipeline is scheduled.

3.3 Hive to MySQL Replication using Dataswarm

write functions that generates pipeline instead of write them manually

4. Advantage

DataSwarm's primary objective is the operator schedule the pipeline in a specific date. The users write python code which defines the pipeline, and it's delegate to the driver script to run the stuff. Dataswarm achieves the following:

1. Fault tolerant

2. Selectivity (ability to select one or more tasks to run, no need to comment out things because the pipeline definition is seperated from the execution script.)

3. Remote execution (dataswarm + chronos)

chronos is a pool of machines, when submit the job to chrono, by default it will do retries across different machines. That means if a task fails, chronos will retry that task in a different machine. So failures due to host failures can be avoided.

(when execute, instead of printing the logs, it will print the url that take u to the log.)

4. Scheduling - check your pipeline in to datastorm

5.Monitoring

5. Other tools

DataSwarm is not open source. An equivelent open source tool is Luigi.

Luigi is an open source Python-based data framework for building data pipelines. Instead of using an XML/YAML configuration of some sort, all the jobs and their dependencies are written as Python programs. Because it’s Python, developers can backtrack to figure out exactly how data is processed.)

https://blog.treasuredata.com/blog/2015/02/25/managing-the-data-pipeline-with-git-luigi/

Chronos

It is a distributed and fault-tolerant scheduler which runs on top ofMesos. It’s a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.

R's NoteBook

Tuesday, April 26, 2016

Seminar Notes- Data Pipelines development/deployment and management using Data Swarm