From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

There are two bold new ideas in Pachyderm:

• Containers as the core primitive for computation -- which means each stage in your workflow can be written using any languages or libraries you want.

• Version Control for data -- view diffs of your data and incrementally process only the new data as it streams in.

These ideas lead directly to a system that's much more powerful, flexible and easy to use. Pachyderm is open source so check it out on GitHub.

Meeting Notes

Pachyderm is big data with containers

version control for data

view diffs ( problem that airbnb faces )
collaboration ( DS need to fork their own data set to work on it)
data provenance (origin/source of data)

uses containers for data processing

DAG of jobs
Pipeline triggered by data changes
Processing efficiency
Incrementally

batched and streaming
data lives in object stages(s3, GCS, Ceph)
shares no code with hadoop

Why is container useful for Big Data?

container provides process level security
provides a pristine environment
Containers enable performance controls that limit individual allotments for compute, networking, and storage performance on a per container (and so) per app basis on a given host to provide QoS.

What is kubernetes?

open source container cluster manager by Google.
provides a platform for automating deployment, scaling, and operations of application containers across cluster of hosts.
third generation orchestration tool

Pachyderm data lake use case

Is There a Data Layer Dilemma Between Containers & Big Data Applications?

Talk #2 Leveraging Big Data at TubeMogul to convert Events --> Insights --> Actions, by Murtaza Doctor and John Trenkle from TubeMogul

TubeMogul is a leader in digital advertising delivering our client's creative content to desktops, mobile phones , programmatic TV and, ultimately, any device that can show engaging Ads to users. Over the course of 10 years, the scale of data flowing through our RTB (Real-Time Bidding) system has increased exponentially. As this flow has increased, so has our data ecosystem evolved to handle the collection and ETL of this data for the purposes of billing clients, fueling Optimization, Machine Learning, and Analytics. In this talk we'll discuss the path we've followed that has employed Hadoop, Hive, Spark and Presto, as well as Cascading and other variations to fulfill specific functions of our system. We'll talk about specific use cases in our platform and will end with a hint, the directions that this trajectory is taking us.

Meeting Notes:

Models -> Action

Optimization

Surrogate measures of engagement: Clicks, Completions, Conversion

Audience Building for Targeting

Demongraphic
Behavioral

Fraud Detection
Cross Device Synching
Profiling/Data Mining/Actionable Intel

The three main componenets are:

bidding layer,
ad serving layer
Ad Events/Stats (Evenet collection, engagement method)
User cookie ( what segment user belong, user data)

ingestion into user database so next time user can get better Ad

Event architecture

Auctions(bids + non bids)
Win events (impressions)
Columnar format (ORC)
Data pipeline
Bad data
Scaling challenging

Event architecture takeaways:

Simply and unify
Focus on data validation at each step
Automated recovery
Leverage the messaging system for status or completion
Metrics & measurements for Service Level Agreement (SLA)

Machine learning as a consumer

audience modeling begets user-oriend data

pivot RTB /Analytics source for model-building

Many sources of Truth that need to be integrated

ad interaction

Characterize users with robust signature rather than just an item list
Facilitate rapid prototyping and model-building
Maintain enriched information for exploratory analysis and visualization

insights
actionable intel

R's NoteBook

Wednesday, April 27, 2016

Seminar Notes- Pachyderm and TubeMogul ( use Big Data to convert Events --> Insights --> Actions)

From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Meeting Notes

Is There a Data Layer Dilemma Between Containers & Big Data Applications?

Talk #2 Leveraging Big Data at TubeMogul to convert Events --> Insights --> Actions, by Murtaza Doctor and John Trenkle from TubeMogul

No comments:

Post a Comment

Blog Archive