Wednesday, April 27, 2016

Seminar Notes- Pachyderm and TubeMogul ( use Big Data to convert Events --> Insights --> Actions)

From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker. 

There are two bold new ideas in Pachyderm: 
• Containers as the core primitive for computation -- which means each stage in your workflow can be written using any languages or libraries you want. 
• Version Control for data -- view diffs of your data and incrementally process only the new data as it streams in. 
These ideas lead directly to a system that's much more powerful, flexible and easy to use. Pachyderm is open source so check it out on GitHub.

Meeting Notes 

Pachyderm is big data with containers
  • version control for data
    •  view diffs ( problem that airbnb faces ) 
    •  collaboration ( DS need to fork their own data set to work on it)
    •  data provenance (origin/source of data) 
  • uses containers for data processing
    • DAG of jobs
    •  Pipeline triggered by data changes
    •  Processing efficiency
    •  Incrementally
  • batched and streaming
  • data lives in object stages(s3, GCS, Ceph)
  • shares no code with hadoop
Why is container useful for Big Data?
  • container provides process level security
  • provides a pristine environment 
  • Containers enable performance controls that limit individual allotments for compute, networking, and storage performance on a per container (and so) per app basis on a given host to provide QoS.
What is kubernetes?
  • open source container cluster manager by Google. 
  • provides a platform for automating deployment, scaling, and operations of application containers across cluster of hosts. 
  • third generation orchestration tool
Pachyderm data lake use case


see also:


Talk #2 Leveraging Big Data at TubeMogul to convert Events --> Insights --> Actions, by Murtaza Doctor and John Trenkle from TubeMogul

TubeMogul is a leader in digital advertising delivering our client's creative content to desktops, mobile phones , programmatic TV and, ultimately, any device that can show engaging Ads to users. Over the course of 10 years, the scale of data flowing through our RTB (Real-Time Bidding) system has increased exponentially. As this flow has increased, so has our data ecosystem evolved to handle the collection and ETL of this data for the purposes of billing clients, fueling Optimization, Machine Learning, and Analytics. In this talk we'll discuss the path we've followed that has employed Hadoop, Hive, Spark and Presto, as well as Cascading and other variations to fulfill specific functions of our system. We'll talk about specific use cases in our platform and will end with a hint, the directions that this trajectory is taking us. 
Meeting Notes: 
Models ->  Action
  • Optimization
    • Surrogate measures of engagement: Clicks, Completions, Conversion
  • Audience Building for Targeting
    • Demongraphic
    • Behavioral
  • Fraud Detection
  • Cross Device Synching
  • Profiling/Data Mining/Actionable Intel


The three main componenets are:
  • bidding layer, 
  • ad serving layer
  • Ad Events/Stats (Evenet collection, engagement method)
  • User cookie ( what segment user belong, user data)
    •  ingestion into user database so next time user can get better Ad

Event architecture
  •           Auctions(bids + non bids)
  •           Win events (impressions)
  •           Columnar format (ORC)
  •           Data pipeline
  •           Bad data
  •           Scaling challenging 


Event architecture takeaways:
  •           Simply and unify
  •           Focus on data validation at each step
  •           Automated recovery
  •           Leverage the messaging system for status or completion
  •           Metrics & measurements for Service Level Agreement (SLA)




Machine learning as a consumer
  •  audience modeling begets user-oriend data
    • pivot RTB /Analytics source for model-building
  • Many sources of Truth that need to be integrated
    • ad interaction
  • Characterize users with robust signature rather than just an item list
  • Facilitate rapid prototyping and model-building
  • Maintain enriched information for exploratory analysis and visualization
    • insights
    • actionable intel




No comments:

Post a Comment