Thursday, May 5, 2016

Project Notes- Apache Drill on Windows 7 to query Yelp Data

Apache Drill

Query Hadoop with SQL

Open source, low-latency query engine for Hadoop that delivers secure,

interactive SQL analytics at petabyte scale.

delivering self-service data exploration capabilities on data stored in multiple formats in files or NoSQL database

Apache Drill supports nested data, schema-less execution, and decentralized metadata.

why use drill

http://drill.apache.org/docs/why-drill/

easy to start, Schema-free JSON model, able to query complex, semi structured data, real SQL, BI tools, interactive queries on Hive table, access multiple data sources, user defined function for drill and hive, high performance, scales from a single laptop to 1000 node cluster

https://community.mapr.com/docs/DOC-1540

Architecture

Tutorial

Drill tutorial in Windows – run simple queries

summary:

Drill in 10 minutes

http://drill.apache.org/docs/drill-in-10-minutes/

install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.

Setup

Download latest Drill, extract the Tar

http://www.apache.org/dyn/closer.lua?filename=drill/drill-1.6.0/apache-drill-1.6.0.tar.gz&action=download

The Apache Drill archive contains sample JSON and Parquet files that you can query immediately.

Go to bin folder, Type the following command on the command line:

sqlline.bat -u "jdbc:drill:zk=local"

Will not work if you have another session open

Run Queries

At the root of the Drill installation, a sample-data directory includes JSON and Parquet files that you can query.

Querying JSON File

A sample JSON file, employee.json, contains fictitious employee data. To view the data in the employee.json file, submit the following SQL query to Drill, using the cp (classpath) storage plugin configuration to point to the file.

SELECT * FROM cp.`employee.json` LIMIT 3;

Todo: findout what’s the error ID: [Error Id: 5b46f491-0eca-404e-b0d8-54d0d586152c on RTS140-pc:31010] (state=,code =0)

Querying a Parquet File

Query the region.parquet and nation.parquet files in the sample-data directory on your local file system.

SELECT * FROM dfs.`C:\Development\apache-drill-1.6.0\sample-data\region.parquet`;

Todo : find out what’s the error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

Query the nation file

SELECT * FROM dfs.`C:\Development\apache-drill-1.6.0\sample-data/nation.parquet`;

Stop Drill

!quit

Drill tutorial in Windows – run Yelp DataSet

Original Tutorial:

http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/

Download Yelp Dataset here

https://www.yelp.com/dataset_challenge/dataset

One that works for this ex can be found here

https://github.com/melqkiades/yelp/blob/master/notebooks/yelp_academic_dataset_business.json

View the contents of the Yelp business data

select * from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json` limit 1;

Find total reviews in the data set

select sum(review_count) as totalreviews from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json`;

Top states and cities in total reviews

select state, city, count() totalreviews from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json` group by state, city order by count() desc limit 10;

Thursday, April 28, 2016

Seminar Notes- Job Search on Linkedin

Notes from Randy Block- Using Linkedin for Job Search presentation on 4/28/2016. For more info: http://www.randyblock.com/

Fun Facts:

1. 96% recruiters are active on linkedin, but only 36% of job seekers are active on linkedin,

2. 89% of recruiters have hired candidates through social media.

3. Social networks continue to grow the fastest as hiring tools.
4. Social edia has become a fast and cheap “background check” that is often done before inviting a job applicant in for an interview.
5.As a recruiter, if you don’t ask the question, the answer is always no.
6. Merger, re-orgnizating, layoff are the times to apply

7. Algorithm are ignoring words that are obsolete.

8. Recruiters don’t contact people who is not employed

Fun Tips:

1. when you are following a company, and when they are searching for candidate, you come up higher on their search.
2. Photo Tips: men do not smile, women show your teeth
3. use photofeeler.com to see how people like your photo
4. Turn off notification when you update your profile!!

Seminar Notes- Pachyderm and TubeMogul ( use Big Data to convert Events --> Insights --> Actions)

From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

There are two bold new ideas in Pachyderm:

Seminar Notes- Data Pipelines development/deployment and management using Data Swarm

Below is my learning notes from Mike Starr's presentation on Dataswarm. Full video here:
https://www.youtube.com/watch?v=M0VCbhfQ3HQ&list=PL_EeYa3aRS55QAbL851AF5FIHlCcN9xbp

1. Key Takeaways:

Dataswarm is a dependency graph description language. It's not a code that runs to completion or does anything. It just defines what you want to do.
DataSwarm's primary objective is the operator schedule the pipeline in a specific date. Users write python code which defines the pipeline, and it's delegate to the driver script to run the stuff.
DataStorm advantage: write functions that generates pipeline instead of write them manually
In facebook, datastorm runs every major batch pipeline

2. Summary:

"At Facebook, data is used to gain insights for existing products and drive development of new products. In order to do this, engineers and analysts need to seamlessly process data across a variety of backend data stores. Dataswarm is a framework for writing data processing pipelines in Python. Using an extensible library of operations (e.g. executing queries, moving data, running scripts), developers programmatically define dependency graphs of tasks to be executed. Dataswarm takes care of the rest: distributed execution, scheduling, and dependency management. "

Below is a high level data flow for batch processing; An action leads to a web request on the backend server, that backend server generates logs of events. Those log events then go to the data warehouse.

Dataswarm at a high level is a tool that enables Data Scientists to convert logs into useful information.

R's NoteBook

Thursday, May 5, 2016

Project Notes- Apache Drill on Windows 7 to query Yelp Data

Apache Drill

Architecture

Tutorial

Drill tutorial in Windows – run simple queries

summary:

Drill in 10 minutes

http://drill.apache.org/docs/drill-in-10-minutes/

install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.

Setup

Run Queries

At the root of the Drill installation, a sample-data directory includes JSON and Parquet files that you can query.

Querying JSON File

Querying a Parquet File

Stop Drill

!quit

Drill tutorial in Windows – run Yelp DataSet

Thursday, April 28, 2016

Seminar Notes- Job Search on Linkedin

Wednesday, April 27, 2016

Seminar Notes- Pachyderm and TubeMogul ( use Big Data to convert Events --> Insights --> Actions)

From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Tuesday, April 26, 2016

Seminar Notes- Data Pipelines development/deployment and management using Data Swarm

1. Key Takeaways:

2. Summary:

Blog Archive

Thursday, May 5, 2016

Project Notes- Apache Drill on Windows 7 to query Yelp Data

Apache Drill

Architecture

Tutorial

Drill tutorial in Windows – run simple queries

summary:

Drill in 10 minutes http://drill.apache.org/docs/drill-in-10-minutes/ install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.

Setup

Run Queries

At the root of the Drill installation, a sample-data directory includes JSON and Parquet files that you can query.

Querying JSON File

Querying a Parquet File

Stop Drill !quit

Drill tutorial in Windows – run Yelp DataSet

Thursday, April 28, 2016

Seminar Notes- Job Search on Linkedin

Wednesday, April 27, 2016

Seminar Notes- Pachyderm and TubeMogul ( use Big Data to convert Events --> Insights --> Actions)

From Big Data Application Meetup 4/27 See http://bdam.io/ for complete notes. slides: http://www.slideshare.net/JoeyZwicker/big-data-applications-61439464

Talk #1 Introducing Pachyderm, by Joe Doliner from Pachyderm

Pachyderm is a big data analytics platform deployed with Kubernetes and Docker. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Tuesday, April 26, 2016

Seminar Notes- Data Pipelines development/deployment and management using Data Swarm

1. Key Takeaways:

2. Summary:

Drill in 10 minutes

http://drill.apache.org/docs/drill-in-10-minutes/

install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.

Stop Drill

!quit