Apache Drill
-
Query Hadoop with SQL
-
Open source, low-latency query engine for Hadoop that delivers secure,
-
interactive SQL analytics at petabyte scale.
-
delivering self-service data exploration capabilities on data stored in multiple formats in files or NoSQL database
-
Apache Drill supports nested data, schema-less execution, and decentralized metadata.
-
why use drill
-
http://drill.apache.org/docs/why-drill/
-
easy to start, Schema-free JSON model, able to query complex, semi structured data, real SQL, BI tools, interactive queries on Hive table, access multiple data sources, user defined function for drill and hive, high performance, scales from a single laptop to 1000 node cluster
-
Query Hadoop with SQL
Open source, low-latency query engine for Hadoop that delivers secure,
interactive SQL analytics at petabyte scale.
delivering self-service data exploration capabilities on data stored in multiple formats in files or NoSQL database
Apache Drill supports nested data, schema-less execution, and decentralized metadata.
why use drill
- http://drill.apache.org/docs/why-drill/
- easy to start, Schema-free JSON model, able to query complex, semi structured data, real SQL, BI tools, interactive queries on Hive table, access multiple data sources, user defined function for drill and hive, high performance, scales from a single laptop to 1000 node cluster
Architecture
Tutorial
Drill tutorial in Windows – run simple queries
summary:
-
Drill in 10 minutes
-
-
install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.
Drill in 10 minutes
install drill to run in embedded mode on MAC OS. After installing Drill, you start the Drill shell. The Drill shell is a pure-Java console-based utility for connecting to relational databases and executing SQL commands.
Setup
-
Download latest Drill, extract the Tar
-
-
The Apache Drill archive contains sample JSON and Parquet files that you can query immediately.
-
Go to bin folder, Type the following command on the command line:
sqlline.bat -u "jdbc:drill:zk=local"
-
Will not work if you have another session open
Download latest Drill, extract the Tar
- The Apache Drill archive contains sample JSON and Parquet files that you can query immediately.
Go to bin folder, Type the following command on the command line:
Will not work if you have another session open
Run Queries
-
At the root of the Drill installation, a sample-data directory includes JSON and Parquet files that you can query.
- At the root of the Drill installation, a sample-data directory includes JSON and Parquet files that you can query.
Querying JSON File
-
A sample JSON file, employee.json, contains fictitious employee data. To view the data in the employee.json file, submit the following SQL query to Drill, using the cp (classpath) storage plugin configuration to point to the file.
SELECT * FROM cp.`employee.json` LIMIT 3;
Todo: findout what’s the error ID: [Error Id: 5b46f491-0eca-404e-b0d8-54d0d586152c on RTS140-pc:31010] (state=,code =0)
A sample JSON file, employee.json, contains fictitious employee data. To view the data in the employee.json file, submit the following SQL query to Drill, using the cp (classpath) storage plugin configuration to point to the file.
Querying a Parquet File
-
Query the region.parquet and nation.parquet files in the sample-data directory on your local file system.
SELECT * FROM dfs.`C:\Development\apache-drill-1.6.0\sample-data\region.parquet`;
Todo : find out what’s the error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
-
Query the nation file
SELECT * FROM dfs.`C:\Development\apache-drill-1.6.0\sample-data/nation.parquet`;
Query the region.parquet and nation.parquet files in the sample-data directory on your local file system.
Query the nation file
- Stop Drill
!quit
Drill tutorial in Windows – run Yelp DataSet
Original Tutorial:
http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/
-
Download Yelp Dataset here
-
-
One that works for this ex can be found here
-
https://github.com/melqkiades/yelp/blob/master/notebooks/yelp_academic_dataset_business.json
-
View the contents of the Yelp business data
select * from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json` limit 1;
-
Find total reviews in the data set
select sum(review_count) as totalreviews from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json`;
-
Top states and cities in total reviews
select state, city, count(*) totalreviews from dfs.`C:\Development\apache-drill-1.6.0\sample-data/yelp/yelp_academic_dataset_business.json` group by state, city order by count(*) desc limit 10;
Download Yelp Dataset here
- One that works for this ex can be found here
- https://github.com/melqkiades/yelp/blob/master/notebooks/yelp_academic_dataset_business.json
View the contents of the Yelp business data
Find total reviews in the data set
Top states and cities in total reviews
