文章出處

Hive學習筆記總結

05. Hql練習

1. hql基礎練習

題目和數據來源：http://www.w2b-c.com/article/150326(去掉-)

create和load

create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student;
load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc;
load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;

1.查詢全體學生的學號與姓名

hive> select Sno,Sname from students;

2.查詢選修了課程的學生姓名

select distinct Sname from students, sc where students.Sno = sc.Sno;

或：

select distinct Sname from students inner join sc on students.Sno = sc.Sno;

3.查詢學生的總人數

select count(*) from students;

4.計算1號課程的學生平均成績

select avg(Grade) from sc where Cno = 1;

5.查詢各科成績平均分

select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;

//Grade要么出現在group關鍵詞之后，要么使用聚合函數。

6.查詢選修1號課程的學生最高分數

select max(Grade) from sc where Cno = 1;

7.求各個課程號及相應的選課人數

select Cno,count(*) from sc group by Cno;

8.查詢選修了3門以上的課程的學生學號

select Sno from sc group by Sno having count(Cno) >3 ;

9.查詢學生信息，結果按學號全局有序

select * from students order by Sno;

10.查詢學生信息，結果區分性別按年齡有序

set mapred.reduce.tasks=2;
select * from students distribute by sex sort by sage;

11.查詢每個學生及其選修課程的情況

select students.*,sc.* from students join sc on (students.Sno =sc.Sno);

12.查詢學生的得分情況
13.查詢選修2號課程且成績在90分以上的所有學生。

select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;

或者：

select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where  sc.Cno=2 and sc.Grade>90;

14.查詢所有學生的信息，如果在成績表中有成績，則輸出成績表中的課程號

select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;

15.重寫以下子查詢為LEFT SEMI JOIN

SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);

查詢目的：查找A中，key值在B中存在的數據。
可以被重寫為：

select a.key,a.value from a left semi join b on a.key = b.key;

16.查詢與“劉晨”在同一個系學習的學生

select s1.Sname from students s1 where sdept in (select sdept from students where sname = '劉晨');

或者：

select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

注意比較：

select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

2. 執行順序

標準順序：
select--from--where--group by--having--order by

join操作中，on條件與where條件的區別

數據庫在通過連接兩張或多張表來返回記錄時，都會生成一張中間的臨時表，然后再將這張臨時表返回給用戶。
join發生在where字句之前，在使用left jion時，on和where條件的區別如下：

1、on條件是在生成臨時表時使用的條件，它不管on中的條件是否為真，都會返回左邊表中的記錄。（右邊置為Null了）

2、where條件是在臨時表生成好后，再對臨時表進行過濾的條件。這時已經沒有left join的含義（必須返回左邊表的記錄）了，條件不為真的就全部過濾掉。

假設有兩張表：

表1：tab1

id size 
1  10 
2  20 
3  30

表2：tab2

size name 
10   AAA 
20   BBB 
20   CCC

兩條SQL:

1、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA'
2、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'

第一條SQL的過程：
1、中間表
on條件:

tab1.size = tab2.size 
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 20 BBB 
2 20 20 CCC 
3 30 (null) (null)

2、再對中間表過濾
where 條件：

tab2.name='AAA'
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA

第二條SQL的過程：
1、中間表
on條件:

tab1.size = tab2.size and tab2.name='AAA'
(條件不為真也會返回左表中的記錄) tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 (null) (null) 
3 30 (null) (null)

其實以上結果的關鍵原因就是left join,right join,full join的特殊性，
不管on上的條件是否為真都會返回left或right表中的記錄，full則具有left和right的特性的并集。

** 而inner join沒這個特殊性，則條件放在on中和where中，返回的結果集是相同的。**

3. Hive實戰--級聯求和（累計報表）

需求：
有如下訪客訪問次數統計表 t_access_times

訪客	月份	訪問次數
A	2015-01	5
A	2015-01	15
B	2015-01	5
A	2015-01	8
B	2015-01	25
A	2015-01	5
A	2015-02	4
A	2015-02	6
B	2015-02	10
B	2015-02	5

需要輸出報表：t_access_times_accumulate
月訪問：當月的總次數；累計訪問總計：截止到當月的月訪問次數之和。

訪客	月份	月訪問總計	累計訪問總計
A	2015-01	33	33
A	2015-02	10	43
B	2015-01	30	30
B	2015-02	15	45

準備數據：
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

create table t_access_time(username string,month string,salary int)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;

1、第一步，先求每個用戶的月總金額

select username,month,sum(salary) from t_access_time group by username,month;

+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+

2、第二步，將月總金額表自己連接(自連接)

select * from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username;

+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+

3、第三步，從上一步的結果中
進行分組查詢，分組的字段是a.username a.month
求月累計值：將b.month <= a.month的所有b.salary求和即可

select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary 
from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username 
where TabB.month<= TabA.month 
group by TabA.username,TabA.month;

max(TabA.salary)不能直接寫成TabA.salary，因為這個字段沒有出現在group by中，也沒有聚合函數，所以使用max表示。

結果：
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45

參考http://www.w2b-c.com/article/150326（去掉-）

初接觸，記下學習筆記，還有很多問題，望指導，謝謝。

文章列表