文章出處

Hive學習筆記總結

05. Hql練習

1. hql基礎練習

題目和數據來源:http://www.w2b-c.com/article/150326(去掉-)

create和load

create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student;
load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc;
load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;

1.查詢全體學生的學號與姓名

hive> select Sno,Sname from students;

2.查詢選修了課程的學生姓名

select distinct Sname from students, sc where students.Sno = sc.Sno;

或:

select distinct Sname from students inner join sc on students.Sno = sc.Sno;

3.查詢學生的總人數

select count(*) from students;

4.計算1號課程的學生平均成績

select avg(Grade) from sc where Cno = 1;

5.查詢各科成績平均分

select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;

//Grade要么出現在group關鍵詞之后,要么使用聚合函數。

6.查詢選修1號課程的學生最高分數

select max(Grade) from sc where Cno = 1;

7.求各個課程號及相應的選課人數

select Cno,count(*) from sc group by Cno;

8.查詢選修了3門以上的課程的學生學號

select Sno from sc group by Sno having count(Cno) >3 ;

9.查詢學生信息,結果按學號全局有序

select * from students order by Sno;

10.查詢學生信息,結果區分性別按年齡有序

set mapred.reduce.tasks=2;
select * from students distribute by sex sort by sage;

11.查詢每個學生及其選修課程的情況

select students.*,sc.* from students join sc on (students.Sno =sc.Sno);

12.查詢學生的得分情況
13.查詢選修2號課程且成績在90分以上的所有學生。

select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;

或者:

select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where  sc.Cno=2 and sc.Grade>90;

14.查詢所有學生的信息,如果在成績表中有成績,則輸出成績表中的課程號

select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;

15.重寫以下子查詢為LEFT SEMI JOIN

SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);

查詢目的:查找A中,key值在B中存在的數據。
可以被重寫為:

select a.key,a.value from a left semi join b on a.key = b.key;

16.查詢與“劉晨”在同一個系學習的學生

select s1.Sname from students s1 where sdept in (select sdept from students where sname = '劉晨');

或者:

select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

注意比較:

select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

2. 執行順序

標準順序:
select--from--where--group by--having--order by

join操作中,on條件與where條件的區別

數據庫在通過連接兩張或多張表來返回記錄時,都會生成一張中間的臨時表,然后再將這張臨時表返回給用戶。
join發生在where字句之前,在使用left jion時,on和where條件的區別如下:

1、on條件是在生成臨時表時使用的條件,它不管on中的條件是否為真,都會返回左邊表中的記錄。(右邊置為Null了)

2、where條件是在臨時表生成好后,再對臨時表進行過濾的條件。這時已經沒有left join的含義(必須返回左邊表的記錄)了,條件不為真的就全部過濾掉。

假設有兩張表:

表1:tab1

id size 
1  10 
2  20 
3  30 

表2:tab2

size name 
10   AAA 
20   BBB 
20   CCC 

兩條SQL:

1、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA'
2、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'

第一條SQL的過程:
1、中間表
on條件:

tab1.size = tab2.size 
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 20 BBB 
2 20 20 CCC 
3 30 (null) (null) 

2、再對中間表過濾
where 條件:

tab2.name='AAA'
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 

第二條SQL的過程:
1、中間表
on條件:

tab1.size = tab2.size and tab2.name='AAA'
(條件不為真也會返回左表中的記錄) tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 (null) (null) 
3 30 (null) (null) 

其實以上結果的關鍵原因就是left join,right join,full join的特殊性,
不管on上的條件是否為真都會返回left或right表中的記錄,full則具有left和right的特性的并集。

** 而inner join沒這個特殊性,則條件放在on中和where中,返回的結果集是相同的。**

3. Hive實戰--級聯求和(累計報表)

需求:
有如下訪客訪問次數統計表 t_access_times

訪客 月份 訪問次數
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
B 2015-02 10
B 2015-02 5

需要輸出報表:t_access_times_accumulate
月訪問:當月的總次數;累計訪問總計:截止到當月的月訪問次數之和。

訪客 月份 月訪問總計 累計訪問總計
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45

準備數據:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

create table t_access_time(username string,month string,salary int)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;

1、第一步,先求每個用戶的月總金額

select username,month,sum(salary) from t_access_time group by username,month;

+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+

2、第二步,將月總金額表 自己連接(自連接)

select * from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username;

+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+

3、第三步,從上一步的結果中
進行分組查詢,分組的字段是a.username a.month
求月累計值: 將b.month <= a.month的所有b.salary求和即可

select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary 
from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username 
where TabB.month<= TabA.month 
group by TabA.username,TabA.month;

max(TabA.salary)不能直接寫成TabA.salary,因為這個字段沒有出現在group by中,也沒有聚合函數,所以使用max表示。

結果:
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45

參考http://www.w2b-c.com/article/150326(去掉-)

初接觸,記下學習筆記,還有很多問題,望指導,謝謝。


文章列表


不含病毒。www.avast.com
arrow
arrow
    全站熱搜
    創作者介紹
    創作者 大師兄 的頭像
    大師兄

    IT工程師數位筆記本

    大師兄 發表在 痞客邦 留言(0) 人氣()