Hive學習筆記總結
05. Hql練習
1. hql基礎練習
題目和數據來源:http://www.w2b-c.com/article/150326(去掉-)
create和load
create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;
load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student;
load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc;
load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;
1.查詢全體學生的學號與姓名
hive> select Sno,Sname from students;
2.查詢選修了課程的學生姓名
select distinct Sname from students, sc where students.Sno = sc.Sno;
或:
select distinct Sname from students inner join sc on students.Sno = sc.Sno;
3.查詢學生的總人數
select count(*) from students;
4.計算1號課程的學生平均成績
select avg(Grade) from sc where Cno = 1;
5.查詢各科成績平均分
select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;
//Grade要么出現在group關鍵詞之后,要么使用聚合函數。
6.查詢選修1號課程的學生最高分數
select max(Grade) from sc where Cno = 1;
7.求各個課程號及相應的選課人數
select Cno,count(*) from sc group by Cno;
8.查詢選修了3門以上的課程的學生學號
select Sno from sc group by Sno having count(Cno) >3 ;
9.查詢學生信息,結果按學號全局有序
select * from students order by Sno;
10.查詢學生信息,結果區分性別按年齡有序
set mapred.reduce.tasks=2;
select * from students distribute by sex sort by sage;
11.查詢每個學生及其選修課程的情況
select students.*,sc.* from students join sc on (students.Sno =sc.Sno);
12.查詢學生的得分情況
13.查詢選修2號課程且成績在90分以上的所有學生。
select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;
或者:
select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where sc.Cno=2 and sc.Grade>90;
14.查詢所有學生的信息,如果在成績表中有成績,則輸出成績表中的課程號
select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;
15.重寫以下子查詢為LEFT SEMI JOIN
SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);
查詢目的:查找A中,key值在B中存在的數據。
可以被重寫為:
select a.key,a.value from a left semi join b on a.key = b.key;
16.查詢與“劉晨”在同一個系學習的學生
select s1.Sname from students s1 where sdept in (select sdept from students where sname = '劉晨');
或者:
select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
注意比較:
select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
2. 執行順序
標準順序:
select--from--where--group by--having--order by
join操作中,on條件與where條件的區別
數據庫在通過連接兩張或多張表來返回記錄時,都會生成一張中間的臨時表,然后再將這張臨時表返回給用戶。
join發生在where字句之前,在使用left jion時,on和where條件的區別如下:
1、on條件是在生成臨時表時使用的條件,它不管on中的條件是否為真,都會返回左邊表中的記錄。(右邊置為Null了)
2、where條件是在臨時表生成好后,再對臨時表進行過濾的條件。這時已經沒有left join的含義(必須返回左邊表的記錄)了,條件不為真的就全部過濾掉。
假設有兩張表:
表1:tab1
id size
1 10
2 20
3 30
表2:tab2
size name
10 AAA
20 BBB
20 CCC
兩條SQL:
1、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA'
2、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'
第一條SQL的過程:
1、中間表
on條件:
tab1.size = tab2.size
tab1.id tab1.size tab2.size tab2.name
1 10 10 AAA
2 20 20 BBB
2 20 20 CCC
3 30 (null) (null)
2、再對中間表過濾
where 條件:
tab2.name='AAA'
tab1.id tab1.size tab2.size tab2.name
1 10 10 AAA
第二條SQL的過程:
1、中間表
on條件:
tab1.size = tab2.size and tab2.name='AAA'
(條件不為真也會返回左表中的記錄) tab1.id tab1.size tab2.size tab2.name
1 10 10 AAA
2 20 (null) (null)
3 30 (null) (null)
其實以上結果的關鍵原因就是left join,right join,full join的特殊性,
不管on上的條件是否為真都會返回left或right表中的記錄,full則具有left和right的特性的并集。
** 而inner join沒這個特殊性,則條件放在on中和where中,返回的結果集是相同的。**
3. Hive實戰--級聯求和(累計報表)
需求:
有如下訪客訪問次數統計表 t_access_times
訪客 | 月份 | 訪問次數 |
---|---|---|
A | 2015-01 | 5 |
A | 2015-01 | 15 |
B | 2015-01 | 5 |
A | 2015-01 | 8 |
B | 2015-01 | 25 |
A | 2015-01 | 5 |
A | 2015-02 | 4 |
A | 2015-02 | 6 |
B | 2015-02 | 10 |
B | 2015-02 | 5 |
需要輸出報表:t_access_times_accumulate
月訪問:當月的總次數;累計訪問總計:截止到當月的月訪問次數之和。
訪客 | 月份 | 月訪問總計 | 累計訪問總計 |
---|---|---|---|
A | 2015-01 | 33 | 33 |
A | 2015-02 | 10 | 43 |
B | 2015-01 | 30 | 30 |
B | 2015-02 | 15 | 45 |
準備數據:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
create table t_access_time(username string,month string,salary int)
row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;
1、第一步,先求每個用戶的月總金額
select username,month,sum(salary) from t_access_time group by username,month;
+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
2、第二步,將月總金額表 自己連接(自連接)
select * from
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA
inner join
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB
on TabA.username = TabB.username;
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
3、第三步,從上一步的結果中
進行分組查詢,分組的字段是a.username a.month
求月累計值: 將b.month <= a.month的所有b.salary求和即可
select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary
from
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA
inner join
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB
on TabA.username = TabB.username
where TabB.month<= TabA.month
group by TabA.username,TabA.month;
max(TabA.salary)不能直接寫成TabA.salary,因為這個字段沒有出現在group by中,也沒有聚合函數,所以使用max表示。
結果:
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45
參考http://www.w2b-c.com/article/150326(去掉-)
初接觸,記下學習筆記,還有很多問題,望指導,謝謝。
文章列表