-
[Hive] Partition이란?(add 추가, drop 삭제, show partitions 조회, partitions)데이터 엔지니어링/Hive 2022. 10. 11. 20:19반응형
Hive에서 사용하는 partition이란?
- table을 만들 partition을 이용하여 폴더 구조를 세분화 해서 관리 할 수 있다.
- 폴더를 구조적으로 나눠주기 때문에 데이터 관리 측면에서 효율적이다.
- 다만, 너무 많은 뎁스를 가지는 partition은 쿼리 나 성능면에서 악영향을 줄 수 있기 때문에 적당한 뎁스를 설정하는게 중요하다.
Partition 테이블 생성
CREATE TABLE if not exsits default.person( col1 INT, col2 STRING, col3 INT, col4 STRING ) PARTITIONED BY(depth1 int, depth2 int, depth3 int) -- depth STORED AS ORC LOCATION 'hdfs://{hadoop_domain}:{port}/{path}'; # 형식은 'hdfs://{hadoop_domain}:{port}/{path}/depth1/depth2/depth3' 로 관리 된다. hive> CREATE TABLE if not exists default.person( > pid INT, > name STRING, > age INT, > address STRING > ) > PARTITIONED BY(year int, month int, day int) -- depth > STORED AS ORC > LOCATION 'hdfs://namenode:8020/person/'; OK Time taken: 0.092 seconds
Partition 데이터 생성
insert into table default.person partition(year=2022, month=9, day=15) values (1, 'James', 22, 'Seoul'),(2, 'Tom', 23, 'New York'),(3, 'Jace', 24, 'San Francisco'), (4, 'Catalina', 25, 'Virginia'),(5, 'Nilla', 31, 'Florida'),(6, 'Hyun', 32, 'Busan'), (7, 'Jack', 45, 'Tokyo'),(8, 'Micheal', 82, 'Fukuoka'),(9, 'Sunny', 31, 'Gwangju'), (10, 'John', 42, 'Canberra'); insert into table default.person partition(year=2022, month=9, day=16) values (11, 'James', 94, 'Seoul'),(12, 'Tom', 47, 'New York'),(13, 'Jace', 56, 'San Francisco'), (14, 'Catalina', 22, 'Florida'),(15, 'Nilla', 36, 'Busan'),(16, 'Hyun', 22, 'Tokyo'), (17, 'Jack', 26, 'Fukuoka'),(18, 'Micheal', 52, 'Fukuoka'),(19, 'Sunny', 38, 'Gwangju'), (20, 'John', 34, 'Canberra'); insert into table default.person partition(year=2022, month=9, day=17) values (21, 'james', 22, 'Seoul'),(22, 'Tom', 47, 'New York'),(23, 'Jace', 56, 'San Francisco'), (24, 'Catalina', 36, 'Florida'),(25, 'Nilla', 26, 'Seoul'),(26, 'Hyun', 52, 'Seoul'), (27, 'Jack', 34, 'Tokyo'),(28, 'Sunny', 38, 'Fukuoka');
Hadoop에 생성된 Partition 폴더 확인
# 위에 insert에서 만든 partition별로 폴더가 생긴걸 확인이 가능하다 # hdfs dfs -ls /person/year=2022/month=9/ Found 3 items drwxr-xr-x - root supergroup 0 2022-10-11 05:38 /person/year=2022/month=9/day=15 drwxr-xr-x - root supergroup 0 2022-10-11 05:40 /person/year=2022/month=9/day=16 drwxr-xr-x - root supergroup 0 2022-10-11 05:40 /person/year=2022/month=9/day=17
hive 서버에서 쿼리로 질의를 날려도 아래와 같이 partition별로 컬럼이 조회 된 것을 볼 수 있다. 또한, where 조건을 걸어 partition별로 질의도 가능하다.
# 전체 조회한 결과 hive> select * from person; OK 1 James 22 Seoul 2022 9 15 2 Tom 23 New York 2022 9 15 3 Jace 24 San Francisco 2022 9 15 4 Catalina 25 Virginia 2022 9 15 5 Nilla 31 Florida 2022 9 15 6 Hyun 32 Busan 2022 9 15 7 Jack 45 Tokyo 2022 9 15 8 Micheal 82 Fukuoka 2022 9 15 9 Sunny 31 Gwangju 2022 9 15 10 John 42 Canberra 2022 9 15 11 James 94 Seoul 2022 9 16 12 Tom 47 New York 2022 9 16 13 Jace 56 San Francisco 2022 9 16 14 Catalina 22 Florida 2022 9 16 15 Nilla 36 Busan 2022 9 16 16 Hyun 22 Tokyo 2022 9 16 17 Jack 26 Fukuoka 2022 9 16 18 Micheal 52 Fukuoka 2022 9 16 19 Sunny 38 Gwangju 2022 9 16 20 John 34 Canberra 2022 9 16 21 james 22 Seoul 2022 9 17 22 Tom 47 New York 2022 9 17 23 Jace 56 San Francisco 2022 9 17 24 Catalina 36 Florida 2022 9 17 25 Nilla 26 Seoul 2022 9 17 26 Hyun 52 Seoul 2022 9 17 27 Jack 34 Tokyo 2022 9 17 28 Sunny 38 Fukuoka 2022 9 17 Time taken: 0.142 seconds, Fetched: 28 row(s) # partition 날짜 조건 hive> select * from person where day = 15; OK 1 James 22 Seoul 2022 9 15 2 Tom 23 New York 2022 9 15 3 Jace 24 San Francisco 2022 9 15 4 Catalina 25 Virginia 2022 9 15 5 Nilla 31 Florida 2022 9 15 6 Hyun 32 Busan 2022 9 15 7 Jack 45 Tokyo 2022 9 15 8 Micheal 82 Fukuoka 2022 9 15 9 Sunny 31 Gwangju 2022 9 15 10 John 42 Canberra 2022 9 15 Time taken: 0.445 seconds, Fetched: 10 row(s)
Partition 조회
hive> show partitions default.person; OK year=2022/month=9/day=15 year=2022/month=9/day=16 year=2022/month=9/day=17 Time taken: 0.073 seconds, Fetched: 3 row(s)
수동으로 partition추가
ALTER TABLE 버전
hive> ALTER TABLE person ADD PARTITION (year=2022, month=10, day=1) > LOCATION 'hdfs://namenode:8020/person/year=2022/month=10/day=1'; OK Time taken: 0.112 seconds hive> show partitions default.person; OK year=2022/month=10/day=1 year=2022/month=9/day=15 year=2022/month=9/day=16 year=2022/month=9/day=17 Time taken: 0.054 seconds, Fetched: 4 row(s) # hdfs dfs -ls /person/year=2022/ Found 2 items drwxr-xr-x - root supergroup 0 2022-10-11 05:54 /person/year=2022/month=10 drwxr-xr-x - root supergroup 0 2022-10-11 05:54 /person/year=2022/month=10/day=1
MSCK 버전
# 폴더를 먼저 추가한 상태에서는 msck 명령어 이용 # 테스트를 위해 spare 폴더 생성 hdfs dfs -mkdir /person/year=2022/month=10/day=2/ hdfs dfs -mkdir /person/year=2022/month=10/day=3/ hdfs dfs -ls /person/year=2022/month=10 Found 3 items drwxr-xr-x - root supergroup 0 2022-10-11 05:54 /person/year=2022/month=10/day=1 drwxr-xr-x - root supergroup 0 2022-10-11 05:59 /person/year=2022/month=10/day=2 drwxr-xr-x - root supergroup 0 2022-10-11 06:06 /person/year=2022/month=10/day=3
# msck 적용 hive> msck repair TABLE default.person; OK Partitions not in metastore: person:year=2022/month=10/day=2 person:year=2022/month=10/day=3 Repair: Added partition to metastore default.person:year=2022/month=10/day=2 Repair: Added partition to metastore default.person:year=2022/month=10/day=3 Time taken: 0.148 seconds, Fetched: 3 row(s)
# add partition 되었는지 확인 hive> show partitions default.person; OK year=2022/month=10/day=1 year=2022/month=10/day=2 year=2022/month=10/day=3 year=2022/month=9/day=15 year=2022/month=9/day=16 year=2022/month=9/day=17 Time taken: 0.044 seconds, Fetched: 6 row(s)
Partition 삭제
# 단일로 삭제 가능 ALTER TABLE default.person DROP PARTITION(day=1); hive> show partitions default.person; OK year=2022/month=10/day=2 year=2022/month=10/day=3 year=2022/month=9/day=15 year=2022/month=9/day=16 year=2022/month=9/day=17 Time taken: 0.063 seconds, Fetched: 5 row(s)
# 조건문으로 삭제 hive> ALTER TABLE default.person DROP PARTITION(day > 1); Dropped the partition year=2022/month=10/day=2 Dropped the partition year=2022/month=10/day=3 Dropped the partition year=2022/month=9/day=15 Dropped the partition year=2022/month=9/day=16 Dropped the partition year=2022/month=9/day=17 OK Time taken: 0.265 seconds hive> show partitions default.person; OK Time taken: 0.066 seconds
모든 샘플 소스는 제 깃헙에 있습니다.
참고문헌
반응형'데이터 엔지니어링 > Hive' 카테고리의 다른 글
[Hive] Merge문 이해하기 (0) 2022.10.26 [Hive] Hive에서 Transaction이란?(component, 제약조건, delta폴더, staging 폴더, insert, update, delete) (0) 2022.10.11 [Hive]Create table Location 알아보기 (0) 2022.10.06 [Hive] Hive query - Bucketing table 알아보기 (0) 2022.10.06 [Hive] Hive Query Table 관련 sql 정리(create, drop, truncate table) - options(if not exists, file format) (0) 2022.10.06