ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [Hive] Partition이란?(add 추가, drop 삭제, show partitions 조회, partitions)
    데이터 엔지니어링/Hive 2022. 10. 11. 20:19
    반응형

    Hive에서 사용하는 partition이란?

    • table을 만들 partition을 이용하여 폴더 구조를 세분화 해서 관리 할 수 있다.
    • 폴더를 구조적으로 나눠주기 때문에 데이터 관리 측면에서 효율적이다.
    • 다만, 너무 많은 뎁스를 가지는 partition은 쿼리 나 성능면에서 악영향을 줄 수 있기 때문에 적당한 뎁스를 설정하는게 중요하다.

    Partition 테이블 생성

    CREATE TABLE if not exsits default.person(
        col1  INT,
        col2  STRING,
        col3  INT,
        col4  STRING
    )
    PARTITIONED BY(depth1 int, depth2 int, depth3 int) -- depth
    STORED AS ORC
    LOCATION 'hdfs://{hadoop_domain}:{port}/{path}';
    # 형식은 'hdfs://{hadoop_domain}:{port}/{path}/depth1/depth2/depth3' 로 관리 된다.
    
    hive> CREATE TABLE if not exists default.person(
        >     pid     INT,
        >     name    STRING,
        >     age     INT,
        >     address STRING
        > )
        > PARTITIONED BY(year int, month int, day int) -- depth
        > STORED AS ORC
        > LOCATION 'hdfs://namenode:8020/person/';
    OK
    Time taken: 0.092 seconds

    Partition 데이터 생성

    insert into table default.person partition(year=2022, month=9, day=15) values
    (1, 'James', 22, 'Seoul'),(2, 'Tom', 23, 'New York'),(3, 'Jace', 24, 'San Francisco'),
    (4, 'Catalina', 25, 'Virginia'),(5, 'Nilla', 31, 'Florida'),(6, 'Hyun', 32, 'Busan'),
    (7, 'Jack', 45, 'Tokyo'),(8, 'Micheal', 82, 'Fukuoka'),(9, 'Sunny', 31, 'Gwangju'),
    (10, 'John', 42, 'Canberra');
    
    insert into table default.person partition(year=2022, month=9, day=16) values
    (11, 'James', 94, 'Seoul'),(12, 'Tom', 47, 'New York'),(13, 'Jace', 56, 'San Francisco'),
    (14, 'Catalina', 22, 'Florida'),(15, 'Nilla', 36, 'Busan'),(16, 'Hyun', 22, 'Tokyo'),
    (17, 'Jack', 26, 'Fukuoka'),(18, 'Micheal', 52, 'Fukuoka'),(19, 'Sunny', 38, 'Gwangju'),
    (20, 'John', 34, 'Canberra');
    
    insert into table default.person partition(year=2022, month=9, day=17) values
    (21, 'james', 22, 'Seoul'),(22, 'Tom', 47, 'New York'),(23, 'Jace', 56, 'San Francisco'),
    (24, 'Catalina', 36, 'Florida'),(25, 'Nilla', 26, 'Seoul'),(26, 'Hyun', 52, 'Seoul'),
    (27, 'Jack', 34, 'Tokyo'),(28, 'Sunny', 38, 'Fukuoka');

    Hadoop에 생성된 Partition 폴더 확인

    # 위에 insert에서 만든 partition별로 폴더가 생긴걸 확인이 가능하다
    # hdfs dfs -ls /person/year=2022/month=9/
    Found 3 items
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:38 /person/year=2022/month=9/day=15
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:40 /person/year=2022/month=9/day=16
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:40 /person/year=2022/month=9/day=17
    

    hive 서버에서 쿼리로 질의를 날려도 아래와 같이 partition별로 컬럼이 조회 된 것을 볼 수 있다. 또한, where 조건을 걸어 partition별로 질의도 가능하다.

    # 전체 조회한 결과
    hive> select * from person;
    OK
    1       James   22      Seoul   2022    9       15
    2       Tom     23      New York        2022    9       15
    3       Jace    24      San Francisco   2022    9       15
    4       Catalina        25      Virginia        2022    9       15
    5       Nilla   31      Florida 2022    9       15
    6       Hyun    32      Busan   2022    9       15
    7       Jack    45      Tokyo   2022    9       15
    8       Micheal 82      Fukuoka 2022    9       15
    9       Sunny   31      Gwangju 2022    9       15
    10      John    42      Canberra        2022    9       15
    11      James   94      Seoul   2022    9       16
    12      Tom     47      New York        2022    9       16
    13      Jace    56      San Francisco   2022    9       16
    14      Catalina        22      Florida 2022    9       16
    15      Nilla   36      Busan   2022    9       16
    16      Hyun    22      Tokyo   2022    9       16
    17      Jack    26      Fukuoka 2022    9       16
    18      Micheal 52      Fukuoka 2022    9       16
    19      Sunny   38      Gwangju 2022    9       16
    20      John    34      Canberra        2022    9       16
    21      james   22      Seoul   2022    9       17
    22      Tom     47      New York        2022    9       17
    23      Jace    56      San Francisco   2022    9       17
    24      Catalina        36      Florida 2022    9       17
    25      Nilla   26      Seoul   2022    9       17
    26      Hyun    52      Seoul   2022    9       17
    27      Jack    34      Tokyo   2022    9       17
    28      Sunny   38      Fukuoka 2022    9       17
    Time taken: 0.142 seconds, Fetched: 28 row(s)
    
    # partition 날짜 조건
    hive> select * from person where day = 15;
    OK
    1       James   22      Seoul   2022    9       15
    2       Tom     23      New York        2022    9       15
    3       Jace    24      San Francisco   2022    9       15
    4       Catalina        25      Virginia        2022    9       15
    5       Nilla   31      Florida 2022    9       15
    6       Hyun    32      Busan   2022    9       15
    7       Jack    45      Tokyo   2022    9       15
    8       Micheal 82      Fukuoka 2022    9       15
    9       Sunny   31      Gwangju 2022    9       15
    10      John    42      Canberra        2022    9       15
    Time taken: 0.445 seconds, Fetched: 10 row(s)
    

    Partition 조회

    hive> show partitions default.person;
    OK
    year=2022/month=9/day=15
    year=2022/month=9/day=16
    year=2022/month=9/day=17
    Time taken: 0.073 seconds, Fetched: 3 row(s)
    

    수동으로 partition추가

    ALTER TABLE 버전

    hive> ALTER TABLE person ADD PARTITION (year=2022, month=10, day=1)
        > LOCATION 'hdfs://namenode:8020/person/year=2022/month=10/day=1';
    OK
    Time taken: 0.112 seconds
    
    hive> show partitions default.person;
    OK
    year=2022/month=10/day=1
    year=2022/month=9/day=15
    year=2022/month=9/day=16
    year=2022/month=9/day=17
    Time taken: 0.054 seconds, Fetched: 4 row(s)
    
    # hdfs dfs -ls /person/year=2022/
    Found 2 items
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:54 /person/year=2022/month=10
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:54 /person/year=2022/month=10/day=1
    

    MSCK 버전

    # 폴더를 먼저 추가한 상태에서는 msck 명령어 이용
    # 테스트를 위해 spare 폴더 생성
    hdfs dfs -mkdir /person/year=2022/month=10/day=2/
    hdfs dfs -mkdir /person/year=2022/month=10/day=3/
    hdfs dfs -ls /person/year=2022/month=10
    Found 3 items
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:54 /person/year=2022/month=10/day=1
    drwxr-xr-x   - root supergroup          0 2022-10-11 05:59 /person/year=2022/month=10/day=2
    drwxr-xr-x   - root supergroup          0 2022-10-11 06:06 /person/year=2022/month=10/day=3
    
    # msck 적용
    hive> msck repair TABLE default.person;
    OK
    Partitions not in metastore:    person:year=2022/month=10/day=2 person:year=2022/month=10/day=3
    Repair: Added partition to metastore default.person:year=2022/month=10/day=2
    Repair: Added partition to metastore default.person:year=2022/month=10/day=3
    Time taken: 0.148 seconds, Fetched: 3 row(s)
    
    # add partition 되었는지 확인
    hive> show partitions default.person;
    OK
    year=2022/month=10/day=1
    year=2022/month=10/day=2
    year=2022/month=10/day=3
    year=2022/month=9/day=15
    year=2022/month=9/day=16
    year=2022/month=9/day=17
    Time taken: 0.044 seconds, Fetched: 6 row(s)
    

    Partition 삭제

    # 단일로 삭제 가능
    ALTER TABLE default.person DROP PARTITION(day=1);
    
    hive> show partitions default.person;
    OK
    year=2022/month=10/day=2
    year=2022/month=10/day=3
    year=2022/month=9/day=15
    year=2022/month=9/day=16
    year=2022/month=9/day=17
    Time taken: 0.063 seconds, Fetched: 5 row(s)
    
    # 조건문으로 삭제
    hive> ALTER TABLE default.person DROP PARTITION(day > 1);
    Dropped the partition year=2022/month=10/day=2
    Dropped the partition year=2022/month=10/day=3
    Dropped the partition year=2022/month=9/day=15
    Dropped the partition year=2022/month=9/day=16
    Dropped the partition year=2022/month=9/day=17
    OK
    Time taken: 0.265 seconds
    
    hive> show partitions default.person;
    OK
    Time taken: 0.066 seconds
    

     

     

     

     

    모든 샘플 소스는 제 깃헙에 있습니다.

     

    GitHub - hyunseokjoo/hive_sample_code

    Contribute to hyunseokjoo/hive_sample_code development by creating an account on GitHub.

    github.com

     

     

     

     

    참고문헌 

    https://moons08.github.io/programming/Hive_Partition/

    반응형

    댓글

Designed by Tistory.