hive study notes two: complex data types

hive study notes two: complex data types

Welcome to my GitHub

github.com/zq2599/blog...

Content: Classification and summary of all original articles and supporting source code, involving Java, Docker, Kubernetes, DevOPS, etc.;

"Hive study notes" series navigation

  1. Basic data type
  2. Complex data type
  3. Internal table and external table
  4. Partition Table
  5. Bucketing
  6. HiveQL basics
  7. Built-in function
  8. Sqoop
  9. Basic UDF
  10. User Defined Aggregate Function (UDAF)
  11. UDTF

Overview of this article

  • As the second chapter of "Hive Study Notes", we have learned about the basic types before, and what we want to learn in this article is complex data types;
  • There are four types of complex data:
  1. ARRAY: array
  2. MAP: key-value pair
  3. STRUCT: named field collection
  4. UNION: Specify one of several data types, and the value of UNION must exactly match one of these data types;
  • Next, learn one by one;

Prepare the environment

  1. Make sure that hadoop has been started;
  2. Enter the interactive mode of the hive console;
  3. Execute the following command to make the field name appear in the query result:
set hive.cli.print.header=true; copy the code

ARRAY

  1. Create a table named t2, with only two fields person and friends. Person is a string type, and friends is an array type. When importing data through a text file, the separator between person and friends is a vertical bar, and there are multiple inside friends. The separator between elements is a comma, pay attention to the syntax for declaring the separator:
create table if not exists t2( person string, friends array < string > ) row format delimited fields terminated by '|' collection items terminated by ',' ; copy code
  1. Create a text file 002.txt with the following content. It can be seen that there are only two records. The first person field value is tom, and there are three elements in the friends field, separated by commas:
tom | tom_friend_0,tom_friend_1,tom_friend_2 jerry | jerry_friend_0,jerry_friend_1,jerry_friend_2,jerry_friend_3,jerry_friend_4,jerry_friend_5Copy code
  1. Execute the following statement to import data from the local 002.txt file to the t2 table:
load data local inpath'/home/hadoop/temp/202010/25/002.txt' into table t2; copy code
  1. View all data:
hive > select * from t2; OK t2.person t2.friends tom ["tom_friend_0","tom_friend_1","tom_friend_2"] jerry ["jerry_friend_0","jerry_friend_1","jerry_friend_2","jerry_friend_3","jerry_friend_4","jerry_friend_5"] Time taken: 0.052 seconds The, FETCHED: 2 Row (S) copying the code
  1. SQL to query an element in friends:
select person, friends[ 0 ], friends[ 3 ] from t2; Copy code

The execution result is as follows, the first record has no friends[3] and is displayed as NULL:

hive> select person, friends[0], friends[3] from t2; OK person _c1 _c2 tom tom_friend_0 NULL jerry jerry_friend_0 jerry_friend_3 Time taken: 0.052 seconds, Fetched: 2 row(s) Copy code
  1. Whether the array element contains a certain value of SQL:
SELECT Person, array_contains (Friends, 'tom_friend_0' ) from T2; duplicated code

The execution result is as follows, the first record has tom_friend_0 in the friends array, which is displayed as true, and the second record does not contain it, and it displays false:

hive> select person, array_contains(friends, 'tom_friend_0' ) from t2; OK person _c1 tom true jerry false Time taken: 0.061 seconds, Fetched: 2 row(s) Copy code
  1. There are three elements in the friends array of the first record. With the help of LATERAL VIEW syntax, these three elements can be split into three rows. The SQL is as follows:
SELECT t.person, single_friend from ( SELECT Person, Friends from T2 WHERE Person = 'Tom' ) T the LATERAL the VIEW the explode (t.friends) V AS single_friend; duplicated code

The execution result is as follows, it can be seen that each element in the array can be split into a single line:

OK t.person single_friend tom tom_friend_0 tom tom_friend_1 tom tom_friend_2 Time taken: 0.058 seconds, Fetched: 3 row(s) Copy code
  • The above is the basic operation of the array, the next is the key-value pair;

MAP, create table, import data

  • Next, I plan to create a table named t3. There are only two fields, person and address. Person is a string type, and address is a MAP type. When importing data through a text file, the delimiter is defined as follows:
  1. The separator between person and address is a vertical bar;
  2. There are multiple key-value pairs in address, and their separator is a comma;
  3. The separator between the key and value of each key-value pair is a colon;
  • The table building statement that meets the above requirements is as follows:
create table if not exists t3( person string, address map < string, string > ) row format delimited fields terminated by '|' collection items terminated by ',' map keys terminated by ':' ; copy code
  • Create the text file 003.txt, it can be seen that three types of separators are used to separate the fields, multiple elements in the MAP, and the key and value of each element:
tom | province:guangdong,city:shenzhen jerry | province:jiangsu,city:nanjingCopy code
  • Import the data of 003.txt to the t3 table:
load data local inpath '/home/hadoop/temp/202010/25/003.txt' into table t3; copy code

MAP, query

  1. View all data:
hive > select * from t3; OK t3.person t3.address tom {"province":"guangdong","city":"shenzhen"} jerry {"province":"jiangsu","city":"nanjing"} Time taken: 0.075 seconds The, FETCHED: 2 Row (S) copying the code
  1. View a key in MAP, the syntax is field["xxx"]:
hive > select person, address["province"] from t3; OK person _c1 tom guangdong jerry jiangsu Time taken: 0.075 seconds The, FETCHED: 2 Row (S) copying the code
  1. Using the if function, the following SQL is to determine whether there is a "street" key in the address field. If there is, it will display the corresponding value, and if it is not, it will display filed street not exists:
select person, IF (address [ 'Street' ] IS null , "Not Filed Street EXISTS", address [ 'Street' ]) from T3; duplicated code

The output is as follows. Since the address field has only two keys, province and city, it will display filed street not exists:

OK Filed Street Tom Not EXISTS Jerry Filed Street Not EXISTS Time taken: 0.087 seconds The, FETCHED: 2 Row (S) copying the code
  1. Use explode to display each key-value pair of the address field as a line:
hive > select explode(address) from t3; OK province guangdong city shenzhen province jiangsu city nanjing Time taken: 0.081 seconds The, FETCHED: . 4 Row (S) copying the code
  1. The above explode function can only display the address field. If you want to display other fields, you must continue the LATERAL VIEW syntax, as shown below, it can be seen that the previous array is expanded into one field, and the MAP is expanded into two fields, namely key and value:
SELECT t.person, address_key, address_value from ( SELECT Person, address from T3 WHERE Person = 'Tom' ) T the LATERAL the VIEW the explode (t.address) V AS address_key, address_value; duplicated code

The results are as follows:

OK tom province guangdong tom city shenzhen Time taken: 0.118 seconds The, FETCHED: 2 Row (S) copying the code
  1. The size function can view the number of key-value pairs in the MAP:
hive > select person, size(address) from t3; OK Tom 2 Jerry 2 Time taken: 0.082 seconds The, FETCHED: 2 Row (S) copying the code

STRUCT

  1. STRUCT is a record type. It encapsulates a named field collection with many attributes. Create a new table named t4. Its info field is of type STRUCT. There are two attributes, age and city, between person and info. The separator is a vertical bar, and the separator between multiple elements in info is a comma. Note the syntax for declaring the separator:
create table if not exists t4( person string, info struct < age: int , city:string > ) row format delimited fields terminated by '|' collection items terminated by ',' ; copy code
  1. Prepare a text file named 004.txt with the following content:
tom | 11 ,shenzhen jerry | 12 ,nanjingCopy code
  1. Load the data of 004.txt to the t4 table:
load data local inpath '/home/hadoop/temp/202010/25/004.txt' into table t4; copy code
  1. View all data of t4:
hive > select * from t4; OK tom {"age": 11 ,"city":"shenzhen"} {Jerry "Age": 12 is , "City": "Nanjing"} Time taken: 0.063 seconds The, FETCHED: 2 Row (S) copying the code
  1. To view the specified field, use filename.xxx syntax:
hive > select person, info.city from t4; OK tom shenzhen jerry nanjing Time taken: 0.141 seconds The, FETCHED: 2 Row (S) copying the code

UNION

  • The last one is UNIONTYPE, which is to choose one from several data types. Since the creation of UNIONTYPE data is designed to UDF (create_union), I won t expand it here. Let s take a look at the table creation statement:
The CREATE TABLE union_test (UNIONTYPE foo < int , Double , Array < String > , struct < A: int , B: String >> ); duplicated code
  • search result:
SELECT foo FROM union_test; { 0 : 1 } { 1 : 2.0 } { 2 :["three","four"]} { 3 :{"a": 5 ,"b":"five"}} { 2 :["six","seven"]} { 3 :{"a": 8 ,"b":"eight"}} { 0 : 9 } { 1 : 10.0 } Copy code
  • So far, we have actually operated on the basic data types and complex data types of hive. The next article will expand more hive knowledge and look forward to making progress with you;

You are not alone, Xinchen and original are with you all the way

  1. Java series
  2. Spring series
  3. Docker series
  4. kubernetes series
  5. Database + middleware series
  6. DevOps series

Welcome to pay attention to the public account: programmer Xin Chen

Search "Programmer Xin Chen" on WeChat, I am Xin Chen, and I look forward to traveling the Java world with you...

github.com/zq2599/blog...