[hadoop] hadoop_2


Hadoop_2


hadoop_img_4

MapReduce๊ฐ€ ํ•˜๋‘ก ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋™์ž‘ํ•˜๋Š” ๋ฐฉ์‹

hadoop_img_5


Job Tracker

- ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ ๊ธฐ๋Šฅ : ์‹ค์ œ๋กœ ๋ถ„์„์ด ์ด๋ฃจ์–ด์ง€๋Š” ๋…ธ๋“œ๋“ค(Task Tracker)์˜ ๋ฆฌ์†Œ์Šค(๊ฐ€์šฉ ์Šคํ† ๋ฆฌ์ง€ ๋“ฑ)๊ฐ€ ์ด์šฉ ๊ฐ€๋Šฅํ•˜์ง€, ์–ผ๋งˆ๋‚˜ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š”์ง€ ๋“ฑ์„ ๊ด€๋ฆฌ

- ์‹คํ–‰ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ : ์‹ค์ œ๋กœ ๋ถ„์„์„ ์‹คํ–‰ํ•˜๋Š” MapReduce Job์„ ๋ฐฐํฌํ•˜๊ณ  ์Šค์ผ€์ฅด๋งํ•˜๊ณ  ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๋Š” ๊ธฐ๋Šฅ

Task Tracker

- Job Tracker์—์„œ ์ผ์„ ๋ฐ›์•„์™€์„œ ์‹คํ–‰ํ•˜๊ณ  ์‹คํ–‰ ํ˜„ํ™ฉ์„ ๋‹ค์‹œ ๋ณด๊ณ ํ•œ๋‹ค.


โ˜… MapReduce ํ”„๋ ˆ์ž„์›Œํฌ

MapReduce๋ž€

2004๋…„ ๊ตฌ๊ธ€๋žฉ์—์„œ ๋ฐœํ‘œํ•œ โ€˜MapReduce : Simplified Data Processing on Large Clusterโ€™๋ž€ ๋…ผ๋ฌธ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž‘์„ฑ๋œ ๋งˆ์Šคํ„ฐ/์Šฌ๋ ˆ์ด๋ธŒ ๊ตฌ์กฐ์˜ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ.

MapReduce ์˜ ํŠน์ง•

- ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์„œ๋ฒ„๋กœ ์ฝ”๋“œ๋ฅผ ์ „์†ก

์†Œ์Šค์ฝ”๋“œ๋ณด๋‹ค ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํ›จ์”ฌ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ฒ˜๋ฆฌํ•  ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ์„œ๋ฒ„๋กœ ์ฝ”๋“œ๋ฅผ ์ „์†กํ•จ์œผ๋กœ์จ ์‹คํ–‰์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•œ๋‹ค.

- ๋ฐ์ดํ„ฐ๋ฅผ ํ‚ค/๋ฐธ๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ณ€ํ™˜์œผ๋กœ ์ฒ˜๋ฆฌ

๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๋กœ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ํฌ๊ฒŒ ๋‘๊ฐ€์ง€ ์Šคํ…์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค.

๊ฐ€. Map Task

์ž…๋ ฅ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์กฐ๊ฐ์œผ๋กœ ๋‚˜๋ˆˆ ํ›„ ๊ทธ๊ฒƒ๋“ค์„ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ ์ˆ˜ ๋งŒํผ ๊ฐ ์„œ๋ฒ„์—์„œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ Map Task๋ผ๊ณ  ํ•œ๋‹ค.

Map Task์˜ ์ž…๋ ฅ, ์ถœ๋ ฅ ๋ชจ๋‘ ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๊ฐ€ ๋œ๋‹ค.

๋‚˜. Reduce Task

Map Task์—์„œ ๊ฐ ์„œ๋ฒ„์—์„œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ์•„์„œ ์ตœ์ข… ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. Reduce Task ์—ญ์‹œ ํ•˜๋‚˜ ์ด์ƒ์˜ ์„œ๋ฒ„์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด ์Šคํ…์—์„œ๋Š” ๋งต ํ…Œ์Šคํฌ๋“ค์—์„œ ๋‚˜์˜จ ๋ ˆ์ฝ”๋“œ๋“ค ์ค‘์—์„œ ๊ฐ™์€ ํ‚ค๋ฅผ ๊ฐ–๋Š” ํŽ˜์–ด๋“ค์„ ๋ฌถ์–ด์„œ ํ•˜๋‚˜์˜ ๋ ˆ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ ๋‹ค. ๋งŒ๋“ค์–ด์ง„ ๋ ˆ์ฝ”๋“œ๋ฅผ ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋กœ ๋„˜๊ธฐ๊ฒŒ ๋˜๊ณ  ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋Š” ๊ทธ ๋ ˆ์ฝ”๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋˜ ๋‹ค๋ฅธ ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์ง€์ •ํ•œ HDFS์ƒ์˜ ์œ„์น˜์— ์ €์žฅํ•œ๋‹ค.

MapReduce ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๊ฐ™์€ ํ‚ค๋ฅผ ๊ฐ–๋Š” ํŽ˜์–ด๋“ค์„ ๋ฌถ๋Š” ๊ฒƒ์„ ์•Œ์•„์„œ ํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์‹ ๊ฒฝ์“ธ ํ•„์š”๊ฐ€ ์—†๊ณ  ๋ฌถ์ธ ํ‚ค/๋ฐธ๋ฅ˜ ํŽ˜์–ด๋“ค์˜ ์ง‘ํ•ฉ์„ ํ•˜๊ณ ์ž ํ•˜๋Š” ์ผ์— ๋งž์ถฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.

- Share Nothing

MapReduce๋Š” ๋ณ‘๋ ฌ์„ฑ์ด ์•„์ฃผ ๋†’๋‹ค. ๊ทธ ์ด์œ ๋Š” ๋งต ํƒœ์Šคํฌ๋ผ๋ฆฌ ํ˜น์€ ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋ผ๋ฆฌ ์„œ๋กœ ์˜์กด์„ฑ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

- ์˜คํ”„๋ผ์ธ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ์— ์ ํ•ฉ

MapReduce ํŠน์ง•์— ๋”ฐ๋ผ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์˜ ์˜คํ”„๋ผ์ธ ๋ฐฐ์น˜์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์‹œ์Šคํ…œ์ด์ง€ ๋ฆฌ์–ผํƒ€์ž„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์‹œ์Šคํ…œ์€ ์•„๋‹ˆ๋‹ค.

MapReduce ํ๋ฆ„

์–ด๋А MapReduce ํ”„๋กœ๊ทธ๋žจ์ด๊ฑด ๋งต๊ณผ ๋ฆฌ๋“€์Šค๋ฅผ ๊ตฌํ˜„ํ•˜์—ฌ์•ผ ํ•œ๋‹ค.

๊ธฐ๋ณธ์ ์œผ๋กœ MapReduce ํ”„๋กœ๊ทธ๋žจ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์„ฑ์„ ์ง€๋‹Œ๋‹ค.

- ๋งต๊ณผ ๋ฆฌ๋“€์Šค์˜ ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ

- ๋งต๊ณผ ๋ฆฌ๋“€์Šค ๋ชจ๋‘ ์ž…๋ ฅ, ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ํ‚ค์™€ ๋ฐธ๋ฅ˜๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

Map ๋‹จ๊ณ„์—์„œ๋Š” ์ฃผ์–ด์ง„ ํ‚ค์™€ ๋ฐธ๋ฅ˜๋ฅผ ์ƒˆ๋กœ์šด ํ‚ค์™€ ๋ฐธ๋ฅ˜๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค

๋ชจ๋“  ์ž…๋ ฅ ๋ ˆ์ฝ”๋“œ๋“ค์ด ๋งต์„ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋˜์—ˆ์œผ๋ฉด ๋ฆฌ๋“€์Šค ์ž‘์—…์ด ์‹œ์ž‘๋œ๋‹ค.

๋งต์—์„œ ์ถœ๋ ฅ๋œ ๋ ˆ์ฝ”๋“œ๋“ค์—์„œ ๊ฐ™์€ ํ‚ค ๊ฐ’์„ ๊ฐ–๋Š” ๋ ˆ์ฝ”๋“œ๋“ค์„ ๋ชจ์•„์„œ ๋ฆฌ๋“€์Šค์—์„œ ํ•˜๋‚˜์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.

hadoop_img_6

๋ชจ๋‘ ์„ธ ๊ฐœ์˜ ์ž…๋ ฅ ํ…์ŠคํŠธ๊ฐ€ ์žˆ์–ด์„œ ์„ธ ๊ฐœ์˜ ๋งต ํƒœ์Šคํฌ๊ฐ€ ์กด์žฌํ•˜๊ณ  ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋Š” 2๊ฐœ๋ผ๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž.

๋งต์—์„œ๋Š” ๋“ค์–ด์˜ค๋Š” ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด๋กœ ๋‚˜๋ˆ ์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๋Œ€๋กœ ๋นˆ๋„์ˆ˜๋ฅผ 1๋กœํ•ด์„œ Key, Value์Œ์œผ๋กœ ์ถœ๋ ฅํ•œ๋‹ค.

๋ฆฌ๋“€์Šค์—์„œ๋Š” ๋ชจ๋“  ๋งต ํƒœ์Šคํฌ๋“ค์˜ ์ถœ๋ ฅ๋“ค์„ ๋ชจ์•„์„œ ๋‹จ์–ด๋ฅผ ํ‚ค๋กœ ํ•˜๊ณ  ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐธ๋ฅ˜๋กœ ๋งŒ๋“ค์–ด์„œ ๋ฆฌ๋“€์Šค์˜ ์ž…๋ ฅ์œผ๋กœ ์ง€์ •ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ๋ฆฌ๋“€์Šค๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฆฌ๋“€์Šค์˜ ๋ฐธ๋ฅ˜๋กœ ๋“ค์–ด์˜จ ๋นˆ๋„์ˆ˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์Šค์บ”ํ•ด์„œ ๋นˆ๋„์ˆ˜๋ฅผ ๋‹ค ๋”ํ•œ ๊ฐ’์„ ๊ตฌํ•œ๋‹ค

๋ฆฌ๋“€์Šค์˜ ์ถœ๋ ฅ์€ ๊ฒฐ๊ตญ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํ‚ค๊ฐ€๋˜๊ณ  ๋ฐธ๋ฅ˜๋Š” ๊ทธ ๋‹จ์–ด์˜ ์ด ๋นˆ๋„์ˆ˜๊ฐ€ ๋œ๋‹ค.

hadoop_img_7

[์‹ค์Šต]

Project : hadoop_mapreduce

Package : WordCount

Class : WordCountMapper.java

โ€‹ WordCountReducer.java

โ€‹ WordCount.java

1.input.txt ํŒŒ์ผ ์ž‘์„ฑ

[hadoop@master jar]$ vi input.txt
[hadoop@master jar]$ cat input.txt

2.ํŒŒ์ผ ์˜ฌ๋ฆฌ๊ธฐ

[hadoop@master jar]$ hadoop fs -put input.txt input.txt
[hadoop@master jar]$ hadoop fs -ls

3.์†Œ์Šค ์ž‘์„ฑ

4.JAR ์ƒ์„ฑ - WordCount.jar

5.WordCount ๋นŒ๋“œ

์ž…๋ ฅํŒŒ์ผ : input.txt

์ถœ๋ ฅํด๋” : output

[hadoop@master jar]$ hadoop jar WordCount.jar WordCount.WordCount input.txt output

6.๊ฒฐ๊ณผ

[hadoop@master jar]$ hadoop fs -ls
[hadoop@master jar]$ hadoop fs -ls output
Found 2 items
-rw-r--r--  1 hadoop supergroup     0 2017-02-22 17:35 output/_SUCCESS
-rw-r--r--  1 hadoop supergroup     58 2017-02-22 17:35 output/part-r-00000
[hadoop@master jar]$ 
[hadoop@master jar]$ hadoop fs -cat output/part-r-00000

[์‹ค์Šต]

Project : hadoop_mapreduce

Package : CharCount

Class : CharCountMapper.java

โ€‹ CharCountReducer.java

โ€‹ CharCount.java

์ž…๋ ฅํŒŒ์ผ : input_str.txt

์ถœ๋ ฅํด๋” : output_str

[hadoop@master jar]$ vi input_str.txt
ILOVEYOUILIKEYOU

Stiring ํด๋ž˜์Šค์˜ charAt()๋ฅผ ์ด์šฉํ•˜์„ธ์š”

I 1

L 1

O 1

V 1

E 1

Y 1

O 1

U 1

[์‹ค์Šต]

Project : hadoop_mapreduce

Package : NewsCount

Class : NewsCountMapper.java

โ€‹ NewsCountReducer.java

โ€‹ NewsCount.java

์ž…๋ ฅํŒŒ์ผ : input_news.txt

์ถœ๋ ฅํด๋” : output_news

[hadoop@master jar]$ vi input_news.txt
์ธํ„ฐ๋„ท์˜ ๋‰ด์Šค(์˜๋ฌธ)

A~Z๊นŒ์ง€(๋Œ€๋ฌธ์ž/์†Œ๋ฌธ์ž ๊ฐ€๋ฆฌ์ง€ ๋ง ๊ฒƒ) ๊ฐ๊ฐ ๋ช‡ ๊ฐœ๊ฐ€ ์žˆ๋Š”์ง€ ์ถœ๋ ฅํ•˜์‹œ์˜ค

A 1

B 3

C 10

โ€ฆ

Z 7


โ˜… ํ•ญ๊ณต ์ถœ๋ฐœ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„

1.์‹ค์Šต ๋ฐ์ดํ„ฐ ๋‚ด๋ ค๋ฐ›๊ธฐ

http://stat-computing.org/dataexpo/2009 ์ ‘์† - Download the data ํด๋ฆญ

1987, 1988, 1989 ๋…„๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๋Š”๋‹ค.

2.๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ํŒŒ์ผ์„ workspace๋กœ ๋ณต์‚ฌ

3.ํŒŒ์ผ ์••์ถ• ํ•ด์ œ

[hadoop@master workspace]$ bzip2 โ€“d 1987.csv.bz2
[hadoop@master workspace]$ bzip2 โ€“d 1988.csv.bz2
[hadoop@master workspace]$ bzip2 โ€“d 1989.csv.bz2

๋˜๋Š” 

[hadoop@master workspace]$ bzip2 -d *.bz2

4.์ฒซ์งธ์ค„์€ ์ œ๋ชฉ๋ผ์ธ์ด๋‹ค. ์‚ญ์ œํ•˜์ž.

[hadoop@master workspace]$ more 1987.csv
[hadoop@master workspace]$ sed โ€“e โ€˜1dโ€™ 1987.csv > 1987_new.csv
( 1988, 1989 ๋งˆ์ฐฌ๊ฐ€์ง€ )

5.HDFS์— ๋ถ„์„์šฉ ํŒŒ์ผ ์—…๋กœ๋“œํ•˜๊ธฐ.

[hadoop@master workspace]$ hadoop fs -ls
[hadoop@master workspace]$ hadoop fs โ€“mkdir airline_input
[hadoop@master workspace]$ hadoop fs โ€“put *_new.csv airline_input
[hadoop@master workspace]$ hadoop fs -ls
[hadoop@master workspace]$ hadoop fs -ls airline_input
[hadoop@master workspace]$ hadoop fs -ls
[hadoop@master workspace]$ hadoop fs โ€“mkdir airline_input
[hadoop@master workspace]$ hadoop fs โ€“put *_new.csv airline_input
[hadoop@master workspace]$ hadoop fs -ls
[hadoop@master workspace]$ hadoop fs -ls airline_input

6.AirlinePerformanceParser.java ๊ตฌํ˜„

ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋Š” ์ฝค๋งˆ(,)๋‹จ์œ„๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ๋‹ค.

ํ•œ ์ค„์˜ ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜์ž.

์ƒ์„ฑ์ž์—์„œ๋Š” csvํŒŒ์ผ ํ•œ ์ค„์„ ๊ฐ€์ง€๊ณ  ๋ถ„์„์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž์‹ ์˜ ๋ฉค๋ฒ„๋ณ€์ˆ˜์— ์ €์žฅํ•œ๋‹ค.

7.ํ•ญ๊ณต ์ถœ๋ฐœ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„

์—ฐ๋„๋ณ„๋กœ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ํ•ญ๊ณต๊ธฐ๊ฐ€ ์ถœ๋ฐœ์ด ์ง€์—ฐ๋˜์—ˆ๋Š”์ง€ ๊ณ„์‚ฐํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ.

ํ•ญ๊ณต ์ถœ๋ฐœ ์ง€์—ฐ ๋งต๋ฆฌ๋“€์Šค ์ž…์ถœ๋ ฅ ๋ฐ์ดํ„ฐ ํƒ€์ž…

ํด๋ž˜์Šค ์ž…์ถœ๋ ฅ ํ‚ค                  ๊ฐ’

\--------------------------------------------------------------

๋งคํผ   ์ž…๋ ฅ   ๋ผ์ธ๋„˜๋ฒ„             ํ•ญ๊ณต ์šดํ•ญ ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ

โ€‹       ์ถœ๋ ฅ   ์šดํ•ญ๋…„๋„,์šดํ•ญ์›”             ์ถœ๋ฐœ ์ง€์—ฐ ๊ฑด์ˆ˜

 

๋ฆฌ๋“€์„œ ์ž…๋ ฅ   ์šดํ•ญ๋…„๋„,์šดํ•ญ์›”             ์ถœ๋ฐœ ์ง€์—ฐ ๊ฑด์ˆ˜

โ€‹       ์ถœ๋ ฅ   ์šดํ•ญ๋…„๋„ ์šดํ•ญ์›”      ์ถœ๋ฐœ ์ง€์—ฐ ๊ฑด์ˆ˜ ํ•ฉ๊ณ„

ํ”„๋กœ์ ํŠธ : hadoop_mapreduce

ํŒจํ‚ค์ง€ : AirlinePerformance

ํด๋ž˜์Šค : AirlinePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : DepartureDelayCountMapper.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ : DelayCountReducer.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : DepartureDelayCount.java

JAR : AirlinePerformanceDeparture.jar

์ž…๋ ฅํด๋” : airline_input

์ถœ๋ ฅํด๋” : dep_delay_count

[hadoop@master jar]$ hadoop jar AirlinePerformanceDeparture.jar 
โ€‹       AirlinePerformance.ArrivalDelayCount airline_input arr_delay_count
[hadoop@master jar]$ hadoop fs -ls
Found 5 items
drwxr-xr-x  - hadoop supergroup     0 2017-03-25 12:58 airline_input
drwxr-xr-x  - hadoop supergroup     0 2017-03-26 09:51 dep_delay_count
[hadoop@master jar]$ hadoop fs -ls dep_delay_count
Found 2 items
-rw-r--r--  1 hadoop supergroup     0 2017-03-26 09:51 dep_delay_count/_SUCCESS
-rw-r--r--  1 hadoop supergroup    387 2017-03-26 09:51 dep_delay_count/part-r-00000
[hadoop@master jar]$ hadoop fs โ€“cat dep_delay_count/part-r-00000


โ˜… ํ•ญ๊ณต ๋„์ฐฉ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„

ํด๋ž˜์Šค ์ž…์ถœ๋ ฅ ํ‚ค                  ๊ฐ’

--------------------------------------------------------------

๋งคํผ   ์ž…๋ ฅ   ๋ผ์ธ๋„˜๋ฒ„             ํ•ญ๊ณต ์šดํ•ญ ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ

โ€‹       ์ถœ๋ ฅ   ์šดํ•ญ๋…„๋„,์šดํ•ญ์›”             ๋„์ฐฉ ์ง€์—ฐ ๊ฑด์ˆ˜

 
๋ฆฌ๋“€์„œ ์ž…๋ ฅ   ์šดํ•ญ๋…„๋„,์šดํ•ญ์›”             ๋„์ฐฉ ์ง€์—ฐ ๊ฑด์ˆ˜

โ€‹       ์ถœ๋ ฅ   ์šดํ•ญ๋…„๋„ ์šดํ•ญ์›”      ๋„์ฐฉ ์ง€์—ฐ ๊ฑด์ˆ˜ ํ•ฉ๊ณ„

Package : AirlinePerformance

ํด๋ž˜์Šค : AirlinePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : ArrivalDelayCountMapper.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ : DelayCountReducer.java

โ€‹ - ์šดํ•ญ ๋…„๋„,์›”์„ ๋ฌถ์–ด์„œ ํ•ฉ๊ณ„๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๋™์ผ

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : ArrivalDelayCount.java

JAR : AirlinePerformanceArrival.jar

์ž…๋ ฅํด๋” : airline_input

์ถœ๋ ฅํด๋” : arr_delay_count

[hadoop@master jar]$ hadoop jar AirlinePerformanceArrival.jar 

โ€‹       AirlinePerformance.ArrivalDelayCount airline_input arr_delay_count

[hadoop@master jar]$ hadoop fs -ls arr_delay_count
[hadoop@master jar]$ hadoop fs โ€“cat arr_delay_count/part-r-00000
[hadoop@master jar]$ hadoop jar AirlinePerformanceArrival.jar 
โ€‹       AirlinePerformance.ArrivalDelayCount airline_input arr_delay_count
[hadoop@master jar]$ hadoop fs -ls arr_delay_count
[hadoop@master jar]$ hadoop fs โ€“cat arr_delay_count/part-r-00000


[๋ฌธ์ œ] 2013 ~ 2015๋…„๊นŒ์ง€ ๋ฒ”์ฃ„ ์กฐ์‚ฌ

ํ”„๋กœ์ ํŠธ : hadoop_mapreduce

ํŒจํ‚ค์ง€ : CrimePerformance

ํด๋ž˜์Šค : CrimePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : CrimeCountMapper.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ : CrimeCountReducer.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : CrimeCount.java

JAR : CrimePerformanceType.jar

1.์‹ค์Šต ๋ฐ์ดํ„ฐ ๋‚ด๋ ค๋ฐ›๊ธฐ

http://data.gov ์ ‘์† - Crime ๊ฒ€์ƒ‰

2014, 2015, 2016 ๋…„๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๋Š”๋‹ค.

2.๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ํŒŒ์ผ์„ workspace๋กœ ๋ณต์‚ฌ

3.์ฒซ์งธ์ค„์€ ์ œ๋ชฉ๋ผ์ธ์ด๋‹ค. ์‚ญ์ œํ•˜์ž.

[hadoop@master workspace]$ sed -e '1d' Crime_Data_2013.csv > Crime_Data_2013_new.csv
[hadoop@master workspace]$ sed -e '1d' Crime_Data_2014.csv > Crime_Data_2014_new.csv
[hadoop@master workspace]$ sed -e '1d' Crime_Data_2015.csv > Crime_Data_2015_new.csv

4.HDFS์— ๋ถ„์„์šฉ ํŒŒ์ผ ์—…๋กœ๋“œํ•˜๊ธฐ.

[hadoop@master workspace]$ hadoop fs -ls
[hadoop@master workspace]$ hadoop fs โ€“mkdir crime_input
[hadoop@master workspace]$ hadoop fs โ€“put Crime_Data_2013_new.csv crime_input
[hadoop@master workspace]$ hadoop fs โ€“put Crime_Data_2014_new.csv crime_input
[hadoop@master workspace]$ hadoop fs โ€“put Crime_Data_2015_new.csv crime_input
[hadoop@master workspace]$ hadoop fs -ls

5.์†Œ์Šค์ž‘์„ฑ

6.๋ฒ”์ฃ„ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„์„

ํด๋ž˜์Šค ์ž…์ถœ๋ ฅ ํ‚ค                  ๊ฐ’

--------------------------------------------------------------

๋งคํผ   ์ž…๋ ฅ   ๋ผ์ธ๋„˜๋ฒ„             ๋ฒ”์ฃ„ ์ด๋ฒคํŠธ ๋ฐ์ดํ„ฐ

โ€‹       ์ถœ๋ ฅ   ๋…„๋„,์›”,๋ฒ”์ฃ„์œ ํ˜•      ์ด๋ฒคํŠธ ๊ฑด์ˆ˜

๋ฆฌ๋“€์„œ ์ž…๋ ฅ   ๋…„๋„,์›”,๋ฒ”์ฃ„์œ ํ˜•      ์ด๋ฒคํŠธ ๊ฑด์ˆ˜

โ€‹       ์ถœ๋ ฅ   ๋…„๋„,์›”,๋ฒ”์ฃ„์œ ํ˜•      ์ด๋ฒคํŠธ ๊ฑด์ˆ˜ ํ•ฉ๊ณ„
[hadoop@master jar]$ hadoop jar CrimePerformanceType.jar CrimePerformance.CrimeCount crime_input crime_output
DRUGS/ALCOHOL VIOLATIONS - ์•Œ์ฝœ์œ„๋ฐ˜
VEHICLE BREAK-IN/THEFT - ์ฐจ๋Ÿ‰์ ˆ๋„
VANDALISM - ๊ณต๊ณต๊ธฐ๋ฌผ ํŒŒ์†
ASSAULT - ํญํ–‰
DRUGS/ALCOHOL VIOLATIONS
VEHICLE BREAK-IN/THEFT
VEHICLE BREAK-IN/THEFT
ASSAULT 
VANDALISM
OTHER
BURGLARY - ์ ˆ๋„


โ˜… ํ•ญ๊ณต ์ถœ๋ฐœ/๋„์ฐฉ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„_ํŒŒ๋ผ๋ฏธํ„ฐ

ํ•ญ๊ณต ์ถœ๋ฐœ/๋„์ฐฉ ์ง€์—ฐ์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ์ปดํŒŒ์ผ์„ ๋”ฐ๋กœ๋”ฐ๋กœ ํ•˜๊ณ , ์‹คํ–‰๋„ ๋”ฐ๋กœ ํ•˜์˜€์—ˆ๋‹ค.

์ด๊ฒƒ์„ ์‚ฌ์šฉ์ž ์˜ต์…˜์„ ๋„˜๊ฒจ๋ฐ›์•„์„œ workType ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ departure์ผ๋•Œ๋Š” ์ถœ๋ฐœ ์ง€์—ฐ์„ ๋ถ„์„ํ•˜๊ณ , workType ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ arrival์ผ๋•Œ๋Š” ๋„์ฐฉ์ง€์—ฐ์„ ๋ถ„์„ํ•˜๋„๋ก ํ•ด๋ณด์ž.

ํŒจํ‚ค์ง€ : AirlinePerformanceWorkType

ํด๋ž˜์Šค : AirlinePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : DelayCountMapper.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ (๋™์ผ) : DelayCountReducer.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : DelayCount.java

JAR : AirlinePerformanceWorkType.jar

[hadoop@master jar]$ hadoop jar AirlinePerformanceWorkType.jar AirlinePerformanceWorkType.DelayCount 
-D workType=departure airline_input departure_delay_count
[hadoop@master jar]$ hadoop fs -ls
[hadoop@master jar]$ hadoop fs -ls departure_delay_count
[hadoop@master jar]$ hadoop fs -ls -cat departure_delay_count/part-r-00000
[hadoop@master jar]$ hadoop jar AirlinePerformanceWorkType.jar AirlinePerformanceWorkType.DelayCount 
-D workType=arrival airline_input arrival_delay_count

1.GenericOptionsParser

ํ•˜๋‘ก ์ฝ˜์†”๋ช…๋ น์–ด์—์„œ ์ž…๋ ฅํ•œ ์˜ต์…˜์„ ๋ถ„์„ํ•œ๋‹ค.

GenericOptionsParser๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ํ•˜๋‘ก ์ฝ˜์†” ๋ช…๋ น์—์„œ ์ž…๋ ฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ธ์‹ํ•œ๋‹ค.

-D ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์—…ํ•˜๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„๋กœ ์ž‘์—…์ด ๋‹ค๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋˜๋„๋ก ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

2.TOOL

Tool์˜ run๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•ด์„œ ํ•˜๋‘ก ์‹คํ–‰์‹œ์ ์— ์ž…๋ ฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฝ์–ด์˜ค๊ณ  ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ž‘์—…ํ•  ์ˆ˜ ์žˆ๋‹ค.

3.ToolRunner

Tool ์˜ run๋ฉ”์„œ๋“œ๋Š” ์˜ค์ง ToolRunner์—์„œ๋งŒ ํ˜ธ์ถœ ํ•  ์ˆ˜ ์žˆ๋‹ค.

4.Mapper ์ž‘์„ฑ

ํ•˜๋‘ก์„ ์‹คํ–‰ ํ•  ๋•Œ ์ž…๋ ฅํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋กœ์ง์„ ์ ์šฉํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰

- ํ•˜๋‘ก ์ž…๋ ฅ ์‹œ์— ์ž…๋ ฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’์„ ์ถ”์ถœ ํ•˜๋Š” ์ž‘์—…

- ๊ฐ’์— ๋”ฐ๋ผ ๋กœ์ง์„ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ž‘์—…

5.Reducer ์ž‘์„ฑ

์ถœ๋ ฅ๋œ ๊ฐ’์„ ํ•ฉ์น˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋˜๋ฏ€๋กœ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค.

6.Driver ์ž‘์„ฑ

๏ปฟConfigured์™€ Tool์„ ์ƒ์†๋ฐ›๋„๋ก ์ž‘์„ฑํ•œ๋‹ค.

run๋ฉ”์„œ๋“œ๋ฅผ ์˜ค๋ฒ„๋ผ์ด๋”ฉํ•˜์—ฌ ์‹ค์ œ Driver์—์„œ ์ž‘์—…ํ–ˆ๋˜ ๋ชจ๋“  ๋‚ด์šฉ์„ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌํ˜„ํ•ด์•ผ ํ•œ๋‹ค.

๋งŒ์•ฝ input๊ณผ output์˜ ๊ฒฝ๋กœ๋ฅผ ๋ช…๋ นํ–‰ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •ํ•œ๋‹ค๋ฉด GenericOptionsParser ์˜ getRemainingArgs() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž ์˜ต์…˜์— ์ง€์ •ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ œ์™ธํ•œ ๊ฐ’์„ ์ฝ์–ด๋“ค์—ฌ์•ผ ํ•œ๋‹ค.


โ˜… ํ•ญ๊ณต ์ถœ๋ฐœ/๋„์ฐฉ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„_์นด์šดํ„ฐ

์นด์šดํ„ฐ์˜ ์‚ฌ์šฉ : ๋งต๋ฆฌ๋“€์Šค ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๋งต์—์„œ ์ž…๋ ฅ ํŒŒ์ผ์„ ํ•œ์ค„ ํ•œ์ค„ ์ฝ์„ ๋•Œ๋งˆ๋‹ค ์กฐ๊ฑด๋ฌธ์„ ์„ค์ •ํ•˜์—ฌ ์กฐ๊ฑด์— ๋งž์„ ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž ์ •์˜ ์นด์šดํ„ฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

Ex. ํ•ญ๊ณต๊ธฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ์ค„ ์ฝ์—ˆ๋Š”๋ฐ, ์กฐ๊ธฐ ์ถœ๋ฐœ์ผ ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž ์ •์˜ ์นด์šดํ„ฐ early_departure 1์ฆ๊ฐ€.

๋ชจ๋“  ํŒŒ์ผ์„ ์ฝ์€ ํ›„ ์นด์šดํ„ฐ๊ฐ’ ์ถœ๋ ฅ ํ•ด์คŒ.

ํŒจํ‚ค์ง€ : AirlinePerformanceCounter

ํด๋ž˜์Šค : AirlinePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : DelayCountMapperWithCounter.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ (๋™์ผ) : DelayCountReducer.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : DelayCountWithCounter.java

์ƒ์ˆ˜ : DelayCounters.java

JAR : AirlinePerformanceCounter.jar

[์‹คํ–‰]

[hadoop@master jar]$ hadoop jar AirlinePerformanceCounter.jar AirlinePerformanceCounter.DelayCountWithCounter

-D workType=departure airline_input departure_delay_count_counter

[hadoop@master jar]$ hadoop jar AirlinePerformanceCounter.jar AirlinePerformanceCounter.DelayCountWithCounter

-D workType=arrival airline_input departure_delay_count_counter


โ˜… ํ•ญ๊ณต ์ถœ๋ฐœ/๋„์ฐฉ ์ง€์—ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„_๋ฉ€ํ‹ฐ

์œ„์˜ ์˜ˆ์ œ ๊นŒ์ง€๋Š” ๋„์ฐฉ๊ณผ ์ถœ๋ฐœ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์ปดํŒŒ์ผ์€ ํ•œ๋ฒˆ์ด์ง€๋งŒ ์‹คํ–‰์„ ๋‘๋ฒˆ ํ•˜์—ฌ์•ผ ํ–ˆ๋‹ค.

์ด๋ฒˆ์—๋Š” ํ•œ๋ฒˆ๋งŒ ์‹คํ–‰ํ•ด์„œ ์ถœ๋ฐœ๊ณผ ๋„์ฐฉ์„ ํ•œ๊บผ๋ฒˆ์— ๋ถ„์„ํ•ด์„œ ํŒŒ์ผ์„ 2๊ฐœ๋กœ ๋‚˜๋ˆ ์„œ ์ถœ๋ ฅํ•œ๋‹ค.

๋งคํผ๊ฐ€ ํ•œ์ค„์˜ ์ž…๋ ฅ ํŒŒ์ผ์„ ์ฝ์—ˆ์„ ๋•Œ, ์ถœ๋ฐœ์ง€์—ฐ์ด๋ฉด ๋งคํผ์˜ ์ถœ๋ ฅ ํ‚ค์— D๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ๋„์ฐฉ ์ง€์—ฐ์ด๋ฉด ๋งคํผ์˜ ์ถœ๋ ฅ ํ‚ค์— A๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฆฌ๋“€์„œ์—์„œ๋Š” A,D๋ฅผ ์ด์šฉํ•ด์„œ ๋„์ฐฉ๊ณผ ์ถœ๋ฐœ์„ ๊ตฌ๋ถ„ํ•˜์—ฌ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•œ๋‹ค.

ํŒจํ‚ค์ง€ : AirlinePerformanceMultiple

ํด๋ž˜์Šค : AirlinePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : DelayCountMapperWithMultipleOutputs.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ : DelayCountReducerWithMultipleOutputs.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : DelayCountWithMultipleOutputs.java

์ƒ์ˆ˜ : DelayCounters.java

JAR : AirlinePerformanceMultiple.jar

[์‹คํ–‰]

[hadoop@master jar]$ hadoop jar AirlinePerformanceMultiple.jar AirlinePerformanceMultiple.DelayCountWithMultipleOutputs

airline_input  delay_count_multiple
 
[hadoop@master jar]$ hadoop fs -ls delay_count_multiple
Found 4 items
-rw-r--r--  1 hadoop supergroup    0 2016-02-26 15:08 delay_count_multiple/_SUCCESS
-rw-r--r--  1 hadoop supergroup   387 2016-02-26 15:08 delay_count_multiple/arrival-r-00000
-rw-r--r--   1 hadoop  supergroup    387 2016-02-26 15:08 delay_count_multiple/departure-r-00000
-rw-r--r--  1 hadoop supergroup    0 2016-02-26 15:08 delay_count_multiple/part-r-00000

[hadoop@master jar]$ hadoop fs โ€“cat delay_count_multiple/departure-r-00000
[hadoop@master jar]$ hadoop fs โ€“cat delay_count_multiple/arrival-r-00000


[๋ฌธ์ œ] 2013 ~ 2015๋…„๊นŒ์ง€ ๋ฒ”์ฃ„ ์กฐ์‚ฌ

๋ฒ”์ฃ„์œ ํ˜•(CRIME_TYPE), ๋ฒ”์ฃ„ํ˜„์žฅ(PREMISE_TYPE), ์œ„์น˜(BLOCK_ADDRESS)์— ๋”ฐ๋ฅธ ๊ฐ๊ฐ์˜ ํŒŒ์ผ์„ MultiOutputs๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•œ๋ฒˆ์— ๊ตฌํ•˜์‹œ์˜ค

ํ”„๋กœ์ ํŠธ : hadoop_mapreduce

ํŒจํ‚ค์ง€ : CrimePerformanceMultiple

ํด๋ž˜์Šค : CrimePerformanceParser.java

๋งคํผ ๊ตฌํ˜„ : CrimeCountMapperMultiple.java

๋ฆฌ๋“€์„œ ๊ตฌํ˜„ : CrimeCountReducerMultiple.java

๋“œ๋ผ์ด๋ฒ„ ํด๋ž˜์Šค ๊ตฌํ˜„ : CrimeCountMultiple.java

JAR : CrimePerformanceMultiple.jar

[hadoop@master jar]$ hadoop jar CrimePerformanceMultiple.jar CrimePerformanceMultiple.CrimeCountMultiple 

crime_input crime_output_multiple


[hadoop@master jar]$ hadoop fs -ls crime_output_multiple
Found 5 items
-rw-r--r--  1 hadoop supergroup     0 2017-02-28 20:13 crime_output_multiple/_SUCCESS

-rw-r--r--  1 hadoop supergroup  9433906 2017-02-28 20:13 crime_output_multiple/blockAddress-r-00000

-rw-r--r--  1 hadoop supergroup   12881 2017-02-28 20:13 crime_output_multiple/crimeType-r-00000

-rw-r--r--  1 hadoop supergroup     0 2017-02-28 20:13 crime_output_multiple/part-r-00000

-rw-r--r--  1 hadoop supergroup   50775 2017-02-28 20:13 crime_output_multiple/premiseType-r-00000

[hadoop@master jar]$ hadoop fs -cat crime_output_multiple/crimeType-r-00000

[hadoop@master jar]$ hadoop fs -cat crime_output_multiple/premiseType-r-00000

[hadoop@master jar]$ hadoop fs -cat crime_output_multiple/blockAddress-r-00000


โ˜… MapReduce ์ •๋ ฌ ๊ธฐ๋ฒ•


Hadoop ์ •๋ ฌ

ํ•˜๋‘ก์˜ ๋งต๋ฆฌ๋“€์Šค๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ‚ค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋งŒ ์‹คํ–‰๋˜๊ฒŒ ํ•œ๋‹ค๋ฉด ์ •๋ ฌ์„ ์‰ฝ๊ฒŒ ํ•ด๊ฒฐ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ๋…ธ๋“œ๊ฐ€ ๊ตฌ์„ฑ๋œ ์ƒํ™ฉ์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋งŒ์„ ์‹คํ–‰ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์™œ๋ƒ๋ฉด ๋ถ„์‚ฐํ™˜๊ฒฝ์˜ ์žฅ์ ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๋ถ„์‚ฐํ™˜๊ฒฝ์—์„œ์˜ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋…ธ๋“œ๋ฅผ ์ด์šฉํ•˜๋ฉด์„œ ์ •๋ ฌ์„ ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

์šฐ๋ฆฐ ์ด๊ฒƒ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ณด์กฐ ์ •๋ ฌ(Secondary Sort), ๋ถ€๋ถ„ ์ •๋ ฌ(Partial Sort), ์ „์ฒด ์ •๋ ฌ(Total Sort)์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.


๋ณด์กฐ ์ •๋ ฌ(Secondary Sort)

๋ณด์กฐ ์ •๋ ฌ์€ ํ‚ค์˜ ๊ฐ’๋“ค์„ ๊ทธ๋ฃนํ•‘ํ•˜๊ณ , ๊ทธ๋ฃนํ•‘๋œ ๋ ˆ์ฝ”๋“œ์— ์ˆœ์„œ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹

์‚ฌ์šฉ์ž ์ •์˜ ํŒŒํ‹ฐ์…”๋„ˆ์™€ GroupingComparator๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


1.๊ธฐ์กด ํ‚ค์˜ ๊ฐ’๋“ค์„ ์กฐํ•ฉํ•œ ๋ณตํ•ฉํ‚ค(Composite Key)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

์ด๋•Œ ํ‚ค์˜ ๊ฐ’ ์ค‘์—์„œ ์–ด๋–ค ํ‚ค๋ฅผ ๊ทธ๋ฃนํ•‘ ํ‚ค๋กœ ์‚ฌ์šฉํ• ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

๋ณตํ•ฉํ‚ค๋Š” ๊ธฐ์กด์˜ ํ‚ค๊ฐ’์„ ์กฐํ•ฉํ•œ ์ผ์ข…์˜ ํ‚ค ์ง‘ํ•ฉ

2.๋ณตํ•ฉํ‚ค์˜ ๋ ˆ์ฝ”๋“œ๋ฅผ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•œ ๋น„๊ต๊ธฐ(Comparator)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

๏ปฟ๋ณตํ•ฉํ‚ค ๋น„๊ต๊ธฐ๋Š” ๋‘๊ฐœ์˜ ๋ณตํ•ฉํ‚ค๋ฅผ ๋น„๊ตํ•˜์—ฌ ์ •๋ ฌ ์ˆœ์„œ๋ฅผ ์ •ํ•จ

3.๊ทธ๋ฃนํ•‘ ํ‚ค๋ฅผ ํŒŒํ‹ฐ์…”๋‹ํ•  ํŒŒํ‹ฐ์…”๋„ˆ(Partitioner)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒํ‹ฐ์…”๋„ˆ๋Š” ๋งต ํƒœ์Šคํฌ์˜ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๋ณด๋‚ผ์ง€ ๊ฒฐ์ •

ํŒŒํ‹ฐ์…”๋‹๋œ ๋ฐ์ดํ„ฐ๋Š” ๋งต ํƒœ์Šคํฌ์˜ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ‚ค์˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ฒฐ์ •

4.๊ทธ๋ฃนํ•‘ ํ‚ค๋ฅผ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•œ ๋น„๊ต๊ธฐ(Comparator)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

๏ปฟ๋ฆฌ๋“€์„œ๋Š” ๊ทธ๋ฃนํ‚ค ๋น„๊ต๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ด ๊ทธ๋ฃนํ•‘ ๋˜์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ(๊ฐ™์€ ์—ฐ๋„์— ํ•ด๋‹นํ•˜๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ)๋ฅผ ํ•˜๋‚˜์˜ Reducer ๊ทธ๋ฃน์—์„œ ์ฒ˜๋ฆฌ


[์‹ค์Šต] ๋ณด์กฐ ์ •๋ ฌ(Secondary Sort)

๋จผ์ € ์‹ค์Šตํ•˜๊ธฐ ์ „์— ๊ฐ„๋‹จํ•˜๊ฒŒ Writable์— ๋Œ€ํ•œ ์ž๋ฐ”๋ฅผ ๊ณต๋ถ€ํ•œ๋‹ค

Project : hadoop_sort

Package : exam

Class : Person

โ€‹ PersonMain

Project : hadoop_sort

Package : AirlinePerformanceSecondarySort

AirlinePerformanceParser.java

DelayCounters.java

DateKey.java

DateKeyComparator.java

GroupKeyPartitioner.java

GroupKeyComparator.java

DelayCountMapperWithDateKey.java

DelayCountReducerWithDateKey.java

DelayCountWithDateKey.java

JAR : AirlinePerformanceSecondarySort.jar

[hadoop@master jar]$ hadoop jar AirlinePerformanceSecondarySort.jar 
AirlinePerformanceSecondarySort.DelayCountWithDateKey
airline_input delay_count_sort


...์ค‘๋žต
 

       Map-Reduce Framework

              Map input records=11555122

              Map output records=11273069

              Map output bytes=202915242

              Map output materialized bytes=225461434

              Input split bytes=1098

              Combine input records=0

              Combine output records=0

              Reduce input groups=6 โ€“ 87๋…„ ์ถœ๋ฐœ,์ง€์—ฐ

                     88๋…„ ์ถœ๋ฐœ,์ง€์—ฐ

                     89๋…„ ์ถœ๋ฐœ, ์ง€์—ฐ => ์ด 6๊ฐœ

              Reduce shuffle bytes=225461434

              Reduce input records=11273069

              Reduce output records=0

              Spilled Records=22546138

              Shuffled Maps =9

              Failed Shuffles=0

              Merged Map outputs=9

              GC time elapsed (ms)=16648

              CPU time spent (ms)=78070

              Physical memory (bytes) snapshot=2027245568

              Virtual memory (bytes) snapshot=20612046848

              Total committed heap usage (bytes)=1439666176

       AirlinePerformanceSecondarySort.DelayCounters

              early_arrival=4380215

              early_departure=1984103

              not_available_arrival=177103

              not_available_departure=144013

              scheduled_arrival=567687

              scheduled_departure=4584054


๋ถ€๋ถ„ ์ •๋ ฌ(Partial Sort)

๋ถ€๋ถ„ ์ •๋ ฌ์€ ๋งคํผ์˜ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋งตํŒŒ์ผ(MapFile)๋กœ ๋ณ€๊ฒฝํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•

๋ถ€๋ถ„ ์ •๋ ฌ์€ ์ •๋ ฌํ•  ํŒŒ์ผ์„ ๋งต ๋‹จ๊ณ„์—์„œ ์‹œํ€€์Šค ํŒŒ์ผ๋กœ ์ถœ๋ ฅ์„ ํ•ด์ฃผ๊ณ , ์ถœ๋ ฅ๋œ ์‹œํ€€์Šค ํŒŒ์ผ์„ ๋งตํŒŒ์ผ๋กœ ๋ณ€๊ฒฝํ•œ ํ›„ ๋งตํŒŒ์ผ์—์„œ ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•ด๋‚ด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๋งตํŒŒ์ผ์ด ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ๋งŒ๋“ค์–ด์ ธ ์žˆ๋”๋ผ๋„, ๊ฐ๊ฐ์˜ ๋งตํŒŒ์ผ์—๋Š” ํ‚ค์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋“ค๋งŒ ์ €์žฅ๋˜์–ด ์žˆ์–ด์„œ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.๏ปฟ


1.์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์‹œํ€€์Šค ํŒŒ์ผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

- ๋งคํผ๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฐ์‚ฐํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฆฌ๋“€์„œ๊ฐ€ ํ•„์š” ์—†์Œ

2.์‹œํ€€์Šค ํŒŒ์ผ์„ ๋งตํŒŒ์ผ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

- ๋งตํŒŒ์ผ์€ ํ‚ค๊ฐ’์„ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ƒ‰์ธ๊ณผ ํ•จ๊ป˜ ์ •๋ ฌ๋œ ์‹œํ€€์Šค ํŒŒ์ผ

- ๋งตํŒŒ์ผ์€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ƒ‰์ธ์ด ์ €์žฅ๋œ index ํŒŒ์ผ๊ณผ ๋ฐ์ดํ„ฐ ๋‚ด์šฉ์ด ์ €์žฅ๋ผ ์žˆ๋Š” data ํŒŒ์ผ๋กœ ๊ตฌ์„ฑ

3.๋งตํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

- ๊ฒ€์ƒ‰์˜ ํ‚ค๋Š” ํŒŒํ‹ฐ์…”๋„ˆ

- ํŒŒํ‹ฐ์…”๋‹๋œ ๋ฐ์ดํ„ฐ๋Š” ๋งต ํƒœ์Šคํฌ์˜ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ‚ค์˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ฒฐ์ •


๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋‹ค ๋ณด๋ฉด ์ „์ฒด ์ •๋ ฌ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋„ ์žˆ์ง€๋งŒ ์ง€๊ธˆ์ฒ˜๋Ÿผ ํŠน์ • ํ‚ค์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋งŒ ๊ฒ€์ƒ‰ํ•ด์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๊ฒฝ์šฐ ๋ถ€๋ถ„ ์ •๋ ฌ์„ ํ™œ์šฉ


[์‹ค์Šต] ๋ถ€๋ถ„ ์ •๋ ฌ(Partial Sort)

ํŒจํ‚ค์ง€ : AirlinePerformancePartialSort

AirlinePerformanceParser.java

SequenceFileCreator.java

MapFileCreator.java

SearchValueList.java

JAR : AirlinePerformancePartialSort.jar

1.์‹œํ€€์Šค ํŒŒ์ผ

1988๋…„ ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋กœ ๋งต๋ฆฌ๋“€์Šค ์žก์„ ์‹คํ–‰

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar
AirlinePerformancePartialSort.SequenceFileCreator
airline_input/1988_new.csv 1988_sequence
[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar
AirlinePerformancePartialSort.SequenceFileCreator

airline_input airline_sequence โŸถ airline_inputํด๋”์— 87,88,89๋…„๋„์˜ ๋ชจ๋‘ ํ†ตํ•ฉํ•ด์„œ sequenceํŒŒ์ผ์„ ๋งŒ๋“ ๋‹ค

[hadoop@master jar]$ hadoop fs -ls 1988_sequence
Found 5 items
-rw-r--r--  1 hadoop supergroup     0 2017-02-02 12:07 1988_sequence/_SUCCESS
-rw-r--r--  1 hadoop supergroup  19913688 2017-02-02 12:07 1988_sequence/part-00000
-rw-r--r--  1 hadoop supergroup  19122032 2017-02-02 12:07 1988_sequence/part-00001
-rw-r--r--  1 hadoop supergroup  19117352 2017-02-02 12:07 1988_sequence/part-00002
-rw-r--r--  1 hadoop supergroup  14328075 2017-02-02 12:07 1988_sequence/part-00003
[hadoop@master jar]$ hadoop fs -cat 1988_sequence/part-00000 | more
[hadoop@master jar]$ hadoop fs -text 1988_sequence/part-00000 | more

2.๋งต ํŒŒ์ผ

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar

AirlinePerformancePartialSort.MapFileCreator 1988_sequence 1988_mapfile

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar

AirlinePerformancePartialSort.MapFileCreator airline_sequence airline_mapfile

[hadoop@master jar]$ hadoop fs -ls 1988_mapfile
Found 2 items
-rw-r--r--  1 hadoop supergroup     0 2017-03-06 20:49 1988_mapfile/_SUCCESS
drwxr-xr-x  - hadoop supergroup     0 2017-03-06 20:49 1988_mapfile/part-00000
 
[hadoop@master jar]$ hadoop fs -ls 1988_mapfile/part-00000
Found 2 items
-rw-r--r--  1 hadoop supergroup  70574032 2017-03-06 20:49 1988_mapfile/part-00000/data
-rw-r--r--  1 hadoop supergroup    3349 2017-03-06 20:49 1988_mapfile/part-00000/index

๋งตํŒŒ์ผ์˜ ๊ทœ๊ฒฉ์— ๋งž๊ฒŒ data์™€ indexํŒŒ์ผ์ด ์ƒ์„ฑ๋˜์–ด ์žˆ๋‹ค.

[hadoop@master jar]$ hadoop fs -text 1988_mapfile/part-00000/index | more

์—‰๋ง์œผ๋กœ ๋ณด์˜€๋˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๋ ฌ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค

[hadoop@master jar]$ hadoop fs -text 1988_mapfile/part-00000/data | more
[hadoop@master jar]$ hadoop fs -text 1988_mapfile/part-00000/data | head -10

3.๊ฒ€์ƒ‰ ํ”„๋กœ๊ทธ๋žจ

๊ฒ€์ƒ‰ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ ์ „ ์ฃผ์˜ํ•ด์•ผ ํ•  ์‚ฌํ•ญ์€ ํ˜„์žฌ mapfile ์•ˆ์— ์žˆ๋Š” ๋กœ๊ทธํด๋”๋ฅผ ๋ชจ๋‘ ์‚ญ์ œํ•ด์•ผ ํ•ด์•ผ ํ•œ๋‹ค

MapFileOutputFormat ํด๋ž˜์Šค๊ฐ€ mapfileํด๋” ์•ˆ์— ๋งต๋ฆฌ๋“€์Šค ์žก์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ๋กœ๊ทธํŒŒ์ผํด๋”๋ฅผ ์ฒดํฌํ•ด์„œ ๋งตํŒŒ์ผ์ด ์—†๋‹ค๋Š” ์˜ค๋ฅ˜๋ฅผ ๋ฐœ์ƒ์‹œํ‚ค๊ธฐ ๋–„๋ฌธ์ž…๋‹ˆ๋‹ค.

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar

AirlinePerformancePartialSort.SearchValueList 1988_mapfile 273

Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://master:9000/user/hadoop/1988_mapfile/_SUCCESS/data
...
[hadoop@master jar]$ hadoop fs -ls 1988_mapfile

Found 2 items
-rw-r--r--  1 hadoop supergroup     0 2017-03-06 20:49 1988_mapfile/_SUCCESS
drwxr-xr-x  - hadoop supergroup     0 2017-03-06 20:49 1988_mapfile/part-00000

[hadoop@master jar]$ hadoop fs -rm -r 1988_mapfile/_SUCCESS
[hadoop@master jar]$ hadoop fs -ls 1988_mapfile
Found 1 items
drwxr-xr-x  - hadoop supergroup     0 2017-03-06 20:49 1988_mapfile/part-00000

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar

AirlinePerformancePartialSort.SearchValueList 1988_mapfile 273 | more

[hadoop@master jar]$ hadoop jar AirlinePerformancePartialSort.jar

AirlinePerformancePartialSort.SearchValueList 1988_mapfile 20 | more

...

The requested key was not found


์ „์ฒด ์ •๋ ฌ(Total Sort)

๋ชจ๋“  ๋งต๋ฆฌ๋“€์Šค ์žก์€ ์ž…๋ ฅ๋ฐ์ดํ„ฐ์˜ ํ‚ค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•˜๋‚˜์˜ ํŒŒํ‹ฐ์…˜์œผ๋กœ ์†์‰ฝ๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ผ ํŒŒํ‹ฐ์…˜์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ํฌ๊ธฐ๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ ฌํ•˜๊ฒŒ ๋˜๋ฉด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๋ฅผ ์‹คํ–‰ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋…ธ๋“œ๋Š” ๊ฐ€๋™๋˜์ง€ ์•Š๊ณ , ๋ฆฌ๋“€์Šค ํƒœ์Šคํฌ๊ฐ€ ์‹คํ–‰๋˜๋Š” ๋ฐ์ดํ„ฐ๋…ธ๋“œ๋งŒ ๋ถ€ํ•˜๊ฐ€ ์ง‘์ค‘ ๋  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ถ„์‚ฐ ์ฒ˜๋ฆฌ์˜ ์žฅ์ ์„ ์‚ด๋ฆฌ๋ฉด์„œ ์ „์ฒด ์ •๋ ฌ์„ ํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆœ์„œ๋กœ ์ •๋ ฌ์„ ์ง„ํ–‰


1.์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด์„œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋„๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

- ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์—์„œ ํŠน์ • ๊ฐœ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด์„œ ํ‚ค์™€ ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋ฅผ ์ƒ˜ํ”Œ๋ง

2.๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋„์— ๋งž๊ฒŒ ํŒŒํ‹ฐ์…˜ ์ •๋ณด๋ฅผ ๋ฏธ๋ฆฌ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

- ์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํŒŒํ‹ฐ์…˜์„ ์ƒ์„ฑ(์ƒ˜ํ”Œ๋Ÿฌ๊ฐ€ ์ œ๊ณตํ•˜๋Š” ํ‚ค์˜ ์ •๋ณด๋ฅผ ์„ค์ •)

- ๊ฐ task๊ฐ€ ํŒŒํ‹ฐ์…˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ถ„์‚ฐ ์บ์‹œ์— ํŒŒํ‹ฐ์…˜ ์ •๋ณด๋ฅผ ๋“ฑ๋ก

- TotalOrderPartitioner๋กœ ํŒŒํ‹ฐ์…˜๊ฐœ์ˆ˜์™€ ํŒŒํ‹ฐ์…˜์— ์ €์žฅํ•  ๋ฐ์ดํ„ฐ๋ฒ”์œ„๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค

- ์‹œํ€€์Šค ํŒŒ์ผ๋กœ ์ถœ๋ ฅ(TotalOderPartitioner๋Š” ์‹œํ€€์Šค ํŒŒ์ผ์— ์ตœ์ ํ™”)

3.๋ฆฌ๋“€์Šค ์žก์€ TotalOderPartitioner๊ฐ€ ์ƒ์„ฑํ•œ ํŒŒํ‹ฐ์…˜์— ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

4.๊ฐ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘ํ•ฉ ํ•ฉ๋‹ˆ๋‹ค.


โ€ป InputSampler ์ข…๋ฅ˜

SplitSampler : ์ž…๋ ฅ ์Šคํ”Œ๋ฆฟ์—์„œ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ฝ”๋“œ์˜ ํ‚ค๋ฅผ ์ˆ˜์ง‘

IntervalSampler : ์ž…๋ ฅ ์Šคํ”Œ๋ฆฟ์—์„œ ์ผ์ •ํ•œ ๊ฐ„๊ฒฉ์œผ๋กœ ํ‚ค๋ฅผ ์ˆ˜์ง‘

RandomSampler : ์ผ์ • ์Šคํ”Œ๋ฆฟ ๊ฑด์ˆ˜์—์„œ ์ผ์ • ํ™•๋ฅ ๋กœ ํ‚ค๋ฅผ ์ˆ˜์ง‘ํ•จ. ์ด๋•Œ ์Šคํ”Œ๋ฆฟ ๊ฑด์ˆ˜์™€ ํ™•๋ฅ ์€ ์‚ฌ์šฉ์ž๊ฐ€ ์„ค์ •ํ•˜๋ฉฐ, ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋„ ์ž„์˜๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.


๊ฐ ํŒŒํ‹ฐ์…˜์€ ํ‚ค ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋ผ ์žˆ์œผ๋ฉฐ, ์ด ํŒŒ์ผ๋“ค์„ ํ•ฉ์น˜๋ฉด ์ „์ฒด ์ •๋ ฌํ•œ ํšจ๊ณผ๋ฅผ ๋ณด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด ์‹œ์ ์— ๋ถ€๋ถ„ ์ •๋ ฌ์„ ํ•˜์ง€ ์•Š๊ณ  ์ „์ฒด ์ •๋ ฌ์„ ํ•˜๋Š”๊ฒŒ ๋” ํŽธํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก  ๋‘ ์ •๋ ฌ์€ ๋ชจ๋‘ ๋ถ€๋ถ„ ์ •๋ ฌ์„ ํ•˜์ง€๋งŒ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์€ ๊ฒ€์ƒ‰์ž…๋‹ˆ๋‹ค.

์ •๋ ฌ๋œ ๊ฒฐ๊ณผ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค๋ฉด ๋ถ€๋ถ„ ์ •๋ ฌ์„, ๋‹จ์ˆœํžˆ ์ •๋ ฌ๋œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋งŒ ํ•„์š”ํ•˜๋‹ค๋ฉด ์ „์ฒด ์ •๋ ฌ์„ ์ด์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.