WordCount.R met Revolution Analytics (rmr2)

Het onderstaande code fragment is een handig en simpel stukje testcode wat de woorden telt van de R licentie tekst. Het is oorspronkelijk afkomstig van de github pagina van rmr2.

Dit kan worden gebruikt voor het testen van een Hadoop installatie, in combinatie met R en Revolution Analytics (rmr2).

 

library(rmr2)

wordcount =
function(
input,
output = NULL,
pattern = " "){

wc.map =
function(., lines) {
keyval(
unlist(
strsplit(
x = lines,
split = pattern)),
1)}

wc.reduce =
function(word, counts ) {
keyval(word, sum(counts))}

mapreduce(
input = input,
output = output,
map = wc.map,
reduce = wc.reduce,
combine = TRUE)

}
text = capture.output(license())
out = list()
for(be in c("local", "hadoop")) {
rmr.options(backend = be)
out[[be]] = from.dfs(wordcount(to.dfs(keyval(NULL, text)), pattern = " +"))}
stopifnot(rmr2:::kv.cmp(out$hadoop, out$local))

 

Wanneer het script is uitgevoerd, geeft het de volgende output:

14/10/20 10:21:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/10/20 10:21:51 INFO compress.CodecPool: Got brand-new compressor
packageJobJar: [/scratch/hadoop/tmp/hadoop-unjar1248872206048220313/] [] /tmp/streamjob5124481892103881274.jar tmpDir=null
14/10/20 10:21:52 INFO mapred.FileInputFormat: Total input paths to process : 1
14/10/20 10:21:52 INFO streaming.StreamJob: getLocalDirs(): [/scratch/hadoop/tmp/mapred/local]
14/10/20 10:21:52 INFO streaming.StreamJob: Running job: job_201410201021_0001
14/10/20 10:21:52 INFO streaming.StreamJob: To kill this job, run:
14/10/20 10:21:52 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201410201021_0001
14/10/20 10:21:52 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201410201021_0001
14/10/20 10:21:53 INFO streaming.StreamJob:  map 0%  reduce 0%
14/10/20 10:23:30 INFO streaming.StreamJob:  map 50%  reduce 0%
14/10/20 10:23:42 INFO streaming.StreamJob:  map 100%  reduce 0%
14/10/20 10:23:49 INFO streaming.StreamJob:  map 100%  reduce 67%
14/10/20 10:23:53 INFO streaming.StreamJob:  map 100%  reduce 100%
14/10/20 10:23:56 INFO streaming.StreamJob: Job complete: job_201410201021_0001
14/10/20 10:23:56 INFO streaming.StreamJob: Output: /tmp/file3f4f7e32e290

De resultaten worden opgeslagen in het “out” object.

> out
$local
$local$key
 [1] "This"                                "software"                            "is"                                 
 [4] "distributed"                         "under"                               "the"                                
 [7] "terms"                               "of"                                  "GNU"                                
[10] "General"                             "Public"                              "License,"                           
[13] "either"                              "Version"                             "2,"            
...
Share on LinkedIn0Tweet about this on TwitterEmail this to someoneShare on Google+0Share on Facebook1

Geen reacties

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes