iterative Hadoop mapreduce in python

2020-08-12T03:53:31

In my Mac I have a standalone installation of Hadoop 3.3.0

I have 2 Python scripts, mapper.py and reducer.py.

I can successfully run 1 step of map and reduce, correctly writing the output on local hdfs though the command

bin/hadoop jar /usr/local/Cellar/hadoop/3.3.0/libexec/share/hadoop/tools/lib/hadoop-*streaming*.jar -file /Users/mauro/hadoop_job/mapper.py -mapper /Users/mauro/hadoop_job/mapper.py -file /Users/mauro/hadoop_job/reducer.py  -reducer /Users/mauro/hadoop_job/reducer.py  -input /input/4300.txt -output /input/output-output

The problem is: How can I iterate the 2 stages until a condition is met? More specifically, the implementation of kmeans I made to get familiar with Hadoop MapReduce.

I can update centroids once, running 1 map and 1 reduce. I now need to send the updated version of the centroids back to the mapper and iterate Map-Reduce stages a number of times until a condition is met (namely the intra cluster cumulative distance <than a threshold). How can I do that?

Copyright License:
Author:「Mauro Gentile」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/63365459/iterative-hadoop-mapreduce-in-python

About “iterative Hadoop mapreduce in python” questions

In my Mac I have a standalone installation of Hadoop 3.3.0 I have 2 Python scripts, mapper.py and reducer.py. I can successfully run 1 step of map and reduce, correctly writing the output on local ...
Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the same logic? I am
I am new with Hadoop and I am writing an iterative MapReduce job. I know that with Hadoop, starting from a big Dataset it will be split in small files and than send them as input to mapfunction at
I am trying to execute a python MapReduce wordcount Program I take it from writing a Hadoop MapReduce program in python just to try to understand how it works but the problem always is Job not
I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python ...
I know that for iterative algorithms,Hadoop mapreduce doesn't perform well since it does a complete disk read/write for each iteration. But why?Is that for the robustness of the system?
I'm building a Hadoop (0.20.1) mapreduce job that uses HBase (0.20.1) as both the data source and data sink. I would like to write the job in Python which has required me to use hadoop-0.20.1-strea...
I am following this hadoop mapreduce tutorial given by Apache. The Java code given there uses these Apache-hadoop classes: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.f...
I wanted to run a MapReduce-Job on my FreeBSD-Cluster with two nodes but I get the following Exception 14/08/27 14:23:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
I know the rule of thumb of it: Big data, non-iterative, fault tolerant => MapReduce ; Speed, small data, iterative, non-Mapper-Reducer type => MPI (Hadoop MapReduce vs MPI (vs Spark vs Ma...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.