Project

General

Profile

Bug #701

Reducer isn't called if there is only 1 input

Added by Jackson, Mike about 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Tunnell, Christopher
Category:
common_py
Target version:
Start date:
16 September 2011
Due date:
% Done:

100%

Estimated time:
Workflow:

Description

This issue affects Go.py.

Looking at functools.reduce(function, buffer), if the buffer only has 1 element then the function is not called and functools.reduce just returns the first element. This can be seen by running the attached reducetest.py script:

$ python reducetest.py 
Calling functools.reduce on ['elt1']
Result: elt1
Calling functools.reduce on ['elt1', 'elt2']
    In TestReducer.process(elt1,elt2)
Result: elt1---elt2

Whereas, for MAUS I assume we'd want the function to always be called even if only one element is present (e.g. to histogram just one JSON document).


Files

reducetest.py (477 Bytes) reducetest.py Jackson, Mike, 16 September 2011 17:17
Go.py (7.91 KB) Go.py Tunnell, Christopher, 29 September 2011 19:21
#1

Updated by Tunnell, Christopher about 10 years ago

  • Assignee changed from Rogers, Chris to Tunnell, Christopher
#2

Updated by Tunnell, Christopher about 10 years ago

This is actually more complicated than I thought at first. A reduce is a binary function with one output. So ZxZ -> Z for example. But it seems like what we may want is actually a second 'map' phase that isn't parallelized (ie. internal state is allowed).

I'll have to give this a think...

#3

Updated by Tunnell, Christopher about 10 years ago

I think essentially there's a design flaw in MAUS because we never actually want to do a reduce (ie. two inputs with one output like f(x,y ) = x+y. We actually want another 'map' but that isn't parallelized and retains internal state.

So what I will probably do (after thinking) is modify the reduce API to only take in one input. A lot of what we described this part of MAUS to do is the same -- functionality that sees all of the data -- but I think it would clarify things.

Can you think of anything better to call 'map' and 'reduce'?

#4

Updated by Jackson, Mike about 10 years ago

At a very high-level, MAUS processing is similar to a map-reduce in that data is partitioned, mapped, then reduced (e.g. to a single output plot or file).

In the detail though, the match to the map-reduce concepts seems to break down at bit. Map is defined as Map(k1,v1)->list(k2,v2) whereas MAUS has Map(doc1)->doc2. Reduce is defined as Reduce(k2, list(v2))->list(v3) whereas MAUS wants, for example, Reduce(list(doc))->thing.

Mappers and reducers could be replaced with Transformers (no internal state) and Mergers (internal state)? These could apply:

  • Transform(doc)->doc
  • Merge(list(docs))->thing (e.g. plot, list(docs), file ...)
#5

Updated by Tunnell, Christopher about 10 years ago

I was trying to think if we should abandon the Map-Reduce setup (like you say) but was waiting to talk to Chris Rogers before doing anything. I think you're right: we either need to follow more strictly the Map-Reduce setup which would mean that histogramming was a separate package (which is what I was originally thinking honestly) or move to what you suggest but avoid Map-Reduce naming.

I think I like your idea best. We could even just define every component to say if it can be parallelized. Normally single threaded programming is more like pipeline programming and this would facilitate that. However, we could probably start having a multipass with the intermediate state stored in a DB or something. Currently the temporary storage between map and reduce really worries me because it's a lot of I/O. We could have normal users do single-pass single-thread, and the control room and experts have a two pass setup where one pass is parallelized and the other isn't. This could even generalize to N passes.

I imagine that getting an answer to this question taking a week should not block your work? It just may require a little refactoring?

#6

Updated by Jackson, Mike about 10 years ago

Yes, it wouldn't be a blocker.

#7

Updated by Jackson, Mike about 10 years ago

Going back to the functools.reduce issue, according to the functools.reduce API, functools.reduce(function, iterable, initializer]) expects function to be of form function(m, n) where m is the currently aggregated information i.e. it processes n and updates m with the result.

It states that "If initializer is not given and iterable contains only one item, the first item is returned." This is the cause of the problems for us since function won't be called (not only that but the function is never called on the first argument of the iterable) meaning it won't be "reduced" in the way we may want. For example:

def test_int_reduce(x, y):
    print "In test_int_reduce: y: %d" % y
    x = x + y
    return x

functools.reduce(test_int_reduce, [1, 2, 3])
In test_int_reduce: y: 2
In test_int_reduce: y: 3
6
functools.reduce(test_int_reduce, [1, 2])
In test_int_reduce: y: 2
3
functools.reduce(test_int_reduce, [1])
1

But, it states that "If the optional initializer is present, it is placed before the items of the iterable in the calculation, and serves as a default when the iterable is empty." This default will then force execution of function in a 1-element list case e.g.
functools.reduce(test_int_reduce, [1], 0)
In test_int_reduce: y: 1
In test_int_reduce: y: 2
1

And, trying with a non-integer example:
def test_reduce(x, y):
    x = "%s%d" % (x, len(y))
    return x

functools.reduce(test_reduce, ["aa", "bbb", "cccc"])
'aa34'

Again the non-application of reduction to the first element of the list causes problems, but adding a default solves this:
functools.reduce(test_reduce, ["aa", "bbb", "cccc"], "")
'234'
functools.reduce(test_reduce, ["aa"], "")
'2'

So that may address the immediate issue with Go.py. Though the more general issue of some reducers being mappers with internal state remains.

#8

Updated by Rogers, Chris almost 10 years ago

Sounds like the API wasn't quite right to start with, so changing makes sense. Splitting into a separate package introduces more management overhead than we can probably sustain. So I think modifying the API is probably the correct way to go...

#9

Updated by Rogers, Chris almost 10 years ago

After a bit more thought, the way I would like to do this is to introduce a base class, one for C++ and one for Python. The base class defines the interface and handles

  • Conversion from json to string
  • Errors

The motivation for this is

  • Makes the interface clear
  • Allows to do error handling once

Might be slightly out of scope for Mike's work? But if we change the API, best to only do it once. Thoughts? Might be a Rogers job that one...

#10

Updated by Tunnell, Christopher almost 10 years ago

That's definitely outside of his job definition. I'll try to do it this week unless you want to do it (since you wrote the error handler for example).

#11

Updated by Rogers, Chris almost 10 years ago

Will take me a week or two to get to it... snowed this week.

#12

Updated by Tunnell, Christopher almost 10 years ago

I'll do it Thursday. At least everything but the errors part unless you can explain what you wanted. My other goal on Thursday relates to the RealData stuff (which I really want to rename DAQData for clarity!). This should take an hour to write and an hour to test.

#13

Updated by Tunnell, Christopher almost 10 years ago

I propose this Go for a pipeline dataflow for normal single threaded users. The tests are in my branch. Chris: how do we want to go about merging? It passes all the tests and doesn't break anything. It's also more efficient for single threaded users.

And it closes this bug.

#14

Updated by Rogers, Chris almost 10 years ago

Good code - lots of comments, easy to read. Where are the tests? I couldn't see anything in your branch. Looking through Go.py:

40: 'control_room_style' doesnt mean much to me, and the description doesnt help. Presumably you mean distributed across many PCs? Can you think of a better description/name

140: assert type_of_dataflow in possible_types_of_dataflow.keys()
...
148: else
149:     raise LookupError("bad type_of_dataflow: %s" % type_of_dataflow)

Presumably, no point in having both. Prefer something like
148: else
149:     raise LookupError("type_of_dataflow '"+str(type_of_dataflow)+"' should be one of "+str(possible_types_of_dataflow.keys()) % type_of_dataflow)

Which tells the user both what is wrong and how to fix it.

#15

Updated by Tunnell, Christopher almost 10 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

r651 closes this. control_room_style currently isn't used so can be renamed in the future sinc eI can't think of anything.

#16

Updated by Rogers, Chris almost 10 years ago

  • Target version changed from Future MAUS release to MAUS-v0.0.8
#17

Updated by Jackson, Mike almost 10 years ago

A small comment. It may be less confusing to have the print statements in the

while len(map_buffer) != 0

after the inner for loop. For example, running Go.py with 4 spills gives:
HINT: MAUS will process 1 event only at first...
TRANSFORM/MERGE/OUTPUT:  Processed 0 events so far, 1 events in buffer.
TRANSFORM/MERGE/OUTPUT:  Processed 1 events so far, 3 events in buffer.
TRANSFORM: Shutting down transformer

which made me think at first it had missed a couple. Putting the print statements after the inner for loop would give:
HINT: MAUS will process 1 event only at first...
TRANSFORM/MERGE/OUTPUT:  Processed 1 events so far,   3 events in buffer.
TRANSFORM/MERGE/OUTPUT:  Processed 4 events so far,   0 events in buffer.
TRANSFORM: Shutting down transformer

which is more reassuring to the user.

#18

Updated by Tunnell, Christopher almost 10 years ago

implemented

Also available in: Atom PDF