MongoDB ran out of address space
Running in the control room, we are getting an error like:
[miceonrec01a] reducers > ./reconstruct_daq_single_station_reducer.py -type_of_dataflow=multi_process_merge_output Error in <TGClient::TGClient>: can't open display "isis57139.sci.rl.ac.uk:0.0", switching to batch mode... In case you run from a remote ssh session, reconnect with ssh -Y MAUS running in online mode Welcome to MAUS: Process ID (PID): 10152 Program Arguments: ['./reconstruct_daq_single_station_reducer.py', '-type_of_dataflow=multi_process_merge_output'] Version: MAUS release version 0.2.2 INITIATING EXECUTION -------- MERGE OUTPUT -------- Traceback (most recent call last): File "./reconstruct_daq_single_station_reducer.py", line 74, in <module> run() File "./reconstruct_daq_single_station_reducer.py", line 71, in run MAUS.Go(my_input, my_map, reducer, output_worker, data_cards) File "/home/mice/MAUS/.maus_control-room/src/common_py/Go.py", line 121, in __init__ executor.execute() File "/home/mice/MAUS/.maus_control-room/src/common_py/framework/merge_output.py", line 215, in execute raise DocumentStoreException(exc) docstore.DocumentStore.DocumentStoreException: Exception when using document store: database error: can't map file memory - mongo requires 64 bit build for larger datasets
Workaround is to use a different mongodb name, but this is a bit clunky... and probably means something has gone wrong somewhere. The feature is that the tracker is now in place and data sizes are indeed larger.
Updated by Rogers, Chris over 11 years ago
It didn't occur until after we were running (on test random noise) for about an hour. Now it happens every time unless I change the mongodb_database_name configuration parameter. My worry is that:
- somewhere the data is getting cached on disk and we run out of disk space in a couple of weeks.
- there is a load issue that will cause the online recon to crash in a nasty way every hour or so (so that even ctrl-c restart doesnt fix it)
Updated by Jackson, Mike over 11 years ago
I did a Google. Deep within the MongoDB FAQ - http://www.mongodb.org/display/DOCS/FAQ - it has:
What are the 32-bit limitations?
MongoDB uses memory-mapped files. When running on a 32-bit operating system, the total storage size for the server (data, indexes, everything) is 2gb. If you are running on a 64-bit os, there is virtually no limit to storage size. Thus 64 bit production deployments are recommended. See the blog post for more information. One other note: journaling is not on by default in the 32 bit binaries as journaling uses extra memory-mapped views.
The blog post cited is http://blog.mongodb.org/post/137788967/32-bit-limitations, in which users do ask why this limitation is buried in the FAQ.
- Use a 64-bit deployment on a separate server if you have such a server.
- Look at http://www.mongodb.org/display/DOCS/Sharding.
- At present src/framework/merge_output.py reads spills for merging and outputing but leaves the spills in MongoDB. This is because N merge-output clients may be reading from MongoDB. Introducing something like a MergeOutputGroup which contains N Merger-Outputer pairs would mean that a single merge-output client could be used instead of N clients. merge_output.py could then delete the spills from MongoDB once they've been passed to the MergeOutputGroup. This can reduce the rate at which the database fills up.
Updated by Tunnell, Christopher over 11 years ago
Thanks Mike for finding that!
We're moving to 64-bit anyways, Chris. Ask Matt Robinson and maybe he can prioritize the move? He's supposed to be coming down to RAL next week. Linda says the intention is to backup the other non-MAUS onrec, move that to 64-bit, then we can install MongoDB on that I guess?
I'm sure Alex can figure something else clever out, but this is just a proposed solution of minimal work for the software people (because Matt does the heavy lifting).
Chris: is the tracker zero suppressed? I don't think the CKOV is and that (I would guess) would be more information than the tracker outputs. If you're worried about it filling up every two weeks, then just tie the end-of-run signal to a flush of the database?