Project

General

Profile

Feature #705

Data cache

Added by Jackson, Mike about 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Jackson, Mike
Category:
Online reconstruction
Target version:
Start date:
20 September 2011
Due date:
% Done:

100%

Estimated time:
Workflow:

Description

Component to cache results output by chains of map workers until they?re ready for consumption by reducers. Based on CouchDB.

#1

Updated by Jackson, Mike about 10 years ago

From MAUS SSI Component Design - Online Reconstruction:

  • At present an in-memory buffer is used to cache data - see temp_file in src/common_py/Go.py
  • Use CouchDB high-speed document-oriented database, http://couchdb.apache.org/, to support multiple I/O requests and large data volumes.
  • CouchDB will be flushed after every run.
  • A run may be every 2 days or so.
  • Define a data store interface. Initial prototype can be in memory buffer (as at present), then hook in CouchDB.
#2

Updated by Tunnell, Christopher about 10 years ago

A run is normally 2 hours and rarely 8. It's mainly dictated by our shift schedule where we run only one shift per day. We can't run for safety reasons too long of shifts and don't have the manpower for 24/7 running.

#3

Updated by Tunnell, Christopher almost 10 years ago

Per CouchDB issues. I tried building from source on Ubuntu and got:

configure: error: Could not find the js library.

Is the Mozilla SpiderMonkey library installed?

Then just thought "screw it" and ran:

sudo apt-get install couchdb

on Ubuntu which works. We can always make the machine Ubuntu, if it takes less time to setup an Ubuntu server than sort out that problem.

#4

Updated by Tunnell, Christopher almost 10 years ago

$$$ for people >> $$$ for computers

#5

Updated by Jackson, Mike almost 10 years ago

Your error is the same as mine...

$ wget http://www.mirrorservice.org/sites/ftp.apache.org/couchdb/1.1.1/
apache-couchdb-1.1.1.tar.gz
$ gunzip apache-couchdb-1.1.1.tar.gz 
$ tar -xf apache-couchdb-1.1.1.tar 
$ cd apache-couchdb-1.1.1
$ cd /home/michaelj/couchdb/apache-couchdb-1.1.1
$ ./configure
...
checking for JS_NewContext in -lmozjs... no
checking for JS_NewContext in -ljs... no
checking for JS_NewContext in -ljs3250... no
checking for JS_NewContext in -ljs32... no
configure: error: Could not find the js library.

Is the Mozilla SpiderMonkey library installed?

So,
$ yum install js
Loaded plugins: kernel-module
Setting up Install Process
Package js-1.70-8.el5.i386 already installed and latest version
Nothing to do

So,
$ ./configure 
Is the Mozilla SpiderMonkey library installed?

Google "couchdb is the spidermonkey installed" gives http://wiki.apache.org/couchdb/Installing_SpiderMonkey which states "yum install js-devel", so

$ yum install js-devel
$ ./configure 
checking whether JSOPTION_ANONFUNFIX is declared... no
configure: error: Your SpiderMonkey library is too new.
Versions of SpiderMonkey after the js185-1.0.0 release remove the optional
enforcement of preventing anonymous functions in a statement context. This
will most likely break your existing JavaScript code as well as render all
example code invalid.

This was confusing as I have 1.70-e8!

Looking at configure it checks for JSOPTION_ANONFUNFIX. This was removed after js185-1.0.0. Checking the JS download, it was not added until js-1.8.0-rc1 so I'm right to get the error (that flag is not there) but for the wrong reason (my version is too old, not too new)

Try and build 185-1.0.0,
wget http://ftp.mozilla.org/pub/mozilla.org/js/js185-1.0.0.tar.gz
http://wiki.apache.org/couchdb/Installing_SpiderMonkey
$ cd js/src
$ make BUILD_OPT=1 -f Makefile.ref

According to this the make -f was for5 1.7, http://old.nabble.com/How-can-i-install-SpiderMonkey--td20791332.html but https://developer.mozilla.org/En/SpiderMonkey/Build_Documentation says
cd js/src
autoconf-2.13
./configure
make

But my autoconf is 2.59 and it says no greater than 2.13 is suitable.
$ yum install autoconf213
autoconf213          noarch          2.13-12.1          sl-base          253 k

Try again,
$ cd js/src
$ autoconf-2.13
$ ./configure
...
checking for 64-bit OS... no
checking for Python version >= 2.5 but not 3.x... configure: error: Python 2.5 or higher (but not Python 3.x) is required.
$ python -V
2.4

So,
$ yum install python26
================================================================================
Installing:
 python26             i386        2.6.5-6.el5             epel            6.5 M
Installing for dependencies:
 libffi               i386        3.0.9-1.el5.rf          rpmforge         87 k
 python26-libs        i386        2.6.5-6.el5             epel            667 k
$ ./configure
$ make
$ ./js -help
JavaScript-C 1.8.5 2011-03-31
$ pwd
/home/michaelj/couchdb/js-1.8.5/js/src

Return to CouchDB

$ cd /home/michaelj/couchdb/apache-couchdb-1.1.1
$ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ --with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/
  • The icu-config script could not be found. Make sure it is
  • in your path, and that taglib is properly installed.
  • Or see http://ibm.com/software/globalization/icu/
    configure: error: Library requirements (ICU) not met.

    Google and find http://wiki.apache.org/couchdb/Error_messages#Missing_icu-config,
    $ yum install libicu-dev
    # Not found
    $ yum install libicu
    # 3rd time lucky,
    $ yum install icu
    ================================================================================
    Installing:
     icu            i386            3.6-5.16               sl-base            187 k
    $ which icu-config
    # Not found
    $ yum install libicu-devel
    ================================================================================
    Installing:
     libicu-devel          i386          3.6-5.16            sl-base          571 k
    $ which icu-config
    /usr/bin/icu-config
    # Finally!
    

    Try again,
    $ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ 
    --with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/
    
    checking for curl >= 7.18.0... can't find curl >= 7.18.0
    configure: error: Library requirements (curl) not met.
    

According to http://wiki.apache.org/couchdb/Installing_on_RHEL5, curl >= 7.18 is needed for CouchDB > 0.11.

$ wget http://curl.haxx.se/download/curl-7.20.1.tar.gz
$ tar -xzf curl-7.20.1.tar.gz
$ cd curl-7.20.1
$ ./configure --prefix=/usr/local
$ make
$ make install

And back to CouchDB,
$ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ 
--with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/
configure: error: The installed Erlang version is less than erts-5.7.3 (R13B02)

$ yum info erlang
Release    : 5.10.el5

At which point I decide to just stick with the yum installable CouchDB 0.1.1.
yum install couchdb
===============================================================================
 Package                 Arch         Version                Repository    Size
===============================================================================
Installing:
 couchdb                 i386         0.11.2-2.el5           epel         557 k
Installing for dependencies:
 erlang-ibrowse          i386         2.1.0-1.el5            epel          48 k
 erlang-mochiweb         i386         1.4.1-5.el5            epel         368 k
 erlang-oauth            i386         1.0.1-1.el5            epel          27 k
 js                      i386         1.70-8.el5             epel         394 k

#6

Updated by Jackson, Mike almost 10 years ago

CouchDB familiarisation

Reinstalled...

http://wiki.apache.org/couchdb/Installing_on_RHEL5

$ yum install couchdb
 couchdb          i386          1.0.1-2.el5.rf          rpmforge          749 k

Configuration file (untouched for now): /etc/couchdb/local.ini

$ service couchdb start
Starting database server couchdb
chown: `couchdb': invalid user
su: user couchdb does not exist

Never happened before :-(

$ yum remove couchdb
$ yum install couchdb
Loaded plugins: kernel-module
Setting up Install Process
Package couchdb-1.0.1-2.el5.rf.i386 already installed and latest version
Nothing to do
$ yum list couchdb
Loaded plugins: kernel-module
Installed Packages
couchdb.i386                      1.0.1-2.el5.rf                       installed
$ yum clean all
$ yum install couchdb
Still there!
$ rpm -qa couchdb
couchdb-1.0.1-2.el5.rf
$ rpm -e couchdb
error: %preun(couchdb-1.0.1-2.el5.rf.i386) scriptlet failed, exit status 1
Google,
$ rpm -e --noscripts couchdb
$ yum install couchdb

Verify and test...

http://wiki.apache.org/couchdb/Installing_on_Ubuntu
http://wiki.apache.org/couchdb/Verify_and_Test_Your_Installation

$ /sbin/service couchdb start
Starting database server couchdb
$ /sbin/service couchdb status
Apache CouchDB is running as process 6723, time to relax.
[or /etc/init.d/couchdb status]
$ curl http://localhost:5984
{"couchdb":"Welcome","version":"1.0.1"}

Using Curl...

http://guide.couchdb.org/draft/tour.html

$ curl -X GET http://127.0.0.1:5984/_all_dbs
['_users']
$ curl -X PUT http://127.0.0.1:5984/mike
{"ok":true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_users","mike"]
$ curl -X DELETE http://127.0.0.1:5984/mike
{"ok":true}
$ curl -X GET http://127.0.0.1:5984/_all_dbs
["_users"]

Admin interface...

Security...

CouchDB and Python...

http://wiki.apache.org/couchdb/Getting_started_with_Python

$ easy_install CouchDB
Reading http://code.google.com/p/couchdb-python/

http://packages.python.org/CouchDB/index.html
http://packages.python.org/CouchDB/getting-started.html

$ ipython
$ import couchdb
$ couch = couchdb.Server() # Assumes localhost 
# Can do couch = couchdb.Server('http://example.com:5984/')
$ for i in couch
$    print i
_users
mike
$ db = couch['mike']
$ doc = {'foo': 'bar'}
$ db.save(doc)
(u'771f4698483b44de3ae6af0b8b00037f', u'1-4c6114c65e295552ab1019e2b046b10e')
# ID and "rev" (revision)
$ print doc
{'_rev': u'1-4c6114c65e295552ab1019e2b046b10e', 'foo': 'bar', '_id': u'771f4698483b44de3ae6af0b8b00037f'}
$ doc2 = {'_id': 'mydoc', 'foo':'bar'}
$ db.save(doc2)
(u'mydoc', u'1-4c6114c65e295552ab1019e2b046b10e')
$ for id in db:
$     print id
$
771f4698483b44de3ae6af0b8b00037f
mydoc
$ tmp = db.get('mydoc')
$ print tmp
<Document u'mydoc'@u'1-4c6114c65e295552ab1019e2b046b10e' {u'foo': u'bar'}>
$ db.delete(tmp)
$ tmp = db.get('mydoc')
# Returns "None" 
$ from couchdb import ResourceNotFound
$ try:
$     couch.delete("random")
$ except ResourceNotFound:
$     print "Not found" 
Not found

Commit 680 - modified Go.py to dump spills into CouchDB when they return from Celery workers and then for reducer loop to pull these from CouchDB. Worked fine.
#7

Updated by Jackson, Mike almost 10 years ago

Notes on MongoDB

  • http://www.mongodb.org/
  • http://www.mongodb.org/display/DOCS/CentOS+and+Fedora+Packages
    $ yum install mongo-10gen
    No package mongo-10gen available.
    
    Edit /etc/yum.repos.d/10gen.repo and add,
    
    [10gen]
    name=10gen Repository
    baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/i686
    gpgcheck=0
    
    $ yum install mongo-10gen
     mongo-10gen         i686         2.0.1-mongodb_1           10gen          28 M
    $ yum install mongo-10gen-server
     mongo-10gen-server       i686       2.0.1-mongodb_1          10gen       5.4 M
    

    http://www.mongodb.org/display/DOCS/Quickstart+Unix
    Set up data directory. I did this as root:
    $ mkdir -p /data/db
    $ chown `id -u` /data/db/
    
    --dbpath can specify a different directory, 
    
    $  /etc/init.d/mongod start
    $  /etc/init.d/mongod status
    
    Connect by default to test database on localhost
    $ mongo
    $ db.foo.save( { a : 1 } )
    $ db.foo.find()
    { "_id" : ObjectId("4ed777486efa1cd775690793"), "a" : 1 }
    $ db.foo.remove({a:1})
    $ db.foo.find()
    $
    
  • http://www.mongodb.org/display/DOCS/Manual
  • http://api.mongodb.org/python/current/
    $ easy_install pymongo
    
    Databases have collections have documents.
    
    $ from pymongo import Connection
    $ connection = Connection()
    OR
    $ connection = Connection('localhost', 27017)
    
    Lazy creation - DB will be created when document is inserted.
    
    $ connection.mausdb
    OR
    $ connection["mausdb"]
    
    $ mausdb.mauscoll
    OR
    $ mauscoll = mausdb["mauscoll"]
    
    $ mauscoll.insert({"_id":1, "doc":{"a":"b"}})
    Out[18]: 1
    $ mauscoll.insert({"doc":{"c":"d"}})
    Out[19]: ObjectId('4ed77bba966dc90c61000000')
    
    $ connection.database_names()
    Out[21]: [u'test', u'mausdb', u'local']
    
    $ mausdb.collection_names()
    Out[20]: [u'mauscoll', u'system.indexes']
    
    $ mauscoll.find_one({"_id":ObjectId('4ed77bba966dc90c61000000')})
    Out[23]: {u'_id': 1, u'doc': {u'a': u'b'}}
    
    Returns None if no match.
    
    $ cursor = mauscoll.find({"_id":1})
    $ cursor.next()
    Out[30]: {u'_id': 1, u'doc': {u'a': u'b'}}
    $ cursor.next()
    StopIteration
    [Exception]
    
    $ for a in mauscoll.find():
    $     print a
    $
    
    $ mauscoll.count()
    Out[33]: 2
    
    $ mauscoll.distinct("_id")
    [1, ObjectId('4ed77bba966dc90c61000000')]
    $ mauscoll.remove()
    []
    Remove specific doc
    $ mauscoll.remove({"_id":3})
    
    Drop the collection
    $ mauscoll.drop()
    
    Multiple inserts do not update the documentr:
    
    $ mauscoll.insert({"_id":"tmp", "doc":"doc1"})
    $ mauscoll.find_one({"_id":"tmp"})
     {u'_id': u'tmp', u'doc': u'doc1'}
    $ mauscoll.insert({"_id":"tmp", "doc":"doc2"})
    $ mauscoll.find_one({"_id":"tmp"})
     {u'_id': u'tmp', u'doc': u'doc1'}
    
    It needs an update e.g.
    
    $ mauscoll.update({"_id":"tmp"},{"doc":"doc2"})
    $ mauscoll.find_one({"_id":"tmp"})
     {u'_id': u'tmp', u'doc': u'doc2'}
    
    Or, use save, which insert's if the doc is not there and updates if it is e.g.
    
    $ mauscoll.save({"_id":"tmp", "doc":"doc3"})
    $ mauscoll.find_one({"_id":"tmp"})
     {u'_id': u'tmp', u'doc': u'doc3'}
    $ mauscoll.save({"_id":"tmp", "doc":"doc4"})
    $ mauscoll.find_one({"_id":"tmp"})
     {u'_id': u'tmp', u'doc': u'doc4'}
    
#8

Updated by Jackson, Mike almost 10 years ago

  • Downloaded and installed MongoDB and pymongo.
  • Wrote classes for dictionary-backed in-memory, CouchDB, and MongoDB data stores plus unit test classes.
  • Changed Go.py to use these classes.
  • Wrote MAUSDocumentCacheConfiguration.

General document store code changes:

configure
 Added
  export PYTHONPATH="\$MAUS_ROOT_DIR/src/docstore:\$PYTHONPATH" 
  export LD_LIBRARY_PATH="\$MAUS_ROOT_DIR/src/docstore:\$LD_LIBRARY_PATH" 
src/docstore
 InMemoryDocumentStore.py
  In-memory, dictionary-based doc store.
tests/py_unit/
 test_InMemoryDocumentStore.py
src/common_py/ConfigurationDefaults.py
 Added configuration parameters
  doc_store_class="InMemoryDocumentStore.InMemoryDocumentStore" 
src/common_py/Go.py
 multi_process uses doc_store_class

CouchDB code changes:

third_party/bash/40python_extras.bash
 Added
  easy_install CouchDB
src/docstore
 CouchDBDocumentStore.py
tests/py_unit/
 test_CouchDBDocumentStore.py
  Currently skips if http://localhost:5984 is not available.
src/common_py/ConfigurationDefaults.py
 Added configuration parameters
  couchdb_url = "http://localhost:5984" 
   Default CouchDB URL. Only needed if using CouchDBDocumentStore.
  couchdb_database_name = "mausdb" 
   Default CouchDB database name. Only needed if using CouchDBDocumentStore.
bin/utilities/delete_couchdb.py 
 Simple client to delete database on a CouchDB server.

MongoDB code changes:

third_party/bash/40python_extras.bash
 Added
  easy_install pymongo
src/docstore
  MongoDBDocumentStore.py 
tests/py_unit/
  test_MongoDBDocumentStore.py 
  Currently skips if http://localhost:27017 is not available.
src/common_py/ConfigurationDefaults.py
 Added configuration parameters
  mongodb_host = "localhost" 
   Default MongoDB host name. Only needed if using MongoDBDocumentStore.
  mongodb_port = 27017 
   Default MongoDB port. Only needed if using MongoDBDocumentStore.
  mongodb_database_name = "mausdb" 
   Default MongoDB database name. Only needed if using MongoDBDocumentStore.
  mongodb_collection_name = "spills" 
   Default MongoDB collection name. Only needed if using MongoDBDocumentStore.

Examples of use of each:

$ ./bin/examples/simple_histogram_example.py 
 -type_of_dataflow=multi_process 
 -doc_store_class="CouchDBDocumentStore.CouchDBDocumentStore" 

$ ./bin/examples/simple_histogram_example.py 
 -type_of_dataflow=multi_process 
 -doc_store_class="MongoDBDocumentStore.MongoDBDocumentStore" 

Commit 693

#9

Updated by Jackson, Mike over 9 years ago

Experiments relating to data cache use in Go.py

Example of using datetime to time-stamp spills:

$ import time
$ print time.time()
1328626465.2
$ from datetime import datetime
$ datetime.fromtimestamp(time.time())
datetime.datetime(2012, 2, 7, 15, 7, 16, 610819)
$ ds.save({'_id':"1", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())})
$ ds.save({'_id':"2", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())})
$ ds.save({'_id':"3", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())})
$ ds.save({'_id':"4", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())})

And queries for all spills added since a specific datetime:
$ query = datetime(2012,2,7,15,0,10)
result = ds.find({"date":{"$gt":query}}).sort("_id")
print result.count()
4
query = datetime(2012,2,7,15,0,9)
result = ds.find({"date":{"$gt":query}}).sort("_id")
print result.count()
1
$ result.next()
...

If no more then Python StopIteration is raised.

Example of using a simple reduced flag:

$ ds.save({'_id':"1", 'doc':{'a':'b'}, 'reduced':False})
$ ds.save({'_id':"2", 'doc':{'a':'b'}, 'reduced':False})
$ ds.save({'_id':"3", 'doc':{'a':'b'}, 'reduced':False})
$ ds.save({'_id':"4", 'doc':{'a':'b'}, 'reduced':False})
# Imagine an update...
$ ds.save({'_id':"3", 'doc':{'a':'b'}, 'reduced':True})
# Query for non-reduced,
$ result = ds.find({"reduced":False}).sort("_id")
$ result.count()
3
# Query for reduced,
$ result = ds.find({"reduced":True}).sort("_id")
$ result.count()
1

Exceptions expected:

  • Bad connection URL or port number: pymongo.errors.AutoReconnect
  • MongoDB goes down and a database operation is attempted: socket.error on first attempt and pymongo.errors.AutoReconnect on subsequent attempt/

Current state and questions

  • The database connection ass-umes by default a database called "mausdb" with spills in a collection called "spills" (though these can be overriden by configuration flags)
  • At present for the reduce-output stage, a spill is read from the database, sent to reduce and, if no errors arise, the spill is then removed from the database.
  • The database connection class is pluggable - users can use InMemoryDocumentStore or MongoDBDocumentStore - they have the same API so Go.py doesn't care.

Questions

  • What is the envisaged usage of the document store? Is it envisaged that multiple reduce-output clients would pull data from a common set of spills output by a single input-map run and convert this to histograms?
    • If not, then how will multiple histograms end up in the web interface? By running multiple instances of Go.py each with their own map-reduce workflows?
    • If so, then the reduce flag isn't applicable and we need to go with the timestamp (or some other option) so each reduce-output client doesn't process the same spill twice.
    • For queries
  • Either, need to abandon InMemoryDocumentStore and just have Go.py explicity interact with MongoDB directly
  • Or, have query-specific functions in the API (e.g. get_non_reduced_spills or get_spills_added_since_time(...)). This then requires simple implementations of these queries for InMemoryDocumentStore.
  • Instead of using a default collection "spills" it may be better to name this after the process ID/host of the spill depositor? This though, would require a new (optional) command-line parameter to be supported for reduce-output clients, so they know which collection to pull spills from?
#10

Updated by Jackson, Mike over 9 years ago

Decision:

  • Avoid destructive reading of spills. Timestamp spills and readers keep track of latest spill read.
#11

Updated by Jackson, Mike over 9 years ago

To 752. Added simple client to delete MongoDB collection or database. Added support for MongoDB disconnect to document store API.

#12

Updated by Jackson, Mike over 9 years ago

  • Priority changed from Normal to Urgent

Do query by timestamp, add component to read from latest timestamp that can be used in Go.py. Add test that uses 2xTOF plot mergers.

#13

Updated by Jackson, Mike over 9 years ago

To 776

  • Removed CouchDB code for maintainability purposes. Focus on MongoDB.
  • Added DocumentStore super-class.
  • Added get_since method to DocumentStore and sub-classes to return documents added since a given time, in date-sorted order.
  • Changed InMemoryDocumentStore to timestamp additions.
  • Added new tests and pulled out common tests into test super-class.
  • Set MongoDB to be default doc_store_class.
#14

Updated by Jackson, Mike over 9 years ago

Last week's tasks:

Commits up to 790

Data cache #705

  • Added bin/utilities/summarise_mongodb.py, which summarises collection names, sizes and numbers of documents in MongoDB
  • Changed DocumentStore API to explicity support notion of a named collection in the document store.
    Allows use of process ID and partitioning of spills from runs or jobs etc.
#15

Updated by Jackson, Mike over 9 years ago

Error that arises when trying to get 160 spills sorted by date order:

pymongo.errors.OperationFailure: database error: too much data for sort() 
with no index.  add an index or specify a smaller limit

Fix is to create an index,
$ coll.create_index("date")
$ coll.index_information()
{u'_id_': {u'key': [(u'_id', 1)], u'v': 1},
 u'date_1': {u'key': [(u'date', 1)], u'v': 1}}

MongoDBDocumentStore now indexes by date - 799

#16

Updated by Jackson, Mike over 9 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100
#17

Updated by Rogers, Chris over 9 years ago

  • Target version changed from Future MAUS release to MAUS-v0.2.0

Also available in: Atom PDF