Feature #705
Data cache
100%
Description
Component to cache results output by chains of map workers until they?re ready for consumption by reducers. Based on CouchDB.
Updated by Jackson, Mike about 12 years ago
From MAUS SSI Component Design - Online Reconstruction:
- At present an in-memory buffer is used to cache data - see temp_file in src/common_py/Go.py
- Use CouchDB high-speed document-oriented database, http://couchdb.apache.org/, to support multiple I/O requests and large data volumes.
- CouchDB will be flushed after every run.
- A run may be every 2 days or so.
- Define a data store interface. Initial prototype can be in memory buffer (as at present), then hook in CouchDB.
Updated by Tunnell, Christopher about 12 years ago
A run is normally 2 hours and rarely 8. It's mainly dictated by our shift schedule where we run only one shift per day. We can't run for safety reasons too long of shifts and don't have the manpower for 24/7 running.
Updated by Tunnell, Christopher about 12 years ago
Per CouchDB issues. I tried building from source on Ubuntu and got:
configure: error: Could not find the js library. Is the Mozilla SpiderMonkey library installed?
Then just thought "screw it" and ran:
sudo apt-get install couchdb
on Ubuntu which works. We can always make the machine Ubuntu, if it takes less time to setup an Ubuntu server than sort out that problem.
Updated by Jackson, Mike about 12 years ago
Your error is the same as mine...
$ wget http://www.mirrorservice.org/sites/ftp.apache.org/couchdb/1.1.1/ apache-couchdb-1.1.1.tar.gz $ gunzip apache-couchdb-1.1.1.tar.gz $ tar -xf apache-couchdb-1.1.1.tar $ cd apache-couchdb-1.1.1 $ cd /home/michaelj/couchdb/apache-couchdb-1.1.1 $ ./configure ... checking for JS_NewContext in -lmozjs... no checking for JS_NewContext in -ljs... no checking for JS_NewContext in -ljs3250... no checking for JS_NewContext in -ljs32... no configure: error: Could not find the js library. Is the Mozilla SpiderMonkey library installed?
So,
$ yum install js Loaded plugins: kernel-module Setting up Install Process Package js-1.70-8.el5.i386 already installed and latest version Nothing to do
So,
$ ./configure Is the Mozilla SpiderMonkey library installed?
Google "couchdb is the spidermonkey installed" gives http://wiki.apache.org/couchdb/Installing_SpiderMonkey which states "yum install js-devel", so
$ yum install js-devel $ ./configure checking whether JSOPTION_ANONFUNFIX is declared... no configure: error: Your SpiderMonkey library is too new. Versions of SpiderMonkey after the js185-1.0.0 release remove the optional enforcement of preventing anonymous functions in a statement context. This will most likely break your existing JavaScript code as well as render all example code invalid.
This was confusing as I have 1.70-e8!
Looking at configure it checks for JSOPTION_ANONFUNFIX. This was removed after js185-1.0.0. Checking the JS download, it was not added until js-1.8.0-rc1 so I'm right to get the error (that flag is not there) but for the wrong reason (my version is too old, not too new)
Try and build 185-1.0.0,wget http://ftp.mozilla.org/pub/mozilla.org/js/js185-1.0.0.tar.gz http://wiki.apache.org/couchdb/Installing_SpiderMonkey $ cd js/src $ make BUILD_OPT=1 -f Makefile.ref
According to this the make -f was for5 1.7, http://old.nabble.com/How-can-i-install-SpiderMonkey--td20791332.html but https://developer.mozilla.org/En/SpiderMonkey/Build_Documentation says
cd js/src autoconf-2.13 ./configure make
But my autoconf is 2.59 and it says no greater than 2.13 is suitable.
$ yum install autoconf213 autoconf213 noarch 2.13-12.1 sl-base 253 k
Try again,
$ cd js/src $ autoconf-2.13 $ ./configure ... checking for 64-bit OS... no checking for Python version >= 2.5 but not 3.x... configure: error: Python 2.5 or higher (but not Python 3.x) is required. $ python -V 2.4
So,
$ yum install python26 ================================================================================ Installing: python26 i386 2.6.5-6.el5 epel 6.5 M Installing for dependencies: libffi i386 3.0.9-1.el5.rf rpmforge 87 k python26-libs i386 2.6.5-6.el5 epel 667 k $ ./configure $ make $ ./js -help JavaScript-C 1.8.5 2011-03-31 $ pwd /home/michaelj/couchdb/js-1.8.5/js/src
Return to CouchDB
$ cd /home/michaelj/couchdb/apache-couchdb-1.1.1
$ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ --with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/
- The icu-config script could not be found. Make sure it is
- in your path, and that taglib is properly installed.
- Or see http://ibm.com/software/globalization/icu/
configure: error: Library requirements (ICU) not met.
Google and find http://wiki.apache.org/couchdb/Error_messages#Missing_icu-config,$ yum install libicu-dev # Not found $ yum install libicu # 3rd time lucky, $ yum install icu ================================================================================ Installing: icu i386 3.6-5.16 sl-base 187 k $ which icu-config # Not found $ yum install libicu-devel ================================================================================ Installing: libicu-devel i386 3.6-5.16 sl-base 571 k $ which icu-config /usr/bin/icu-config # Finally!
Try again,$ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ --with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/ checking for curl >= 7.18.0... can't find curl >= 7.18.0 configure: error: Library requirements (curl) not met.
According to http://wiki.apache.org/couchdb/Installing_on_RHEL5, curl >= 7.18 is needed for CouchDB > 0.11.
$ wget http://curl.haxx.se/download/curl-7.20.1.tar.gz $ tar -xzf curl-7.20.1.tar.gz $ cd curl-7.20.1 $ ./configure --prefix=/usr/local $ make $ make install
And back to CouchDB,
$ ./configure --with-js-lib=/home/michaelj/couchdb/js-1.8.5/js/src/ --with-js-include=/home/michaelj/couchdb/js-1.8.5/js/src/ configure: error: The installed Erlang version is less than erts-5.7.3 (R13B02) $ yum info erlang Release : 5.10.el5
At which point I decide to just stick with the yum installable CouchDB 0.1.1.
yum install couchdb =============================================================================== Package Arch Version Repository Size =============================================================================== Installing: couchdb i386 0.11.2-2.el5 epel 557 k Installing for dependencies: erlang-ibrowse i386 2.1.0-1.el5 epel 48 k erlang-mochiweb i386 1.4.1-5.el5 epel 368 k erlang-oauth i386 1.0.1-1.el5 epel 27 k js i386 1.70-8.el5 epel 394 k
Updated by Jackson, Mike about 12 years ago
CouchDB familiarisation
Reinstalled...
http://wiki.apache.org/couchdb/Installing_on_RHEL5 $ yum install couchdb couchdb i386 1.0.1-2.el5.rf rpmforge 749 k Configuration file (untouched for now): /etc/couchdb/local.ini $ service couchdb start Starting database server couchdb chown: `couchdb': invalid user su: user couchdb does not exist Never happened before :-( $ yum remove couchdb $ yum install couchdb Loaded plugins: kernel-module Setting up Install Process Package couchdb-1.0.1-2.el5.rf.i386 already installed and latest version Nothing to do $ yum list couchdb Loaded plugins: kernel-module Installed Packages couchdb.i386 1.0.1-2.el5.rf installed $ yum clean all $ yum install couchdb Still there! $ rpm -qa couchdb couchdb-1.0.1-2.el5.rf $ rpm -e couchdb error: %preun(couchdb-1.0.1-2.el5.rf.i386) scriptlet failed, exit status 1 Google, $ rpm -e --noscripts couchdb $ yum install couchdb
Verify and test...
http://wiki.apache.org/couchdb/Installing_on_Ubuntu
http://wiki.apache.org/couchdb/Verify_and_Test_Your_Installation
$ /sbin/service couchdb start Starting database server couchdb $ /sbin/service couchdb status Apache CouchDB is running as process 6723, time to relax. [or /etc/init.d/couchdb status] $ curl http://localhost:5984 {"couchdb":"Welcome","version":"1.0.1"}
Using Curl...
http://guide.couchdb.org/draft/tour.html
$ curl -X GET http://127.0.0.1:5984/_all_dbs ['_users'] $ curl -X PUT http://127.0.0.1:5984/mike {"ok":true} $ curl -X GET http://127.0.0.1:5984/_all_dbs ["_users","mike"] $ curl -X DELETE http://127.0.0.1:5984/mike {"ok":true} $ curl -X GET http://127.0.0.1:5984/_all_dbs ["_users"]
Admin interface...
- Uuse a browser, http://localhost:5984/_utils/
Security...
- http://guide.couchdb.org/draft/security.html
- Default - no authorization - anyone on localhost/ can do anything.
- Options: VPN, behind HTTP proxy (e.g. Apache httpd's mod_proxy, nginx, or varnish)
- Need to revist.
CouchDB and Python...
http://wiki.apache.org/couchdb/Getting_started_with_Python
$ easy_install CouchDB Reading http://code.google.com/p/couchdb-python/ http://packages.python.org/CouchDB/index.html http://packages.python.org/CouchDB/getting-started.html $ ipython $ import couchdb $ couch = couchdb.Server() # Assumes localhost # Can do couch = couchdb.Server('http://example.com:5984/') $ for i in couch $ print i _users mike $ db = couch['mike'] $ doc = {'foo': 'bar'} $ db.save(doc) (u'771f4698483b44de3ae6af0b8b00037f', u'1-4c6114c65e295552ab1019e2b046b10e') # ID and "rev" (revision) $ print doc {'_rev': u'1-4c6114c65e295552ab1019e2b046b10e', 'foo': 'bar', '_id': u'771f4698483b44de3ae6af0b8b00037f'} $ doc2 = {'_id': 'mydoc', 'foo':'bar'} $ db.save(doc2) (u'mydoc', u'1-4c6114c65e295552ab1019e2b046b10e') $ for id in db: $ print id $ 771f4698483b44de3ae6af0b8b00037f mydoc $ tmp = db.get('mydoc') $ print tmp <Document u'mydoc'@u'1-4c6114c65e295552ab1019e2b046b10e' {u'foo': u'bar'}> $ db.delete(tmp) $ tmp = db.get('mydoc') # Returns "None" $ from couchdb import ResourceNotFound $ try: $ couch.delete("random") $ except ResourceNotFound: $ print "Not found" Not found
Commit 680 - modified
Go.py
to dump spills into CouchDB when they return from Celery workers and then for reducer loop to pull these from CouchDB. Worked fine.Updated by Jackson, Mike about 12 years ago
Notes on MongoDB
- http://www.mongodb.org/
- http://www.mongodb.org/display/DOCS/CentOS+and+Fedora+Packages
$ yum install mongo-10gen No package mongo-10gen available. Edit /etc/yum.repos.d/10gen.repo and add, [10gen] name=10gen Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/i686 gpgcheck=0 $ yum install mongo-10gen mongo-10gen i686 2.0.1-mongodb_1 10gen 28 M $ yum install mongo-10gen-server mongo-10gen-server i686 2.0.1-mongodb_1 10gen 5.4 M
http://www.mongodb.org/display/DOCS/Quickstart+UnixSet up data directory. I did this as root: $ mkdir -p /data/db $ chown `id -u` /data/db/ --dbpath can specify a different directory, $ /etc/init.d/mongod start $ /etc/init.d/mongod status Connect by default to test database on localhost $ mongo $ db.foo.save( { a : 1 } ) $ db.foo.find() { "_id" : ObjectId("4ed777486efa1cd775690793"), "a" : 1 } $ db.foo.remove({a:1}) $ db.foo.find() $
- http://www.mongodb.org/display/DOCS/Manual
- http://api.mongodb.org/python/current/
$ easy_install pymongo Databases have collections have documents. $ from pymongo import Connection $ connection = Connection() OR $ connection = Connection('localhost', 27017) Lazy creation - DB will be created when document is inserted. $ connection.mausdb OR $ connection["mausdb"] $ mausdb.mauscoll OR $ mauscoll = mausdb["mauscoll"] $ mauscoll.insert({"_id":1, "doc":{"a":"b"}}) Out[18]: 1 $ mauscoll.insert({"doc":{"c":"d"}}) Out[19]: ObjectId('4ed77bba966dc90c61000000') $ connection.database_names() Out[21]: [u'test', u'mausdb', u'local'] $ mausdb.collection_names() Out[20]: [u'mauscoll', u'system.indexes'] $ mauscoll.find_one({"_id":ObjectId('4ed77bba966dc90c61000000')}) Out[23]: {u'_id': 1, u'doc': {u'a': u'b'}} Returns None if no match. $ cursor = mauscoll.find({"_id":1}) $ cursor.next() Out[30]: {u'_id': 1, u'doc': {u'a': u'b'}} $ cursor.next() StopIteration [Exception] $ for a in mauscoll.find(): $ print a $ $ mauscoll.count() Out[33]: 2 $ mauscoll.distinct("_id") [1, ObjectId('4ed77bba966dc90c61000000')] $ mauscoll.remove() [] Remove specific doc $ mauscoll.remove({"_id":3}) Drop the collection $ mauscoll.drop() Multiple inserts do not update the documentr: $ mauscoll.insert({"_id":"tmp", "doc":"doc1"}) $ mauscoll.find_one({"_id":"tmp"}) {u'_id': u'tmp', u'doc': u'doc1'} $ mauscoll.insert({"_id":"tmp", "doc":"doc2"}) $ mauscoll.find_one({"_id":"tmp"}) {u'_id': u'tmp', u'doc': u'doc1'} It needs an update e.g. $ mauscoll.update({"_id":"tmp"},{"doc":"doc2"}) $ mauscoll.find_one({"_id":"tmp"}) {u'_id': u'tmp', u'doc': u'doc2'} Or, use save, which insert's if the doc is not there and updates if it is e.g. $ mauscoll.save({"_id":"tmp", "doc":"doc3"}) $ mauscoll.find_one({"_id":"tmp"}) {u'_id': u'tmp', u'doc': u'doc3'} $ mauscoll.save({"_id":"tmp", "doc":"doc4"}) $ mauscoll.find_one({"_id":"tmp"}) {u'_id': u'tmp', u'doc': u'doc4'}
Updated by Jackson, Mike about 12 years ago
- Downloaded and installed MongoDB and pymongo.
- Wrote classes for dictionary-backed in-memory, CouchDB, and MongoDB data stores plus unit test classes.
- Changed Go.py to use these classes.
- Wrote MAUSDocumentCacheConfiguration.
General document store code changes:
configure Added export PYTHONPATH="\$MAUS_ROOT_DIR/src/docstore:\$PYTHONPATH" export LD_LIBRARY_PATH="\$MAUS_ROOT_DIR/src/docstore:\$LD_LIBRARY_PATH" src/docstore InMemoryDocumentStore.py In-memory, dictionary-based doc store. tests/py_unit/ test_InMemoryDocumentStore.py src/common_py/ConfigurationDefaults.py Added configuration parameters doc_store_class="InMemoryDocumentStore.InMemoryDocumentStore" src/common_py/Go.py multi_process uses doc_store_class
CouchDB code changes:
third_party/bash/40python_extras.bash Added easy_install CouchDB src/docstore CouchDBDocumentStore.py tests/py_unit/ test_CouchDBDocumentStore.py Currently skips if http://localhost:5984 is not available. src/common_py/ConfigurationDefaults.py Added configuration parameters couchdb_url = "http://localhost:5984" Default CouchDB URL. Only needed if using CouchDBDocumentStore. couchdb_database_name = "mausdb" Default CouchDB database name. Only needed if using CouchDBDocumentStore. bin/utilities/delete_couchdb.py Simple client to delete database on a CouchDB server.
MongoDB code changes:
third_party/bash/40python_extras.bash Added easy_install pymongo src/docstore MongoDBDocumentStore.py tests/py_unit/ test_MongoDBDocumentStore.py Currently skips if http://localhost:27017 is not available. src/common_py/ConfigurationDefaults.py Added configuration parameters mongodb_host = "localhost" Default MongoDB host name. Only needed if using MongoDBDocumentStore. mongodb_port = 27017 Default MongoDB port. Only needed if using MongoDBDocumentStore. mongodb_database_name = "mausdb" Default MongoDB database name. Only needed if using MongoDBDocumentStore. mongodb_collection_name = "spills" Default MongoDB collection name. Only needed if using MongoDBDocumentStore.
Examples of use of each:
$ ./bin/examples/simple_histogram_example.py -type_of_dataflow=multi_process -doc_store_class="CouchDBDocumentStore.CouchDBDocumentStore" $ ./bin/examples/simple_histogram_example.py -type_of_dataflow=multi_process -doc_store_class="MongoDBDocumentStore.MongoDBDocumentStore"
Commit 693
Updated by Jackson, Mike almost 12 years ago
Experiments relating to data cache use in Go.py
Example of using datetime to time-stamp spills:
$ import time $ print time.time() 1328626465.2 $ from datetime import datetime $ datetime.fromtimestamp(time.time()) datetime.datetime(2012, 2, 7, 15, 7, 16, 610819) $ ds.save({'_id':"1", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())}) $ ds.save({'_id':"2", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())}) $ ds.save({'_id':"3", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())}) $ ds.save({'_id':"4", 'doc':{'a':'b'}, "date":datetime.fromtimestamp(time.time())})
And queries for all spills added since a specific datetime:
$ query = datetime(2012,2,7,15,0,10) result = ds.find({"date":{"$gt":query}}).sort("_id") print result.count() 4 query = datetime(2012,2,7,15,0,9) result = ds.find({"date":{"$gt":query}}).sort("_id") print result.count() 1 $ result.next() ...
If no more then Python
StopIteration
is raised.
Example of using a simple reduced
flag:
$ ds.save({'_id':"1", 'doc':{'a':'b'}, 'reduced':False}) $ ds.save({'_id':"2", 'doc':{'a':'b'}, 'reduced':False}) $ ds.save({'_id':"3", 'doc':{'a':'b'}, 'reduced':False}) $ ds.save({'_id':"4", 'doc':{'a':'b'}, 'reduced':False}) # Imagine an update... $ ds.save({'_id':"3", 'doc':{'a':'b'}, 'reduced':True}) # Query for non-reduced, $ result = ds.find({"reduced":False}).sort("_id") $ result.count() 3 # Query for reduced, $ result = ds.find({"reduced":True}).sort("_id") $ result.count() 1
Exceptions expected:
- Bad connection URL or port number:
pymongo.errors.AutoReconnect
- MongoDB goes down and a database operation is attempted:
socket.error
on first attempt andpymongo.errors.AutoReconnect
on subsequent attempt/
Current state and questions
- The database connection ass-umes by default a database called "mausdb" with spills in a collection called "spills" (though these can be overriden by configuration flags)
- At present for the reduce-output stage, a spill is read from the database, sent to reduce and, if no errors arise, the spill is then removed from the database.
- The database connection class is pluggable - users can use InMemoryDocumentStore or MongoDBDocumentStore - they have the same API so Go.py doesn't care.
Questions
- What is the envisaged usage of the document store? Is it envisaged that multiple reduce-output clients would pull data from a common set of spills output by a single input-map run and convert this to histograms?
- If not, then how will multiple histograms end up in the web interface? By running multiple instances of Go.py each with their own map-reduce workflows?
- If so, then the
reduce
flag isn't applicable and we need to go with thetimestamp
(or some other option) so each reduce-output client doesn't process the same spill twice. - For queries
- Either, need to abandon InMemoryDocumentStore and just have Go.py explicity interact with MongoDB directly
- Or, have query-specific functions in the API (e.g. get_non_reduced_spills or get_spills_added_since_time(...)). This then requires simple implementations of these queries for InMemoryDocumentStore.
- Instead of using a default collection "spills" it may be better to name this after the process ID/host of the spill depositor? This though, would require a new (optional) command-line parameter to be supported for reduce-output clients, so they know which collection to pull spills from?
Updated by Jackson, Mike almost 12 years ago
Decision:
- Avoid destructive reading of spills. Timestamp spills and readers keep track of latest spill read.
Updated by Jackson, Mike almost 12 years ago
To 752. Added simple client to delete MongoDB collection or database. Added support for MongoDB disconnect to document store API.
Updated by Jackson, Mike almost 12 years ago
- Priority changed from Normal to Urgent
Do query by timestamp, add component to read from latest timestamp that can be used in Go.py. Add test that uses 2xTOF plot mergers.
Updated by Jackson, Mike almost 12 years ago
To 776
- Removed CouchDB code for maintainability purposes. Focus on MongoDB.
- Added DocumentStore super-class.
- Added get_since method to DocumentStore and sub-classes to return documents added since a given time, in date-sorted order.
- Changed InMemoryDocumentStore to timestamp additions.
- Added new tests and pulled out common tests into test super-class.
- Set MongoDB to be default doc_store_class.
Updated by Jackson, Mike almost 12 years ago
Last week's tasks:
Commits up to 790
Data cache #705
- Added bin/utilities/summarise_mongodb.py, which summarises collection names, sizes and numbers of documents in MongoDB
- Changed DocumentStore API to explicity support notion of a named collection in the document store.
Allows use of process ID and partitioning of spills from runs or jobs etc.
Updated by Jackson, Mike over 11 years ago
Error that arises when trying to get 160 spills sorted by date order:
pymongo.errors.OperationFailure: database error: too much data for sort() with no index. add an index or specify a smaller limit
Fix is to create an index,
$ coll.create_index("date") $ coll.index_information() {u'_id_': {u'key': [(u'_id', 1)], u'v': 1}, u'date_1': {u'key': [(u'date', 1)], u'v': 1}}
MongoDBDocumentStore now indexes by date - 799
Updated by Jackson, Mike over 11 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
Updated by Rogers, Chris over 11 years ago
- Target version changed from Future MAUS release to MAUS-v0.2.0