Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Extracting 10 years of USAspending.gov data into CouchDB (full360.com)
22 points by bbgm on April 12, 2009 | hide | past | favorite | 10 comments


off topic, but that logo sure looks familiar...

http://images.google.com/images?q=xbox+360+logo


And the domain is "www.full360.com" as well. Perhaps they originally wanted to be a games blog :-)


ha! no we were never meant to be a games blog - full 360 was about touching all points of analytics in the enterprise. There are only so many ways to represent 360 degrees - so it ended up that way unintentionally


Disclosure: Part of my responsibilities at AWS include the AWS Public Data Sets program


Nice. As an aside, would it be possible to share your thoughts about the performance of CouchDB when loaded-up with that much data?


Loading the data, including parsing the xml and converting it to json was about 50,000 records per hour on a c1.medium aws instance.

Just transforming the data from one json format to another and loading to a new couchdb is much faster - about 200,000 records per hour. The server does trip over sometimes on the bulk load, and requires a restart. This happens once every 600-700k records

Reading the data is extremely quick, While creating the views on an existing database is slow, once created, accessing the data is very fast using the keys in the views


Thanks!


Thanks for public data sets! I like AWS in general, but public data is a very sweet touch.


this would be a great addition to public data sets, though I imagine for that to happen it would need some sort of viable plan to keep the data in sync


Now that I have this up, I was hoping to be able to work with the usaspending.gov team to get a feed or extension to the api, that gives me the changed records since the last upload. Then update the aws snapshot with this. Do this on the same timeline that the usaspending.gov does it, monthly




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: