May 3, 2013
Dynamo Sure Works Hard
We tend to think of working hard as a good thing. We value a strong work ethic and determination is the face of adversity. But if you are working harder than you should to get the same results, then it's not a virtue, it's a waste of time and energy. If it's your business systems that are working harder than they should, it's a waste of your IT budget.
Dynamo based systems work too hard. SimpleDB/DynamoDB, Riak, Cassandra and Voldemort are all based, at least in part, on the design first described publicly in the Amazon Dynamo Paper. It has some very interesting concepts, but ultimately fails to provide a good balance of reliability, performance and cost. It's pretty neat in that each transaction allows you dial in the levels of redundancy and consistency to trade off performance and efficiency. It can be pretty fast and efficient if you don't need any consistency, but ultimately the more consistency you want the more have to pay for it via a lot of extra work.
Network Partitions are Rare, Server Failures are Not
... it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously. As such systems and applications need to be aware which properties can be achieved under which conditions.
For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated. The challenge with this approach is that it can lead to conflicting changes which must be detected and resolved. This process of conflict resolution introduces two problems: when to resolve them and who resolves them. Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.
- Amazon Dynamo Paper
The Dynamo system is a design that treats the probability of a network switch failure as having the same probability of machine failure, and pays the cost with every single read. This is madness. Expensive madness.
Within a datacenter, the Mean Time To Failure (MTTF) for a network switch is one to two orders of magnitude higher than servers, depending on the quality of the switch. This is according to data from Google about datacenter server failures, and the publish numbers of the MTBF of Cisco switches (There is a subtle difference between MTBF and MTTF, but for our purposes we can treat them the same)
It is claimed that when W + R > N you can get consistency. But it's not true, because without distributed ACID transactions, it's never possible to achieve W > 1 atomically.
Consider W=3, R=1 and N=3. If a network failure or more likely a client/app tier failure (hardware, OS or process crash) happens during the writing of data, it's possible for only replica A to receive the write, with a lag until the cluster notices and syncs up. Then another client with R = 1 can do two consecutive reads, getting newer data first from a node A, and older data next from node B for the same key. But you don't even need a failure or crash, once the first write occurs there is always a lag for the next server(s) to receive the write. It's possible for a fast client to do the same read 2 times again, getting a newer version from one server, then an older version from another.
What is true is that if R > N / 2, then you get consistency where it's not possible to read in a newer value, then a subsequent read get's an older value.
For the vast majority of applications, it's okay for a failure leading to temporary unavailability. Amazon believes its shopping cart is so important to capture writes it's worth the cost of quorum reads, or inconsistency. Perhaps. But the problems and costs multiply. If you are doing extra reads to achieve high consistency, then you are putting extra load on each machine, requiring extra server hardware and extra networking infrastructure to provide the same baseline performance. All of this can increase the frequency of a component failure and increases operational costs (hardware, power, rack space and the personnel to maintain it all).
A Better Way?
What if a document had 1 master and N replicas to write to, but only a single master to read from? Clients know based on the document key and topology map which machine serves as the master. That would make the reads far cheaper and faster. All reads and writes for a document go to the same master, with writes replicated to replicas (which also serve as masters for other documents, each machine is both a master and replica).
But, you might ask, how do I achieve strong consistency if the master goes down or becomes unresponsive?
If when that happens, the cluster also notices the machine is unresponsive or too slow and removes it out of the cluster and fails over to a new master. Then the client tries again and has a successful read.
But, you might ask, what if the client asks the wrong server for a read?
If all machines in the cluster know their role and only one machine in the cluster can be a document master at any time, and the cluster manager (a regular server node elected by Paxos consensus) makes sure to remove the old master, and then assign the new master, and then tell the client about the new topology. Then the client updates its topology map, and retries at the new master.
But, you might ask, what if the topology has changed again, and the client again asks to read from the wrong server?
Then this wrong server will let the client know. The client will reload the topology maps, and re-request from the right server. If the right master server isn't really right any more because of another topology change, it will reload and retry again. It will do this as many times as necessary, but typically it happens only once.
But, you might ask, what if there is a network partition, and the client is on the wrong (minor) side of the partition, and reads from a master server that doesn't know it's not a master server anymore?
Then it gets a stale read. But only for a little while, until the server itself realizes it's no longer in heartbeat contact with the majority of the cluster. And partitions like this are the among the rarest form of a cluster failure. It will require a network failure, and for the client to be on the wrong side of the partition.
But, you might ask, what if there is a network partition, and the client is on the wrong (smaller) side of the partition, and WRITES to a server that doesn't know it's not a master server anymore?
Then the write is lost. But if the client wanted true multi-node durability, then the write wouldn't have succeeded (the client would timeout waiting for replicas(s) to receive the update) and the client wouldn't unknowingly lose data.
What I'm describing is the Couchbase clustering system.
Let's Run Some Numbers
Given the MTTF of a server, how much hardware and how quickly must the cluster failover to a new master and still meet our SLAs requirements vs a Dynamo based system?
Let's start with some assumptions:
We want to achieve 4000 transactions/sec with 3 node replication factor. Our load mix is 75% reads/25% writes.
We want to have some consistency, so that we don't read newer values, then older values, so for Dynamo:
R = 2, W = 2, N = 3
But for Couchbase:
R = 1, W = 2, N = 3
This means for a Dynamo style cluster, the load will be:
Read transactions/sec: 9000 reads (reads spread over 3 nodes)
Write transactions/sec: 3000 writes (writes spread over 3 nodes)
This means for a Couchbase style cluster, the load will be:
Read transactions/sec: 3000 reads (each document read only on 1 master node, but all document masters evenly spread across 3 nodes)
Write transaction/sec: 3000 writes (writes spread over 3 nodes)
Let's assume both systems are equally as reliable at the machine level. Google's research indicates in their datacenter each server has a MTTF of 3141 hrs or 2.7 failures per year. Google also reports a rack failure (usually power supply) of 10.2 years, roughly 30x a reliable as a server, so we'll ignore that to make the analysis simpler. (This is from Googles paper studying server failures here)
The MTBF of Cisco network switch is published at 54,229 hrs on the low end, to 1,023,027 hrs on the high end. For our purposes, we'll ignore switch failures, since the failures affects availability and consistency of both system about the same, and it's 1 to 2 orders of magnitude rarer than a server failure. (This is from a Cisco product spreadsheet here)
Assume we want to meet a latency SLA 99.9% of the time (the actual latency SLA threshold number doesn't matter here).
On Dynamo, that means each node can fail the SLA 1.837% of the time. Because it queries 3 nodes, but only uses the values from the first 2 nodes and the chances of SLA failure are the same across nodes, the formula is different (only two must meet the SLA):
0.0001 = (3 − 2 * P) * P ^ 2
P = 0.001837
On Couchbase, if a master node fails, it must recognize it and fail it out. Given Google's MTTF failure above and it can fail out a node in 30 secs, and let's say it will take 4.5 minutes for it warm up the RAM cache, given 2.7 failures/year with 5 minutes of downtime for each before a failover completes, then queries will fail 0.00095% of time due to node failure.
For Couchbase meet the same SLA:
0.0001 = P(SlaFail) + P(NodeFail) - (P(SlaFail) * P(NodeFail))
0.0001 = P(SlaFail) + 0.0000095 - (P(SlaFail) * 0.0000095)
0.0001 ~= 0.00009 + 0.0000095 - (0.00009 * 0.0000095)
Note: Some things I'm omitting from the analysis are when a Dynamo node fails the lower latency requirement from meeting the SLA for 2 nodes vs. 3 (it would drop from 1.837% to ~0.05%), and also the increased work on the remaining servers when a Couchbase server fails. Both are only temporary and go away when a new server is added back and initialized in the cluster, and shouldn't change the numbers significantly. Also there is the time to add in a new node and rebalance load on it. At Couchbase we work very hard to make that as fast and efficient as possible. I'll assume Dynamo systems do the same, that the cost is the same and omit it, though I think we are the leaders in rebalance performance.
With this analysis, a Couchbase node can only fail its SLA 0.9% of the requests, and a Dynamo node can fail it 1.837%. Sounds good for Dynamo, but it must do for 2X the throughput per node on 3x the data, and with 2x the total network traffic. And for very low latency response times (our customers often want sub-millisecond latency) typically meeting the SLA means a DBMS must keep a large amount of relevant data and metadata in RAM, because there is a huge cost for random disk fetches on latency. With disk fetches 2 orders of magnitude slower on SSDs (100x), and 4 orders of magnitude slower on HDDs (10000x) the disk accesses pile up faster without enough RAM, so do the latencies.
So each Dynamo node can fail its SLA at a higher rate is very small win when it will still need to keep nearly 3X the working set ready in memory because each node will be serving 3x the data at all times for read requests (it can fail its SLA slightly more often, so it's actually about 2.97x the necessary RAM), and will use 2x the network capacity.
Damn Dynamo, you sure do work hard!
Now Couchbase isn't perfect either, far from it. Follow me on twitter @damienkatz. I'll be posting more about the Couchbase shortcomings and capabilities, and technical roadmap soon.
January 18, 2013
If I were to list projects as small, medium, and large or small to enterprise, what methodologies work across them? My thoughts are Agile works well, but eventually you'll hit a wall of complexity, which will make you wonder why you didn't see it many, many iterations ago. I don't know anyone at NASA or Space-X or DoD so I don't know what software methodology they use? Given your experience can you shed some light on it?
I don't really use a specific methodology, however I find it very useful to understand the most popular methodologies and when they are useful. Then it's helpful when you are at various stages of projects and know what kinds of approaches are helpful, and how you can apply them to your situation.
For example, I find Test Driven Design (TDD) very much overkill, but for a mature codebase I find lots of testing invaluable. Early in a codebase I find lots of tests very restrictive, I value the ability to quickly change a lot of code without also having to change a larger amount of tests. Early on, when I'm creating the overall architecture that everything else will hang on, and the code is small and design is plastic and I can keep it all in my head, I value being able to move very quickly. However, other developers may find TDD very valuable to think through the design and problems. I don't work like that. To each his own.
Blindly applying methodologies or even "best practices" is bad. For the inexperienced it's better than nothing, but it's not as good as knowledge of self and team, experience with a variety of projects and their stages, and good old-fashioned pragmatism.
January 17, 2013
Follow up to "The Unreasonable Effectiveness of C"
My post The Unreasonable Effectiveness of C generated a ton discussion on Reddit and Hacker News, nearly 1200 comments combined as people got in to all sorts of heated arguments. I also got a bunch of private correspondence about it.
So I'm going to answer some of the most common questions, feedback and misunderstandings it's gotten.
Is C the best language for everything?
Hell no! Higher level languages, like Python and Ruby, are extremely useful and should definitely be used where appropriate. Java has a lot of advantages, C++ does too. Erlang is amazing. Most every popular language has uses where it's a better choice.
But when both raw performance and reliability are critical, C is very very hard to beat. At Couchbase we need industrial grade reliability without compromising performance.
I love me some Erlang. It's very reliable and predictable, and the whole design of the language is about robustness, even in the face of hardware failures. Just because we experienced a crash problem in the core of Erlang shouldn't tarnish its otherwise excellent track record.
However it's not fast enough for our and our customers needs. This is key, the hard work to make our code as efficient and fast as possible in C now benefits our many thousands of Couchbase server deployments all over the world, saving a ton of money and resources. It's an investment that is payed back many, many times.
But for most projects the extra engineering cost isn't worth it. if you are building something that's only used by your organization, or small # of customers, your money is likely better spent on faster/more hardware than very expensive engineers coding, testing and debugging C code. There is a good chance you don't have the same economies of scale we do at Couchbase where the costs are spread over high # of customers.
Don't just blindly use C, understand its own tradeoffs and if it makes sense in your situation. Erlang is quite good for us, but to stay competitive we need to move on to something faster and industrial grade for our performance oriented code. And Erlang itself is written in C.
If a big problem was C code in Erlang, why would using more C be good?
Because it's easier to debug when you don't lose context between the "application" layer and the lower level code. The big problem we've seen is when C code is getting called from higher level code in the same process, we lose all the debugging context between the higher level code and the underlying C code.
So when we were getting these crashes, we didn't have the expertise and tooling to figure out what exactly the Erlang code was doing at the moment it crashed. Erlang is highly concurrent and many different things were all being executed at the same time. We knew it had something to do with the async IO settings we were using in the VM and the opening and closing of files, but exactly what or why still eluded us.
Also, we couldn't manifest the crash with test code, though we tried, making it hard to report the issue to Erlang maintainers. We had to run the full Couchbase stack with heavy load in order to trigger the crash, and it would often take 6 or more hours before we saw it. This made debugging problematic as we had confounding factors of our own in-process C code that also could have been the source of the crashes.
In the end, we found through code inspection the problem was Erlang's disk based sorting code, the compression options it was using, and the interaction with how Erlang closes files. When Erlang closed files with the compression option it would occasionally have a race condition low down in VM that would lead to a dangling pointer and a double-free. If we hadn't lost all the context between the Erlang user code and the underlying C code, we could have tracked this problem down much sooner. We would have had a complete C stacktrace of what our code was doing when the library code crashed, allowing us to narrow down very quickly the flawed C code/modules.
Why Isn't C++ a suitable replacement for C?
Often it is, but the problem with C++ you have to be very disciplined to use it and not complicate/obfuscate your code unnecessarily. It's also not as portable to as many environments (particularly embedded), and tends to have much higher compilation and build times, which negatively affects developer productivity.
C++ is also a complicated mess, so when you adopt C++ for its libraries and community, you have to take the good with the bad and weird to get the benefits. And there is a lot of disagreement what constitutes bad or weird. Your sane subset of the language is very likely to be at odds with others ideas of a sane subset. C has this problem to a much much smaller degree.
What about Go as a replacement for C?
Perhaps someday. Right now Go is far slower than C. It also doesn't give as good of control over memory since it's garbage collected. It's not as portable, and you also can't host Go code in other environments or language VMs, limiting what you can do with your code.
Go however has a lot of momentum and a very bright future, they've made some very nice and pragmatic choices in its design. If it continues to flourish I expect every objection I listed, except for the garbage collection, will eventually be addressed.
What about D as a replacement for C?
It's not there for the same reasons as Go. It's possible that someday it will be suitable, but I'm less optimistic about it strictly from a momentum perspective, it doesn't have a big backer like Google and doesn't seem to be growing very rapidly. But perhaps it will get there someday.
Is there anything else that could replace C?
I don't know a lot of what's out there on the horizon, and there are some efforts to create a better C. But for completely new languages as a replacement, I'm most hopeful and optimistic about Mozilla's Rust. It's designed to be fast and portable, callable from any language/environment much like C, with no garbage collection yet still safe from buffer overruns, leaks and race conditions. It also has Erlang style concurrency features built in.
But it's a very young and rapidly evolving language. The performance is not yet close to C. The syntax might be too foreign for the masses to ever hit the mainstream, and it may suffer the same niche fate as Erlang because of that.
However if Rust achieves its stated goals, C-like performance but safe with Erlang concurrency and robustness built in, it would be the language of my dreams. I'll be watching its progress very closely.
That's just, like, your opinion, man
Yes, my post was an opinion piece.
But I'm not new to this programming game. I've done this professionally since 1995.
I've built a ton of backend code in C, C++ and Erlang. I've written in excess of 100k lines of C and C++ code. I've easily read, line by line, 300k lines of C code.
I've written a byte code VM in C++ that's been deployed on 100 million+ desktops and 100's of thousands of servers. I used C++ inheritance, templates, exceptions, custom memory allocation and a bunch of other features I thought were very cool at the time. Now I feel bad for the people who have to maintain it.
Also created and wrote, from scratch, Apache CouchDB, including the storage engine & tail append btrees, incremental Map/Reduce indexer and query engine, master/master replication with conflict management, and the HTTP API, plus a zillion of small details necessary to make it all work.
In short, I have substantial real world experience in projects used by millions of people everyday. Maybe I know what I'm talking about.
So while most of what I wrote is my opinion and difficult to back up with hard data, it's born from being cut so many times with the newest and coolest stuff. My view of C has changed over the years, and I used to think the older guys who loved C were just behind the times. Now I see why many of them felt that way, they saw what is traded away when you stray from the simple and effective.
Think about the most widely used backend projects around and see how they are able to get both reliability and performance. Chances are, they are using plain C. That's not just a coincidence.
Follow me on Twitter for more of my coding opinions and updates on Couchbase progress.
January 8, 2013
The Unreasonable Effectiveness of C
For years I've tried my damnedest to get away from C. Too simple, too many details to manage, too old and crufty, too low level. I've had intense and torrid love affairs with Java, C++, and Erlang. I've built things I'm proud of with all of them, and yet each has broken my heart. They've made promises they couldn't keep, created cultures that focus on the wrong things, and made devastating tradeoffs that eventually make you suffer painfully. And I keep crawling back to C.
C is the total package. It is the only language that's highly productive, extremely fast, has great tooling everywhere, a large community, a highly professional culture, and is truly honest about its tradeoffs.
Other languages can get you to a working state faster, but in the long run, when performance and reliability are important, C will save you time and headaches. I'm painfully learning that lesson once again.
Simple and Expressive
"When someone says: 'I want a programming language in which I need only say what I wish done', give him a lollipop."
- Alan J. Perlis
That we have a hard time thinking of lower level languages we'd use instead of C isn't because C is low level. It's because C is so damn successful as an abstraction over the underlying machine and making that high level, it's made most low level languages irrelevant. C is that good at what it does.
The syntax and semantics of C is amazingly powerful and expressive. It makes it easy to reason about high level algorithms and low level hardware at the same time. Its semantics are so simple and the syntax so powerful it lowers the cognitive load substantially, letting the programmer focus on what's important.
It's blown everything else away to the point it's moved the bar and redefined what we think of as a low level language. That's damn impressive.
Simpler Code, Simpler Types
C is a weak, statically typed language and its type system is quite simple. Unlike C++ or Java, you don't have classes where you define all sorts of new runtime behaviors of types. You are pretty much limited to structs and unions and all callers must be very explicit about how they use the types, callers get very little for free.
"You wanted a banana but what you got was a gorilla holding the banana and the entire jungle."
- Joe Armstrong
What sounds like a weakness ends up being a virtue: the "surface area" of C APIs tend to be simple and small. Instead of massive frameworks, there is a strong tendency and culture to create small libraries that are lightweight abstractions over simple types.
Contrast this to OO languages where codebases tend to evolve massive interdependent interfaces of complex types, where the arguments and return types are more complex types and the complexity is fractal, each type is a class defined in terms of methods with arguments and return types or more complex return types.
It's not that OO type systems force fractal complexity to happen, but they encourage it, they make it easier to do the wrong thing. C doesn't make it impossible, but it makes it harder. C tends to breed simpler, shallower types with fewer dependencies that are easier to understand and debug.
C is the fastest language out there, both in micro and in full stack benchmarks. And it isn't just the fastest in runtime, it's also consistently the most efficient for memory consumption and startup time. And when you need to make a tradeoff between space and time, C doesn't hide the details from you, it's easy to reason about both.
"Trying to outsmart a compiler defeats much of the purpose of using one."
- Kernighan & Plauger, The Elements of Programming Style
Every time there is a claim of "near C" performance from a higher level language like Java or Haskell, it becomes a sick joke when you see the details. They have to do awkward backflips of syntax, use special knowledge of "smart" compilers and VM internals to get that performance, to the point that the simple expressive nature of the language is lost to strange optimizations that are version specific, and usually only stand up in micro-benchmarks.
When you write something to be fast in C, you know why it's fast, and it doesn't degrade significantly with different compilers or environments the way different VMs will, the way GC settings can radically affect performance and pauses, or the way interaction of one piece of code in an application will totally change the garbage collection profile for the rest.
The route to optimization in C is direct and simple, and when it's not, there are a host of profiler tools to help you understand why without having to understand the guts of a VM or the "sufficiently smart compiler". When using profilers for CPU, memory and IO, C is best at not obscuring what is really happening. The benchmarks, both micro and full stack, consistently prove C is still the king.
Faster Build-Run-Debug Cycles
Critically important to developer efficiency and productivity is the "build, run, debug" cycle. The faster the cycle is, the more interactive development is, and the more you stay in the state of flow and on task. C has the fastest development interactivity of any mainstream statically typed language.
"Optimism is an occupational hazard of programming; feedback is the treatment."
- Kent Beck
Because the build, run, debug cycle is not a core feature of a language, it's more about the tooling around it, this cycle is something that tends to be overlooked. It's hard to overstate the importance of the cycle for productivity. Sadly it's something that gets left out of most programming language discussions, where the focus tends to be only on lines of code and source writability/readability. The reality is the tooling and interactivity cycle of C is the fastest of any comparable language.
Ubiquitous Debuggers and Useful Crash Dumps
For pretty much any system you'd ever want to port to, there are readily available C debuggers and crash dump tools. These are invaluable to quickly finding the source of problems. And yes, there will be problems.
"Error, no keyboard -- press F1 to continue."
With any other language there might not be a usable debugger available and less likely a useful crash dump tool, and there is a really good chance for any heavy lifting you are interfacing with C code anyway. Now you have to debug the interface between the other language and the C code, and you often lose a ton of context, making it a cumbersome, error prone process, and often completely useless in practice.
With pure C code, you can see call stacks, variables, arguments, thread locals, globals, basically everything in memory. This is ridiculously helpful especially when you have something that went wrong days into a long running server process and isn't otherwise reproducible. If you lose this context in a higher level language, prepare for much pain.
Callable from Anywhere
C has a standardized application binary interface (ABI) that is supported by every OS, language and platform in existence. And it requires no runtime or other inherent overhead. This means the code you write in C isn't just valuable to callers from C code, but to every conceivable library, language and environment in existence.
"Portability is a result of few concepts and complete definition"
- J. Palme
You can use C code in standalone executables, scripting languages, kernel code, embedded code, as a DLL, even callable from SQL. It's the Lingua Franca of systems programming and pluggable libraries. If you want to write something once and have it usable from the most environments and use cases possible, C is the only sane choice.
Yes. It has Flaws
There are many "flaws" in C. It has no bounds checking, it's easy to corrupt anything in memory, there are dangling pointers and memory/resource leaks, bolted-on support for concurrency, no modules, no namespaces. Error handling can be painfully cumbersome and verbose. It's easy to make a whole class of errors where the call stack is smashed and hostile inputs take over your process. Closures? HA!
"When all else fails, read the instructions."
- L. Lasellio
Its flaws are very very well known, and this is a virtue. All languages and implementations have gotchas and hangups. C is just far more upfront about it. And there are a ton of static and runtime tools to help you deal with the most common and dangerous mistakes. That some of the most heavily used and reliable software in the world is built on C is proof that the flaws are overblown, and easy to detect and fix.
At Couchbase we recently spent easily 2+ man/months dealing with a crash in the Erlang VM. We wasted a ton of time tracking down something that was in the core Erlang implementation, never sure what was happening or why, thinking perhaps the flaw was something in our own plug-in C code, hoping it was something we could find and fix. It wasn't, it was a race condition bug in core Erlang. We only found the problem via code inspection of Erlang. This is a fundamental problem in any language that abstracts away too much of the computer.
Initially for performance reasons, we started increasingly rewriting more of the Couchbase code in C, and choosing it as the first option for more new features. But amazingly it's proven much more predictable when we'll hit issues and how to debug and fix them. In the long run, it's more productive.
I always have it in the back of my head that I want to make a slightly better C. Just to clean up some of the rough edges and fix some of the more egregious problems. But getting everything to fit, top to bottom, syntax, semantics, tooling, etc., might not be possible or even worth the effort. As it stands today, C is unreasonably effective, and I don't see that changing any time soon.
Follow me on Twitter for more of my coding opinions and updates on Couchbase progress.
October 28, 2012
How to achieve lots of code?
I get mail.
I read about you from a book on erlang.
Your couchdb application is really a rave.
please can you help me out ,i've got questions only a working programmer can answer.
i'm shooting now:
i've been programming in java for over 3 years
i know all about the syntax and so on but recently i ran a code counter on my apps and
the code sizes were dismal. 2-3k
commercial popular apps have code sizes in the 100 of thousands.
so tell me- for you and what you know of other developers how long does it take to write those large applications ( i.e over 30k lines of code)
what does it take to write large applications - i.e move from the small code size to really large code sizes?
Never try to make your project big. Functionality is an asset, code is a liability. What does that mean? I love this Bill Gates quote:
Measuring programming progress by lines of code is like measuring aircraft building progress by weight.
More code than necessary will bloat your app binaries, causing larger downloads and more disk space, use more memory, and slow down execution with more frequent cache misses. It can make it harder to understand, harder to debug, and will typically have more flaws.
CouchDB, when we hit 1.0, was less than 20k lines of production code, not including dependencies. This included a storage engine (crash tolerant, highly concurrent MVCC with pauseless compactor), language agnostic map/reduce materialized indexing engine (also crash tolerant highly concurrent MVCC with pauseless compactor), master/master replication with conflict management, HTTP API with security model, and simple JS application server.
The small size is partly because it was written in Erlang, which generally requires 1/5 or less code of the equivalent in C or C++, and also because the original codebase was mostly written by one person (me), giving the design a level of coherency and simplicity that is harder to accomplish -- but still very possible -- in teams.
Test are different. Lines of code are more of an asset in tests. More tests (generally) means more reliable production code, helps document code functionality that can't get out of sync the way comments and design docs can (which is worse than no documentation) and doesn't slow down or bloat the compiled output. There are caveats to this, but generally more code in tests is a good thing.
Also you can go overboard with trying to make code short (CouchDB has some WTFs from terseness that are my fault). But generally you should try to make code compact and modular, with clear variable and function names. Early code should be verbose enough to be understandable by those who will work on it, and no more. You should never strive for lots of code, instead you want reliable, understandable, easily modifiable code. Sometimes that requires a lot of code. And sometimes -- often for performance reasons -- the code must be hard to understand to accomplish the project goals.
But often with careful thought and planning, you can make simple, elegant, efficient, high quality code that is easy to understand. That should be your goal.
August 30, 2012
CouchConf SF is coming.
This is our premier Couchbase event. We're going ham.
Come hear speakers from established enterprises and how they are betting their business on Couchbase.
Hang out and talk with speakers, me and other Couchbase engineers in the Couchbase lounge.
I'll be talking at the closing session. Let me know what you'd like to hear about!
Killer after-party. Witness my drunken antics ;)
- Three tracks and nearly 30 technical sessions for dev and ops
- 15 customer speakers from companies like:
- McGraw Hill - who will be sharing their experiences and demoing their Couchbase Server 2.0 app - including full-text search integration among other features
- Orbitz who will be talking about how they replaced Oracle Coherence with Couchbase NoSQL software
- Sabre - discussing how they are using NoSQL to reduce mainframe costs
- Tencent will be sharing their evaluation process (and results) for choosing a NoSQL solution
- Other speakers include Linked In, Tapjoy, TheLadders, and more
- There are also training sessions for developers and admins the two days prior to CouchConf for those who want to also get more hands-on experience.
When you register, you can get the early bird rate if you use the promotional code Damien.
Register here: http://www.couchbase.com/couchconf-san-francisco
June 25, 2012
Reminder: Couchbase Community Summer BBQ
We're celebrating summer by throwing a huge outdoor BBQ bash at our Mountain View office. Wednesday, June 27, 2012 from 5:00 PM to 10:00 PM (PT) Mountain View, CA HQ
June 21, 2012
Why Database Technology Matters
Sometimes I get so down in the weeds of database technology, I forget why I think databases are so fascinating to me, why I found them so important to begin with. ACID. Latency, bandwidth, durability, performance, scalability, Bits and bytes. Virtual this, cloud that. Blah blah blah. Who the fuck cares?
Dear lord I care. I care so much it hurts.
"A database is an organized collection of data, today typically in digital form." -Wikipedia
I think about databases so much. So so much. New schemes for expanding their capacity, new ways of making them work, new ways of making them faster, more reliable, new ways of making them accessible to more developers and users.
I spend so much time thinking about them, it's embarrassing. As much time as I spend thinking about them, I feel like I should I should know so much more than I do.
HTTP, JSON, memcached, elastic clusters, developer accessibility, incremental map/reduce, distributed indexing, intra-cluster replication, cross-cluster replication, tail-append generational storage, disk fragmentation, memory fragmentation, memory/storage hierarchy, disk latency, write amplification, data compression, multi-core, multi-threading, inverted indexes, language parsing, interpreter runtimes, message passing, shared memory, recovery-oriented architectures. All that stuff that makes a database tick.
Why do I spend so time on this? Why have spent so many years on them?
Why do they fascinate me so much? Why did I quit my job and build an open source database engine with my own money, when I wasn't wealthy and I had a family to support?
Why the hell did I do that?
Because I think database technologies are among the most important fundamental advancements of humanity and our collective consciousness. I think databases are as important as telecommunications and the internet. I think they are as important as any scholarly library -- and that libraries are the earliest non-digital databases. I think databases are almost as important the invention of the written word.
Forget SQL. Forget network, document or object databases. Forget the relational algebra. Forget schemas. Forget joins and normalization. Forget ACID. Forget Map/Reduce.
Think knowledge representation. Think knowledge collection, transformation, aggregation, sharing. Think knowledge discovery.
Think of humanity and its collective mind expanding.
When IBM was at the absolute height of its power, they were the richest, most powerful company on the planet. They primarily sold mainframes for a lot of money, and at the core of those mainframes were big database engines, providing a big competitive advantage their customers gladly paid for.
Google has created a database indexing of the internet. They are force because they found ways to find meaning in the massive amounts of information already available. They are a very visible example of changing the way humanity thinks.
File systems are very simple databases. People have been building all sorts of searching and aggregation technology on top them for many years, to better unlock all that knowledge and information stored within.
Email? Email technology is essentially databases that you can send messages to. It's old fashioned and simple, and yet our email systems keeping getting more clever about ways to shows us what's in our unstructured personal databases.
Databases don't have to be huge to have a huge impact. SQLite makes databases accessible on small devices. It's the most deployed database on the planet. It's often easy to miss the impact when when it's billions of small installations, it starts to look like air. Something that's just there, all around us. But add it up and the impact is huge.
And of course big bad Oracle. As much as people love to hate them, they've made reliable database technology very accessible, something you can bet your business on, year after year. They are great at not just making the technology work, but the complete ecosystem around it, something necessary for enterprises and mission critical uses. There is a lot to criticize about them, but much to praise as well.
So yes, I care. I care deeply. I care about the big picture. And I care about the bits and bytes. I care about the ridiculously complex details most people will never see. I care about the boring stuff that makes the bigger stuff happen. And sometimes I forget why I care about it. Sometimes I lose sight of the big picture as I'm so focused on making the details work.
And sometimes I remember. And I feel incredibly lucky and privileged for the opportunities to have a positive impact on the collective mind of humanity. And my reward is to know, in some small way, that I've succeeded. And I want to do more. This is important stuff, the most important and effective way I know how to contribute to the world. It matters to me.
May 30, 2012
Stabilizing Couchbase Server 2.0
I'm glad to report we are now pretty much going into full-on stabilization and resource optimization mode for Couchbase Server 2.0. It's taken us a lot longer than we planned. Creating a high performance, efficient, reliable, full-featured distributed document database is a non-trivial matter ;)
In addition to the same "simple, fast, elastic" memcached and clustering technology we have in previous versions of Couchbase, we've added 3 big new features to dramatically extend it's capabilities and use cases, as well as its performance and reliability.
Couchstore: High Throughput, Recovery Oriented Storage
One of the biggest obstacles for 2.0 was the Erlang-based storage engine was too resource heavy compared to our 1.8.x release, which uses SQLite. We did a ton of optimization work and modifications, stripping out everything we could to make it as a fast and efficient as possible, and in the process making our Erlang-based storage code several times faster than when we started, but the CPU and resource usage was still too high, and without lots of CPU cores, we couldn't get total system performance where our existing customers needed it.
In the end, the answer was to rewrite the core storage engine and compactor in C, using a format bit for bit compatible with our Erlang storage engine, so that updates written in one process could be read, indexed, replicated, and even compacted from Erlang. It's the same basic tail-append, recovery oriented MVCC design, so it's simple to write to it from one OS process and read it from another process. The storage format is immune to corruption caused by server crashes, OOM killers or even power loss.
Rewriting it in C let us break through many optimization barriers. We are easily getting 2x the write throughput over the optimized Erlang engine and SQLite engines, with less CPU and a fraction of the memory overhead.
Not all of this is due to C being faster than Erlang. A good chunk of the performance boost is just being able to embed the persistence engine in-process. That alone cut out a lot of CPU and overhead by avoiding transmitting data across processes and converting to Erlang in-memory structures. But also it's C, which provides good low level control and we can optimize much more easily. The cost is more engineering effort and low-level code, but the performance gains have proven very much worth it.
And so now we've got the same optimistically updating, MVCC capable, recovery oriented, fragmentation resistant storage engines both in Erlang and C. Reads don't block writes and writes don't block reads. Writes also happen concurrently with compaction. Getting all or incremental changes via MVCC snapshotting and the by_sequence index makes our disk io mostly linear for fast warmup, indexing, and cluster rebalances. It allows asynchronous indexing, and it also powers XDCR.
B-Superstar: Cluster Aware Incremental Map/Reduce
Another big item was bringing all the important features of CouchDB incremental map/reduce views to Couchbase, and combining it with clustering while maintaining consistency during rebalance and failover.
We started using an index per virtual partition (vbucket), merging across all indexes results at query time, but quickly scrapped that design as it simply wouldn't bring us the performance or scalability we needed. We needed a system to support MVCC range scans, with fast multi-level key based reductions (_sum, _count, _stats, and user defined reductions), and require the fewest index reads possible.
We embed a bitmap partition index in each btree node that is the recursive OR of all child reductions. Due to the tail append index updates, it's a linear write to update modified leaf nodes through to root while updating all the bitmaps. Now we can tell instantly which subtrees have values emitted from a particular vbucket.
During steady state we have a system that performs with nearly the same efficiency as our regular btrees (just the extra cost of 1 bit per btree node times the number of virtual partitions).
But can exclude vBucket partitions by flipping a single bit mask, for rebalance/failover consistency, with temporary higher query-time cost until the indexes are re-optimized.
In the worst case, O(logN) operations become O(N) until the excluded index results are removed from the index.
The index is once again the steady state, and queries are 0(logN).
The really cool thing is this also works in reverse, so we can start inserting into a vBucket's new node's view index as it rebalances, but exclude the results until the rebalance is complete. The result is consistent view indexes and queries both during steady state and while actively failing-over or rebalancing.
Cross data center replication (XDCR)
Couchbase 2.0 will also have multi-master, cluster aware replication. It allows for geographically dispersed clusters to replicate changes incrementally, tolerant of transient network failures and independent cluster topologies.
If you have a single cluster and geographical dispersed users, latency will slow down applications for distant users. The further away and more network hops a user faces the more inherent latency they will experience. The best way to lower latency for far-away users is to bring the data closer to the user.
With Couchbase XDCR, you can have clusters in multiple data centers, spread across regions and continents, greatly reducing the application latency for users in those regions. Data can be updated at any cluster, replicating changes to remote clusters either on a fixed schedule or continuously. Edit conflicts are resolved by using a "most edited" rule, allowing all clusters to converge on the same value.
I feel like we are just getting started. There is a still a ton of detail and new features I haven't gone into, these are just some of the highlights. I'm really proud and excited not just by what we have for 2.0, but what's possible on the fast, reliable and flexible foundation we've built and the future features and technology we can now easily build. I see a very bright future.
March 27, 2012
0 to 35 million in Six Weeks
If you haven't played Draw Something, you might want to wait until you have some free time, it's creative, social and addictive :) OMGPOP released this game less than 2 months ago, and it's currently #1 game on Facebook, and the #1 app in the iOS app store.
What kind of backend let's you grow a game from nothing to #1 that fast with no downtime? Couchbase baby! Find out more here. Super proud of our guys who built our platform and made this happen. And congrats to OMGPOP for their $200 million sale to Zynga. Nice!