CRDTs The Hard Parts

rw-book-cover

hello everybody my name is Martin clapman I'm a researcher of the University of Cambridge and today I'd like to talk about some of our work on conflict free replicated data types or CRD T's so I'm going to start this talk with a very brief introduction as to what CR duties are and why they are useful but most of this talk is going to be about new results that we have figured out only within the last year or two so you might know me from this book here which I wrote a couple of years ago

it's called designing data intensive applications and it's a broad overview of the architecture of data systems and how to choose your database for your particular application that's not what we're talking about today so today we're talking about collaboration software so in collaboration software I'm taking a fairly broad view of this this is any kind of software where several users can come in and update a document or database or some kind of state shared state and various examples of software

that fits in this category Google Docs is an obvious example where you can have several people editing a document at the same time create a text document or spreadsheet similar things you get with office 365 figma is an example of graphics software that supports multi-user real-time collaboration Trello is an example of project management software where several users can update their status of various tasks assign the tasks and comments and so on so all of these have in common that the

users are updating some kind of shared state now I'm going to use this example collaborative text editing for now there will be some examples from other types of applications later as well so in collaborative text editing you can end up with a scenario something like this you have to users who at the beginning of the time that we're looking at they have the same document which is hello exclamation mark and then the red user comes along and changes this

document to hello world so by inserting the word world before the exclamation mark and concurrently while the red users doing this simultaneously the blue user changes the document to add a smiley chase after the exclamation mark so these two changes happen concurrently without knowledge of each other and then later on as the network communicates those changes so these users communicate over the Internet or whatever network they are using and they should converge

to the same state so we want that at the end both users have the same document on their screen and that document reflects all of the changes that were made and in this case the merged outcome is is quite reasonable what we expect to expect the word world before the exclamation mark and the smiley face after the exclamation mark okay and so there are broadly two families of algorithms for achieving this kind of collaborative text editing one is operational transformation or ot and that's kind of

the classic of multi-user collaborative editing that has been around for a long time and that is used for example by Google Docs and then there's a newer family of algorithms called CID T's which is what I work on which take a bit of a different approach to solving a similar problem and we will compare these two in a minute so I will start with a brief summary of how operational transformation works because that is then useful as a comparison point for

CID T's so with operational transformation we have to first of all describe how changes are made to a document and so when a user changes a document they insert or they delete a character at some position in the document and so we can describe the positions at which those insertions and deletions happen by using indexes so we simply number the first index the first character and the document to be aged

the second character sorry the first character to be 0 the second character to be 1 and so on and so now you can have two users updating their document and here on the left hand side the user changes the document HDL Oh by adding a second L at position 3 and on the right hand side we have the user the purple user adding an exclamation mark after the letter and so now after these two changes have

happened these changes need to be communicated over the network so that one user finds out about the changes made by the other user and so from the left-hand side here we have this insertion of the letter L at position 3 or at index 3 in this document and we have the insertion of the exclamation mark at index 4 on the right hand side so let's first of all take the insertion from the left hand side and move it over to the right hand side and so if there's

a server for example that forwards these messages from one user to the other the server can just forward this insert L at position 3 operation over to the right hand side and on the right hand side we insert L at position 3 and we end up with hello with two L's and an exclamation mark just as we wanted now what happens if we go in the other direction so taking the exclamation mark insertion from the right hand side to the left hand side note that if we simply forward this operation insert

exclamation mark at position 4 if we simply forward that unchanged then the user on the left hand side would end up with the document saying hell exclamation mark O which is not what we want we want both users to end up in the same state and for that stake to reflect the intention of what the users were writing and so instead what has to happen is insert of inserting at position 4 the exclamation mark needs to be inserted at position 5 the reason for that is that concurrently there has been

another insertion at position 3 in before the position for insertion and so the insertion of position 4 needs to be transformed into position 5 due to the concurrent insertion of the L earlier on so this is why the algorithm is called operational transformation we have to take these insertion and deletion operations and depending on which other operations happen concurrently we have to transform the indexes or the positions at which those operations take

place so this algorithm does work but it requires one very particular assumption and fundamental assumption in most operational transformation systems is that all of the communication goes via a single central server so in Google Docs case that server is provided by Google and the server is you know it's it takes a really key role in this algorithm by sequencing all of the operations and the

assumption of operational transformation is that because we have this central server that is required to sequence all of the operations there cannot be any other communication channels in the system so even if two users who are collaborating are actually sitting in the same room and they could just be using the local Wi-Fi or even Bluetooth to be communicating their changes between their two devices using that local communication channel is not allowed because it would undermine the

assumptions that are made by the operational transformation algorithm so Ooty requires all of the communication to go by the central server even if that servers on a different continent and even if the two heaters actually sitting in the same room so this is the difference of where Co deities come in CR DTS allow this kind of multi-user collaboration without making any assumptions about the network topology so it doesn't assume anything about servers or what kind of network is used any type of communication channel can be

used to communicate the operations or the updates from one user to another and the main correctness criterion that we have for both ot and Ciardi T's is what we call convergence so what we require is that whenever any two users have seen the same set of operations then those users must be in the same state and this happens even if the users actually saw the operations in a different order so what we what we do in CR DT is is we

make the operations commutative which means that even if they are swapped around their order is swapped around the end result remains the same and this is how we can achieve convergence it's just ensuring that after everyone has communicated everyone has the same document on their screen and this makes sense it's of course a very useful property if we did not have this we would have some kind permanent inconsistency between the users and that would be bad so convergence is clearly a minimum requirement for any of these sort of

collaboration software but as we shall see in this talk convergence by itself is not really enough because convergence doesn't say anything about what that final state is about after the nodes have communicated so is that final merged state actually the state that we wanted and this is a little bit more difficult to define because what state is desirable or not desirable is really a matter of human judgment it's what are

as humans what is our expectations of the software of how should the software behave and unfortunately it turns out that several of these IDI algorithms for collaborative editing sometimes have behavior that is not really desirable that's behaved weirdly in some ways that we as humans would not expect and I have come to the conclusion after several years of working on Co DTS that basically these problems are quite hard and a simple version of C le T's is very

implement very easy to implement but actually getting it right in such a way that it actually satisfies the user expectations is difficult so C oddities are easy to implement badly and what we shall have a look at in this talk is how we can implement the oddities well in a way that actually behaves in a way that we want so today I want to look at these four topics these are all areas in which C oddities are difficult in some way or

another and for most of them we have solutions as well so I'm going to just talk through those one by one and the one that I want to start with is interleaving so interleaving is a problem that can happen in collaborative text editors if several people insert text at the same position in the document now I'm going to in order to explain why this interleaving happens I'm going to first give a little introduction as to how C oddities for text editing work and I'm going to start

us with an example so as I said earlier when we were looking at operation formation we identify the position where we want to insert or delete a character by its index simply by counting from the start of the document and that has the problem that those indexes need to be transformed so see Aditi's avoid the need for this kind of operational transformation by instead taking a different approach instead see our DTS create a unique identifier for every

single character in the document and that unique identifier remains stable it remains the same even if stuff is added and deleted elsewhere in the document so each character keeps its own permanent unique identifier and there's several possible ways of constructing such identify us one way is we can choose a number between 0 & 1 so it's some kind of fractional number a rational number and we say for example 0 is the beginning of the document 1 is the end

of the document and for every position somewhere every a character somewhere within a text document we will just pick a number that is appropriately positioned within that number line so here in our example of HDL Oh we could say okay let's assign the number zero point two to the H the number zero point four to the air II the number zero point six to the Eldar number zero point eight to the O and so this way we have spaced out our characters over are 0 to 1 number

interval and now when we want to insert a second letter L and an exclamation mark we pick numbers in between the intervals so we want to insert a second L in between the first L and the O so in between zero point six and zero point eight okay let's pick zero point seven as the midpoint and likewise for the exclamation mark we'd pick a number between zero point eight and one point O so a pig zero point nine for example and so this is a simple enough scheme and we can use this to implement a CR DT in

quite a simple way so what we can now do is we can think of our our text document as a set of tuples and each of these triples contains the character at a particular position it contains the numeric value that tells us where this goes and then finally here the aid that I've put in these triples is just a way of referring to the the node ID that created this particular insertion operation so this is just some

kind of unique identifier for a particular user or a particular node in the system and the reason we have that is just in case two different nodes pick the same number then we can define the ordering on these characters based on the node ID as a tiebreaker and so we now just have this set of triples representing the document and we can make edits we can insert new characters into this text document just by picking numbers for them and and then

calculating the set Union and we get a set including the new tuples and we can reconstruct exactly what the state of the document is by just sorting those tuples by their numbers tiebreaking based on node IDs and then we get the text document and several the oddities for text take essentially this approach they tend to not actually use numbers between 0 and 1 instead they use things like a path through a tree but the effective idea is it's still the same

they are just essentially picking positions on a number line and so when we implement a C or D T in this way we can have some unfortunate behavior and this is an example of behavior that can occur in some text editing CR duties so here we have two users on the left hand side user 1 who starts off with the text document saying hello exclamation mark and user 1 inserts the word Alice between hello and the exclamation mark and concurrently on the right hand side user 2 inserts the word charlie

in between hello and the exclamation mark now the two users communicate they merge their changes and we got what we get out at the end is hello and then some kind of weird jumble of letters that we can't read what what does this even mean how did this happen we started with Alice and Charlie and we ended up with some kind of massive letters well what happened is this here so we started off with a text document where each each

character has a number on the number line and hello the the Oh had 0.64 and the exclamation our card 0.95 and so then the insertion of Alice or space Alice just choose some chose some numbers in between that interval zero point six four and zero point nine five and likewise Charlie insertion independently chose some numbers within that interval and so what a lot of CRT T's do for this kind of thing is they actually randomize

the numbers a little bit because that reduces the probability that two different nodes will pick exactly the same position pick exactly the same number for a particular for a particular character so reduce the risk of collisions and so what has now happened with this insertion is that well we have taken all of the numbers from Alice on the code which are coded in red here and the insertions of the Charlie letters

which are in blue and we simply ordered them based on the numbers and the result unfortunately is that we've now interleaved those two character sequences that have come from the two different users and the result is unfortunately a complete unreadable mess now this problem is bad enough if we're just inserting a single word as we are here it gets even worse if what the two users have done is concurrently inserted an entire paragraph or maybe an entire section into their text document if you

have a paragraph or a section in which the characters have been interleaved on a basically random basis you will not be able to use that text anymore you will just have to delete it and rewrite it again and the users are not going to be happy about this unfortunately this behavior really does occur in real si oddities and we have examined a number of different at RC OTT algorithms for text editing and we have demonstrated this problem at least in the algorithms called blue gute and LS EQ and so in these problem in these

algorithms unfortunately this interleaving problem is very deeply baked into the fundamental way how the algorithm is constructed I cannot see any way of fixing this problem fixing this bug in the algorithms because the way how these numbers are assigned or the waiter paths where the trees are assigned the this this interleaving idea is just very inherently baked in to the way these algorithms work so other algorithms we

have not found this problem so in tree doc and wood for example we did not find this interleaving problem this does not entirely guarantee that that it can never happen but we think it won't happen and so in a way these algorithms are safe however unfortunately these two algorithms are also the ones that are least efficient the entire reason why the L seq algorithm was designed for example is because the designers wanted

a CLD T that was more efficient than tree doc and route because tree doc and do unfortunately take a lot of space for their ident for their identifiers and so unfortunately this performance optimization has now introduced bad behavior in the algorithm and in my opinion that makes the algorithm like LS EQ and la cude basically unsuitable for use in in many text editing applications moreover there is a specification of collaborative text editing called a strong here which was published by Atia

and others in 2016 and this specification also allows interleaving as a ballad behavior though we have a paper in which we fix this specification to rule out this interleaving behavior finally our GA is an interesting one so our GA is another text editing CR DT algorithm which does not quite have this same interleaving problem it has a bit of a less of an interleaving problem and so I will mention that briefly because it's an interesting case study

so in our GA the interleaving problem that can happen is like this so in this example here on the left hand side we've got user one who starts off with the document saying hello then the user inserts the word leader between hello and the exclamation mark then the user moves their cursor back again after the word hello and types the word dear so what we have here is typing the word read are moving the cursor back and then typing the word dear on the right hand

side we have user 2 who just typed the word Alice between hello and the exclamation mark and what can now happen in RTA is that the word Alice can get interleaved in between the word dear and the word reader so even though the text dear reader was typed by the one user and the word Alice were typed by the other user we can end up with the two mixed up a bit so we don't get the same character by character interleaving as we get with some of the other Co duties but we do still have the

ability for text to be inserted into the middle of a different users insertion and the reason this happens is the data structure that RTA uses looks a bit like this so it's a kind of tree data structure with with very deep sub trees and a small branching factor usually and what happens here is that each node industry is a character that was inserted and the parent of each node is

the character that was the immediate predecessor character at the time of insertion so we start off with a single special node at the root which is the beginning of the document and then when the user types h-e-l-l-o exclamation mark each of those letters has as its predecessor the character that was just typed and so you end up with this chain of h-e-l-l-o luckily like a linked list almost where each characters parent is just its

immediate predecessor in this tree and then the users come and position their cursors in between the hello and the exclamation mark and one type space reader another types space Alice and the first user who typed reader moved their cursor back and type dear and so you can see every time the cursor gets moved that results in a new sub tree being formed here and then otherwise the the

order of the characters in this document is a depth-first traversal over this tree and so we can see that in the interleaving that has happened here hello dear Alice reader what has happened is we visit hello then first the dear subtree then the Allis subtree then the winter subtree and then the exclamation mark and the order in which we visit those sub trees is defined by a time stamp that is attached to every node industry and so if we have in this case here

several siblings so the own ode has several children then we visit those children in descending order of timestamp and because the time stamps are unpredictable in concurrent editing cases you can end up with these three possible outcomes as the merge results so it's not as bad as the character by character interleaving that you get with the others I should be precise it is possible to get character by character into leaving an RG a the circumstances in which that

happens is if the users type their document back to front so if the users start at the end of the document and first type the last letter then move their cursor back to the beginning type the second to last letter move to cursor types third last letter and so on and type the entire document from the end to the beginning if they do that then they would actually get character by character interleaving in RGA because in that case in this tree here all of our characters would be siblings there would all be children of the root

node and in our case the ordering is just defined by timestamps again but fortunately in practice people tend to not actually type documents like that and people tend to type documents in a reasonably linear fashion from beginning to end and of course occasionally people will backtrack a bit delete a bit copy and paste some sections and so on but as the general rule typing still happens in a fashion from beginning to end and so for that reason the interleaving problem is not actually

as bad in our GA as it is in the other algorithms however we could still say we want to be strict and we want to not allow any interleaving at all and we did actually devise an algorithm so for our GA we were able to find a fix a modification to the algorithm which changes it such that this kind of interleaving of concurrently insertion does not happen regardless of how the cursors was moved I don't have time in this talk to talk about this particular rhythm but if you're interested there's

a paper that we published in 2019 called interleaving anomalies in collaborative text editors and in this paper the algorithm is described exactly so let's move on to our next topic our next topic today is reordering list items that is we have a list of things and we want to move an item from one position in the list to another so one example in which

an example application in which we might want this is a to-do list so in a to-do list often you can drag and drop items into whatever order you want and so what happens here for example is you have a to-do list consisting of firstly by milk secondly water the plants thirdly phone Joe and the user is dragging and dropping the phone Joe item and moving it to the top of the list so that this is now at position one so how do we do this well we can represent the list

using any of the CEO duties that we talked about earlier so all of the CEO duties for text editing are also Co duties for lists and so we can use those to represent this ordered list of items however all of those CR duties don't actually support a move operation all they have is operations for inserting and deleting items but no operation for moving now you might think ok well that's not too bad because we can always implement moving by just deleting the

item from its old location and inserting it at its new location and this works except it has a problem if several people can currently move the same item so in this example here replica a and replica B both move phone Joe to the top of the list and so what happens if we implement this moving by deleting and reinserting well they both delete phone Joe from position 3 so it's definitely gone from position 3 and then they both

insert phone Joe at position 1 but the result now is that these two insertions at position 1 they both happen and so we end up with two copies of phone Joe at position 1 & 2 so this item that we wanted to only move has accidentally got duplicated in the process and this is really not the kind of behavior that a user would expect I think over to do list application just because they reordered some items on their list doesn't mean they certainly want to copies of those items appearing

so well what is the behavior actually does we want what is the desirable behavior let's look at a similar example here in this case we've also got two replicas moving phone Joe but we've got the moving that item to different positions so replica a moves phone Joe to the top of the list whereas a replica B moves phone Joe to be at the second position so after buying milk so what is the desirable behavior in this case well

we could say that now our phone Joe should appear in both positions before buy milk and after buy milk but then we've got duplication again and we did not want duplication so really what we want is phone Joe to appear in one of the two positions I guess that maybe show a warning saying hey use odd as a warning we we picked one of the two positions or maybe it's just fine to just pick one of the two and forget that there were two different list positions so I'm just going to say now let's we're

going to pick one of the two concurrent positions as arbitrarily but deterministically as the winner and so in this case example here we've got phone Joe being at the top of the list as the winner so from the point of view of the first user nothing changes but from the point of view of the second user after they've moved from Joe to position two then additionally the phone Joe is then subsequently moved to position one and now let's think about

what is happening here so we've got two concurrently written values for the two different positions and we want to deterministic ly but arbitrarily but deterministically pick one of them as the result as the winner as the outcome and this may look familiar to you if you have studied existing CEO duties because it's just what happens in a last writer winds register in the last writer in red we've got except a register which is a variable effectively which can

concurrently have a value assigned to it by different users and if we have those concurrent assignments the end result after everybody is merged their state is that we pick one of those values that has been written as the winner and we throw away any other values that have been written concurrently and so you can think of here as the position of the list item phone Joe as having a last writer wins register semantics to it that is first one user assigns the

position of phone Joe to the head of the list the other user assigns the position of fern Joe to be after by milk and then as the two users communicate then the end result is the winner is one of those two values either head of the list or after by milk so let's make this a little bit more formal what we need is if we want to implement this here we need a last write of wins register okay second we need some kind of an ambiguous

way of referencing positions in the list but if you think about what we've just done earlier when we were talking about unique identifiers for the characters in a text document what the lists the oddities and these text see oddities already doing is providing us with a stable and unique way of referencing particular positions in a list so we can just use one of these existing algorithms and use its unique identifiers for list elements in order

to reference the position of a particular item endless now difference er duties have different ways of constructing those identifiers as we saw like tree dock uses a path through a binary tree log ooh chooses a path through a tree with a higher branching factor RGA and causal trees use some kind of time stamps but the basic idea is still the same we've got unique identifiers for particular positions in the list and so we can take those identifiers and put them as the value inside our last strata wins register

secondly what we need now if we've got several list items is we need a separate register for each list item to hold the for that particular list item and so well we can just construct a set we can for example use a a twin set set CR DT and the contents of that set are going to be all of the items on our list and so we're going to say to each item on the list is going to be a pair consisting of a value which is like the text on the to-do list item for example

and they last writer wins register containing the position of that particular list item and so now we can add new items to the list by adding them to the add win set we can move position from one position to another by updating the value of their last writer wins register in particular we use a list C R DT to create a position identifier for the location where we want to move a particular item and then we take that position identifier and use it as the

value in the last write wins register so here we've constructed a new C R DT operation now a move operation for lists just by using three existing CRD T's any of this this range of different lists see our duties plus an ad win set plus a last writer wins register and we've constructed a new operation and because we've just composed these existing C our duties we know that the end result is also going to be a CLE T so this is very nice and it allows us to do these atomic

moves of a single list item at a time now things get a little bit more difficult if we want to move from more than one item at a time and so in a to-do list you really only need to move one item at a time because they typically drag and drop a list lets you pick up one item at a time in the user interface so that's not really a problem but in text editing you could imagine other circumstances arising so in this example here we've got a to-do list this time represented just as plain text as a

list of bullet points wage bullet point starts with a bullet point character and ends with a newline character and so here replica B is taking the item milk and wants to move that in front of bacon and so that means taking all of the characters taking a character bullet point space m IL k new line the whole range of characters are moved a whole range to be in front of bacon concurrently with that happening a

different user is updating the milk item to say soy milk and so they do that by deleting the uppercase M and inserting the text soy and the lowercase M okay so what do we expect to have happen here so firstly the soy milk sorry to change from milk to soy milk takes effect and the move from milk to be in front of bacon takes effect so we would expect the desired outcome here is that we have soy milk first in the list in front of

bacon so that means that two changes have become merged together cleanly this is what we want to have happen unfortunately it's not what actually happens so if we look at the algorithm that we just constructed for moving a single list item and if we just move each character individually according to this algorithm the end result that we get is this here so we end up with milk is correctly moved in front of bacon that's happens exactly as we wanted but the change from milk to soy milk remains

attached to the old position of where milk was previously not the new position where milk has now been moved to and so the end result now is that the list reads milk new line bacon new line soy M and that soy M is just standing there without any context because its context has been moved away in the meantime and this is a problem that I do not know how to solve at the time for the time being I've spent a while thinking about it come up with a couple of half solutions

that don't really work a couple of other people are thinking about it as well so if you're interested in this feel free think about it and do publish the result if you manage to figure it out so this is the problem of moving items in lists so we've figured out how to move a single list item at a time but these ranges are characters moving them in a way that behaves nicely is still an open challenge let's move on to point three which is

again about moving but now rather than moving elements in lists we're going to move sub trees in a tree so let's again look at an example let's say we have a tree structure which might be set a file system and like in the case of moving elements in lists we can think about what happens if the same node in a tree is concurrently moved by two different users to two different locations and so you can see here we start off with the

tree on both replicas where a B and C are children of the root and on replicas a we're going to take that that node a and move it to be a child of B whereas on replicas two we're going to take the same node a and move it to be a child of C so now a has been moved to two different locations on two different replicas in this case what do we expect the outcome to be and it's kind of similar to the case of moving elements

in lists so one option is to duplicate it and so what we can have is that we have a appearing as a child of B and also a appearing as a child of C this means then further any children of a will also have to be duplicated and that's what you get here in the in this box labeled a that is you have the the a subtree with a 1 a 2 a 3 children and then another whole copy of the whole

subtree so this is duplication like in the case of the lists I think duplicating nodes on concurrent news is not a good behavior so we don't want that well another option is for a to be actually shared as a common child of both B and C this works but it's no longer a tree so if what we expect our data structure to be is a tree rather than a dag or some other kind of graph

then this is not an acceptable outcome because here a has two parents and in a tree no node ever has two parents so that leaves C and D as the post outcomes see is simply we take the replica one operation as our outcome and ignored the move made by replica two or vice versa we can choose replica two to be the winner and to ignore the move made by replica one so in this case

either a a pipe appears as a child of B or a appears as a child of C but not both so like in the case of of moving atoms in lists what we have is a kind of last writer winds behavior here and I think that is reasonable now in trees additional complications appear and so trees are definitely a harder case than lists and so another example of tricky things that can happen in trees can be

illustrated with your file system so if you want you can try this on your computer and just create a directory called a and then create a sub directory within a called B and then try moving a into B so move a to be a child of a slash B this may sound weird what you are doing here is effectively trying to move a directory into itself I don't know if you've ever tried this I did try

it on Mac OS I get this response here from that from the shell saying invalid argument I'm not allowed to do this movie and this makes sense because if we did perform a move where we move a directory into itself we would kind of be creating a cycle and then this would our data structure would not no longer be a tree it would be some kind of graph with cycles in it and this would be very confusing for a file system so if we build see oddities or trees it also makes sense that if we allow moving sub

trees we also prevent this kind of cycle and prevent somebody moving an object inside itself but unfortunately with C oddities we have the additional complication of concurrent changes so consider this example here where we start off with the same tree again on both replicas on replica one we take the node a sorry we take the node B and move it to be a child of a so here B appears as a child of a while C reigns

as an existing child of a on replica two we start off with the same tree and in this case we move a we move a to be a child of B so we end up here with a as a child of B and C as a child of a as before what happens now if we merge these two so we've got here to move operations each of which by itself is perfectly fine there's nothing wrong with moving B to be a child of a and

there's nothing wrong with moving a to be a trial to B but if we combine both of these we end up with exactly the same problem as we just had we're effectively one directory gets moved inside itself and so if we are not careful with this algorithm we end up with a and B in this kind of cycle disconnected from the root of the tree and well this is again no longer a tree it's some kind of more general graph data structure so we don't want this if we want to preserve the factor of grandpa tree another work

option as before is we can duplicate nodes so we can say that well we move B to be a child of a and we move a to be a child of B so we have both options both B as a child of a and a as a child of the book that simply exists in in our directory graph in our tree and so again we have to duplicate any of the children of those and as before again I think duplication is the wrong way of doing this so it seems as before what we really want is a last writer wins

semantics here where either this option here that is the outcome of moving B into a and ignoring the other move or we choose the outcome of moving a into B and ignoring the other move so in this case we want to pick one of the two conflicting operations and we want to just ignore the other one what we have to ensure that all of the replicas end up picking the same operation as the winner and ignoring the same other operation because of course otherwise

they weren't converged and then it's not a CR DT so how do we actually achieve this now where we want to ensure that all of these conflicting operations exactly one gets picked well I don't know we can try some existing software I tried this with Google Drive for example and I got this wonderful error message here I tried exactly just having two directories a and B move a into B and move me into a concurrently on two different computers

what happens I get some kind of internal unknown error try again but this error never goes away it just keeps sitting down and loop trying to sync yeah so Google Drive hasn't solved this problem either okay I guess we'll have to think about it for ourselves so the way that we solve this is by thinking about the sequence of operations that get applied on each replica and having a time stamp on each operation and we're just going to assume

some globally unique timestamp as usual so you can use Lamport time stamps for example that works fine and of these operations some of them are going to be move operations and some might be some other kind of operations and here we have on replica one we have moving a into B and moving B into a on replica two and as I said before each of these operations by itself is safe and fine to execute what we need to ensure that we detect the case when two operations are

in conflict with each other in the sense that executing both of the operations would create a cycle so we have to ensure that no matter what we do we do not create a cycle now what can happen now if these two replicas merge there are sequences of operations is well we take the union of the operations from both of the replicas we can put them in time snap order and now if we say that the operations get executed exactly in

increasing time stamp order then the first move is fine because by this point nothing bad has happened yet but then by the time we get to the second move operation that move is unsafe and so if we want to ensure that there are no cycles in our tree then we have to skip that operation just pretend it didn't happen this is our tricky because replica 2 has already executed this operation so replica 2 will sell somehow have to undo this

operation that it is really performed in order to ensure that it converges again with replica 1 so the basic principle we're going to use is to take these sequences of operations where we have a time stamp on each operation and we're going to merge them such that we get a single sequence of operations in time stamp order and so for example here on replica 1 we have time stamps 1 3 4 and 5 on replica 2 we have time stamps 2 6 7

& 8 and so if we imagine your replica one and you've executed 1 3 4 5 and now you receive the operation with time stamp - from replicas - well what you need to do is you need to take this time stamp - operation and insert it somewhere earlier into your sequence of operations that you've executed but you know time has already moved on by this point and we've really executed the operations with time stamp 3 4 & 5 so how do we

handle this well we can just take a fairly simple approach which is to undo and redo operations and this is exactly what we do in our algorithm and so that is if you are replica one here that wants to apply operation - and with time stamp - you've got your log of operations that you've performed where x x 1 3 4 & 5 and we're first going to undo operations until we are back at the point until we've undone all of the

operations with time stamps greater than the new operation so that is we're going to undo up 5 we're going to undo the move we're going to undo up 3 now we're only left with off 1 so now we can apply up to on top of that with time stamp - and now we we do the three operations that we've undone and the end result is now that we have this log of operations we're up to has been inserted in the right place and if we keep performing this kind of logic on each of the replicas then this will actually ensure that all of the

replicas end up with the same operation as exit shoot it in the same order just means we have a bit of extra work to do because if some operation needs to be inserted somewhere quite early on in the order we have to do a lot of undos to do that one operation and then a lot of reduce to replay all of the operations that have happened so you might wonder what is the performance of this isn't the performance going to be terrible well we did some performance experiments and it was okay it was of course there was certainly a cost in order of performing

all of these undos and redos and you can see that here so we set up a system with three replicas on three different continents and so there was a delay of over a hundred milliseconds round-trip time between these different replicas and we then just started generating move operations at a high rate and the higher the rate of move operations in this system the more undo and redo these need to be done and so the slower it becomes

to execute each individual operation and we were able to get this system to at least 600 operations per second which you know is it's not massive on sort of big data scales but on the other hand for like a single user updating their file system I think I think a single user is not going to perform more than 600 move operations per second on their own local file systems so for a lot of collaboration software actually this performance is it's perfectly fine and

we don't have to worry too much so I will explain a little bit more about how this how this algorithm works in terms of the data structures that we use and so we have to first find a way of describing the move operation that we want to perform we're going to say move operation is a structure here with the following fields first of all like I said earlier and weave operation has a globally unique time stamps such as a LAN port time stamp then a move operation has got to have a child which

is the the node in the tree that we are moving and the parent which is the new location the new parent of that child in the tree now we are not going to note in this operation here what the old parent of this child was so this is like we can call this stealing move that is we're simply going to take child from wherever it currently is in the tree and move it to be a parent of this new parent here and further we can have a bit of metadata associated with the operation so for

example in a file system you have at one file within a directory and the file has a name within the directory and so that can be the metadata that is associated with a particular file in a particular directory and then we have to construct this log of operations in order to perform the undo and redo and in order to perform this undo we need to associate a little bit of extra data with each entry and the log so each entry here has several fields that come

straight from the move operations so the timestamp is from the move operation as before the new parent the new metadata and the child these come directly from the move operations and so in addition to these we've now also recalled the old parent and the old metadata so that is what was the parent of child before we performed this move and was what was its old file name for example if we're using the metadata for file names and so now

this allows us to perform the undo because what we do in order to undo this operation is just to set child's parent and metadata back to its old parent and it's old metadata and given this log of log entries of move operations we can now construct the current state of the tree the tree is just a set of triples of current metadata and trial triples whenever child is a parent or parent in

the tree and so we can now define this ancestor relation here so we say that a is ancestor of B if either a is a direct parent of tree as in there's a comma M comma B triple exists in the tree or if there exists some tree node C such that a is a parent of C and C is an ancestor of me okay so this defines just a transitive closure over the parent-child relationship and so given this definition of ancestry

now we can define what it makes for a move operation to be safe so for a move operation if we want to perform this move we look at the child that is being moved and the parent which is the destination of where the child is being moved to and if the child is currently already an ancestor of the destination where it's being moved to then this would be unsafe we would create a cycle so we do nothing similarly if the child and the parent

are in fact the same node that would also be unsafe so again we do nothing but in all other cases it's fine so we can update the tree by removing from the tree the node the existing location of C that is this is exactly what I said we steal the child node from wherever it currently is in the tree for any peak on Miami here and then we add this new parent metadata child relationship to the tree and this defines our updated

state of the tree and so given this function here for performing a move operation and given the undo/redo procedure that I explained previously we now have an algorithm for safely performing move operations on a tree C R DT and we proved several theorems about this algorithm so in particular we showed that this algorithm does indeed preserve the tree structure so to preserve a tree structure what we need is that every node has a unique parent

and that the tree contains no cycles that is the definition of a tree and so we prove that these things hold so for any sequence of operations that are executed the data structure remains a tree and then obviously we have to prove that it's a CR DT which means that it converges so given any two sequences of operations on this tree if one sequence of operation is a permutation of the other sequence of operations then applying those sequences of operations leads to the same state in

other words operations are commutative so this algorithm works and we have implemented and and proved these properties about performing move opera Shinzon trees okay moving on from moving subtrees NCR duties let's talk about efficiency and making co DTS more efficient so one particular problem that occurs with a lot of CDT algorithms especially the text editing CR DTS is

the overhead that the metadata the CR DT metadata incurs so if you think about a text document what we said is we give every single character and a text document a unique identifier and so what you end up with is each character in English text would be one bite for the ASCII character the unicode utf-8 encoding of that character but then you have additional metadata so you might have for example some path through a tree which might take a couple of dozen

bytes to encode then you might have some identifier for the node that inserted a particular character which if that is a UID will be at least another 16 bytes if you encode it in hexadecimal it'll be like 36 bytes or something like that and so very quickly you end up with this really disproportionate situation where you have one byte for the actual data that you want and something like a hundred bytes of metadata or even more than that just for the CR DT metadata in

order to make the CDT work and that's not a particularly great situation to be in so what I would like to talk about now is some work that we've been doing as part of the Auto merge project which is a CR DT implementation that I work on in order to bring down that metadata overhead and we've had some really good successful results by using some carefully designed algorithms and data structures so let me start with the results of what we have here so the old

version of Otto emerge uses a JSON encoding in order to encode all of the metadata and send it over the network and write it to disk and this JSON encoding is incredibly verbose so what we've done here is as an example to take a particular case study which is the editing history of the latex source of a paper so in order to generate this editing history a colleague and I actually wrote this paper using a homegrown text editor

and so we were able to capture every single keystroke that went into the writing of this paper so the final file is about 100 kilobytes of plain text so this is just ASCII text containing latex markup and we captured over 300,000 operations that is the in 300,000 individual keystrokes of the editing history going into that and so that includes all of the individual characters that were typed in order to

create this document all of the characters that were deleted for any stuff that was added or removed or any typos that were corrected and so on and we also recorded all of the cursor movements so every single time the cursor was moved from one position and document to a different one that was also recorded and so if you take this history of 300 thousands or so changes and you encode it in the current or to merge data format in this JSON data format the file size is about a hundred

and fifty megabytes so that is we're using almost 500 bytes per change here which is which is pretty terrible and so you can see now in the in the new binary data format that we've designed in order to bring down that over metadata overhead we've actually made some some really big improvements on that and so what we have done is first of all designed a binary format that encodes

the full editing history including every single insertion and deletion and including when that happened and encodes that in a compact form so it doesn't lose any information at all this is a completely lossless encoding and what we've managed to do is reduce this file size by over a factor of 200 down to about 700 kilobytes containing exactly the same information just encoded differently so I should emphasize if you

would just take the JSON history and encode it using one of the binary equivalents for Jason you would not anywhere near this so if you use protocol buffers or message pack or something like that you know you you might reduce it by a factor of two or three at the most but in order to get this factor of 200 improvement you have to design data format doses specific to see oddities and specific to the particular heading patterns that tend to occur in these

sorts of applications now you concede that by gzipping these files you can reduce it so you can reduce 150 megabytes down to about 6 megabytes but the 6 megabytes are still huge compared with the 700 kilobytes that we were achieved uncompressed in our binary format and that still does gzip down further to about 300 kilobytes without losing any information at all so note that by this point here now we've been

able to Inc encode a change history the change history consists of over 300,000 changes we've encoded that in 300 kilobytes so we're now actually using less than one byte per change in order to encode the full change history so we can look back at any past version of this document week and if any two versions at any points in time and all of that data is it's still there encoded in this very compact form which I find very exciting now of course we can still

compress it further by throwing away some of the history so one of the first things we might choose to throw away is the cursor movements because to be honest who cares about exactly where the cursor was at which point in the editing history so we can save about 22% by throwing away two cursor movements and we can we can go down to about 230 kilobytes by also throwing away all of the change history for the text so by

this point here where we're at 230 kilobytes we still have all of the CR DT metadata including all of the tombstones so this means at this point we're still able to perform arbitrary merges between different replicas of this data and yet we are down to only about twice the or data set size so about 100 kilobytes is the raw text document with no CR DT metadata at all that's just the final state of the text document and all of

the CR DT metadata is adding only a bit more than 100 percent overhead to this without gzip if we actually apply gzip then that file sizes are almost the same so the the file containing 2 CR DT metadata gzipped is about the same size as the plain text file non gzipped now we can get even further down by removing the tombstones the tombstones are the markers for deleted characters in the

document and if we do so then we save what is that another 70 kilobytes or something like that for the tombstones whether you want to do that or not depends because if you throw away their tombstones it means you can no longer merge with edits that happened concurrently so any edits that happen before that to stone removal or concurrently with the tombstone removal now can't be merged anymore but but if

you do remove the tombstones then just the raw CR DT metadata just that is just the unique identifiers for the characters at that point takes only 50 kilobytes so it's only about 48 percent overhead on top of the the raw text document using using our data format very exciting I find this so I should tell you a bit about how this actually works and how we were able to achieve such good compression the basic idea is

as I said we want to still maintain all of the change history as much as possible and so we're going to store the full set of all of the insertion operations all the deletion operations and all of the custom movements if necessary each operation is given a unique ID which is just the lamp or timestamp and I'm poor timestamp is just a pair of a counter and the ID of the node generated it or we call it actor ID in auto mode and the the algorithm that we use for text editing is based on CR

DT so it's based on whenever you perform an insertion you reference the in the identifier that is the lamp or timestamp of the character that is immediately before the inserted position and we keep this set of all of the operations and stored in document order that it's in in the order in which the characters appear in the document and what we get is this kind of table here so you can think of this almost like a table in a relational database

where each row is one operation and so you can see that is typing h-e-l-l-o so it's hello were three L's and then somebody came along and deleted one of the three L's so that the final result is just hello and so the way this is encoded is each row here is a is an insertion operation into this document and you can see that each row has an operation ID which consists or which is

a lamp or timestamp so it consists of a counter that just increments and the actor ID I just wrote a here but in reality our actor IDs are actually new IDs so they're actually like 16 bytes or even 32 bytes long and and so we have this unique identifier for each for each character dispenser that was inserted here we have the reference element that is the predecessor character the idea the predecessor character and so this is just saying here the e comes after the H

so the H has ID 1a the e has ID 2a and comes after 1a and so we can then encode all of the operations this way furthermore if an operation is marked as deleted we do that by having these additional deletion columns that we can regroup sorry record the operation ID of the operation that deleted this particular character in fact it could happen that there are multiple deletion operations for the same character if multiple users deleted the same

character and now we can take this table and include it in a very compact form using some fairly simple tricks so we'll go to first of all encode each column of this table individually and so for example we can look at this first column the counter call which consists of two numbers one two three four five six how do we encode that efficiently well first of all we Delta encode it which means for each number we calculate the difference between that number and its predecessor

so one two three four five six the difference between any two subsequent numbers is one so this delta encodes 2 1 1 1 1 1 1 which we can then one length encode to 6 times the number 1 and then we can use a variable length in binary encoding for integers to encode 6 comma 1 in 2 bytes so this variable length encoding just ensures that small numbers get represented in one byte slightly bigger numbers get represented in 2 bytes bigger numbers still get represented in 3 bytes and so on and so

here we've now encoded this entire counter column here in 2 bytes similarly we can proceed to the actor ID column so for this actor column well first of all if we make a lookup table we don't need to keep repeating the huge IDs so we just map a to 0 B to 1 for example and so now this translates into 0 0 0 0 0 0 we can run length encoding use the variable length binary encoding again running again we've encoded this entire column in 2

bytes moving on let's look at this actual text column here what I've done actually represented the text in two separate columns so here this is the the utf-8 byte string for that particular character that is being inserted and in this length column here we have the number of bytes for that particular Unicode character so if what we're doing is just typing English text then every single character is going to be within

the 1 byte encoding of utf-8 and so this column here is just going to consists of lots of ones it's just all of the characters have length 1 so that encodes again super compactly and now as we've encoded the length columns for the actual characters we can just concatenate them because we can still then later split out where the boundaries are between the characters by just looking at the length column and so we're simply going to concatenate all of the bytes in the utf-8 column hello with triple L which

is encoded in six bytes so what we've done here just by encoding each column individually we can represent each column extremely compactly in just a couple of bites and it turns out then that because of the editing patterns that you tend to get with text you tend to get these increasing sequences because people just tend to type letters in sequential order which means you get to use increasing sequences of counters

that always increase by one each time so the whole data has a structure which compresses really nicely and we can take advantage of our knowledge of this structure to encode it very compactly so this gives us the set of operations now in order to have the full change history we need some little bit of extra metadata that is for each set of changes that were made at some point in time we need the timestamp of when that was occurred and the range of operation IDs

and with this extra metadata we can now reconstruct exactly what the document looked like at every passed point in time even while using a very small amount of additional storage space so if people tell you that with CRD T's you know you have to put effort into tombstone collection because tombstones waste a lot of space you know just think back at this table here the tombstones actually just cost us 48 percent of the space it's a fairly small amount of

costs to to maintain here whereas choosing the difference between a simple JSON format and an optimized binary format made a factor of 200 difference so I think this means we have the ability if we want to store the full change history and to allow us to do you know dips and interesting visualizations or to change history for our documents with very low cost very low cost and we

don't actually have to throw away all of this extra metadata that we have so this brings me to the end as I said at the beginning I think C oddities are very easy to implement badly so they're very easy to implement in a way that is inefficient and that has all sorts of weird behaviors like the interleaving anomaly that we saw and they're all these problems of how do we do a move operation both on lists and on trees in a way that is safe and that is

recently intuitive to our users but as you've seen from this talk we have made a lot of progress in just the last few years on these topics and I think it's going in the direction we're really see oddities are starting to become something that can be really the foundation for a very large set of applications so we can start to build like actual real practical user software not just research prototypes on top of these these basic principles so I hope

you will join me in finding this very interesting and exploring where this ends up going in the future so I will end with some references that is here first of all the list of the various text editing CRD T's that I talked about in the context of interleaving and finally the publications that I've worked on for the various algorithms and anomalies and problems that we talked about in the context of this talk all of the slides are online so you can find

them there thank you very much for listening I hope you enjoyed this talk and I'll see you again soon