- Distributed transaction
- Web search spell check
- Web crawler
- Manhattan distance
- Count the 1 bits in a file
- How to debug nondeterministic failures
Given two separate databases A and B with bank accounts on both, design a system that can transfer money from an account on A to another account on B. Assume that the dbs and any servers you add can die and restart at any time. The system may be eventually consistent.
One simple solution is to add parallel transfer log tables on each db that store the source and destination accounts, amount, and whether the local account has been updated. Algorithm:
- In a db transaction on A, insert a new transfer row and update A’s account balance.
- In a db transaction on B, insert a new transfer row including A’s transfer row primary key and update B’s account balance.
- Update A’s transfer row and mark it complete.
To recover, look for any rows in A that aren’t marked complete, and roll forward.
The other simple solution is to journal on the app server’s disk, and use the contents of the journal to roll forward or back after failures. This is less elegant, but still definitely works, as long as you carefully and completely record all the initial state and changes to be applied.
Some people will want to use locking. It’s not necessary or sufficient, but if they get the lock ordering right so that they don’t deadlock, ok.
Some people will propose two-phase commit. This is harder, and not many people really understand and can describe the 2PC algorithm, but if they can get it right, using only the two dbs as durable storage, then ok.
- How do you handle multiple concurrent transfers? (Works already, modulo edge cases like checking available balance.)
- How do you handle transfers across multiple accounts, e.g. withdraw $10 from A and deposit $5 each into B and C? (Expand transfer rows to store multiple accounts, impose ordering on balance updates.)
- How do you scale?
Web search spell check
When you’ve used Google (or any search engine), have you seen the “Did you mean…” suggestions when you misspell a word? Design that system. Specifically, given a search query, either let it pass or suggest just one correction.
The three main tasks are:
- Build a dictionary. Starting with a literal language dictionary (e.g. OECD) is fine, but eventually you want to expand to handle technical terms, names and other pronouns, slang, etc. The difficulty is that these can change rapidly, e.g. if someone with an unusual name suddenly becomes famous, you don’t want to consider their name a typo. The ideal long term answer is to use the search engine’s crawl corpus itself, or maybe other corpuses, and/or search query logs, with weights on frequency, existing language dictionary, etc to weed out misspellings. This also allows for a few useful estimation questions, e.g. how many english words are there? (Lots of ways to count; answer is ~150K-1M.)
Scoring system for possible suggestions. Lots of possibilities to include here, in rough order of value:
- edit distance (bonus points for knowing what that is)
- clicks and non-clicks on previous suggestions
- frequency in corpus (both single words and N-grams)
- frequency in query logs
- keyboard key distance
- phrase/grammar understanding
- Implementation. Easy to come up with one, e.g. search whole dictionary and compare. Hard to find one that’s fast enough. Needs to be <100ms or so. Understanding running time/space complexity and average speeds of memory access, disk seek, network round trip are both important here.
Lots of people will come up with optimizations and heuristics. They’ll rarely be good enough in practice – it’s hard to do any kind of search for this in real time and still be fast enough – but that’s ok. If they cache the top N (say 1M) queries, that’s great. The best complete solution is to precompute permutations of most queries, say top 10-25%, along with their suggestions, store them in a hashtable, and just do simple lookup at query time.
- How do you handle multiple languages? Language inference, dictionary per language, etc.
- How do you scale? Pretty easily parallelizable.
Design a web crawler for a search engine.
Pretty straightforward system design question with well understood constraints, but it covers a wide range of knowledge. Look for them to ask lots of questions to understand the problem space.
Important points to cover:
- storage (many acceptable answers)
- scaling (how to shard workers, manage shared state, prevent duplicating work)
- seeds (ie where to start)
- obeying robots.txt
- throttling to avoid overwhelming individual sites
- recrawl frequency
- different tiers based on update rate
- bonus points if they’re familiar with existing notification protocols like PuSH
- executing JS, handling query parameters, avoiding POSTs
- avoiding link generators, loops, etc
- handling partnerships with sites that require auth
Manhattan distance is the shortest distance between two points on a grid. There are multiple paths between any two points with the same Manhattan distance. Implement a function that finds the number of paths. You can write code, or just come up with an algorithm, or any other approach.
There are lots of ways to do it. The most elegant is recursive:
num(N, M) = num(N - 1, M) + num(N, M - 1) num(X, 0) = num(0, X) = 1
…however, there’s a closed form solution: N + M choose N (or choose M), ie N!M! / (N!(M-1!) + (N-1)!M!). Bonus points for finding that.
Count the 1 bits in a file
Write a program that reads in a file and counts the bits that are 1, ie not 0.
This one sounds silly and simple, but it has some fun subtleties and can expand into a surprisingly deep conversation, even (especially!) with experienced candidates.
At its most basic level, this question tests two simple concepts: file I/O and bitwise operations. Be forgiving if they don’t get the bitwise operation syntax exactly right or can’t remember it; it’s not something any of us do very often. However, they should definitely understand how to read data from a file and how to examine and count the 1 bits.
After that, ask them the complexity. Runtime should be O(n), space O(1) if they streamed, O(n) if they read the whole file into memory.
Now it gets interesting. Ask if they can do better. They’ll usually decide no, they can’t. Nudge them that constant factors count. They may talk about reading larger chunks of the file, etc…but the key speedup is to use a lookup table instead of shifting and examining individual bits. Specifically, they can build a table with all 256 possible bytes and their corresponding 1-bit counts, and then look up each of the file’s bytes in the table.
After that, ask them why only 256. Would it be faster to generate all 16-bit values? 24-bit? 32? Look for them to understand the space vs time tradeoff (and the initial time cost to generate the table).
If they get this far, think bigger. Ask them where the bottleneck is. You want to hear them talk about hardware, hard drive bandwidth, disk buses, etc.
Another direction you can go is parallelizing. For especially big files, how would they use multiple machines? Even better, what if the file is too big to fit on one machine? Straightforward, but still a good way to check for systems understanding.
How to debug nondeterministic failures
Have you ever had to debug a nondeterministic or transient problem before? How did/would you do it?
This is an open ended discussion. There are a few big topics to cover. First, logging. Seems silly, but it’s a good marker for experience. What kinds of things to log, at what level, how to configure the level for normal prod operation vs local vs temporarily raised on individual hosts, etc.
Second, isolation. There are two good questions that will get them talking:
- What are all the possible sources of nondeterminism? Random number generators, obviously, but also user input, threading, configuration, data (many kinds), network, other local processes, unordered data structures (e.g. hash tables), different hardware, specific hardware parts etc.
- How can you isolate the part that’s behaving nondeterministically? Logging is key here, but also minimal canned datasets, record/reply, mocking external services and data sources, reproducible config, reusing the same RNG seed, etc.
Third, reproducing it. Use all the isolation techniques, then look for an automated harness that repeatedly runs the system and watches for the expected behavior. It’s important that the harness record as much state as possible. Bonus points if it automatically drops them into a debugger!