Overview

When hiring engineering talent it's useful to have a rubric defining what exactly excellent means to your organisation. If you aspire to hire the top 5% of talent, then your technical interview question should be solvable by just 1 in every 20 candidates. It's worth noting that for these outlier candidates, previous interviews prior to the technical should already have produced a strong signal on their expertise. This means therefore, that the technical interview becomes a necessary platform not only for evaluating them, but also for impressing them.

I'm going to stick to experience here and in the past I've evaluated against a combined rubric of ML-engineering and applied research ability for hiring a world-class AI team. Both skills were evaluated on a score of 0-5, where the threshold for progression was a total of 8.

Interview Structure

In my experience, obtaining an understanding of an engineer's proclivity for ML ops is easier than understanding exactly just how strong their ability to apply research to practical problems is. As a result I've structured interviews into two parts: a short filter for production judgement, followed by a deeper probe for research taste and first-principles reasoning.

The ordering matters. Going deep too quickly can vex everyone involved: the candidate gets dragged into a hard research problem before they've had a chance to show practical competence, and the interviewer may spend half the session exploring depth that was never actually the binding constraint. A good filter keeps the process humane. If the basics are there, then the probe is worth the time.

Question 1: The Filter

Consider the below pseudocode excerpt, it should resemble a familiar pattern to most ML engineers who have worked with APIs. The candidate's task is to analyse this code specifically with respect to scaling considerations, making it clear why or why not this code should be productionised.

1"""
2This is jank pseudocode.
3Do not over-index on syntax, or function/variable names
4"""
5from transformers import AutoModelForCausalLM, AutoTokenizer
6
7app = Flask(_name__)
8MODEL_NAME = "huggingface/llama-70b"
9
10# load the LLM
11large_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
12tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME)
13
14# route for performing inference
15@app.route("/generate", methods=[POST])
16def generate(prompt: str) -> str:
17    tokens = tokeniser.tokenise(prompt)
18    output_tokens = large_model.generate(prompt)
19    return tokeniser.decode(output_tokens)
20

There are so many problems with this code and the nature of the problems identified by the candidate will reveal how much they know. Strong candidates will immediately dissect the sheer number of scaling issues or ambiguity associated with this code, whereas weaker candidates might focus on aspects such as dimensionality mismatches. This is not meant to be a trick question; it is a quick way to see whether someone has the production instincts needed before the interview moves into deeper territory.

Score	Notes
0-1	Red Flags "I'm not familiar with the huggingface transformers library, I use keras / tensorflow / sklearn" "I use pytorch" - without being able to explain why (or understand pytorch-lightning, for example) — no mention that this codes runs on CPU.
1-2	Trivial Albeit Sensible Observations "The dimensionality of our inputs isn't super well described here" "This will breakdown with super long prompts" "etc" — has never worked with LLMs, just smaller models under 7 billion params
2-3	Good Observations, in general know what they're talking about "Production servers wouldn't use flask, you'd use something like TGI, vLLM, or Triton" "A large model such as 70b won't fit on a single GPU, we'll need a larger machine or quantisation etc" — observes that this code runs on CPU, a big no-no
4-5	Actual Experts "None of this code is any good - there's no accelerator tuning or batching, or any sensible mechanism for queing requests or managing overloaded GPUs. This system won't work in production". "You should never really perform inference on python models, you compress them into tensor-safe formats or compressed functional traces" "Wrapping this in vLLM instead of flask is only half the battle, the vLLM server can also get overloaded - you need to think about orchestration" — focusses on product-led requirements e.g. "Do you NEED streaming online or will offline batching work, how quickly do you need results?" — mentions cold starts, orchestration frameworks such as skypilot, k8s, or providers such as Baseten - while being able to explain the merits of building in house — mentions acceleration techniques grounded in research for faster inference, and their tradeoffs (e.g. speculative decoding, see below)

Question 2: The Probe

Once the filter has done its job, the second question can be more exploratory. The failure mode here would be to ask a candidate to explain some difficult paper because that's just an information retrieval problem - you bias towards lucky candidates who've read the paper. An alternative approach, therefore, is to explain the general motivation and method of a paper, and ask the candidate to re-implement it using pseudocode.

With that said, it's important we use a paper which conceptually should be intuitive for strong candidates while also being difficult to ask chatGPT for an explanation → we want to reduce the effectiveness of AI assisted tools at cheating the interview.

After some trial and error, Speculative Decoding revealed itself as a highly suitable question. The strongest candidates all verbally communicated that they thoroughly enjoyed the problem (meaning it's conceptually simple) while admitting that it was super hard once one dives into specifics (somewhat AI proof).

Here's what we give to candidates

Interview prompt

Speculative Decoding

Background

Speculative decoding involves using a much smaller model to speed up inference of a large language model. By smaller here, we mean by an order of magnitude, say 90% fewer params.

The smaller "speculator" model guesses a possible sequence of tokens likely to be generated by the larger model.
The larger model is used to score this sequence, discerning whether the sequence is likely, or specifically how much of the sequence we should accept.

Consider the sequence "<start> The cat in the hat is fat <end>". Assuming a word-level tokeniser, this sequence of 7 tokens requires 7 forward passes of a large model.

For example, if we seed the speculator model with "<start> The cat" it might generate the candidate sequence of "in the hat is a nice cat <end>".

Questions

Explain how the large model could be used to score the smaller model's output at each token.
Explain how we could use this mechanism to generate the 7 token sequence without performing 7 forward passes with the larger model.
Explain whether this is actually more efficient.

Ordinary autoregressive decoding has a serial dependency: the target model has to produce one token before it can be asked for the next. Speculative decoding loosens that bottleneck by asking a cheap draft model to guess several future tokens, then asking the target model to verify those drafted positions in one parallel pass.

Verify 1\(P(V \mid \langle start \rangle, the)\)

Verify 2\(P(V \mid \langle start \rangle, the, cat)\)

Verify 3\(P(V \mid \langle start \rangle, the, cat, in)\)

Verify 4\(P(V \mid \langle start \rangle, the, cat, in, the)\)

Verify 5\(P(V \mid \langle start \rangle, the, cat, in, the, hat)\)

Verify 6\(P(V \mid \langle start \rangle, the, cat, in, the, hat, is)\)

Verify 7\(P(V \mid \langle start \rangle, the, cat, in, the, hat, is, a)\)

Verify 8\(P(V \mid \langle start \rangle, the, cat, in, the, hat, is, a, nice)\)

Verify 9\(P(V \mid \langle start \rangle, the, cat, in, the, hat, is, a, nice, cat)\)

The target model gives us a distribution for each drafted position, so we can compare the draft token against what the target model would have predicted. In the real algorithm this is not just a fixed threshold: the accept/reject rule accounts for both the draft distribution and the target distribution so that, when implemented carefully, the sampled output still follows the target model.

This is why the question is useful. A weak answer stops at "small model faster". A strong answer separates FLOPs from latency, notices that the draft model adds work, and reasons about acceptance rate, KV caching, serving batch pressure, and preserving the target model's output distribution.