A technical overview of two-snapshot churn analysis, the distinction between SLOC and LLOC, and the principal sources of misclassification in automated differencing.
Source code churn — the volume of code changed between two snapshots of a project — is conventionally expressed in lines added, deleted, changed, and unchanged. The line, however, is not a uniform unit. Two distinct definitions are in common use: Source Lines of Code (SLOC), which counts newline-delimited physical lines after comment and blank-line stripping, and Logical Lines of Code (LLOC), which counts statement-delimited logical units. The two diverge whenever physical and logical structure do not align — a single physical line may contain several statements, and a single statement may span several physical lines. This paper describes how automated differencing can compute both metrics, the role of the underlying longest-common-subsequence algorithm, the additional alignment passes required when token-level repetition causes ambiguous matches, and the principal classes of churn that resist clean classification.
This paper concerns the measurement of differences between two textual snapshots of a source project, where each snapshot is a directory tree of files written in one or more programming languages. The objective is to quantify, per file and in aggregate, the volume of change in the codebase as expressed in the categories added, deleted, changed, and unchanged.
Two units of line counting are in routine use:
A third quantity, Physical Lines of Code (raw line count including blanks and comments), is sometimes reported but is rarely useful for churn analysis because it inflates with trivial whitespace edits and reformatting.
The terms SLOC change, LLOC change, SLOC added, etc., always refer to comparison between two snapshots of the same logical project — typically successive commits, builds, releases, or development branches.
Differencing two text files is, in its standard formulation, a longest-common-subsequence problem. Given two sequences A (the old file) and B (the new file), find the longest subsequence common to both; the lines of A not in that subsequence are deletions, the lines of B not in it are additions, and everything else is unchanged. Lines that fall in deletion and addition gaps at the same position are typically reclassified as changes if their similarity exceeds a threshold; otherwise they remain as separate deletes and adds.
The algorithm of choice in practical implementations is Myers' O(ND) edit-script algorithm [1], which computes the shortest edit script in time proportional to the sum of file lengths and the number of differences. Variants and refinements were established by Hunt and McIlroy [2], and the foundational longest-common-subsequence theory by Hunt and Szymanski [5].
For churn analysis the diff is run twice over each file pair, once at the SLOC level and once at the LLOC level. The two diff passes are independent: the same source change may produce different add/delete/change patterns in the two metrics because the underlying token streams differ.
Before the diff algorithm runs, each input file is reduced to a sequence of tokens:
The file is read line by line. Each line is stripped of trailing whitespace. Blank lines and lines consisting only of a comment are discarded. The remaining lines, in order, form the SLOC token stream for that file. Comments embedded at the end of a code line are stripped, but the code preceding the comment is retained as a SLOC.
The file is read character by character through a lexer that understands the language's statement-delimiter and continuation rules. Single-line comments, block comments, string literals, character literals, and (where applicable) template literals and regular-expression literals are recognised and excluded from the token stream so their internal contents do not produce spurious delimiters. The lexer emits one token per logical statement boundary.
For C-family languages the principal LLOC delimiter is the semicolon (with the exception of the semicolons inside for headers, which form one logical statement together with the loop body opener). For Python, the delimiter is a newline at the same indentation level as the enclosing block, with continuation lines (those ending in a backslash, or where an open bracket or parenthesis has not yet been closed) merged into the same logical line. For JavaScript and TypeScript, the delimiter is either a semicolon or an implicitly-inserted statement boundary at a newline where Automatic Semicolon Insertion applies.
The relationship between SLOC and LLOC is not a fixed ratio. Three patterns are common.
A single physical line containing multiple statements (separated by semicolons in C-family languages) counts as one SLOC but several LLOC.
int a = 1; int b = 2; int c = a + b; printf("%d\n", c);
One physical line. SLOC = 1, LLOC = 4 (four semicolon-terminated statements).
x = 1; y = 2; z = x + y
One physical line. Python permits explicit semicolons between statements on a single line, although the style is rare. SLOC = 1, LLOC = 3.
A single statement may be broken across several physical lines for readability — a method call with many arguments, a long expression, a multi-line string concatenation, or any expression whose tokens require continuation.
String result = formatMessage(
template,
customerName,
orderTotal,
deliveryDate,
promotionCode
);
Six physical lines (assuming no trailing blank). SLOC = 6 (each non-blank, non-comment line is one SLOC). LLOC = 1 (the statement is terminated by the final semicolon).
total = (price_excluding_tax
+ sales_tax
+ delivery_charge
- any_discount_applied)
Four SLOC. LLOC = 1: the unclosed parenthesis from the opening bracket means Python treats the entire expression as one logical line, even though the newlines are physically present.
const message = `Hello, ${name},
your order ${orderId}
will arrive ${formatDate(deliveryDate)}.`;
Three SLOC. LLOC = 1: the template literal contains literal newlines but is a single expression terminated by one semicolon. A naive newline counter would yield LLOC = 3, which is incorrect.
The ratio of LLOC to NCLOC (non-comment, non-blank physical lines) in any given codebase is determined by language convention and coding style. Published measurements using the phploc tool on widely-used PHP frameworks report ratios in the range of 28–34% (LLOC as a percentage of NCLOC) [6]: symfony-standard 31%, zend-framework2 29%, laravel-framework 28%, and cakephp 34%. These figures reflect a single language and a single tooling implementation; they should not be extrapolated to other languages or other definitions of LLOC without independent measurement.
For languages where one statement per physical line is the dominant convention (notably Python, where explicit semicolon separators are syntactically permitted but stylistically rare), LLOC and NCLOC will tend to track more closely. For C-family code that places opening braces on separate lines or breaks long argument lists across several physical lines, LLOC is typically lower than NCLOC because the closing brackets and continuation lines contribute SLOC without contributing LLOC. The Wikipedia entry on Source Lines of Code illustrates the opposite direction with a single-physical-line C statement (for (int i = 0; i < 100; i++) printf("hello");) that counts as 1 SLOC and 2 LLOC [7] — demonstrating that LLOC can exceed SLOC for any line containing more than one statement-terminating delimiter.
A substantial deviation from the ratios documented for similar codebases is a useful signal that the LLOC tokeniser may be mishandling something — in particular, multi-line string or template literals whose internal newlines are being miscounted as statement boundaries.
For each snapshot, the per-file counts are summed to produce project totals:
Aggregation is straightforward; the interesting questions arise in the diff comparison between two snapshots.
For a pair of snapshots, every file in either snapshot is placed into one of four file-level states, and the lines within each file are placed into four line-level categories.
| State | Meaning |
|---|---|
UNCHANGED | The file appears in both snapshots with identical content (after newline normalisation). |
ADDED | The file appears in the new snapshot only. |
DELETED | The file appears in the old snapshot only. |
CHANGED | The file appears in both snapshots with differing content. The diff algorithm runs on this file. |
| Category | Definition |
|---|---|
UNCHANGED | Line appears in both old and new at the position chosen by the longest-common-subsequence. |
ADDED | Line in the new snapshot has no corresponding line in the old snapshot. |
DELETED | Line in the old snapshot has no corresponding line in the new snapshot. |
CHANGED | A pair of lines — one deleted, one added at the same position in the edit script — whose similarity exceeds a configured threshold (typically Jaccard token similarity above 0.3 to 0.7 depending on context). Such pairs are reported as one change instead of one delete plus one add. |
The aggregate counts CHG, DEL, ADD, and CRN (churn = CHG + DEL + ADD) are produced for each metric (SLOC and LLOC) separately. UNCHANGED is computed from the totals: unchanged-old = oldTotal - CHG - DEL, unchanged-new = newTotal - CHG - ADD.
Consider the following modification to a single C function.
int calculate(int x, int y) {
int result = x * y;
return result;
int unused = 99;
}
int calculate(int x, int y) {
int result = x + y;
int doubled = result * 2;
return doubled;
}
After comment-stripping (none here) and blank-line removal, both files have five physical lines. The diff algorithm produces:
SLOC counts: CHG = 2, ADD = 1, DEL = 1, UNCHANGED = 2 (the function header and closing brace).
The LLOC tokeniser emits one token per semicolon-terminated statement. The function brace itself contributes no LLOC. The token sequences are:
int result = x * y, return result, int unused = 99 — three LLOC.int result = x + y, int doubled = result * 2, return doubled — three LLOC.LLOC counts: CHG = 2, ADD = 1, DEL = 1, UNCHANGED = 0. (The two changed pairings — the result assignment and the return — pass the similarity threshold; the unused and doubled lines do not pair with anything and are pure DEL and ADD respectively.)
In this case the SLOC and LLOC counts happen to agree numerically; in larger or more structurally diverse changes they typically do not.
The classifications produced by a single pass of Myers' algorithm are not always semantically correct. The algorithm finds a shortest edit script, which is not necessarily the one a human reviewer would choose. Several common patterns produce confusing outputs.
If two adjacent lines are swapped, Myers' algorithm typically reports this as one delete and one add — not as a "move." From the algorithm's perspective the old line at position n no longer exists, and a new line has appeared at position n+1. There is no built-in concept of relocation in the standard formulation.
Code frequently contains identical tokens — repeated closing braces, repeated blank-line-equivalent patterns, repeated boilerplate calls. When the LLOC tokeniser emits the same logical-line token at many positions, Myers' algorithm has many equally-short edit scripts to choose from. The one it picks may align unrelated braces with each other and report an entire function body as deleted plus another function body as added, when in fact only one statement was altered.
When a variable is renamed throughout a function, every occurrence of the old name disappears and every occurrence of the new name appears. Under a strict line-equality criterion, Myers reports a large number of deletes and adds. The similarity-threshold logic discussed in Section 06 mitigates this by pairing deletes with adds whose token sets overlap sufficiently — at which point most of these become CHGs rather than separate DELs and ADDs.
A common refinement runs the diff twice. The first pass aligns on raw line equality and produces an initial classification. The second pass re-runs the alignment using a richer token, often constructed by concatenating each line's contents with the enclosing scope name (e.g., function or class qualifier). The scope-qualified tokens distinguish the closing brace of foo() from the closing brace of bar(), allowing the second pass to discover better-anchored matches that the first pass could not see because its tokens were ambiguous.
The second pass typically improves the CHG count at the expense of ADD and DEL — that is, what the first pass reported as a deleted line in one place and an added line elsewhere is sometimes recognised, on the second pass, as a single moved or modified line. Section 09 quantifies the typical effect.
Several patterns sit on the boundary between CHG and ADD+DEL, and reasonable implementations will classify them differently.
The Jaccard threshold for promoting a delete+add pair to a CHG is itself a configuration choice. A pair with token similarity 0.31 is almost certainly a CHG; a pair at 0.05 is almost certainly an unrelated DEL and ADD. The transition between the two depends on the chosen threshold. For lines that fall close to the threshold, the same physical edit will be classified as CHG under one threshold and as DEL+ADD under a slightly different one.
When a block of new code is inserted in the middle of an existing function, Myers' algorithm reports the inserted lines as additions but may also report some preceding or following lines as "changed" if their contents shift relative to surrounding unchanged anchors. This is correct behaviour from the algorithm's standpoint but can be counterintuitive: a clean insertion produces a small number of CHGs the reviewer did not consciously make.
Reformatting (indentation changes, tab-to-space conversions, line-wrap adjustments) can produce arbitrarily large diffs at the SLOC level while producing zero changes at the LLOC level. This is one of the principal practical reasons for reporting both metrics: LLOC is approximately invariant under reformatting, SLOC is not.
When the only difference between two versions of a line is its trailing or leading comment, comment-stripping at the tokeniser stage removes the comment before the diff runs. The line is then classified as UNCHANGED at both SLOC and LLOC levels. Separate comment-churn metrics (CHG_COM, DEL_COM, ADD_COM) can be computed by running a third diff pass over comment lines only.
Standard two-snapshot diffing treats files by path. A file moved from src/foo.c to lib/foo.c appears as a DELETED file at the old path and an ADDED file at the new path, with no churn information about its internal stability. Detecting moves requires a separate similarity comparison between deleted-file and added-file content sets, typically using hash digests or content fingerprints. Most line-counting tools do not perform this step.
For any non-trivial churn measurement to be trustworthy, the underlying classification of every file pair should be inspectable. Two outputs are conventionally produced:
The per-line view supports manual verification: a reviewer can confirm that a particular line is correctly classified as CHG rather than DEL+ADD, or vice versa. Without this view the aggregate numbers are difficult to audit. The examples in Section 04 and Section 07 are small enough to verify by inspection; production codebases producing thousands of changed lines require the diff view to spot-check classifications.
SLOC counting is essentially language-agnostic once comment syntax is known. LLOC counting requires per-language tokenisers; supporting a new language requires implementing or borrowing a lexer that knows the language's comment, string, and statement-delimiter rules. For ill-defined "languages" (configuration files, mixed-content templates), LLOC may not be meaningful and is conventionally reported as equal to SLOC.
Files identified as binary by a content sniff (control-byte density, magic-number heuristics) are excluded from line counting. Generated files (typically identifiable by directory naming convention or build-system patterns) should also be excluded, as their churn does not reflect human authorship effort. Both exclusions are policy choices and should be documented for any given measurement.
The Myers algorithm runs in O(ND) where N is the sum of file lengths and D is the number of edit operations. For typical source-file pairs (a few thousand lines, a few hundred edits) this is fast; for pathological cases (files that have been entirely rewritten, or files of tens of thousands of lines) the constant factors matter and divide-and-conquer or linear-space variants [1] are preferred. Project-wide churn measurement processes file pairs independently and parallelises naturally across CPU cores.
The numerical results of a churn measurement depend on several configuration choices: the file extension list (which determines what is counted as source), the comment-stripping rules (line-only versus end-of-line versus block), the Jaccard threshold for CHG classification, and the two-pass refinement options. Measurements taken with different configurations are not directly comparable. Any reported figure should specify the configuration used.
Source code churn is measured by aligning two snapshots of a project with a longest-common-subsequence algorithm, separately at the physical-line (SLOC) and logical-line (LLOC) levels. The two metrics diverge whenever statements and physical lines do not coincide: multi-statement lines yield more LLOC than SLOC, multi-line statements yield more SLOC than LLOC. Classification of changed lines into CHG, DEL, ADD, and UNCHANGED is sensitive to the similarity threshold used to recognise modifications and to the alignment behaviour of the underlying algorithm. Pathological alignments, particularly those caused by repeated identical tokens, are partially mitigated by a second alignment pass using scope-qualified tokens. Several boundary cases — renamed identifiers, reformatting, moved files, comment-only edits — require either auxiliary passes or explicit policy choices. Production use depends on inspectable per-line output for verification.
Direct links to the cited sources, in reference order. Where a paper sits behind a publisher paywall, the link resolves to the canonical record (DOI, proceedings page, or institutional repository) from which open versions can usually be located.