A technical overview of the principal approaches to distinguishing machine-generated from human-written source code, and to identifying code that instantiates autonomous AI agents at runtime.
Two related but distinct detection problems arise in the analysis of contemporary source projects. The first is provenance detection: determining whether a given fragment of source code was written by a human or generated by a large language model. The second is agent detection: determining whether code, regardless of who or what wrote it, constructs or invokes an autonomous AI agent during execution. This paper surveys the principal classes of technique applicable to each problem. For provenance detection it covers supervised stylometric classification, zero-shot probability-curvature methods, perplexity scoring, and fine-tuned neural classifiers; it then sets out the measurable signals on which all of these rest — stylistic regularity, comment characteristics, token-level predictability, structural features, and idiom — together with documented accuracy ranges and failure modes. For agent detection it covers static identification of agent-framework dependencies, dynamic-execution constructs, and data-flow tracing. The paper is descriptive: it characterises the space of methods reported in the literature and does not prescribe a particular implementation.
The proliferation of large language models capable of producing source code has introduced two questions that did not previously require automated tooling. Given a body of source code:
These two questions are independent. Human-written code may construct an AI agent; AI-generated code may be entirely static and contain no agent logic. A complete analysis treats them separately, using different methods, and reports them as different findings.
This paper restricts itself to the detection methods themselves. It does not address what should be done with a detection result, nor does it make any claim about the prevalence, desirability, or risk of either AI-generated code or embedded agents. Those are questions of policy, not of technique.
Provenance detection is an instance of authorship attribution, a field that substantially predates large language models. Classical authorship attribution treats the problem as supervised classification: a body of code is converted into a feature vector, and a model trained on labelled examples predicts the author [1]. The same framing applies when the candidate "authors" are reduced to two classes, human and machine.
The foundational work in code authorship attribution established that programmers leave consistent, measurable stylistic traces in their code — a "coding fingerprint" derived from lexical, layout, and structural features [1]. The Code Stylometry Feature Set introduced in that work demonstrated that such features could attribute authorship among large candidate pools with high accuracy. Subsequent surveys have organised the feature space into broad categories [2]:
The application of this framing to the human-versus-machine question is recent but now well-represented in the literature, with several distinct methodological families.
The most direct approach trains a conventional machine-learning classifier (logistic regression, support vector machine, random forest, or gradient-boosted trees are all reported) on engineered stylometric features extracted from a labelled corpus of human and machine code. A recent multilingual study reported a single classifier achieving an average F1 score of 84.1% across ten programming languages, trained on an open dataset of over 120,000 labelled snippets [3]. This family has the advantage of interpretability — the contributing features can be inspected — and the disadvantage of depending on a feature-engineering process whose effectiveness is dataset-dependent [2]. The choice of classifier also matters: a study constructing a labelled corpus of human-written and AI-generated Python code reported that, among the baseline detectors tested, a Bayesian classifier outperformed the alternatives [12].
A notable empirical finding from this family is that not all feature categories contribute equally. At least one study reported that abstract-syntax-tree and control-flow-graph structural metrics shifted classifier decision boundaries only marginally, while the presence or absence of comments was a comparatively strong signal for attribution [4]. This suggests that surface features can dominate deep structural features in practice, though the result is specific to the datasets and models studied.
A second family avoids training a dedicated classifier. Building on the DetectGPT method developed for natural-language text [5], these approaches exploit a statistical property of model-generated text: such text tends to occupy regions of negative curvature in the generating model's log-probability surface. The method perturbs the candidate text (for code, by masking and refilling lines using a mask-filling model), measures the change in log-probability, and uses the curvature to discriminate generated from human content. The reported advantage is that it requires no labelled training corpus; the reported limitation is sensitivity to the choice of perturbation model and to adversarial editing of the candidate [5].
A third family uses the perplexity of the candidate code under one or more reference language models as the discriminating signal. Code generated by a model tends to exhibit lower perplexity under that model (or a related one) than human-written code, because it was sampled from a similar distribution. Perplexity-based detection of AI-generated code assignments has been studied in the educational-integrity context [6]. This family sits between the previous two: it uses model probabilities like the curvature methods but reduces them to a simpler scalar statistic.
A fourth family fine-tunes a pre-trained code language model (of the BERT or CodeT5 lineage) directly as a binary classifier, bypassing manual feature engineering. Studies in this family report accuracy exceeding the classical stylometric baselines — one reported a modified CodeT5 classifier reaching above 97% on its dataset [4]. The trade-off is reduced interpretability and increased computational cost relative to the engineered-feature approach.
Every method in Section 03, whatever its statistical machinery, ultimately rests on measurable differences between the two populations of code. It is worth setting out what those differences are, because they explain why detection is possible at all and why it is imperfect. The literature reports several signal categories that carry discriminative weight. None is decisive in isolation; their value is collective.
Machine-generated code tends to exhibit lower variance across a range of stylistic dimensions than a human population does. Indentation and brace placement are more uniform; identifier-naming follows more consistent schemes; the distribution of function lengths and statement lengths is narrower. Human code, drawn from many habits, levels of experience, and moods, is comparatively irregular. The intuition is statistical: a single model sampling at moderate temperature produces output clustered around its learned conventions, whereas a population of humans produces a wider spread. Surface-regularity features of this kind are among those shown to carry discriminative weight in code-stylometry studies [1][3].
The presence, density, and uniformity of comments is repeatedly reported as a strong attribution cue. In at least one controlled study, the presence or absence of comment tokens moved classifier decision boundaries by several percentage points — more than abstract-syntax-tree structural metrics did [4]. Generated code frequently carries complete, uniformly-phrased, grammatically-regular comments across most constructs; human commenting tends to be sparser, more uneven, and more idiosyncratic in phrasing. Because this is a surface feature, it is also one of the easiest for an author to alter (Section 05).
Under a reference language model, generated code tends to occupy a higher-probability — that is, lower-perplexity — region than human code, because it was sampled from a distribution similar to the reference model's own. Perplexity is therefore usable as a continuous signal: informally, a measure of how "surprised" a reference model is by each successive token, averaged over the file. The zero-shot curvature methods (Section 3.2) and the perplexity methods (Section 3.3) both exploit this property, the former by examining how perplexity changes under perturbation, the latter by using the aggregate value directly [5][6].
Features derived from the abstract syntax tree and control-flow graph — nesting-depth distribution, branching factor, and related measures of code shape — capture structure rather than surface text. Their reported discriminative power is mixed: useful in some studies, marginal in others, with at least one study finding that such structural metrics shifted classical-model decision boundaries only slightly compared with surface features [4]. The mixed result is itself informative: it indicates that, for current generating models, the detectable signal lives more in surface style than in deep structure.
Human-authored code carries traces that machine-generated code often lacks: dead code left behind during development, commented-out experiments, inconsistent shortcuts, idiosyncratic abbreviations, informal TODO and FIXME notes, and occasional misspellings in identifiers or comments. The relative absence of such "imperfections" is weak evidence toward machine authorship. This category is the least formalised in the literature and the most easily confounded — disciplined human authors and linters remove many of these traces — but it is frequently noted qualitatively.
No single signal above is individually reliable; each is, in machine-learning terms, a weak classifier. The general and well-established principle is that a set of approximately-orthogonal weak signals can be combined into a stronger classifier than any one alone — the basis of ensemble methods such as random forests and gradient boosting. The detection families of Section 03 differ chiefly in how they perform this combination: engineered-feature classifiers combine explicit signals through a trained model; neural classifiers learn an implicit combination directly from token sequences. The particular choice of signals, their relative weighting, and any per-language calibration are implementation and tuning decisions specific to a given detector; they are not properties of the method class and vary between systems.
The literature is consistent on several limitations that apply across method families. These are reported here because they materially affect how any detection result should be interpreted.
Multiple studies report that classifiers remain brittle under adversarial edits: small, meaning-preserving modifications to generated code (reformatting, identifier renaming, comment insertion or removal) can move it across the decision boundary [4]. Because some of the strongest signals are surface features such as comment style (Section 4.2), an author who edits those surface features can substantially degrade detection.
Detectors trained on the output of one generation of language models do not necessarily generalise to newer models. One large-scale study explicitly re-tested its detector against code from later models released after the original data-collection window, treating generalisation as an open empirical question rather than an assumption [7]. As generating models change, a fixed detector's accuracy can be expected to drift.
In practice, code is frequently neither purely human-written nor purely machine-generated but a mixture — human-authored code edited with model assistance, or model-generated code subsequently revised by a human. At least one dataset effort explicitly incorporates "machine-refined" code as a third class distinct from both pure cases, and notes that purely AI-generated code is comparatively rare in real repositories [8]. A binary human-versus-machine framing is therefore a simplification of a spectrum.
Because attribution of authorship to a machine can carry consequences for the human associated with the code, the asymmetric cost of false positives is a recurring theme. The literature does not establish a universal threshold; the appropriate operating point on the precision-recall curve is application-specific and is not a property the detection method can supply on its own.
Agent detection addresses a different question from provenance: not who wrote the code, but what the code does when run. Specifically, it concerns whether the code constructs or drives an autonomous AI agent — software that issues model-inference calls and acts on their outputs, potentially in a loop and potentially with access to external tools or system resources.
This problem connects to a well-documented class of security concern. When code executes the output of a language model — passing model-generated strings to a dynamic-execution construct such as eval or exec — it creates a path by which manipulation of the model's input can lead to arbitrary code execution. This pattern has been catalogued by security researchers and assigned vulnerability identifiers in specific frameworks [9]. Prompt injection, the manipulation of a model through crafted input, is listed as the leading vulnerability class for language-model applications by the OWASP project [10]. Detecting where in a codebase these constructs occur is therefore a recognised static-analysis objective.
The most reliable static signal that code instantiates an AI agent is its declared dependencies. Agent frameworks are imported by name. A static scan of import statements, dependency manifests, and lock files identifies whether a project links against known agent or model-orchestration libraries. This is a coarse signal — importing a library does not prove it is used on any given execution path — but it is high-precision for the question "could this code construct an agent at all?" The technique is the same one used in software composition analysis for licence and vulnerability auditing, applied to a different target list.
A second static signal is the presence of dynamic-execution constructs — eval, exec, compile, subprocess invocation, deserialisation of executable content — particularly where the argument to such a construct is data rather than a literal. As noted in Section 05, these constructs are the mechanism by which model output can become executed code [9]. Locating them is a classic static-analysis task: the parser identifies call sites, and data-flow analysis determines whether untrusted or model-derived data can reach them.
A more precise but more expensive class of method builds a call graph and performs data-flow analysis to determine whether a model-inference call feeds, directly or transitively, into an action with external effect — a dynamic-execution construct, a network request, a filesystem write, or a tool invocation. This distinguishes code that merely calls a model and displays the result from code that acts autonomously on model output. The general technique is standard inter-procedural static analysis; recent work has also explored using language models themselves to assist static-analysis tasks [11], though that is a separate research direction from the detection problem itself.
Static analysis cannot in general determine what a program does at runtime; this is a consequence of undecidability and applies to agent detection as to any other behavioural property. Dynamically constructed import names, reflection, and configuration-driven dispatch can all hide agent construction from a purely static scan. Conversely, the mere presence of an agent framework in the dependency tree does not establish that any agent is constructed on a reachable path. Static agent detection therefore produces evidence to be weighed, not a definitive runtime verdict. Dynamic and behavioural monitoring — observing a running system for anomalous inference calls or execution patterns — is the complementary approach where static analysis is insufficient [10].
Because provenance and agent detection answer different questions, a complete report keeps their findings distinct. A file may be flagged as probably machine-generated (a provenance finding) and separately as constructing an agent (a behaviour finding); the two flags are computed by different methods and have different confidence characteristics. Conflating them — for instance, treating "imports an agent framework" as evidence that code is AI-generated — introduces a category error, since human authors routinely write agent code and AI generators routinely produce static code.
The two analyses do share infrastructure. Both depend on a parser capable of producing a token stream and, ideally, an abstract syntax tree for each supported language; both benefit from the same file-identification and language-classification front end. Beyond that shared front end, the methods diverge entirely.
The literature converges on a small set of evaluation practices that any reported detection accuracy should specify, and whose absence should be treated as a limitation of the report:
Reproducibility is an acknowledged weakness in parts of this literature: surveys note heavy reliance on a small number of benchmark datasets and uneven release of training code and models [2]. Detection figures reported without the above context are difficult to compare across studies.
Provenance detection and agent detection are distinct problems requiring distinct methods. Provenance detection adapts authorship-attribution techniques — supervised stylometric classification, zero-shot probability-curvature analysis, perplexity scoring, and fine-tuned neural classifiers — to the human-versus-machine question, with reported accuracies ranging from the mid-80% range for interpretable multilingual stylometric models to above 97% for fine-tuned neural classifiers on specific datasets. All provenance methods share documented limitations: adversarial brittleness, distribution shift across model generations, the prevalence of mixed human-machine code, and asymmetric false-positive cost. Agent detection is a static-analysis problem centred on dependency analysis, dynamic-execution-construct identification, and data-flow tracing from model-inference calls to external-effect operations; it is bounded by the general undecidability of runtime behaviour and is complemented by dynamic monitoring. A sound analysis reports the two detections separately and specifies its evaluation methodology.
eval/exec without isolation.Direct links to the cited sources, in reference order. Where a paper sits behind a publisher paywall, the link resolves to the canonical record (DOI or proceedings page) from which open versions can usually be located.