[clamav-users] Scanning very large files in chunks

Discussion:

(too old to reply)

sapientdust+

2016-08-04 23:40:49 UTC

I've recently run into the issue of clamd not being able to scan files
that are larger than a small number of GB, and I have seen the
warnings in `man clamd.conf` that say specified limits above 4GB are
ignored.

Could developers or other folks familiar with the clamd codebase
comment on the feasibility of scanning large files in multiple pieces
as a way of handling larger files?

For example, given a file that is 6GB, does using multiple INSTREAM
calls (that's how I'm interacting wth clamd currently) to check the
full 6GB seem like it should work reliably?

INSTREAM: bytes 0-1000MB
INSTREAM: bytes 900MB-1.9GB
INSTREAM: bytes 1.8GB-2.8GB
INSTREAM: bytes 2.7GB-3.7GB
INSTREAM: bytes 3.6GB-4.6GB
INSTREAM: bytes 4.5GB-5.5GB
INSTREAM: bytes 5.4GB-6.0GB

There is overlap above, wherein the 100MB of data that starts at the
900MB position is scanned twice, once in the first call (as the last
100MB of that stream) and once in the second call (as the first 100MB
of that stream), to reduce the possibility of a virus being split into
two pieces and therefore not recognized.

If Clamav needs the first bytes in order to know
what kind of file it is scanning and trigger filetype-specific heuristics, then
something like the above could be adapted so that the first N bytes of the
first chunk are prepended to each subsequent chunk that is checked for
that file.

Thanks for any guidance or feedback you can provide.
_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml

Al Varnell

2016-08-05 02:14:05 UTC

Permalink

Does anybody have any evidence of malware that exceeds 4GB? Although I can certainly see the utility of the proposed capability as a hedge for the future, it would seem to be a waste of time and compute power to scan such large files today.

With the ever increasing malware issues we face today, itâs important to consider this:

Risk = threat x vulnerability x consequence

<http://fortune.com/2016/05/14/cybersecurity-risk-calculation/>

We all need to focus on fixing the high risk items first.

Sent from Janet's iPad

-Al-

Post by sapientdust+
I've recently run into the issue of clamd not being able to scan files
that are larger than a small number of GB, and I have seen the
warnings in `man clamd.conf` that say specified limits above 4GB are
ignored.
Could developers or other folks familiar with the clamd codebase
comment on the feasibility of scanning large files in multiple pieces
as a way of handling larger files?
For example, given a file that is 6GB, does using multiple INSTREAM
calls (that's how I'm interacting wth clamd currently) to check the
full 6GB seem like it should work reliably?
INSTREAM: bytes 0-1000MB
INSTREAM: bytes 900MB-1.9GB
INSTREAM: bytes 1.8GB-2.8GB
INSTREAM: bytes 2.7GB-3.7GB
INSTREAM: bytes 3.6GB-4.6GB
INSTREAM: bytes 4.5GB-5.5GB
INSTREAM: bytes 5.4GB-6.0GB
There is overlap above, wherein the 100MB of data that starts at the
900MB position is scanned twice, once in the first call (as the last
100MB of that stream) and once in the second call (as the first 100MB
of that stream), to reduce the possibility of a virus being split into
two pieces and therefore not recognized.
If Clamav needs the first bytes in order to know
what kind of file it is scanning and trigger filetype-specific heuristics, then
something like the above could be adapted so that the first N bytes of the
first chunk are prepended to each subsequent chunk that is checked for
that file.
Thanks for any guidance or feedback you can provide.

Paul Kosinski

2016-08-05 02:32:08 UTC

Permalink

Really large files like this would likely either be video files or
disk images (incl. DVD and Blu-Ray). Both kinds could, in principle,
have malware embedded.

Disk images often contain whole file systems and thus many, many files.
The alternative is to scan the entire FS after it is "mounted". (Of
course disk images these days might be 6 TB rather than a mere 6 GB.)

Video files could have malicious sequences of bytes which break codecs
(e.g., via buffer overflows) and possibly result in code execution.
(Adobe Flash comes to mind, with its monthly critical updates, but I'm
sure other codecs can also have similar problems.)

On Thu, 04 Aug 2016 19:14:05 -0700

Post by Al Varnell
Does anybody have any evidence of malware that exceeds 4GB? Although
I can certainly see the utility of the proposed capability as a hedge
for the future, it would seem to be a waste of time and compute power
to scan such large files today.
With the ever increasing malware issues we face today, it’s important
Risk = threat x vulnerability x consequence
<http://fortune.com/2016/05/14/cybersecurity-risk-calculation/>
We all need to focus on fixing the high risk items first.
Sent from Janet's iPad
-Al-

Post by sapientdust+
I've recently run into the issue of clamd not being able to scan
files that are larger than a small number of GB, and I have seen the
warnings in `man clamd.conf` that say specified limits above 4GB are
ignored.
Could developers or other folks familiar with the clamd codebase
comment on the feasibility of scanning large files in multiple
pieces as a way of handling larger files?
For example, given a file that is 6GB, does using multiple INSTREAM
calls (that's how I'm interacting wth clamd currently) to check the
full 6GB seem like it should work reliably?
INSTREAM: bytes 0-1000MB
INSTREAM: bytes 900MB-1.9GB
INSTREAM: bytes 1.8GB-2.8GB
INSTREAM: bytes 2.7GB-3.7GB
INSTREAM: bytes 3.6GB-4.6GB
INSTREAM: bytes 4.5GB-5.5GB
INSTREAM: bytes 5.4GB-6.0GB
There is overlap above, wherein the 100MB of data that starts at the
900MB position is scanned twice, once in the first call (as the last
100MB of that stream) and once in the second call (as the first
100MB of that stream), to reduce the possibility of a virus being
split into two pieces and therefore not recognized.
If Clamav needs the first bytes in order to know
what kind of file it is scanning and trigger filetype-specific
heuristics, then something like the above could be adapted so that
the first N bytes of the first chunk are prepended to each
subsequent chunk that is checked for that file.
Thanks for any guidance or feedback you can provide.

_______________________________________________
Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmi

Al Varnell

2016-08-05 03:08:59 UTC

Permalink

Certainly agree that many, many disk images are known to contain malware, but the usual approach there is to use a hash value for the file as there are other issues with attempting to scan within the image without mounting it first. The most recent versions of ClamXav now does both a hash check and scans after mounting.

-Al-

Post by Paul Kosinski
Disk images often contain whole file systems and thus many, many files.
The alternative is to scan the entire FS after it is "mounted". (Of
course disk images these days might be 6 TB rather than a mere 6 GB.)