Should binary files be stored in the database?
What is the best place for storing binary files that are related to data in your database? Should you:
- Store in the database with a blob
- Store on the filesystem with a link in the database
- Store in the filesystem but rename to a hash of the contents and store the hash on the database
- Something I've not thought of
The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements
The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)
EDIT:
It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres
database-design blob
add a comment |
What is the best place for storing binary files that are related to data in your database? Should you:
- Store in the database with a blob
- Store on the filesystem with a link in the database
- Store in the filesystem but rename to a hash of the contents and store the hash on the database
- Something I've not thought of
The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements
The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)
EDIT:
It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres
database-design blob
7
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
1
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
1
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51
add a comment |
What is the best place for storing binary files that are related to data in your database? Should you:
- Store in the database with a blob
- Store on the filesystem with a link in the database
- Store in the filesystem but rename to a hash of the contents and store the hash on the database
- Something I've not thought of
The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements
The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)
EDIT:
It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres
database-design blob
What is the best place for storing binary files that are related to data in your database? Should you:
- Store in the database with a blob
- Store on the filesystem with a link in the database
- Store in the filesystem but rename to a hash of the contents and store the hash on the database
- Something I've not thought of
The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements
The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)
EDIT:
It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres
database-design blob
database-design blob
edited 2 days ago
Evan Carroll
31.1k865206
31.1k865206
asked Apr 29 '11 at 12:02
Jack Douglas♦Jack Douglas
27.6k1075148
27.6k1075148
7
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
1
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
1
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51
add a comment |
7
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
1
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
1
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51
7
7
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
1
1
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
1
1
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51
add a comment |
12 Answers
12
active
oldest
votes
Store in the database with a blob
A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.
Store on the filesystem with a link in the database
I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:
- One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).
- When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.
- Some of the paths ended up being longer than 256 characters from the
C:
all the way to the.doc
and not all versions of NT were able to deal with long paths.
Store in the filesystem but rename to a hash of the contents and store the hash on the database
The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.
My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
add a comment |
Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.
Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
I'm interested in what theFileStream
equivalents on other platforms are called.
– Ronnie Overby
Jul 7 '15 at 19:26
add a comment |
If going for oracle, take a look at dbfs and Secure Files.
Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.
dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.
More advantages are: availability, backup, recovery, all read consistent with the other relational data.
Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.
A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.
A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
add a comment |
I think the right answer here depends a lot on your application, and how important those documents are.
For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.
However, there are many applications where I believe that the opposite decision is appropriate.
Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.
For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.
I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.
There are other benefits of keeping documents separate.
- You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.
- You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases
- You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)
- For an installation across multiple sites, you can complement a centralised database with a distributed file system
I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.
Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.
add a comment |
I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.
The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.
The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.
Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
add a comment |
Don't do it.
There really isn't an upside to having files stored in the database.
Doesn't it already feel weird and fishy when you think to yourself:
Should I store files in a database or a filesystem?
Even better, say it out loud.
On to the facts:
Using the database
"PROS"... but not quite:
- "Atomicity" which is correct but it's a double edged sword. Because it drags
cons along with it. - Integrity. Same as above.
I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.
If I forgot something comment below, in the meanwhile keep reading below.
CONS:
- Wrong tool for the job
- Harder to maintain
- Slow
- Forget about storing hundreds of MB/gigabytes of data PER user.
- Backing up rapidly growing sites will be a nightmare.
- Restoring/moving will also suck.
Using the filesystem
PROS:
- Way easier to maintain
- Fast
- Database back ups don't have anything to do with this
- Arguably more portability*
CONS:
- None*
*Fine print
Right now you're asking yourself, hold on you mean there's no cons?! Howcome?
The biggest mistakes here is that people are trying to screw a screw with a hammer.
The main reason and I'd go as far to say only reason this is being asked is because of file links.
This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.
"The database will fix my file linking problems."
When in reality, logically the application should actually be in charge of handling and serving links.
A Solution:
- Make your application handle URL requests with custom routes.
- Save this route to your database.
- Internally every time this route is called map it to the file you want.
- If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.
This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):
https://github.com/symfony/Routing
https://github.com/kriswallsmith/assetic
Both of these together are really powerful.
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
add a comment |
Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.
Storing BLOBS in the database has advantages and disadvantages:
On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.
On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.
SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.
Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.
My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do youSELECT image FROM table
in SSMS and validate that the right image is there?
– Aaron Bertrand♦
Sep 26 '11 at 15:07
add a comment |
My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.
This way you get reliability of having the data always accessible without having monster sized databases to deal with.
add a comment |
Although it partly depends on the application/environment (people included), I'd go for the blob.
Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.
In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).
Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.
The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).
add a comment |
For postgres:
It's actually straight foreward. There is a BYTEA
type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream
or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo
type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
add a comment |
Do not store files in a database.
Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.
No filehandes to files in the database. What does this mean?
Programmer-talk: You CAN NOT seek (
fseek
), there is no ability to manage the resource with asynchronous access (asyncio
orepoll
), there is nosendfile
(saving you the copy from kernel space).Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.
Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.
No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.
File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.
Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.
Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?
No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.
Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database.
xfs_repair
is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like
s3fs
it can be transparent).
Exceptions
Storing files in the database has a few valid use cases,
- When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).
- When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.
- When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.
Mitigations
Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as
PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.
SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.
Oracle provides
BFILE
(this has nothing to do with their internal LOB stuff which is calledSecureFile
Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.
Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.
Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?
Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).
Conclusion
The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.
Keep it simple, and the general rule is keep the files out of the DB.
Solution
How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.
add a comment |
Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.
add a comment |
protected by Paul White♦ Jul 31 '14 at 21:52
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
12 Answers
12
active
oldest
votes
12 Answers
12
active
oldest
votes
active
oldest
votes
active
oldest
votes
Store in the database with a blob
A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.
Store on the filesystem with a link in the database
I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:
- One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).
- When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.
- Some of the paths ended up being longer than 256 characters from the
C:
all the way to the.doc
and not all versions of NT were able to deal with long paths.
Store in the filesystem but rename to a hash of the contents and store the hash on the database
The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.
My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
add a comment |
Store in the database with a blob
A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.
Store on the filesystem with a link in the database
I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:
- One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).
- When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.
- Some of the paths ended up being longer than 256 characters from the
C:
all the way to the.doc
and not all versions of NT were able to deal with long paths.
Store in the filesystem but rename to a hash of the contents and store the hash on the database
The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.
My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
add a comment |
Store in the database with a blob
A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.
Store on the filesystem with a link in the database
I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:
- One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).
- When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.
- Some of the paths ended up being longer than 256 characters from the
C:
all the way to the.doc
and not all versions of NT were able to deal with long paths.
Store in the filesystem but rename to a hash of the contents and store the hash on the database
The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.
My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.
Store in the database with a blob
A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.
Store on the filesystem with a link in the database
I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:
- One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).
- When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.
- Some of the paths ended up being longer than 256 characters from the
C:
all the way to the.doc
and not all versions of NT were able to deal with long paths.
Store in the filesystem but rename to a hash of the contents and store the hash on the database
The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.
My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.
edited Jan 28 '13 at 21:48
eykanal
152117
152117
answered Apr 29 '11 at 14:13
TangurenaTangurena
1,46611316
1,46611316
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
add a comment |
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
2
2
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
– Phil Lello
Apr 29 '11 at 16:48
2
2
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
– Ali Razeghi
Jan 29 '13 at 0:20
6
6
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
– Tek
Jul 31 '14 at 18:28
add a comment |
Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.
Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
I'm interested in what theFileStream
equivalents on other platforms are called.
– Ronnie Overby
Jul 7 '15 at 19:26
add a comment |
Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.
Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
I'm interested in what theFileStream
equivalents on other platforms are called.
– Ronnie Overby
Jul 7 '15 at 19:26
add a comment |
Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.
Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway
Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.
Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway
answered Apr 29 '11 at 12:48
gbngbn
63.5k7134211
63.5k7134211
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
I'm interested in what theFileStream
equivalents on other platforms are called.
– Ronnie Overby
Jul 7 '15 at 19:26
add a comment |
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
I'm interested in what theFileStream
equivalents on other platforms are called.
– Ronnie Overby
Jul 7 '15 at 19:26
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
– Jack Douglas♦
Apr 29 '11 at 13:05
4
4
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
@JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
– gbn
Apr 29 '11 at 14:02
6
6
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
@JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
– Phil Lello
Apr 29 '11 at 16:23
2
2
I'm interested in what the
FileStream
equivalents on other platforms are called.– Ronnie Overby
Jul 7 '15 at 19:26
I'm interested in what the
FileStream
equivalents on other platforms are called.– Ronnie Overby
Jul 7 '15 at 19:26
add a comment |
If going for oracle, take a look at dbfs and Secure Files.
Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.
dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.
More advantages are: availability, backup, recovery, all read consistent with the other relational data.
Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.
A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.
A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
add a comment |
If going for oracle, take a look at dbfs and Secure Files.
Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.
dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.
More advantages are: availability, backup, recovery, all read consistent with the other relational data.
Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.
A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.
A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
add a comment |
If going for oracle, take a look at dbfs and Secure Files.
Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.
dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.
More advantages are: availability, backup, recovery, all read consistent with the other relational data.
Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.
A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.
A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.
If going for oracle, take a look at dbfs and Secure Files.
Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.
dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.
More advantages are: availability, backup, recovery, all read consistent with the other relational data.
Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.
A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.
A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.
edited Dec 21 '15 at 9:16
answered Apr 29 '11 at 13:10
ik_zelfik_zelf
5,7821431
5,7821431
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
add a comment |
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
9
9
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
+1 for stating a good rule "A reference in a table to something outside the database is unsafe."
– Niklas Rosencrantz
Sep 24 '11 at 12:23
add a comment |
I think the right answer here depends a lot on your application, and how important those documents are.
For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.
However, there are many applications where I believe that the opposite decision is appropriate.
Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.
For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.
I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.
There are other benefits of keeping documents separate.
- You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.
- You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases
- You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)
- For an installation across multiple sites, you can complement a centralised database with a distributed file system
I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.
Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.
add a comment |
I think the right answer here depends a lot on your application, and how important those documents are.
For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.
However, there are many applications where I believe that the opposite decision is appropriate.
Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.
For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.
I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.
There are other benefits of keeping documents separate.
- You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.
- You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases
- You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)
- For an installation across multiple sites, you can complement a centralised database with a distributed file system
I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.
Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.
add a comment |
I think the right answer here depends a lot on your application, and how important those documents are.
For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.
However, there are many applications where I believe that the opposite decision is appropriate.
Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.
For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.
I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.
There are other benefits of keeping documents separate.
- You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.
- You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases
- You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)
- For an installation across multiple sites, you can complement a centralised database with a distributed file system
I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.
Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.
I think the right answer here depends a lot on your application, and how important those documents are.
For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.
However, there are many applications where I believe that the opposite decision is appropriate.
Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.
For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.
I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.
There are other benefits of keeping documents separate.
- You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.
- You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases
- You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)
- For an installation across multiple sites, you can complement a centralised database with a distributed file system
I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.
Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.
answered May 8 '13 at 14:12
Nathan JollyNathan Jolly
3,2901226
3,2901226
add a comment |
add a comment |
I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.
The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.
The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.
Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
add a comment |
I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.
The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.
The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.
Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
add a comment |
I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.
The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.
The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.
Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.
I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.
The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.
The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.
Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.
answered Feb 18 '13 at 8:17
Chris TraversChris Travers
10.9k3788
10.9k3788
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
add a comment |
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
– Anssi
Jan 4 '18 at 9:18
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
– Evan Carroll
4 hours ago
add a comment |
Don't do it.
There really isn't an upside to having files stored in the database.
Doesn't it already feel weird and fishy when you think to yourself:
Should I store files in a database or a filesystem?
Even better, say it out loud.
On to the facts:
Using the database
"PROS"... but not quite:
- "Atomicity" which is correct but it's a double edged sword. Because it drags
cons along with it. - Integrity. Same as above.
I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.
If I forgot something comment below, in the meanwhile keep reading below.
CONS:
- Wrong tool for the job
- Harder to maintain
- Slow
- Forget about storing hundreds of MB/gigabytes of data PER user.
- Backing up rapidly growing sites will be a nightmare.
- Restoring/moving will also suck.
Using the filesystem
PROS:
- Way easier to maintain
- Fast
- Database back ups don't have anything to do with this
- Arguably more portability*
CONS:
- None*
*Fine print
Right now you're asking yourself, hold on you mean there's no cons?! Howcome?
The biggest mistakes here is that people are trying to screw a screw with a hammer.
The main reason and I'd go as far to say only reason this is being asked is because of file links.
This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.
"The database will fix my file linking problems."
When in reality, logically the application should actually be in charge of handling and serving links.
A Solution:
- Make your application handle URL requests with custom routes.
- Save this route to your database.
- Internally every time this route is called map it to the file you want.
- If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.
This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):
https://github.com/symfony/Routing
https://github.com/kriswallsmith/assetic
Both of these together are really powerful.
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
add a comment |
Don't do it.
There really isn't an upside to having files stored in the database.
Doesn't it already feel weird and fishy when you think to yourself:
Should I store files in a database or a filesystem?
Even better, say it out loud.
On to the facts:
Using the database
"PROS"... but not quite:
- "Atomicity" which is correct but it's a double edged sword. Because it drags
cons along with it. - Integrity. Same as above.
I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.
If I forgot something comment below, in the meanwhile keep reading below.
CONS:
- Wrong tool for the job
- Harder to maintain
- Slow
- Forget about storing hundreds of MB/gigabytes of data PER user.
- Backing up rapidly growing sites will be a nightmare.
- Restoring/moving will also suck.
Using the filesystem
PROS:
- Way easier to maintain
- Fast
- Database back ups don't have anything to do with this
- Arguably more portability*
CONS:
- None*
*Fine print
Right now you're asking yourself, hold on you mean there's no cons?! Howcome?
The biggest mistakes here is that people are trying to screw a screw with a hammer.
The main reason and I'd go as far to say only reason this is being asked is because of file links.
This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.
"The database will fix my file linking problems."
When in reality, logically the application should actually be in charge of handling and serving links.
A Solution:
- Make your application handle URL requests with custom routes.
- Save this route to your database.
- Internally every time this route is called map it to the file you want.
- If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.
This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):
https://github.com/symfony/Routing
https://github.com/kriswallsmith/assetic
Both of these together are really powerful.
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
add a comment |
Don't do it.
There really isn't an upside to having files stored in the database.
Doesn't it already feel weird and fishy when you think to yourself:
Should I store files in a database or a filesystem?
Even better, say it out loud.
On to the facts:
Using the database
"PROS"... but not quite:
- "Atomicity" which is correct but it's a double edged sword. Because it drags
cons along with it. - Integrity. Same as above.
I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.
If I forgot something comment below, in the meanwhile keep reading below.
CONS:
- Wrong tool for the job
- Harder to maintain
- Slow
- Forget about storing hundreds of MB/gigabytes of data PER user.
- Backing up rapidly growing sites will be a nightmare.
- Restoring/moving will also suck.
Using the filesystem
PROS:
- Way easier to maintain
- Fast
- Database back ups don't have anything to do with this
- Arguably more portability*
CONS:
- None*
*Fine print
Right now you're asking yourself, hold on you mean there's no cons?! Howcome?
The biggest mistakes here is that people are trying to screw a screw with a hammer.
The main reason and I'd go as far to say only reason this is being asked is because of file links.
This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.
"The database will fix my file linking problems."
When in reality, logically the application should actually be in charge of handling and serving links.
A Solution:
- Make your application handle URL requests with custom routes.
- Save this route to your database.
- Internally every time this route is called map it to the file you want.
- If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.
This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):
https://github.com/symfony/Routing
https://github.com/kriswallsmith/assetic
Both of these together are really powerful.
Don't do it.
There really isn't an upside to having files stored in the database.
Doesn't it already feel weird and fishy when you think to yourself:
Should I store files in a database or a filesystem?
Even better, say it out loud.
On to the facts:
Using the database
"PROS"... but not quite:
- "Atomicity" which is correct but it's a double edged sword. Because it drags
cons along with it. - Integrity. Same as above.
I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.
If I forgot something comment below, in the meanwhile keep reading below.
CONS:
- Wrong tool for the job
- Harder to maintain
- Slow
- Forget about storing hundreds of MB/gigabytes of data PER user.
- Backing up rapidly growing sites will be a nightmare.
- Restoring/moving will also suck.
Using the filesystem
PROS:
- Way easier to maintain
- Fast
- Database back ups don't have anything to do with this
- Arguably more portability*
CONS:
- None*
*Fine print
Right now you're asking yourself, hold on you mean there's no cons?! Howcome?
The biggest mistakes here is that people are trying to screw a screw with a hammer.
The main reason and I'd go as far to say only reason this is being asked is because of file links.
This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.
"The database will fix my file linking problems."
When in reality, logically the application should actually be in charge of handling and serving links.
A Solution:
- Make your application handle URL requests with custom routes.
- Save this route to your database.
- Internally every time this route is called map it to the file you want.
- If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.
This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):
https://github.com/symfony/Routing
https://github.com/kriswallsmith/assetic
Both of these together are really powerful.
edited Aug 1 '14 at 1:22
answered Jul 31 '14 at 19:32
TekTek
20624
20624
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
add a comment |
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
– a_horse_with_no_name
Aug 1 '14 at 5:56
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
– Tek
Aug 1 '14 at 13:37
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
– Torsten Bronger
Oct 9 '15 at 3:25
add a comment |
Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.
Storing BLOBS in the database has advantages and disadvantages:
On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.
On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.
SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.
Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.
My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do youSELECT image FROM table
in SSMS and validate that the right image is there?
– Aaron Bertrand♦
Sep 26 '11 at 15:07
add a comment |
Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.
Storing BLOBS in the database has advantages and disadvantages:
On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.
On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.
SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.
Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.
My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do youSELECT image FROM table
in SSMS and validate that the right image is there?
– Aaron Bertrand♦
Sep 26 '11 at 15:07
add a comment |
Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.
Storing BLOBS in the database has advantages and disadvantages:
On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.
On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.
SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.
Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.
My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.
Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.
Storing BLOBS in the database has advantages and disadvantages:
On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.
On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.
SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.
Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.
My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.
edited yesterday
a_horse_with_no_name
38.7k775112
38.7k775112
answered Sep 26 '11 at 15:02
datagoddatagod
5,90342953
5,90342953
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do youSELECT image FROM table
in SSMS and validate that the right image is there?
– Aaron Bertrand♦
Sep 26 '11 at 15:07
add a comment |
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do youSELECT image FROM table
in SSMS and validate that the right image is there?
– Aaron Bertrand♦
Sep 26 '11 at 15:07
5
5
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you
SELECT image FROM table
in SSMS and validate that the right image is there?– Aaron Bertrand♦
Sep 26 '11 at 15:07
Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you
SELECT image FROM table
in SSMS and validate that the right image is there?– Aaron Bertrand♦
Sep 26 '11 at 15:07
add a comment |
My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.
This way you get reliability of having the data always accessible without having monster sized databases to deal with.
add a comment |
My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.
This way you get reliability of having the data always accessible without having monster sized databases to deal with.
add a comment |
My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.
This way you get reliability of having the data always accessible without having monster sized databases to deal with.
My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.
This way you get reliability of having the data always accessible without having monster sized databases to deal with.
answered Oct 19 '12 at 6:02
paullbpaullb
16112
16112
add a comment |
add a comment |
Although it partly depends on the application/environment (people included), I'd go for the blob.
Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.
In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).
Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.
The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).
add a comment |
Although it partly depends on the application/environment (people included), I'd go for the blob.
Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.
In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).
Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.
The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).
add a comment |
Although it partly depends on the application/environment (people included), I'd go for the blob.
Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.
In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).
Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.
The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).
Although it partly depends on the application/environment (people included), I'd go for the blob.
Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.
In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).
Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.
The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).
answered Apr 29 '11 at 16:48
Phil LelloPhil Lello
94479
94479
add a comment |
add a comment |
For postgres:
It's actually straight foreward. There is a BYTEA
type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream
or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo
type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
add a comment |
For postgres:
It's actually straight foreward. There is a BYTEA
type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream
or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo
type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
add a comment |
For postgres:
It's actually straight foreward. There is a BYTEA
type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream
or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo
type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.
For postgres:
It's actually straight foreward. There is a BYTEA
type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream
or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo
type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.
answered Apr 30 '11 at 8:30
DrColossosDrColossos
4,00412431
4,00412431
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
add a comment |
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
– Evan Carroll
2 days ago
add a comment |
Do not store files in a database.
Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.
No filehandes to files in the database. What does this mean?
Programmer-talk: You CAN NOT seek (
fseek
), there is no ability to manage the resource with asynchronous access (asyncio
orepoll
), there is nosendfile
(saving you the copy from kernel space).Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.
Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.
No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.
File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.
Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.
Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?
No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.
Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database.
xfs_repair
is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like
s3fs
it can be transparent).
Exceptions
Storing files in the database has a few valid use cases,
- When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).
- When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.
- When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.
Mitigations
Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as
PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.
SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.
Oracle provides
BFILE
(this has nothing to do with their internal LOB stuff which is calledSecureFile
Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.
Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.
Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?
Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).
Conclusion
The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.
Keep it simple, and the general rule is keep the files out of the DB.
Solution
How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.
add a comment |
Do not store files in a database.
Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.
No filehandes to files in the database. What does this mean?
Programmer-talk: You CAN NOT seek (
fseek
), there is no ability to manage the resource with asynchronous access (asyncio
orepoll
), there is nosendfile
(saving you the copy from kernel space).Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.
Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.
No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.
File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.
Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.
Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?
No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.
Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database.
xfs_repair
is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like
s3fs
it can be transparent).
Exceptions
Storing files in the database has a few valid use cases,
- When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).
- When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.
- When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.
Mitigations
Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as
PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.
SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.
Oracle provides
BFILE
(this has nothing to do with their internal LOB stuff which is calledSecureFile
Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.
Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.
Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?
Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).
Conclusion
The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.
Keep it simple, and the general rule is keep the files out of the DB.
Solution
How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.
add a comment |
Do not store files in a database.
Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.
No filehandes to files in the database. What does this mean?
Programmer-talk: You CAN NOT seek (
fseek
), there is no ability to manage the resource with asynchronous access (asyncio
orepoll
), there is nosendfile
(saving you the copy from kernel space).Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.
Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.
No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.
File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.
Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.
Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?
No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.
Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database.
xfs_repair
is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like
s3fs
it can be transparent).
Exceptions
Storing files in the database has a few valid use cases,
- When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).
- When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.
- When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.
Mitigations
Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as
PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.
SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.
Oracle provides
BFILE
(this has nothing to do with their internal LOB stuff which is calledSecureFile
Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.
Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.
Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?
Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).
Conclusion
The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.
Keep it simple, and the general rule is keep the files out of the DB.
Solution
How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.
Do not store files in a database.
Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.
No filehandes to files in the database. What does this mean?
Programmer-talk: You CAN NOT seek (
fseek
), there is no ability to manage the resource with asynchronous access (asyncio
orepoll
), there is nosendfile
(saving you the copy from kernel space).Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.
Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.
No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.
File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.
Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.
Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?
No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.
Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database.
xfs_repair
is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like
s3fs
it can be transparent).
Exceptions
Storing files in the database has a few valid use cases,
- When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).
- When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.
- When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.
Mitigations
Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as
PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.
SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.
Oracle provides
BFILE
(this has nothing to do with their internal LOB stuff which is calledSecureFile
Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.
Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.
Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?
Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).
Conclusion
The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.
Keep it simple, and the general rule is keep the files out of the DB.
Solution
How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.
edited 24 mins ago
answered 2 days ago
Evan CarrollEvan Carroll
31.1k865206
31.1k865206
add a comment |
add a comment |
Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.
add a comment |
Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.
add a comment |
Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.
Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.
answered Nov 22 '13 at 5:27
FengFeng
5
5
add a comment |
add a comment |
protected by Paul White♦ Jul 31 '14 at 21:52
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
7
This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51
@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas♦
Apr 29 '11 at 20:19
1
@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas♦
Sep 5 '11 at 7:54
1
The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51