Should binary files be stored in the database?












108














What is the best place for storing binary files that are related to data in your database? Should you:




  1. Store in the database with a blob

  2. Store on the filesystem with a link in the database

  3. Store in the filesystem but rename to a hash of the contents and store the hash on the database

  4. Something I've not thought of


The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements



The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)



EDIT:



It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres










share|improve this question




















  • 7




    This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
    – Marian
    Apr 29 '11 at 18:51












  • @Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
    – Jack Douglas
    Apr 29 '11 at 20:19






  • 1




    @gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
    – Jack Douglas
    Sep 5 '11 at 7:54








  • 1




    The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
    – jpmc26
    Jul 30 '18 at 17:51


















108














What is the best place for storing binary files that are related to data in your database? Should you:




  1. Store in the database with a blob

  2. Store on the filesystem with a link in the database

  3. Store in the filesystem but rename to a hash of the contents and store the hash on the database

  4. Something I've not thought of


The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements



The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)



EDIT:



It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres










share|improve this question




















  • 7




    This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
    – Marian
    Apr 29 '11 at 18:51












  • @Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
    – Jack Douglas
    Apr 29 '11 at 20:19






  • 1




    @gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
    – Jack Douglas
    Sep 5 '11 at 7:54








  • 1




    The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
    – jpmc26
    Jul 30 '18 at 17:51
















108












108








108


52





What is the best place for storing binary files that are related to data in your database? Should you:




  1. Store in the database with a blob

  2. Store on the filesystem with a link in the database

  3. Store in the filesystem but rename to a hash of the contents and store the hash on the database

  4. Something I've not thought of


The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements



The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)



EDIT:



It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres










share|improve this question















What is the best place for storing binary files that are related to data in your database? Should you:




  1. Store in the database with a blob

  2. Store on the filesystem with a link in the database

  3. Store in the filesystem but rename to a hash of the contents and store the hash on the database

  4. Something I've not thought of


The advantages of (1) are (among others) that atomicity of transactions is preserved. The cost is that you might dramatically increase storage (and associated streaming/backup) requirements



The goal of (3) is to preserve atomicity to some degree - if you can enforce that the filesystem you are writing to does not allow files to be changed or deleted, and always has the correct hash as filename. The idea would be to write the file to the filesystem before permitting the insert/update referencing the hash - if this transaction fails after the filesystem write but before the database DML, that is fine because the filesystem is 'faking' being the repository of all possible files and hashes - it doesn't matter if there are some files in there that are not being pointed to (and you could clean them up periodically if you are careful)



EDIT:



It looks like some RDBMSs have this covered in their individual ways - I'd be interested to know how others do it - and particularly in a solution for postgres







database-design blob






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago









Evan Carroll

31.1k865206




31.1k865206










asked Apr 29 '11 at 12:02









Jack DouglasJack Douglas

27.6k1075148




27.6k1075148








  • 7




    This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
    – Marian
    Apr 29 '11 at 18:51












  • @Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
    – Jack Douglas
    Apr 29 '11 at 20:19






  • 1




    @gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
    – Jack Douglas
    Sep 5 '11 at 7:54








  • 1




    The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
    – jpmc26
    Jul 30 '18 at 17:51
















  • 7




    This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
    – Marian
    Apr 29 '11 at 18:51












  • @Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
    – Jack Douglas
    Apr 29 '11 at 20:19






  • 1




    @gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
    – Jack Douglas
    Sep 5 '11 at 7:54








  • 1




    The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
    – jpmc26
    Jul 30 '18 at 17:51










7




7




This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51






This question has a duplicate here: Is it better to store images in a blob or just the url? that was closed in favor of this one, as this one being more outstanding. Please be sure to read both questions for more insight!
– Marian
Apr 29 '11 at 18:51














@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas
Apr 29 '11 at 20:19




@Marian - thanks for the link, the answers/comments there are indeed helpful (and not just relevant for images as I first thought)
– Jack Douglas
Apr 29 '11 at 20:19




1




1




@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas
Sep 5 '11 at 7:54






@gbn I've come to think it is not such a good question because the 'right' answer might depend on RDBMS - eg the lack of proper incremental backup in postgres is an issue for me if I decide to keep files in the db. If there was one outstanding answer I'd still accept it, but I'm not sure that would add any value as things stand...
– Jack Douglas
Sep 5 '11 at 7:54






1




1




The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51






The modern answer to the "something you've not thought of" is technologies like Amazon S3. Would love to see some thorough answers address that.
– jpmc26
Jul 30 '18 at 17:51












12 Answers
12






active

oldest

votes


















49
















  1. Store in the database with a blob



    A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.




  2. Store on the filesystem with a link in the database



    I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:




    • One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).

    • When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.

    • Some of the paths ended up being longer than 256 characters from the C: all the way to the .doc and not all versions of NT were able to deal with long paths.




  3. Store in the filesystem but rename to a hash of the contents and store the hash on the database



    The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.




My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.






share|improve this answer



















  • 2




    I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
    – Phil Lello
    Apr 29 '11 at 16:48






  • 2




    I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
    – Ali Razeghi
    Jan 29 '13 at 0:20








  • 6




    The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
    – Tek
    Jul 31 '14 at 18:28





















27














Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.



Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway






share|improve this answer





















  • what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
    – Jack Douglas
    Apr 29 '11 at 13:05








  • 4




    @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
    – gbn
    Apr 29 '11 at 14:02






  • 6




    @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
    – Phil Lello
    Apr 29 '11 at 16:23






  • 2




    I'm interested in what the FileStream equivalents on other platforms are called.
    – Ronnie Overby
    Jul 7 '15 at 19:26



















19














If going for oracle, take a look at dbfs and Secure Files.



Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.



dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.



More advantages are: availability, backup, recovery, all read consistent with the other relational data.



Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.



A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.



A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.






share|improve this answer



















  • 9




    +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
    – Niklas Rosencrantz
    Sep 24 '11 at 12:23



















13














I think the right answer here depends a lot on your application, and how important those documents are.



For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.



However, there are many applications where I believe that the opposite decision is appropriate.



Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.



For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.



I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.



There are other benefits of keeping documents separate.




  • You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.

  • You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases

  • You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)

  • For an installation across multiple sites, you can complement a centralised database with a distributed file system


I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.



Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.






share|improve this answer





























    9














    I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.



    The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.



    The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.



    Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.






    share|improve this answer





















    • No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
      – Anssi
      Jan 4 '18 at 9:18










    • I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
      – Evan Carroll
      4 hours ago





















    9














    Don't do it.



    There really isn't an upside to having files stored in the database.



    Doesn't it already feel weird and fishy when you think to yourself:




    Should I store files in a database or a filesystem?




    Even better, say it out loud.



    On to the facts:



    Using the database



    "PROS"... but not quite:




    • "Atomicity" which is correct but it's a double edged sword. Because it drags
      cons along with it.

    • Integrity. Same as above.


    I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.



    If I forgot something comment below, in the meanwhile keep reading below.



    CONS:




    • Wrong tool for the job

    • Harder to maintain

    • Slow

    • Forget about storing hundreds of MB/gigabytes of data PER user.

    • Backing up rapidly growing sites will be a nightmare.

    • Restoring/moving will also suck.


    Using the filesystem



    PROS:




    • Way easier to maintain

    • Fast

    • Database back ups don't have anything to do with this

    • Arguably more portability*


    CONS:




    • None*


    *Fine print



    Right now you're asking yourself, hold on you mean there's no cons?! Howcome?



    The biggest mistakes here is that people are trying to screw a screw with a hammer.



    The main reason and I'd go as far to say only reason this is being asked is because of file links.



    This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.




    "The database will fix my file linking problems."




    When in reality, logically the application should actually be in charge of handling and serving links.



    A Solution:




    1. Make your application handle URL requests with custom routes.

    2. Save this route to your database.

    3. Internally every time this route is called map it to the file you want.

    4. If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.


    This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.



    As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):



    https://github.com/symfony/Routing



    https://github.com/kriswallsmith/assetic



    Both of these together are really powerful.






    share|improve this answer























    • You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
      – a_horse_with_no_name
      Aug 1 '14 at 5:56










    • I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
      – Tek
      Aug 1 '14 at 13:37












    • The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
      – Torsten Bronger
      Oct 9 '15 at 3:25



















    7














    Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.



    Storing BLOBS in the database has advantages and disadvantages:



    On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.



    On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.



    SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.



    Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.



    My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.






    share|improve this answer



















    • 5




      Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
      – Aaron Bertrand
      Sep 26 '11 at 15:07



















    6














    My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.



    This way you get reliability of having the data always accessible without having monster sized databases to deal with.






    share|improve this answer





























      5














      Although it partly depends on the application/environment (people included), I'd go for the blob.



      Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.



      In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).



      Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.



      The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).






      share|improve this answer





























        3














        For postgres:



        It's actually straight foreward. There is a BYTEA type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.






        share|improve this answer





















        • This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
          – Evan Carroll
          2 days ago



















        2














        Do not store files in a database.



        Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.





        • No filehandes to files in the database. What does this mean?




          • Programmer-talk: You CAN NOT seek (fseek), there is no ability to manage the resource with asynchronous access (asyncio or epoll), there is no sendfile (saving you the copy from kernel space).


          • Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.




        • Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.


        • No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.


        • File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.


        • Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.


        • Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?


        • No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.


        • Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database. xfs_repair is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.


        • Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like s3fs it can be transparent).



        Exceptions



        Storing files in the database has a few valid use cases,




        • When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).

        • When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.

        • When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.


        Mitigations





        • Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as




          • PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.


          • SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.


          • Oracle provides BFILE (this has nothing to do with their internal LOB stuff which is called SecureFile




        • Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.


        • Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.


        • Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?


        • Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).



        Conclusion



        The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.



        Keep it simple, and the general rule is keep the files out of the DB.



        Solution



        How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.






        share|improve this answer































          -4














          Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.






          share|improve this answer




















            protected by Paul White Jul 31 '14 at 21:52



            Thank you for your interest in this question.
            Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



            Would you like to answer one of these unanswered questions instead?














            12 Answers
            12






            active

            oldest

            votes








            12 Answers
            12






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            49
















            1. Store in the database with a blob



              A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.




            2. Store on the filesystem with a link in the database



              I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:




              • One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).

              • When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.

              • Some of the paths ended up being longer than 256 characters from the C: all the way to the .doc and not all versions of NT were able to deal with long paths.




            3. Store in the filesystem but rename to a hash of the contents and store the hash on the database



              The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.




            My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.






            share|improve this answer



















            • 2




              I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
              – Phil Lello
              Apr 29 '11 at 16:48






            • 2




              I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
              – Ali Razeghi
              Jan 29 '13 at 0:20








            • 6




              The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
              – Tek
              Jul 31 '14 at 18:28


















            49
















            1. Store in the database with a blob



              A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.




            2. Store on the filesystem with a link in the database



              I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:




              • One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).

              • When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.

              • Some of the paths ended up being longer than 256 characters from the C: all the way to the .doc and not all versions of NT were able to deal with long paths.




            3. Store in the filesystem but rename to a hash of the contents and store the hash on the database



              The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.




            My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.






            share|improve this answer



















            • 2




              I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
              – Phil Lello
              Apr 29 '11 at 16:48






            • 2




              I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
              – Ali Razeghi
              Jan 29 '13 at 0:20








            • 6




              The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
              – Tek
              Jul 31 '14 at 18:28
















            49












            49








            49








            1. Store in the database with a blob



              A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.




            2. Store on the filesystem with a link in the database



              I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:




              • One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).

              • When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.

              • Some of the paths ended up being longer than 256 characters from the C: all the way to the .doc and not all versions of NT were able to deal with long paths.




            3. Store in the filesystem but rename to a hash of the contents and store the hash on the database



              The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.




            My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.






            share|improve this answer
















            1. Store in the database with a blob



              A disadvantage is that it makes your database files quite large and possibly too large to back up with your existing set up. An advantage is integrity and atomicity.




            2. Store on the filesystem with a link in the database



              I've come across such horrible disasters doing this, and it scares me that people keep suggesting it. Some of the disasters included:




              • One privileged user who would rearrange the files and frequently break the links between the paths in the DB and where they now are (but somehow this became my fault).

              • When moving from one server to another, the ownership of some of the files was lost as the SID for the old machine's administator account (what the old website was running on) was not part of the domain and so the copied files had ACLs that could not be resolved thus presenting users with the username/password/domain login prompt.

              • Some of the paths ended up being longer than 256 characters from the C: all the way to the .doc and not all versions of NT were able to deal with long paths.




            3. Store in the filesystem but rename to a hash of the contents and store the hash on the database



              The last place I worked at did this based on my explanation of the above scenarios did this. They thought it was a compromise between the organization's inability to obtain experience with large databases (anything larger than about 40G was ordained to be "too big"), the corporate inability to purchase large hard drives, and the inability to purchase a more modern back up solution, and the need to get away from risks #1 & #3 that I identified above.




            My opinion is that storing in the DB as a blob is a better solution and more scalable in a multi-server scenario, especially with failover and availability concerns.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 28 '13 at 21:48









            eykanal

            152117




            152117










            answered Apr 29 '11 at 14:13









            TangurenaTangurena

            1,46611316




            1,46611316








            • 2




              I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
              – Phil Lello
              Apr 29 '11 at 16:48






            • 2




              I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
              – Ali Razeghi
              Jan 29 '13 at 0:20








            • 6




              The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
              – Tek
              Jul 31 '14 at 18:28
















            • 2




              I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
              – Phil Lello
              Apr 29 '11 at 16:48






            • 2




              I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
              – Ali Razeghi
              Jan 29 '13 at 0:20








            • 6




              The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
              – Tek
              Jul 31 '14 at 18:28










            2




            2




            I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
            – Phil Lello
            Apr 29 '11 at 16:48




            I'm not sure the backup size is an issue; data needs to be backed up however it's stored. The same differential vs full decision gets made whether we're talking about a FS or a DB. I do note that this is presented a possible argument, not your point of view.
            – Phil Lello
            Apr 29 '11 at 16:48




            2




            2




            I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
            – Ali Razeghi
            Jan 29 '13 at 0:20






            I once had a issue where hundreds of megabytes were written to each row thousands of times a day. They were storing a GZIP file in the DB as a binary for 10000 servers, but a bug was introduced where every server recorded info for every server, per alert. It was horrible. After that incident, I became adamant about 'no (MAX) data types unless it's extremely justified'.
            – Ali Razeghi
            Jan 29 '13 at 0:20






            6




            6




            The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
            – Tek
            Jul 31 '14 at 18:28






            The whole "link breaking" is an application issue and not a database issue. The database is doing it's job (serving pure data) while the application isn't (serving mixed file types). The application should take have responsibility of serving files. By storing an abstract route path in the database that would work no matter where the file gets stored on the server internally (ala Symfony2 routing). This would abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.
            – Tek
            Jul 31 '14 at 18:28















            27














            Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.



            Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway






            share|improve this answer





















            • what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
              – Jack Douglas
              Apr 29 '11 at 13:05








            • 4




              @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
              – gbn
              Apr 29 '11 at 14:02






            • 6




              @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
              – Phil Lello
              Apr 29 '11 at 16:23






            • 2




              I'm interested in what the FileStream equivalents on other platforms are called.
              – Ronnie Overby
              Jul 7 '15 at 19:26
















            27














            Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.



            Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway






            share|improve this answer





















            • what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
              – Jack Douglas
              Apr 29 '11 at 13:05








            • 4




              @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
              – gbn
              Apr 29 '11 at 14:02






            • 6




              @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
              – Phil Lello
              Apr 29 '11 at 16:23






            • 2




              I'm interested in what the FileStream equivalents on other platforms are called.
              – Ronnie Overby
              Jul 7 '15 at 19:26














            27












            27








            27






            Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.



            Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway






            share|improve this answer












            Number 1 for complete data integrity. Use the other options if you don't care about data quality. It's that simple.



            Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Apr 29 '11 at 12:48









            gbngbn

            63.5k7134211




            63.5k7134211












            • what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
              – Jack Douglas
              Apr 29 '11 at 13:05








            • 4




              @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
              – gbn
              Apr 29 '11 at 14:02






            • 6




              @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
              – Phil Lello
              Apr 29 '11 at 16:23






            • 2




              I'm interested in what the FileStream equivalents on other platforms are called.
              – Ronnie Overby
              Jul 7 '15 at 19:26


















            • what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
              – Jack Douglas
              Apr 29 '11 at 13:05








            • 4




              @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
              – gbn
              Apr 29 '11 at 14:02






            • 6




              @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
              – Phil Lello
              Apr 29 '11 at 16:23






            • 2




              I'm interested in what the FileStream equivalents on other platforms are called.
              – Ronnie Overby
              Jul 7 '15 at 19:26
















            what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
            – Jack Douglas
            Apr 29 '11 at 13:05






            what is it about (3) specifically that puts data integrity at risk? (assuming you get your transactional API right)
            – Jack Douglas
            Apr 29 '11 at 13:05






            4




            4




            @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
            – gbn
            Apr 29 '11 at 14:02




            @JackPDouglas: you have hash which is the not the correct data and still has an external dependency for dats integrity
            – gbn
            Apr 29 '11 at 14:02




            6




            6




            @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
            – Phil Lello
            Apr 29 '11 at 16:23




            @JackPDouglas There's also the possibility that the server admin and DBA are different teams, with the associated risk that files get deleted in error, or not backed-up as they're thought of as temporary files.
            – Phil Lello
            Apr 29 '11 at 16:23




            2




            2




            I'm interested in what the FileStream equivalents on other platforms are called.
            – Ronnie Overby
            Jul 7 '15 at 19:26




            I'm interested in what the FileStream equivalents on other platforms are called.
            – Ronnie Overby
            Jul 7 '15 at 19:26











            19














            If going for oracle, take a look at dbfs and Secure Files.



            Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.



            dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.



            More advantages are: availability, backup, recovery, all read consistent with the other relational data.



            Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.



            A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.



            A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.






            share|improve this answer



















            • 9




              +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
              – Niklas Rosencrantz
              Sep 24 '11 at 12:23
















            19














            If going for oracle, take a look at dbfs and Secure Files.



            Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.



            dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.



            More advantages are: availability, backup, recovery, all read consistent with the other relational data.



            Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.



            A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.



            A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.






            share|improve this answer



















            • 9




              +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
              – Niklas Rosencrantz
              Sep 24 '11 at 12:23














            19












            19








            19






            If going for oracle, take a look at dbfs and Secure Files.



            Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.



            dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.



            More advantages are: availability, backup, recovery, all read consistent with the other relational data.



            Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.



            A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.



            A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.






            share|improve this answer














            If going for oracle, take a look at dbfs and Secure Files.



            Secure Files says it all, keep ALL your data safe in the database. It is organized in lobs. Secure Files is a modernized version of lobs, that should be activated.



            dbfs is a filesystem in the database. You can mount it similar like a network filesystem, on a Linux host. It is real powerful. See blog It also has a lot of options to tune to your specific needs. Being a dba, given a filesystem (based in the database, mounted on Linux), I created an Oracle Database on it without any problems. (a database, stored in a ... database). Not that this would be very useful but it does show the power.



            More advantages are: availability, backup, recovery, all read consistent with the other relational data.



            Sometimes size is given as a reason not to store documents in the database. That data probably has to be backed up any way so that's not a good reason not to store in the database. Especially in a situation where old documents are to be considered read only, it is easy to make big parts of the database read only. In that case, those parts of the database no longer have a need for a high frequent backup.



            A reference in a table to something outside the database is unsafe. It can be manipulated, is hard to check and can easily get lost. How about transactions? The database offers solutions for all these issues. With Oracle DBFS you can give your docs to non database applications and they wouldn't even know they are poking in a database.



            A last, big surprise, the performance of a dbfs filesystem is often better than a regular filesystem. This is especially true if the files are larger than a few blocks.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 21 '15 at 9:16

























            answered Apr 29 '11 at 13:10









            ik_zelfik_zelf

            5,7821431




            5,7821431








            • 9




              +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
              – Niklas Rosencrantz
              Sep 24 '11 at 12:23














            • 9




              +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
              – Niklas Rosencrantz
              Sep 24 '11 at 12:23








            9




            9




            +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
            – Niklas Rosencrantz
            Sep 24 '11 at 12:23




            +1 for stating a good rule "A reference in a table to something outside the database is unsafe."
            – Niklas Rosencrantz
            Sep 24 '11 at 12:23











            13














            I think the right answer here depends a lot on your application, and how important those documents are.



            For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.



            However, there are many applications where I believe that the opposite decision is appropriate.



            Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.



            For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.



            I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.



            There are other benefits of keeping documents separate.




            • You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.

            • You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases

            • You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)

            • For an installation across multiple sites, you can complement a centralised database with a distributed file system


            I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.



            Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.






            share|improve this answer


























              13














              I think the right answer here depends a lot on your application, and how important those documents are.



              For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.



              However, there are many applications where I believe that the opposite decision is appropriate.



              Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.



              For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.



              I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.



              There are other benefits of keeping documents separate.




              • You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.

              • You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases

              • You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)

              • For an installation across multiple sites, you can complement a centralised database with a distributed file system


              I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.



              Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.






              share|improve this answer
























                13












                13








                13






                I think the right answer here depends a lot on your application, and how important those documents are.



                For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.



                However, there are many applications where I believe that the opposite decision is appropriate.



                Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.



                For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.



                I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.



                There are other benefits of keeping documents separate.




                • You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.

                • You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases

                • You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)

                • For an installation across multiple sites, you can complement a centralised database with a distributed file system


                I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.



                Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.






                share|improve this answer












                I think the right answer here depends a lot on your application, and how important those documents are.



                For a document management system, or a system where recoverability of the stored documents is critical (so most things financial, HR or CRM related), storing documents inline or using your favourite DB vendor's proprietary document tech seems like the Right Thing To Do.



                However, there are many applications where I believe that the opposite decision is appropriate.



                Helpdesk systems and wiki-type systems are ones where I think it makes a lot of sense to keep the data out of the database. I believe some, like Jira, actually provide an option to choose whether you want to store documents inline or not.



                For a medium sized business, storing documents for a ticketing system inline can mean the difference between a compressed backup measured in megabytes, and one measured in gigabytes.



                I would personally prefer to bring a ticketing system back online in a few minutes and wrestle with the (generally less important) documents for a few hours, than increase my "it's broken and the CTO is breathing down my neck" RTO by having to restore and replay logs from a much larger backup.



                There are other benefits of keeping documents separate.




                • You can easily run separate processes that catalog document metadata, perform virus scanning, perform keyword indexing, etc.

                • You can take advantage of tools to assist with backups or recovery - rsync, storage snapshots, etc. - which lend themselves much better to files than databases

                • You can actually use storage that supports compression or deduplication (the stuff that your SAN admins have been blubbering about for years, aka the bane of database administrators worldwide)

                • For an installation across multiple sites, you can complement a centralised database with a distributed file system


                I think a hybrid combination of #2 and #3 might be clever. Keep the original filenames, but calculate and store a hash/checksum of the document, so that you have some reference point that will assist recovery in case someone moves or renames the file.



                Storing the files with their original filenames means that applications can literally yank them straight from a file system and send them over the wire, or in a thick client world, maybe even point the user directly to the file server.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered May 8 '13 at 14:12









                Nathan JollyNathan Jolly

                3,2901226




                3,2901226























                    9














                    I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.



                    The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.



                    The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.



                    Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.






                    share|improve this answer





















                    • No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                      – Anssi
                      Jan 4 '18 at 9:18










                    • I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                      – Evan Carroll
                      4 hours ago


















                    9














                    I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.



                    The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.



                    The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.



                    Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.






                    share|improve this answer





















                    • No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                      – Anssi
                      Jan 4 '18 at 9:18










                    • I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                      – Evan Carroll
                      4 hours ago
















                    9












                    9








                    9






                    I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.



                    The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.



                    The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.



                    Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.






                    share|improve this answer












                    I want to add my experience here as to the tradeoffs. In PostgreSQL, at least, the performance impacts are quite minimal in terms of the db server. Large blobs are stored in separate files, not in the main heap tables so as to move them out of the way of operations that may count large numbers of records. Other dbs may do something similar.



                    The major advantage is the ability to keep all related data in one place for atomicity and backup purposes. This greatly reduces the chance of something going wrong.



                    The major disadvantage is not one I have seen covered above, and that's memory usage on the front-end. I don't know exactly how every db handles this so this may depend on implementation but for PostgreSQL, the data comes in as an escaped ASCII string (possibly hexadecimal, possibly with inlined escapes). This then has to be converted back to binary in the front end. Many frameworks I have seen for doing this involve passing the value (not as a reference) and then constructing a new binary string based on it. I calculated that using Perl to do this ended up using many times the memory of the original binary to accomplish.



                    Verdict: If the files are only being occasionally accessed I would store in the db. If they are being frequently and repeatedly accessed, at least with PostgreSQL, I think the costs outweight the benefits.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Feb 18 '13 at 8:17









                    Chris TraversChris Travers

                    10.9k3788




                    10.9k3788












                    • No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                      – Anssi
                      Jan 4 '18 at 9:18










                    • I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                      – Evan Carroll
                      4 hours ago




















                    • No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                      – Anssi
                      Jan 4 '18 at 9:18










                    • I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                      – Evan Carroll
                      4 hours ago


















                    No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                    – Anssi
                    Jan 4 '18 at 9:18




                    No (large) files should be transferred in strings. This is really bad for memory usage and eventually causes oom. Postgres streaming should be used.
                    – Anssi
                    Jan 4 '18 at 9:18












                    I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                    – Evan Carroll
                    4 hours ago






                    I would be curious to know if this was still your opinion. ;) BTW I tried to answer it to with a lot more specifics about Pg.
                    – Evan Carroll
                    4 hours ago













                    9














                    Don't do it.



                    There really isn't an upside to having files stored in the database.



                    Doesn't it already feel weird and fishy when you think to yourself:




                    Should I store files in a database or a filesystem?




                    Even better, say it out loud.



                    On to the facts:



                    Using the database



                    "PROS"... but not quite:




                    • "Atomicity" which is correct but it's a double edged sword. Because it drags
                      cons along with it.

                    • Integrity. Same as above.


                    I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.



                    If I forgot something comment below, in the meanwhile keep reading below.



                    CONS:




                    • Wrong tool for the job

                    • Harder to maintain

                    • Slow

                    • Forget about storing hundreds of MB/gigabytes of data PER user.

                    • Backing up rapidly growing sites will be a nightmare.

                    • Restoring/moving will also suck.


                    Using the filesystem



                    PROS:




                    • Way easier to maintain

                    • Fast

                    • Database back ups don't have anything to do with this

                    • Arguably more portability*


                    CONS:




                    • None*


                    *Fine print



                    Right now you're asking yourself, hold on you mean there's no cons?! Howcome?



                    The biggest mistakes here is that people are trying to screw a screw with a hammer.



                    The main reason and I'd go as far to say only reason this is being asked is because of file links.



                    This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.




                    "The database will fix my file linking problems."




                    When in reality, logically the application should actually be in charge of handling and serving links.



                    A Solution:




                    1. Make your application handle URL requests with custom routes.

                    2. Save this route to your database.

                    3. Internally every time this route is called map it to the file you want.

                    4. If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.


                    This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.



                    As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):



                    https://github.com/symfony/Routing



                    https://github.com/kriswallsmith/assetic



                    Both of these together are really powerful.






                    share|improve this answer























                    • You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                      – a_horse_with_no_name
                      Aug 1 '14 at 5:56










                    • I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                      – Tek
                      Aug 1 '14 at 13:37












                    • The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                      – Torsten Bronger
                      Oct 9 '15 at 3:25
















                    9














                    Don't do it.



                    There really isn't an upside to having files stored in the database.



                    Doesn't it already feel weird and fishy when you think to yourself:




                    Should I store files in a database or a filesystem?




                    Even better, say it out loud.



                    On to the facts:



                    Using the database



                    "PROS"... but not quite:




                    • "Atomicity" which is correct but it's a double edged sword. Because it drags
                      cons along with it.

                    • Integrity. Same as above.


                    I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.



                    If I forgot something comment below, in the meanwhile keep reading below.



                    CONS:




                    • Wrong tool for the job

                    • Harder to maintain

                    • Slow

                    • Forget about storing hundreds of MB/gigabytes of data PER user.

                    • Backing up rapidly growing sites will be a nightmare.

                    • Restoring/moving will also suck.


                    Using the filesystem



                    PROS:




                    • Way easier to maintain

                    • Fast

                    • Database back ups don't have anything to do with this

                    • Arguably more portability*


                    CONS:




                    • None*


                    *Fine print



                    Right now you're asking yourself, hold on you mean there's no cons?! Howcome?



                    The biggest mistakes here is that people are trying to screw a screw with a hammer.



                    The main reason and I'd go as far to say only reason this is being asked is because of file links.



                    This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.




                    "The database will fix my file linking problems."




                    When in reality, logically the application should actually be in charge of handling and serving links.



                    A Solution:




                    1. Make your application handle URL requests with custom routes.

                    2. Save this route to your database.

                    3. Internally every time this route is called map it to the file you want.

                    4. If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.


                    This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.



                    As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):



                    https://github.com/symfony/Routing



                    https://github.com/kriswallsmith/assetic



                    Both of these together are really powerful.






                    share|improve this answer























                    • You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                      – a_horse_with_no_name
                      Aug 1 '14 at 5:56










                    • I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                      – Tek
                      Aug 1 '14 at 13:37












                    • The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                      – Torsten Bronger
                      Oct 9 '15 at 3:25














                    9












                    9








                    9






                    Don't do it.



                    There really isn't an upside to having files stored in the database.



                    Doesn't it already feel weird and fishy when you think to yourself:




                    Should I store files in a database or a filesystem?




                    Even better, say it out loud.



                    On to the facts:



                    Using the database



                    "PROS"... but not quite:




                    • "Atomicity" which is correct but it's a double edged sword. Because it drags
                      cons along with it.

                    • Integrity. Same as above.


                    I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.



                    If I forgot something comment below, in the meanwhile keep reading below.



                    CONS:




                    • Wrong tool for the job

                    • Harder to maintain

                    • Slow

                    • Forget about storing hundreds of MB/gigabytes of data PER user.

                    • Backing up rapidly growing sites will be a nightmare.

                    • Restoring/moving will also suck.


                    Using the filesystem



                    PROS:




                    • Way easier to maintain

                    • Fast

                    • Database back ups don't have anything to do with this

                    • Arguably more portability*


                    CONS:




                    • None*


                    *Fine print



                    Right now you're asking yourself, hold on you mean there's no cons?! Howcome?



                    The biggest mistakes here is that people are trying to screw a screw with a hammer.



                    The main reason and I'd go as far to say only reason this is being asked is because of file links.



                    This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.




                    "The database will fix my file linking problems."




                    When in reality, logically the application should actually be in charge of handling and serving links.



                    A Solution:




                    1. Make your application handle URL requests with custom routes.

                    2. Save this route to your database.

                    3. Internally every time this route is called map it to the file you want.

                    4. If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.


                    This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.



                    As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):



                    https://github.com/symfony/Routing



                    https://github.com/kriswallsmith/assetic



                    Both of these together are really powerful.






                    share|improve this answer














                    Don't do it.



                    There really isn't an upside to having files stored in the database.



                    Doesn't it already feel weird and fishy when you think to yourself:




                    Should I store files in a database or a filesystem?




                    Even better, say it out loud.



                    On to the facts:



                    Using the database



                    "PROS"... but not quite:




                    • "Atomicity" which is correct but it's a double edged sword. Because it drags
                      cons along with it.

                    • Integrity. Same as above.


                    I really don't want to be biased but I don't think there's more to add. The pros aren't really that great if you think about it.



                    If I forgot something comment below, in the meanwhile keep reading below.



                    CONS:




                    • Wrong tool for the job

                    • Harder to maintain

                    • Slow

                    • Forget about storing hundreds of MB/gigabytes of data PER user.

                    • Backing up rapidly growing sites will be a nightmare.

                    • Restoring/moving will also suck.


                    Using the filesystem



                    PROS:




                    • Way easier to maintain

                    • Fast

                    • Database back ups don't have anything to do with this

                    • Arguably more portability*


                    CONS:




                    • None*


                    *Fine print



                    Right now you're asking yourself, hold on you mean there's no cons?! Howcome?



                    The biggest mistakes here is that people are trying to screw a screw with a hammer.



                    The main reason and I'd go as far to say only reason this is being asked is because of file links.



                    This is a problem that the database isn't meant to solve. It even sounds silly if you think about it.




                    "The database will fix my file linking problems."




                    When in reality, logically the application should actually be in charge of handling and serving links.



                    A Solution:




                    1. Make your application handle URL requests with custom routes.

                    2. Save this route to your database.

                    3. Internally every time this route is called map it to the file you want.

                    4. If you ever move your files elsewhere just change the filename value of the route and that route will always serve the same file no matter where it's stored or referenced across the web.


                    This would also abstract away the native paths, make the application more portable, maintainable and allow to switch to any kind of filesystem without breaking anything.



                    As for how to implement it is beyond the scope of this answer but you can take a look at a general example in arguably the most widely used web language (PHP):



                    https://github.com/symfony/Routing



                    https://github.com/kriswallsmith/assetic



                    Both of these together are really powerful.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Aug 1 '14 at 1:22

























                    answered Jul 31 '14 at 19:32









                    TekTek

                    20624




                    20624












                    • You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                      – a_horse_with_no_name
                      Aug 1 '14 at 5:56










                    • I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                      – Tek
                      Aug 1 '14 at 13:37












                    • The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                      – Torsten Bronger
                      Oct 9 '15 at 3:25


















                    • You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                      – a_horse_with_no_name
                      Aug 1 '14 at 5:56










                    • I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                      – Tek
                      Aug 1 '14 at 13:37












                    • The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                      – Torsten Bronger
                      Oct 9 '15 at 3:25
















                    You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                    – a_horse_with_no_name
                    Aug 1 '14 at 5:56




                    You might be interested in this: research.microsoft.com/apps/pubs/default.aspx?id=64525 a research by Microsoft that shows that storing blobs in the database is actually faster than in the file system (for some sizes of blobs at least). This is in line with my tests that showed that for medium sized blobs (< ~1MB) e.g. Postgres is also faster than a filesystem. For Oracle it's about the same performance but I haven't tested the new securefile storage format yet (but they claim it's faster than the old storage format)
                    – a_horse_with_no_name
                    Aug 1 '14 at 5:56












                    I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                    – Tek
                    Aug 1 '14 at 13:37






                    I saw that, which is why I talked about large files. Plus OP didn't specify a database vendor so performance may differ between vendors and thus my advice is more general.
                    – Tek
                    Aug 1 '14 at 13:37














                    The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                    – Torsten Bronger
                    Oct 9 '15 at 3:25




                    The DB being faster than the FS for files sounds fishy to me. For 1MB, any overhead doesn't play a role. Probably they allowed the DB but didn't the FS to cache. Apples, pies.
                    – Torsten Bronger
                    Oct 9 '15 at 3:25











                    7














                    Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.



                    Storing BLOBS in the database has advantages and disadvantages:



                    On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.



                    On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.



                    SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.



                    Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.



                    My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.






                    share|improve this answer



















                    • 5




                      Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                      – Aaron Bertrand
                      Sep 26 '11 at 15:07
















                    7














                    Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.



                    Storing BLOBS in the database has advantages and disadvantages:



                    On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.



                    On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.



                    SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.



                    Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.



                    My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.






                    share|improve this answer



















                    • 5




                      Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                      – Aaron Bertrand
                      Sep 26 '11 at 15:07














                    7












                    7








                    7






                    Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.



                    Storing BLOBS in the database has advantages and disadvantages:



                    On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.



                    On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.



                    SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.



                    Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.



                    My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.






                    share|improve this answer














                    Back in the day, Microsoft hyped up the ability to store images (and similar blob data types) in the database. The was a cool new feature of SQL Server 2000 (I am pretty sure it was 2000, not 7.0) and many people jumped on the bandwagon.



                    Storing BLOBS in the database has advantages and disadvantages:



                    On one hand, all your data and related images or documents can be stored and accessed in one place. Application user's do not require special network permissions, as it is SQL that is serving up the images/files/documents.



                    On the other hand, your database can grow quite large, depending on the size and number of BLOBS you are storing. This affects backups, storage requirements, time sensitive recovery operations, etc.



                    SQL Server 2008 introduced file streaming. The database contains pointers to the files, the files reside on the server not in the database, but when you backup the database the files are backed up as well.



                    Your backups can get quite large, but you don't end up with orphaned files/documents/blobs/images.



                    My personal preference has been to let the database store pointers/network locations, and let a file server handle the files. File servers are better optimized for such tasks anyway.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited yesterday









                    a_horse_with_no_name

                    38.7k775112




                    38.7k775112










                    answered Sep 26 '11 at 15:02









                    datagoddatagod

                    5,90342953




                    5,90342953








                    • 5




                      Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                      – Aaron Bertrand
                      Sep 26 '11 at 15:07














                    • 5




                      Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                      – Aaron Bertrand
                      Sep 26 '11 at 15:07








                    5




                    5




                    Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                    – Aaron Bertrand
                    Sep 26 '11 at 15:07




                    Never mind that if you don't own the server you're going to pay a heck of a lot more per MB for database space vs. file space. Also having the file on disk makes it much easier to troubleshoot - how do you SELECT image FROM table in SSMS and validate that the right image is there?
                    – Aaron Bertrand
                    Sep 26 '11 at 15:07











                    6














                    My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.



                    This way you get reliability of having the data always accessible without having monster sized databases to deal with.






                    share|improve this answer


























                      6














                      My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.



                      This way you get reliability of having the data always accessible without having monster sized databases to deal with.






                      share|improve this answer
























                        6












                        6








                        6






                        My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.



                        This way you get reliability of having the data always accessible without having monster sized databases to deal with.






                        share|improve this answer












                        My vote would be for neither. Store the data in a system like Amazon S3 or Microsft's CDN and store that URL in the database.



                        This way you get reliability of having the data always accessible without having monster sized databases to deal with.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Oct 19 '12 at 6:02









                        paullbpaullb

                        16112




                        16112























                            5














                            Although it partly depends on the application/environment (people included), I'd go for the blob.



                            Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.



                            In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).



                            Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.



                            The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).






                            share|improve this answer


























                              5














                              Although it partly depends on the application/environment (people included), I'd go for the blob.



                              Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.



                              In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).



                              Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.



                              The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).






                              share|improve this answer
























                                5












                                5








                                5






                                Although it partly depends on the application/environment (people included), I'd go for the blob.



                                Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.



                                In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).



                                Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.



                                The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).






                                share|improve this answer












                                Although it partly depends on the application/environment (people included), I'd go for the blob.



                                Keeping everything in the database means replication works for file data. You'd need a separate mechanism to synchronise FS files.



                                In some applications, the filesystem shouldn't be modified anyway. For example, on a production website, I'd avoid ever using the filesystem for any non-disposable data (site lives under a SCM, data in a database).



                                Assuming we've got multiple users/applications with separate permissions, then any filesystem storage provides an opportunity for differences in DB and FS access rights.



                                The refinement I'd consider making to BLOB storage is to chunk data if it makes sense; if you only need 512 bytes from a 20Mb BLOB, this sector-like access is a real boon, especially if you're dealing with remote clients (and again, a partial update creates much less replication traffic).







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Apr 29 '11 at 16:48









                                Phil LelloPhil Lello

                                94479




                                94479























                                    3














                                    For postgres:



                                    It's actually straight foreward. There is a BYTEA type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.






                                    share|improve this answer





















                                    • This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                      – Evan Carroll
                                      2 days ago
















                                    3














                                    For postgres:



                                    It's actually straight foreward. There is a BYTEA type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.






                                    share|improve this answer





















                                    • This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                      – Evan Carroll
                                      2 days ago














                                    3












                                    3








                                    3






                                    For postgres:



                                    It's actually straight foreward. There is a BYTEA type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.






                                    share|improve this answer












                                    For postgres:



                                    It's actually straight foreward. There is a BYTEA type that can be used for storing binary strings. Per default, there are no build in utilites like the ones mentioned for MS or Oracle. So storing lots of big files and retrieving them can get tedious. You also need to do the conversion of the files within the application (like with a ByteStream or similar, no idea though how this works with the specific MS/Oracle file<->database solutions). There is also a lo type, that helps with the work of managing BLOBs since some of the internal management of these types may not keep track of the references.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 30 '11 at 8:30









                                    DrColossosDrColossos

                                    4,00412431




                                    4,00412431












                                    • This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                      – Evan Carroll
                                      2 days ago


















                                    • This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                      – Evan Carroll
                                      2 days ago
















                                    This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                    – Evan Carroll
                                    2 days ago




                                    This isn't at all straight forward. The bytea type has no bearing on whether or not you should be storing that in a database.
                                    – Evan Carroll
                                    2 days ago











                                    2














                                    Do not store files in a database.



                                    Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.





                                    • No filehandes to files in the database. What does this mean?




                                      • Programmer-talk: You CAN NOT seek (fseek), there is no ability to manage the resource with asynchronous access (asyncio or epoll), there is no sendfile (saving you the copy from kernel space).


                                      • Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.




                                    • Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.


                                    • No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.


                                    • File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.


                                    • Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.


                                    • Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?


                                    • No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.


                                    • Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database. xfs_repair is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.


                                    • Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like s3fs it can be transparent).



                                    Exceptions



                                    Storing files in the database has a few valid use cases,




                                    • When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).

                                    • When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.

                                    • When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.


                                    Mitigations





                                    • Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as




                                      • PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.


                                      • SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.


                                      • Oracle provides BFILE (this has nothing to do with their internal LOB stuff which is called SecureFile




                                    • Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.


                                    • Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.


                                    • Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?


                                    • Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).



                                    Conclusion



                                    The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.



                                    Keep it simple, and the general rule is keep the files out of the DB.



                                    Solution



                                    How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.






                                    share|improve this answer




























                                      2














                                      Do not store files in a database.



                                      Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.





                                      • No filehandes to files in the database. What does this mean?




                                        • Programmer-talk: You CAN NOT seek (fseek), there is no ability to manage the resource with asynchronous access (asyncio or epoll), there is no sendfile (saving you the copy from kernel space).


                                        • Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.




                                      • Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.


                                      • No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.


                                      • File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.


                                      • Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.


                                      • Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?


                                      • No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.


                                      • Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database. xfs_repair is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.


                                      • Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like s3fs it can be transparent).



                                      Exceptions



                                      Storing files in the database has a few valid use cases,




                                      • When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).

                                      • When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.

                                      • When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.


                                      Mitigations





                                      • Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as




                                        • PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.


                                        • SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.


                                        • Oracle provides BFILE (this has nothing to do with their internal LOB stuff which is called SecureFile




                                      • Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.


                                      • Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.


                                      • Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?


                                      • Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).



                                      Conclusion



                                      The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.



                                      Keep it simple, and the general rule is keep the files out of the DB.



                                      Solution



                                      How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.






                                      share|improve this answer


























                                        2












                                        2








                                        2






                                        Do not store files in a database.



                                        Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.





                                        • No filehandes to files in the database. What does this mean?




                                          • Programmer-talk: You CAN NOT seek (fseek), there is no ability to manage the resource with asynchronous access (asyncio or epoll), there is no sendfile (saving you the copy from kernel space).


                                          • Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.




                                        • Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.


                                        • No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.


                                        • File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.


                                        • Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.


                                        • Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?


                                        • No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.


                                        • Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database. xfs_repair is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.


                                        • Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like s3fs it can be transparent).



                                        Exceptions



                                        Storing files in the database has a few valid use cases,




                                        • When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).

                                        • When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.

                                        • When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.


                                        Mitigations





                                        • Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as




                                          • PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.


                                          • SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.


                                          • Oracle provides BFILE (this has nothing to do with their internal LOB stuff which is called SecureFile




                                        • Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.


                                        • Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.


                                        • Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?


                                        • Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).



                                        Conclusion



                                        The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.



                                        Keep it simple, and the general rule is keep the files out of the DB.



                                        Solution



                                        How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.






                                        share|improve this answer














                                        Do not store files in a database.



                                        Everyone, without exception, that can run any RDBMS on the market already has a database specifically for storing files, and the RDBMS itself is using it! That database is the filesystem. Now let's talk about some of the potential drawbacks of storing files in the database, as well as some specific mitigating factors for storing files in the database.





                                        • No filehandes to files in the database. What does this mean?




                                          • Programmer-talk: You CAN NOT seek (fseek), there is no ability to manage the resource with asynchronous access (asyncio or epoll), there is no sendfile (saving you the copy from kernel space).


                                          • Practical application: Want to send a video or picture to a client over HTTP2/3? If it's in the database, then you'll first have to query it. For whatever query returns that file, you'll have to wait for the entire query to conclude before that file can move to the next step. In a production install with a rdbms on a different server than the web server, you'll first have to transfer the file entirely from the rdbms to the webserver rather than streaming it through. However, if the transportation layer provided file-system abstraction (which even NFS supports) you could seek half way through the file and immediately start streaming it back to the client without buffering any more of the file than necessary. This is routinely done by the webserver nginx, Apache, PureFTP, and ProFTP.




                                        • Double copy on the RDBMS. By the very fact that it's in the database, you'll likely be writing it twice. Once in a write-ahead log (WAL), and then again into the tablespace.


                                        • No updates, ever MVCC means nothing gets updated, only copied anew with modifications, and then old row gets marked as expired (deleted). Any update to the file, will require writing the whole row, not just the file the whole row. Filesystems can provide this too, with data-journaling but you rarely need that.


                                        • File-read and transfer to slow down the query If the file itself is stored on a row which you need to query, the whole row will either have to wait for the file to be transferred, or you'll have to issue two separate queries.


                                        • Memory use on the DB-client. The DB-client (libpq, jdbc, odbc, freetds, etc) or the like will likely buffer the query in memory. When that in-memory buffer is exhausted it may start a disk-buffer or even worse it may fall back to the kernel to be paged to disk.


                                        • Query-throttling many databases provide the ability to kill and reap queries when they take either too much in the way of time, or resources. Keep in mind the file transfers will not in any implementation be itemized. Did that query get killed after 3-seconds? Or did it take 1-second and the backend spent 2-seconds transferring a file? Not just "itemized", how are you going to effectively state how much time a query should take when 99.9% of queries return 1 KB, and the other one returns 1 GB?


                                        • No-copy-on-write or de-deduplication XFS and BTRFS support copy-on-write and de-duplication transparently. This means that having the same picture everywhere, or needing a second copy of it can be transparently handled by the filesystem. However, if the file is not standing by itself, and is either on a row or in a store the filesystem is likely unable to dedupe it.


                                        • Integrity a lot of people are here are talking about integrity. What do you think is better at detecting file-system corruption, an application that uses the filesystem or the filesystem's core utilities? Store a file in a row, or out-of-line and any filesystem corruption will be obscured the database. xfs_repair is damn good at recovering when you have filesystem or hard drive corruption, and if it fails it'll still be a lot easier to do data forensics.


                                        • Cloud migration if you ever want to store the files on a SAN or the cloud you'll have all the more difficulty because now that storage-migration is a database-migration. If your files are for example stored on the file system, you can fairly easily move them to S3 (and with something like s3fs it can be transparent).



                                        Exceptions



                                        Storing files in the database has a few valid use cases,




                                        • When you need to edit the file transitionally. That means it's literally part of your transaction to edit the file. Or you need the ability to roll-back edits on the file if the transaction fails for data-integrity issues in the relations (tables).

                                        • When you need to ensure the file system is precisely versioned with the data and you can't afford any risk in keeping them in sync.

                                        • When you the database can actually parse the file and you can query it. In PostgreSQL for example, topologies can be queries with PostGIS. At this point, while it's a file it's also data for the query and not a storage dump.


                                        Mitigations





                                        • Some database have a notion of an "externally managed resource" where the database either manages the file privately on the disk such as




                                          • PostgreSQL through the Large Object infrastructure provides a filehandle to a resource for the duration of the transaction.


                                          • SQL Server 2017's filestream infrastructure provides a temporary access that lasts for the duration of the transaction which you can use to get the File Path and open a File Handle to.


                                          • Oracle provides BFILE (this has nothing to do with their internal LOB stuff which is called SecureFile




                                        • Some of the databases store large binary objects out-of-line or can, like Oracle SecureFile. This permits you to update the row, without rewriting the file.


                                        • Some databases like Oracle do their MVC without a WAL log and do not have to double-the write the file.


                                        • Some of the databases, like SQL Server and Oracle provide abilities to "stream" data from the file without ever having a file handle to it. This may or may not run on a different connection than the databaes query. But the key here is that while you can stream the file (in theory), I can't find any evidence of any product not made by the provider that uses that feature. For example, where is the NGINX/Apache bridge to allow you to do this?


                                        • Oracle provides optional deduplication, compression, and encryption through Internal-LOB storage (like SecureFile).



                                        Conclusion



                                        The worst case scenario when you put a file in the database is very bad for performance, and compatibility with tooling. It's always exceptionally implementation dependent. In no way is the database better at being a file system then the file system. In every way, it's a compromise and the even when you get powerful mitigating features (like the case of SecureFile), the tooling is so poor that it's really not much more than a marketing point unless your whole stack is built by the RDBMS provider.



                                        Keep it simple, and the general rule is keep the files out of the DB.



                                        Solution



                                        How should you store files, or abstract a filesystem in such a fashion to effectively function for multiple tenants and users? I am partial to hashing the file contents. This is pretty common these days and works well.







                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        edited 24 mins ago

























                                        answered 2 days ago









                                        Evan CarrollEvan Carroll

                                        31.1k865206




                                        31.1k865206























                                            -4














                                            Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.






                                            share|improve this answer


























                                              -4














                                              Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.






                                              share|improve this answer
























                                                -4












                                                -4








                                                -4






                                                Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.






                                                share|improve this answer












                                                Share my experience of Ms SQL server and a huge number of files. We save the files on a file server. Database has two tables, one for the file folders and access credentials, one for the filename. It is easy to maintain the database and files. You can easily move the files even cross the servers, just need to modify the folders table.







                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Nov 22 '13 at 5:27









                                                FengFeng

                                                5




                                                5

















                                                    protected by Paul White Jul 31 '14 at 21:52



                                                    Thank you for your interest in this question.
                                                    Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                                    Would you like to answer one of these unanswered questions instead?



                                                    Popular posts from this blog

                                                    Liste der Baudenkmale in Friedland (Mecklenburg)

                                                    Single-Malt-Whisky

                                                    Czorneboh