Use Sha vs md5 or Hash in Snowflake-db












0















Let me preface by saying that I am not using this for storing passwords or any other sensitive info -- I simply want a row-level sha/hash that I can use later or to check for unique records. My tables will be on the long side, in the range of 0.1 - 10 trillion rows.



I am using a Snowflake datawarehouse, and thus my options are SHA1, SHA2, MD5 (each with binary options), and HASH.



I guess I would like to minimize the chance of collisions (given the long tables) while not burning my compute credits needlessly.



Which one is the best option given my use case?










share|improve this question














bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

    – Antonius Bloch
    Dec 19 '17 at 19:49
















0















Let me preface by saying that I am not using this for storing passwords or any other sensitive info -- I simply want a row-level sha/hash that I can use later or to check for unique records. My tables will be on the long side, in the range of 0.1 - 10 trillion rows.



I am using a Snowflake datawarehouse, and thus my options are SHA1, SHA2, MD5 (each with binary options), and HASH.



I guess I would like to minimize the chance of collisions (given the long tables) while not burning my compute credits needlessly.



Which one is the best option given my use case?










share|improve this question














bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

    – Antonius Bloch
    Dec 19 '17 at 19:49














0












0








0








Let me preface by saying that I am not using this for storing passwords or any other sensitive info -- I simply want a row-level sha/hash that I can use later or to check for unique records. My tables will be on the long side, in the range of 0.1 - 10 trillion rows.



I am using a Snowflake datawarehouse, and thus my options are SHA1, SHA2, MD5 (each with binary options), and HASH.



I guess I would like to minimize the chance of collisions (given the long tables) while not burning my compute credits needlessly.



Which one is the best option given my use case?










share|improve this question














Let me preface by saying that I am not using this for storing passwords or any other sensitive info -- I simply want a row-level sha/hash that I can use later or to check for unique records. My tables will be on the long side, in the range of 0.1 - 10 trillion rows.



I am using a Snowflake datawarehouse, and thus my options are SHA1, SHA2, MD5 (each with binary options), and HASH.



I guess I would like to minimize the chance of collisions (given the long tables) while not burning my compute credits needlessly.



Which one is the best option given my use case?







hashing snowflake






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 19 '17 at 17:06









Serban TanasaSerban Tanasa

1085




1085





bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

    – Antonius Bloch
    Dec 19 '17 at 19:49



















  • I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

    – Antonius Bloch
    Dec 19 '17 at 19:49

















I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

– Antonius Bloch
Dec 19 '17 at 19:49





I can't speak to HASH, but the speed of SHA1, SHA2 (SHA256/SHA512) and MD5 vary depending on implementation, hardware and architecture (64 vs 32 bit). Can you run any simple experiments on the Snowflake platform to solve the performance part of your question?

– Antonius Bloch
Dec 19 '17 at 19:49










1 Answer
1






active

oldest

votes


















0














No matter what you pick ...



Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.



https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html



So you have to... Given Snowflake Db table structures and how micro-partitions are pruned/scanned, I suspect something like the query below could get very slow.
It will most likely scan entire table for the rows inserted.



  insert into T 
Select * from (Select ... union Select ... union Select ... union ...) x
where x.hash not in (Select hash from T)


Using clustering keys may speed up the check for unique, but at the cost of much more data writes.



With native clustering you will need to write something closer to



insert into T 
select *
from (Select ... union Select ... union Select ... union ...) s
left Join T t1
-- f1,f2 ... are part of a natural unique key
on s.f1 = t1.f1
and s.f2 = t1.f2
...
and s.hash = t.hash
where t.hash is null


Good Luck






share|improve this answer

























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f193554%2fuse-sha-vs-md5-or-hash-in-snowflake-db%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    No matter what you pick ...



    Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.



    https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html



    So you have to... Given Snowflake Db table structures and how micro-partitions are pruned/scanned, I suspect something like the query below could get very slow.
    It will most likely scan entire table for the rows inserted.



      insert into T 
    Select * from (Select ... union Select ... union Select ... union ...) x
    where x.hash not in (Select hash from T)


    Using clustering keys may speed up the check for unique, but at the cost of much more data writes.



    With native clustering you will need to write something closer to



    insert into T 
    select *
    from (Select ... union Select ... union Select ... union ...) s
    left Join T t1
    -- f1,f2 ... are part of a natural unique key
    on s.f1 = t1.f1
    and s.f2 = t1.f2
    ...
    and s.hash = t.hash
    where t.hash is null


    Good Luck






    share|improve this answer






























      0














      No matter what you pick ...



      Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.



      https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html



      So you have to... Given Snowflake Db table structures and how micro-partitions are pruned/scanned, I suspect something like the query below could get very slow.
      It will most likely scan entire table for the rows inserted.



        insert into T 
      Select * from (Select ... union Select ... union Select ... union ...) x
      where x.hash not in (Select hash from T)


      Using clustering keys may speed up the check for unique, but at the cost of much more data writes.



      With native clustering you will need to write something closer to



      insert into T 
      select *
      from (Select ... union Select ... union Select ... union ...) s
      left Join T t1
      -- f1,f2 ... are part of a natural unique key
      on s.f1 = t1.f1
      and s.f2 = t1.f2
      ...
      and s.hash = t.hash
      where t.hash is null


      Good Luck






      share|improve this answer




























        0












        0








        0







        No matter what you pick ...



        Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.



        https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html



        So you have to... Given Snowflake Db table structures and how micro-partitions are pruned/scanned, I suspect something like the query below could get very slow.
        It will most likely scan entire table for the rows inserted.



          insert into T 
        Select * from (Select ... union Select ... union Select ... union ...) x
        where x.hash not in (Select hash from T)


        Using clustering keys may speed up the check for unique, but at the cost of much more data writes.



        With native clustering you will need to write something closer to



        insert into T 
        select *
        from (Select ... union Select ... union Select ... union ...) s
        left Join T t1
        -- f1,f2 ... are part of a natural unique key
        on s.f1 = t1.f1
        and s.f2 = t1.f2
        ...
        and s.hash = t.hash
        where t.hash is null


        Good Luck






        share|improve this answer















        No matter what you pick ...



        Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.



        https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html



        So you have to... Given Snowflake Db table structures and how micro-partitions are pruned/scanned, I suspect something like the query below could get very slow.
        It will most likely scan entire table for the rows inserted.



          insert into T 
        Select * from (Select ... union Select ... union Select ... union ...) x
        where x.hash not in (Select hash from T)


        Using clustering keys may speed up the check for unique, but at the cost of much more data writes.



        With native clustering you will need to write something closer to



        insert into T 
        select *
        from (Select ... union Select ... union Select ... union ...) s
        left Join T t1
        -- f1,f2 ... are part of a natural unique key
        on s.f1 = t1.f1
        and s.f2 = t1.f2
        ...
        and s.hash = t.hash
        where t.hash is null


        Good Luck







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Aug 29 '18 at 22:31









        Anthony Genovese

        1,6792924




        1,6792924










        answered Aug 29 '18 at 18:20









        Brian SBrian S

        1




        1






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f193554%2fuse-sha-vs-md5-or-hash-in-snowflake-db%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Liste der Baudenkmale in Friedland (Mecklenburg)

            Single-Malt-Whisky

            Czorneboh