String Search for Consecutive Non Adjacent Characters in Postgres












0














I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so



enter image description here





  • Here's the library underlying command-t.

  • I saw trigrams but that is for portions of three consecutive letters.

  • I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.


What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?



😇Desired output



for example given that your query was "doc" I would expect the following ordering:



doc
d***o***c
d***oo**c
dpc
dppc
d***p***c
d
asdf
a




Update



Using ilike



with documents(title) as (
values
('d***o***c'),
('d***p***c'),
('d'),
('a'),
('dppc'),
('d***oo**c'),
('asdf'),
('dpc'),
('doc')
) select *
from documents
where title ilike '%d%o%c%';


😒Actual output



Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.



   title
-----------
d***o***c
d***oo**c
doc


Using Similarity



select title, similarity(title, 'doc') as sml
from documents
where similarity(title, 'doc') > 0
order by sml desc
limit 50;


😒Actual Output



Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms



   title   |   sml
-----------+----------
doc | 1
d | 0.2
dpc | 0.142857
dppc | 0.125
d***o***c | 0.111111
d***p***c | 0.111111
d***oo**c | 0.1


Using Levenshtein



Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms



select levenshtein(title, 'doc', 30, 1, 60), title
from documents
where levenshtein(title, 'doc', 30, 1, 60) <= 500
order by levenshtein(title, 'doc', 30, 1, 60)
limit 50;


🎉 Actual output



 levenshtein |   title
-------------+-----------
0 | doc
6 | d***o***c
6 | d***oo**c
31 | dpc
32 | dppc
37 | d***p***c
60 | d
63 | asdf
91 | a
(9 rows)


My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.



Overall levenshtein was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein method.










share|improve this question





























    0














    I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so



    enter image description here





    • Here's the library underlying command-t.

    • I saw trigrams but that is for portions of three consecutive letters.

    • I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.


    What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?



    😇Desired output



    for example given that your query was "doc" I would expect the following ordering:



    doc
    d***o***c
    d***oo**c
    dpc
    dppc
    d***p***c
    d
    asdf
    a




    Update



    Using ilike



    with documents(title) as (
    values
    ('d***o***c'),
    ('d***p***c'),
    ('d'),
    ('a'),
    ('dppc'),
    ('d***oo**c'),
    ('asdf'),
    ('dpc'),
    ('doc')
    ) select *
    from documents
    where title ilike '%d%o%c%';


    😒Actual output



    Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.



       title
    -----------
    d***o***c
    d***oo**c
    doc


    Using Similarity



    select title, similarity(title, 'doc') as sml
    from documents
    where similarity(title, 'doc') > 0
    order by sml desc
    limit 50;


    😒Actual Output



    Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms



       title   |   sml
    -----------+----------
    doc | 1
    d | 0.2
    dpc | 0.142857
    dppc | 0.125
    d***o***c | 0.111111
    d***p***c | 0.111111
    d***oo**c | 0.1


    Using Levenshtein



    Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms



    select levenshtein(title, 'doc', 30, 1, 60), title
    from documents
    where levenshtein(title, 'doc', 30, 1, 60) <= 500
    order by levenshtein(title, 'doc', 30, 1, 60)
    limit 50;


    🎉 Actual output



     levenshtein |   title
    -------------+-----------
    0 | doc
    6 | d***o***c
    6 | d***oo**c
    31 | dpc
    32 | dppc
    37 | d***p***c
    60 | d
    63 | asdf
    91 | a
    (9 rows)


    My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.



    Overall levenshtein was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein method.










    share|improve this question



























      0












      0








      0







      I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so



      enter image description here





      • Here's the library underlying command-t.

      • I saw trigrams but that is for portions of three consecutive letters.

      • I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.


      What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?



      😇Desired output



      for example given that your query was "doc" I would expect the following ordering:



      doc
      d***o***c
      d***oo**c
      dpc
      dppc
      d***p***c
      d
      asdf
      a




      Update



      Using ilike



      with documents(title) as (
      values
      ('d***o***c'),
      ('d***p***c'),
      ('d'),
      ('a'),
      ('dppc'),
      ('d***oo**c'),
      ('asdf'),
      ('dpc'),
      ('doc')
      ) select *
      from documents
      where title ilike '%d%o%c%';


      😒Actual output



      Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.



         title
      -----------
      d***o***c
      d***oo**c
      doc


      Using Similarity



      select title, similarity(title, 'doc') as sml
      from documents
      where similarity(title, 'doc') > 0
      order by sml desc
      limit 50;


      😒Actual Output



      Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms



         title   |   sml
      -----------+----------
      doc | 1
      d | 0.2
      dpc | 0.142857
      dppc | 0.125
      d***o***c | 0.111111
      d***p***c | 0.111111
      d***oo**c | 0.1


      Using Levenshtein



      Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms



      select levenshtein(title, 'doc', 30, 1, 60), title
      from documents
      where levenshtein(title, 'doc', 30, 1, 60) <= 500
      order by levenshtein(title, 'doc', 30, 1, 60)
      limit 50;


      🎉 Actual output



       levenshtein |   title
      -------------+-----------
      0 | doc
      6 | d***o***c
      6 | d***oo**c
      31 | dpc
      32 | dppc
      37 | d***p***c
      60 | d
      63 | asdf
      91 | a
      (9 rows)


      My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.



      Overall levenshtein was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein method.










      share|improve this question















      I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so



      enter image description here





      • Here's the library underlying command-t.

      • I saw trigrams but that is for portions of three consecutive letters.

      • I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.


      What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?



      😇Desired output



      for example given that your query was "doc" I would expect the following ordering:



      doc
      d***o***c
      d***oo**c
      dpc
      dppc
      d***p***c
      d
      asdf
      a




      Update



      Using ilike



      with documents(title) as (
      values
      ('d***o***c'),
      ('d***p***c'),
      ('d'),
      ('a'),
      ('dppc'),
      ('d***oo**c'),
      ('asdf'),
      ('dpc'),
      ('doc')
      ) select *
      from documents
      where title ilike '%d%o%c%';


      😒Actual output



      Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.



         title
      -----------
      d***o***c
      d***oo**c
      doc


      Using Similarity



      select title, similarity(title, 'doc') as sml
      from documents
      where similarity(title, 'doc') > 0
      order by sml desc
      limit 50;


      😒Actual Output



      Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms



         title   |   sml
      -----------+----------
      doc | 1
      d | 0.2
      dpc | 0.142857
      dppc | 0.125
      d***o***c | 0.111111
      d***p***c | 0.111111
      d***oo**c | 0.1


      Using Levenshtein



      Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms



      select levenshtein(title, 'doc', 30, 1, 60), title
      from documents
      where levenshtein(title, 'doc', 30, 1, 60) <= 500
      order by levenshtein(title, 'doc', 30, 1, 60)
      limit 50;


      🎉 Actual output



       levenshtein |   title
      -------------+-----------
      0 | doc
      6 | d***o***c
      6 | d***oo**c
      31 | dpc
      32 | dppc
      37 | d***p***c
      60 | d
      63 | asdf
      91 | a
      (9 rows)


      My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.



      Overall levenshtein was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein method.







      postgresql search fuzzy-search string-search






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 23 at 17:56

























      asked Nov 20 at 16:34









      david_adler

      2,25422046




      2,25422046





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53397487%2fstring-search-for-consecutive-non-adjacent-characters-in-postgres%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53397487%2fstring-search-for-consecutive-non-adjacent-characters-in-postgres%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin