String Search for Consecutive Non Adjacent Characters in Postgres

I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so

enter image description here

Here's the library underlying command-t.

I saw trigrams but that is for portions of three consecutive letters.

I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.

What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?

😇Desired output

for example given that your query was "doc" I would expect the following ordering:

doc

d***o***c

d***oo**c

dpc

dppc

d***p***c

d

asdf

a

Update

Using `ilike`

with documents(title) as (

values

    ('d***o***c'),

    ('d***p***c'),

    ('d'),

    ('a'),

    ('dppc'),

    ('d***oo**c'),

    ('asdf'),

    ('dpc'),

    ('doc')

) select *

from documents

where title ilike '%d%o%c%';

😒Actual output

Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.

   title

-----------

 d***o***c

 d***oo**c

 doc

Using Similarity

select title, similarity(title, 'doc') as sml

    from documents

    where similarity(title, 'doc') > 0

    order by sml desc

    limit 50;

😒Actual Output

Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms

   title   |   sml

-----------+----------

 doc       |        1

 d         |      0.2

 dpc       | 0.142857

 dppc      |    0.125

 d***o***c | 0.111111

 d***p***c | 0.111111

 d***oo**c |      0.1

Using Levenshtein

Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms

select levenshtein(title, 'doc', 30, 1, 60), title

    from documents

    where levenshtein(title, 'doc', 30, 1, 60) <= 500

    order by levenshtein(title, 'doc', 30, 1, 60)

    limit 50;

🎉 Actual output

 levenshtein |   title

-------------+-----------

           0 | doc

           6 | d***o***c

           6 | d***oo**c

          31 | dpc

          32 | dppc

          37 | d***p***c

          60 | d

          63 | asdf

          91 | a

(9 rows)

My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.

Overall levenshtein was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein method.

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

add a comment |

I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so

enter image description here

Here's the library underlying command-t.

I saw trigrams but that is for portions of three consecutive letters.

I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.

What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?

😇Desired output

for example given that your query was "doc" I would expect the following ordering:

doc

d***o***c

d***oo**c

dpc

dppc

d***p***c

d

asdf

a

Update

Using `ilike`

with documents(title) as (

values

    ('d***o***c'),

    ('d***p***c'),

    ('d'),

    ('a'),

    ('dppc'),

    ('d***oo**c'),

    ('asdf'),

    ('dpc'),

    ('doc')

) select *

from documents

where title ilike '%d%o%c%';

😒Actual output

Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.

   title

-----------

 d***o***c

 d***oo**c

 doc

Using Similarity

select title, similarity(title, 'doc') as sml

    from documents

    where similarity(title, 'doc') > 0

    order by sml desc

    limit 50;

😒Actual Output

Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms

   title   |   sml

-----------+----------

 doc       |        1

 d         |      0.2

 dpc       | 0.142857

 dppc      |    0.125

 d***o***c | 0.111111

 d***p***c | 0.111111

 d***oo**c |      0.1

Using Levenshtein

Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms

select levenshtein(title, 'doc', 30, 1, 60), title

    from documents

    where levenshtein(title, 'doc', 30, 1, 60) <= 500

    order by levenshtein(title, 'doc', 30, 1, 60)

    limit 50;

🎉 Actual output

 levenshtein |   title

-------------+-----------

           0 | doc

           6 | d***o***c

           6 | d***oo**c

          31 | dpc

          32 | dppc

          37 | d***p***c

          60 | d

          63 | asdf

          91 | a

(9 rows)

My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

add a comment |

I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so

enter image description here

Here's the library underlying command-t.

I saw trigrams but that is for portions of three consecutive letters.

I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.

What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?

😇Desired output

for example given that your query was "doc" I would expect the following ordering:

doc

d***o***c

d***oo**c

dpc

dppc

d***p***c

d

asdf

a

Update

Using `ilike`

with documents(title) as (

values

    ('d***o***c'),

    ('d***p***c'),

    ('d'),

    ('a'),

    ('dppc'),

    ('d***oo**c'),

    ('asdf'),

    ('dpc'),

    ('doc')

) select *

from documents

where title ilike '%d%o%c%';

😒Actual output

Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.

   title

-----------

 d***o***c

 d***oo**c

 doc

Using Similarity

select title, similarity(title, 'doc') as sml

    from documents

    where similarity(title, 'doc') > 0

    order by sml desc

    limit 50;

😒Actual Output

Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms

   title   |   sml

-----------+----------

 doc       |        1

 d         |      0.2

 dpc       | 0.142857

 dppc      |    0.125

 d***o***c | 0.111111

 d***p***c | 0.111111

 d***oo**c |      0.1

Using Levenshtein

Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms

select levenshtein(title, 'doc', 30, 1, 60), title

    from documents

    where levenshtein(title, 'doc', 30, 1, 60) <= 500

    order by levenshtein(title, 'doc', 30, 1, 60)

    limit 50;

🎉 Actual output

 levenshtein |   title

-------------+-----------

           0 | doc

           6 | d***o***c

           6 | d***oo**c

          31 | dpc

          32 | dppc

          37 | d***p***c

          60 | d

          63 | asdf

          91 | a

(9 rows)

My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

I would like to replicate something like command-t or Sublime "Go to file" / CMD + P search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so

enter image description here

Here's the library underlying command-t.

I saw trigrams but that is for portions of three consecutive letters.

I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.

What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?

😇Desired output

for example given that your query was "doc" I would expect the following ordering:

doc

d***o***c

d***oo**c

dpc

dppc

d***p***c

d

asdf

a

Update

Using `ilike`

with documents(title) as (

values

    ('d***o***c'),

    ('d***p***c'),

    ('d'),

    ('a'),

    ('dppc'),

    ('d***oo**c'),

    ('asdf'),

    ('dpc'),

    ('doc')

) select *

from documents

where title ilike '%d%o%c%';

😒Actual output

Doesn't order for similarity. With trigram index on title and 700k rows around 30ms.

   title

-----------

 d***o***c

 d***oo**c

 doc

Using Similarity

select title, similarity(title, 'doc') as sml

    from documents

    where similarity(title, 'doc') > 0

    order by sml desc

    limit 50;

😒Actual Output

Doesn't prioritize non consecutive letters. With trigram index on title and 700k rows around 30ms

   title   |   sml

-----------+----------

 doc       |        1

 d         |      0.2

 dpc       | 0.142857

 dppc      |    0.125

 d***o***c | 0.111111

 d***p***c | 0.111111

 d***oo**c |      0.1

Using Levenshtein

Prioritizes consecutive letters. With btree index on title and 700k rows around 89ms

select levenshtein(title, 'doc', 30, 1, 60), title

    from documents

    where levenshtein(title, 'doc', 30, 1, 60) <= 500

    order by levenshtein(title, 'doc', 30, 1, 60)

    limit 50;

🎉 Actual output

 levenshtein |   title

-------------+-----------

           0 | doc

           6 | d***o***c

           6 | d***oo**c

          31 | dpc

          32 | dppc

          37 | d***p***c

          60 | d

          63 | asdf

          91 | a

(9 rows)

My timing results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.

postgresql search fuzzy-search string-search

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

edited Nov 23 at 17:56

asked Nov 20 at 16:34

david_adler

2,25422046

asked Nov 20 at 16:34

david_adler

2,25422046

asked Nov 20 at 16:34

david_adler

2,25422046

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53397487%2fstring-search-for-consecutive-non-adjacent-characters-in-postgres%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

String Search for Consecutive Non Adjacent Characters in Postgres

😇Desired output

Update

Using ilike

😒Actual output

Using Similarity

😒Actual Output

Using Levenshtein

🎉 Actual output

😇Desired output

Update

Using ilike

😒Actual output

Using Similarity

😒Actual Output

Using Levenshtein

🎉 Actual output

😇Desired output

Update

Using ilike

😒Actual output

Using Similarity

😒Actual Output

Using Levenshtein

🎉 Actual output

😇Desired output

Update

Using ilike

😒Actual output

Using Similarity

😒Actual Output

Using Levenshtein

🎉 Actual output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Ottavio Pratesi

Tricia Helfer

15 giugno

Using `ilike`

Using `ilike`

Using `ilike`

Using `ilike`