String Search for Consecutive Non Adjacent Characters in Postgres
I would like to replicate something like command-t or Sublime "Go to file" / CMD + P
search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so
Here's the library underlying command-t.- I saw trigrams but that is for portions of three consecutive letters.
- I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.
What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?
😇Desired output
for example given that your query was "doc" I would expect the following ordering:
doc
d***o***c
d***oo**c
dpc
dppc
d***p***c
d
asdf
a
Update
Using ilike
with documents(title) as (
values
('d***o***c'),
('d***p***c'),
('d'),
('a'),
('dppc'),
('d***oo**c'),
('asdf'),
('dpc'),
('doc')
) select *
from documents
where title ilike '%d%o%c%';
😒Actual output
Doesn't order for similarity. With trigram index on title
and 700k rows around 30ms.
title
-----------
d***o***c
d***oo**c
doc
Using Similarity
select title, similarity(title, 'doc') as sml
from documents
where similarity(title, 'doc') > 0
order by sml desc
limit 50;
😒Actual Output
Doesn't prioritize non consecutive letters. With trigram index on title
and 700k rows around 30ms
title | sml
-----------+----------
doc | 1
d | 0.2
dpc | 0.142857
dppc | 0.125
d***o***c | 0.111111
d***p***c | 0.111111
d***oo**c | 0.1
Using Levenshtein
Prioritizes consecutive letters. With btree index on title
and 700k rows around 89ms
select levenshtein(title, 'doc', 30, 1, 60), title
from documents
where levenshtein(title, 'doc', 30, 1, 60) <= 500
order by levenshtein(title, 'doc', 30, 1, 60)
limit 50;
🎉 Actual output
levenshtein | title
-------------+-----------
0 | doc
6 | d***o***c
6 | d***oo**c
31 | dpc
32 | dppc
37 | d***p***c
60 | d
63 | asdf
91 | a
(9 rows)
My timing
results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.
Overall levenshtein
was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein
method.
postgresql search fuzzy-search string-search
add a comment |
I would like to replicate something like command-t or Sublime "Go to file" / CMD + P
search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so
Here's the library underlying command-t.- I saw trigrams but that is for portions of three consecutive letters.
- I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.
What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?
😇Desired output
for example given that your query was "doc" I would expect the following ordering:
doc
d***o***c
d***oo**c
dpc
dppc
d***p***c
d
asdf
a
Update
Using ilike
with documents(title) as (
values
('d***o***c'),
('d***p***c'),
('d'),
('a'),
('dppc'),
('d***oo**c'),
('asdf'),
('dpc'),
('doc')
) select *
from documents
where title ilike '%d%o%c%';
😒Actual output
Doesn't order for similarity. With trigram index on title
and 700k rows around 30ms.
title
-----------
d***o***c
d***oo**c
doc
Using Similarity
select title, similarity(title, 'doc') as sml
from documents
where similarity(title, 'doc') > 0
order by sml desc
limit 50;
😒Actual Output
Doesn't prioritize non consecutive letters. With trigram index on title
and 700k rows around 30ms
title | sml
-----------+----------
doc | 1
d | 0.2
dpc | 0.142857
dppc | 0.125
d***o***c | 0.111111
d***p***c | 0.111111
d***oo**c | 0.1
Using Levenshtein
Prioritizes consecutive letters. With btree index on title
and 700k rows around 89ms
select levenshtein(title, 'doc', 30, 1, 60), title
from documents
where levenshtein(title, 'doc', 30, 1, 60) <= 500
order by levenshtein(title, 'doc', 30, 1, 60)
limit 50;
🎉 Actual output
levenshtein | title
-------------+-----------
0 | doc
6 | d***o***c
6 | d***oo**c
31 | dpc
32 | dppc
37 | d***p***c
60 | d
63 | asdf
91 | a
(9 rows)
My timing
results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.
Overall levenshtein
was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein
method.
postgresql search fuzzy-search string-search
add a comment |
I would like to replicate something like command-t or Sublime "Go to file" / CMD + P
search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so
Here's the library underlying command-t.- I saw trigrams but that is for portions of three consecutive letters.
- I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.
What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?
😇Desired output
for example given that your query was "doc" I would expect the following ordering:
doc
d***o***c
d***oo**c
dpc
dppc
d***p***c
d
asdf
a
Update
Using ilike
with documents(title) as (
values
('d***o***c'),
('d***p***c'),
('d'),
('a'),
('dppc'),
('d***oo**c'),
('asdf'),
('dpc'),
('doc')
) select *
from documents
where title ilike '%d%o%c%';
😒Actual output
Doesn't order for similarity. With trigram index on title
and 700k rows around 30ms.
title
-----------
d***o***c
d***oo**c
doc
Using Similarity
select title, similarity(title, 'doc') as sml
from documents
where similarity(title, 'doc') > 0
order by sml desc
limit 50;
😒Actual Output
Doesn't prioritize non consecutive letters. With trigram index on title
and 700k rows around 30ms
title | sml
-----------+----------
doc | 1
d | 0.2
dpc | 0.142857
dppc | 0.125
d***o***c | 0.111111
d***p***c | 0.111111
d***oo**c | 0.1
Using Levenshtein
Prioritizes consecutive letters. With btree index on title
and 700k rows around 89ms
select levenshtein(title, 'doc', 30, 1, 60), title
from documents
where levenshtein(title, 'doc', 30, 1, 60) <= 500
order by levenshtein(title, 'doc', 30, 1, 60)
limit 50;
🎉 Actual output
levenshtein | title
-------------+-----------
0 | doc
6 | d***o***c
6 | d***oo**c
31 | dpc
32 | dppc
37 | d***p***c
60 | d
63 | asdf
91 | a
(9 rows)
My timing
results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.
Overall levenshtein
was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein
method.
postgresql search fuzzy-search string-search
I would like to replicate something like command-t or Sublime "Go to file" / CMD + P
search in postgres. Where the search matches consecutive but not necessarily adjacent characters. Like so
Here's the library underlying command-t.- I saw trigrams but that is for portions of three consecutive letters.
- I saw Levenshtein but that calculates distance not quite the non consecutive char behaviour I was hoping for.
What kind of fuzzy search is this called? If this is not possible in postgres, is it possible in elasticsearch or another DB?
😇Desired output
for example given that your query was "doc" I would expect the following ordering:
doc
d***o***c
d***oo**c
dpc
dppc
d***p***c
d
asdf
a
Update
Using ilike
with documents(title) as (
values
('d***o***c'),
('d***p***c'),
('d'),
('a'),
('dppc'),
('d***oo**c'),
('asdf'),
('dpc'),
('doc')
) select *
from documents
where title ilike '%d%o%c%';
😒Actual output
Doesn't order for similarity. With trigram index on title
and 700k rows around 30ms.
title
-----------
d***o***c
d***oo**c
doc
Using Similarity
select title, similarity(title, 'doc') as sml
from documents
where similarity(title, 'doc') > 0
order by sml desc
limit 50;
😒Actual Output
Doesn't prioritize non consecutive letters. With trigram index on title
and 700k rows around 30ms
title | sml
-----------+----------
doc | 1
d | 0.2
dpc | 0.142857
dppc | 0.125
d***o***c | 0.111111
d***p***c | 0.111111
d***oo**c | 0.1
Using Levenshtein
Prioritizes consecutive letters. With btree index on title
and 700k rows around 89ms
select levenshtein(title, 'doc', 30, 1, 60), title
from documents
where levenshtein(title, 'doc', 30, 1, 60) <= 500
order by levenshtein(title, 'doc', 30, 1, 60)
limit 50;
🎉 Actual output
levenshtein | title
-------------+-----------
0 | doc
6 | d***o***c
6 | d***oo**c
31 | dpc
32 | dppc
37 | d***p***c
60 | d
63 | asdf
91 | a
(9 rows)
My timing
results aren't particularly reliable as they fluctuated to up to 3 seconds on my local 16GB 2.5Ghz quad core cpu but they may help for relative estimates.
Overall levenshtein
was the only one which produced the desired results but tended to be a lot slower, sometimes 5x slower than the other methods. I couldn't see any indexes which could have helped with the levenshtein
method.
postgresql search fuzzy-search string-search
postgresql search fuzzy-search string-search
edited Nov 23 at 17:56
asked Nov 20 at 16:34
david_adler
2,25422046
2,25422046
add a comment |
add a comment |
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53397487%2fstring-search-for-consecutive-non-adjacent-characters-in-postgres%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53397487%2fstring-search-for-consecutive-non-adjacent-characters-in-postgres%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown