Regular expression to match sentence containing specific words and discarding the match if it contains...
up vote
2
down vote
favorite
The problem is as the title says. Is it even possible?
For example, I have two words that I search for: apple, orange
And a word that makes the whole sentence wrong: box
So the expression should accept this sentence:
One orange and one apple
but discard this one orange and apple within a box
.
I've been thinking about that for some time but I can't find any solution.
regex
add a comment |
up vote
2
down vote
favorite
The problem is as the title says. Is it even possible?
For example, I have two words that I search for: apple, orange
And a word that makes the whole sentence wrong: box
So the expression should accept this sentence:
One orange and one apple
but discard this one orange and apple within a box
.
I've been thinking about that for some time but I can't find any solution.
regex
What language are you using?
– Andy Lester
Nov 17 at 22:49
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
The problem is as the title says. Is it even possible?
For example, I have two words that I search for: apple, orange
And a word that makes the whole sentence wrong: box
So the expression should accept this sentence:
One orange and one apple
but discard this one orange and apple within a box
.
I've been thinking about that for some time but I can't find any solution.
regex
The problem is as the title says. Is it even possible?
For example, I have two words that I search for: apple, orange
And a word that makes the whole sentence wrong: box
So the expression should accept this sentence:
One orange and one apple
but discard this one orange and apple within a box
.
I've been thinking about that for some time but I can't find any solution.
regex
regex
asked Nov 17 at 19:09
PatrykTraveler
356
356
What language are you using?
– Andy Lester
Nov 17 at 22:49
add a comment |
What language are you using?
– Andy Lester
Nov 17 at 22:49
What language are you using?
– Andy Lester
Nov 17 at 22:49
What language are you using?
– Andy Lester
Nov 17 at 22:49
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
You can use positive look ahead to match strings that contain either apple
or orange
word like this,
(?=.*(orange|apple))
and can use negative look ahead to discard the match if it contains box
word like this,
(?!.*box)
Hence the over regex becomes this,
^(?=.*(orange|apple))(?!.*box).*$
Here is the demo for same
If you can provide what language you are using, I should be able to help you with sample codes as well.
Edit:
Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,
import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))
add a comment |
up vote
1
down vote
First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.
That said, consider something like:
I want to find the word "orange" anywhere in my string.
You can typically take advantage of regex searching by doing something like:
/orange/
But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:
/^.*orange/
(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)
You can do the same thing with apple, but how can you tie them together?
One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:
I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".
That's an alternation, which is |
(vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write |
or \\|
or something in between.
But, the sub-patterns are simple:
/orange.*apple/
/apple.*orange/
So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:
/(orange.*apple|apple.*orange)/
Then prepend the "tie to start of string" at the front:
/^.*(orange.*apple|apple.*orange)/
Now you can match text that contains both words in either order.
Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... )
(where ...
is "box" in our case).
I don't want to be looking at the word "box" next.
Is a regex like:
/(?!box)/
But in your case, you want to say:
I don't want to be looking at the word "box" anywhere in the following text.
Which is another "any character" special:
/(?!.*box)/
Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.
For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:
I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.
You can do that by dropping the lookahead right after the anchor to the start:
/^(?!.*box).*(apple.*orange|orange.*apple)/
This translates to
At start of string,
- confirm "box" does not appear in the line
- match any character any number of times,
- then either
- match "apple",
- followed by any chars, any number of times
- then "orange"
- or
- match "orange"
- followed by any chars, any number of times
- then "apple"
There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a *
or +
in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.
Likewise, the .*
before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)
Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b
, like bappleb
but you may have to write yappley
, mappleM
, <apple>
, or even [[:<:]]apple[[:>:]]
. Check your documentation.
Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple
construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:
/^(?!.*box)(?=.*apple)(?=.*orange)./
The .
at the end is just to force a single character to participate. This expression says
I want a line that does not hold the word "box", does hold "apple", and does hold "orange".
You can see how to extend this with more words, but note that each time you do ?=.*
you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You can use positive look ahead to match strings that contain either apple
or orange
word like this,
(?=.*(orange|apple))
and can use negative look ahead to discard the match if it contains box
word like this,
(?!.*box)
Hence the over regex becomes this,
^(?=.*(orange|apple))(?!.*box).*$
Here is the demo for same
If you can provide what language you are using, I should be able to help you with sample codes as well.
Edit:
Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,
import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))
add a comment |
up vote
1
down vote
accepted
You can use positive look ahead to match strings that contain either apple
or orange
word like this,
(?=.*(orange|apple))
and can use negative look ahead to discard the match if it contains box
word like this,
(?!.*box)
Hence the over regex becomes this,
^(?=.*(orange|apple))(?!.*box).*$
Here is the demo for same
If you can provide what language you are using, I should be able to help you with sample codes as well.
Edit:
Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,
import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You can use positive look ahead to match strings that contain either apple
or orange
word like this,
(?=.*(orange|apple))
and can use negative look ahead to discard the match if it contains box
word like this,
(?!.*box)
Hence the over regex becomes this,
^(?=.*(orange|apple))(?!.*box).*$
Here is the demo for same
If you can provide what language you are using, I should be able to help you with sample codes as well.
Edit:
Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,
import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))
You can use positive look ahead to match strings that contain either apple
or orange
word like this,
(?=.*(orange|apple))
and can use negative look ahead to discard the match if it contains box
word like this,
(?!.*box)
Hence the over regex becomes this,
^(?=.*(orange|apple))(?!.*box).*$
Here is the demo for same
If you can provide what language you are using, I should be able to help you with sample codes as well.
Edit:
Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,
import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))
edited Nov 17 at 19:43
answered Nov 17 at 19:28
Pushpesh Kumar Rajwanshi
2,6701821
2,6701821
add a comment |
add a comment |
up vote
1
down vote
First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.
That said, consider something like:
I want to find the word "orange" anywhere in my string.
You can typically take advantage of regex searching by doing something like:
/orange/
But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:
/^.*orange/
(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)
You can do the same thing with apple, but how can you tie them together?
One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:
I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".
That's an alternation, which is |
(vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write |
or \\|
or something in between.
But, the sub-patterns are simple:
/orange.*apple/
/apple.*orange/
So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:
/(orange.*apple|apple.*orange)/
Then prepend the "tie to start of string" at the front:
/^.*(orange.*apple|apple.*orange)/
Now you can match text that contains both words in either order.
Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... )
(where ...
is "box" in our case).
I don't want to be looking at the word "box" next.
Is a regex like:
/(?!box)/
But in your case, you want to say:
I don't want to be looking at the word "box" anywhere in the following text.
Which is another "any character" special:
/(?!.*box)/
Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.
For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:
I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.
You can do that by dropping the lookahead right after the anchor to the start:
/^(?!.*box).*(apple.*orange|orange.*apple)/
This translates to
At start of string,
- confirm "box" does not appear in the line
- match any character any number of times,
- then either
- match "apple",
- followed by any chars, any number of times
- then "orange"
- or
- match "orange"
- followed by any chars, any number of times
- then "apple"
There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a *
or +
in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.
Likewise, the .*
before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)
Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b
, like bappleb
but you may have to write yappley
, mappleM
, <apple>
, or even [[:<:]]apple[[:>:]]
. Check your documentation.
Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple
construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:
/^(?!.*box)(?=.*apple)(?=.*orange)./
The .
at the end is just to force a single character to participate. This expression says
I want a line that does not hold the word "box", does hold "apple", and does hold "orange".
You can see how to extend this with more words, but note that each time you do ?=.*
you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.
add a comment |
up vote
1
down vote
First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.
That said, consider something like:
I want to find the word "orange" anywhere in my string.
You can typically take advantage of regex searching by doing something like:
/orange/
But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:
/^.*orange/
(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)
You can do the same thing with apple, but how can you tie them together?
One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:
I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".
That's an alternation, which is |
(vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write |
or \\|
or something in between.
But, the sub-patterns are simple:
/orange.*apple/
/apple.*orange/
So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:
/(orange.*apple|apple.*orange)/
Then prepend the "tie to start of string" at the front:
/^.*(orange.*apple|apple.*orange)/
Now you can match text that contains both words in either order.
Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... )
(where ...
is "box" in our case).
I don't want to be looking at the word "box" next.
Is a regex like:
/(?!box)/
But in your case, you want to say:
I don't want to be looking at the word "box" anywhere in the following text.
Which is another "any character" special:
/(?!.*box)/
Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.
For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:
I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.
You can do that by dropping the lookahead right after the anchor to the start:
/^(?!.*box).*(apple.*orange|orange.*apple)/
This translates to
At start of string,
- confirm "box" does not appear in the line
- match any character any number of times,
- then either
- match "apple",
- followed by any chars, any number of times
- then "orange"
- or
- match "orange"
- followed by any chars, any number of times
- then "apple"
There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a *
or +
in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.
Likewise, the .*
before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)
Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b
, like bappleb
but you may have to write yappley
, mappleM
, <apple>
, or even [[:<:]]apple[[:>:]]
. Check your documentation.
Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple
construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:
/^(?!.*box)(?=.*apple)(?=.*orange)./
The .
at the end is just to force a single character to participate. This expression says
I want a line that does not hold the word "box", does hold "apple", and does hold "orange".
You can see how to extend this with more words, but note that each time you do ?=.*
you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.
add a comment |
up vote
1
down vote
up vote
1
down vote
First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.
That said, consider something like:
I want to find the word "orange" anywhere in my string.
You can typically take advantage of regex searching by doing something like:
/orange/
But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:
/^.*orange/
(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)
You can do the same thing with apple, but how can you tie them together?
One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:
I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".
That's an alternation, which is |
(vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write |
or \\|
or something in between.
But, the sub-patterns are simple:
/orange.*apple/
/apple.*orange/
So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:
/(orange.*apple|apple.*orange)/
Then prepend the "tie to start of string" at the front:
/^.*(orange.*apple|apple.*orange)/
Now you can match text that contains both words in either order.
Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... )
(where ...
is "box" in our case).
I don't want to be looking at the word "box" next.
Is a regex like:
/(?!box)/
But in your case, you want to say:
I don't want to be looking at the word "box" anywhere in the following text.
Which is another "any character" special:
/(?!.*box)/
Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.
For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:
I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.
You can do that by dropping the lookahead right after the anchor to the start:
/^(?!.*box).*(apple.*orange|orange.*apple)/
This translates to
At start of string,
- confirm "box" does not appear in the line
- match any character any number of times,
- then either
- match "apple",
- followed by any chars, any number of times
- then "orange"
- or
- match "orange"
- followed by any chars, any number of times
- then "apple"
There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a *
or +
in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.
Likewise, the .*
before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)
Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b
, like bappleb
but you may have to write yappley
, mappleM
, <apple>
, or even [[:<:]]apple[[:>:]]
. Check your documentation.
Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple
construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:
/^(?!.*box)(?=.*apple)(?=.*orange)./
The .
at the end is just to force a single character to participate. This expression says
I want a line that does not hold the word "box", does hold "apple", and does hold "orange".
You can see how to extend this with more words, but note that each time you do ?=.*
you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.
First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.
That said, consider something like:
I want to find the word "orange" anywhere in my string.
You can typically take advantage of regex searching by doing something like:
/orange/
But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:
/^.*orange/
(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)
You can do the same thing with apple, but how can you tie them together?
One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:
I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".
That's an alternation, which is |
(vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write |
or \\|
or something in between.
But, the sub-patterns are simple:
/orange.*apple/
/apple.*orange/
So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:
/(orange.*apple|apple.*orange)/
Then prepend the "tie to start of string" at the front:
/^.*(orange.*apple|apple.*orange)/
Now you can match text that contains both words in either order.
Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... )
(where ...
is "box" in our case).
I don't want to be looking at the word "box" next.
Is a regex like:
/(?!box)/
But in your case, you want to say:
I don't want to be looking at the word "box" anywhere in the following text.
Which is another "any character" special:
/(?!.*box)/
Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.
For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:
I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.
You can do that by dropping the lookahead right after the anchor to the start:
/^(?!.*box).*(apple.*orange|orange.*apple)/
This translates to
At start of string,
- confirm "box" does not appear in the line
- match any character any number of times,
- then either
- match "apple",
- followed by any chars, any number of times
- then "orange"
- or
- match "orange"
- followed by any chars, any number of times
- then "apple"
There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a *
or +
in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.
Likewise, the .*
before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)
Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b
, like bappleb
but you may have to write yappley
, mappleM
, <apple>
, or even [[:<:]]apple[[:>:]]
. Check your documentation.
Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple
construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:
/^(?!.*box)(?=.*apple)(?=.*orange)./
The .
at the end is just to force a single character to participate. This expression says
I want a line that does not hold the word "box", does hold "apple", and does hold "orange".
You can see how to extend this with more words, but note that each time you do ?=.*
you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.
answered Nov 17 at 20:08
Austin Hastings
10.6k11032
10.6k11032
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354590%2fregular-expression-to-match-sentence-containing-specific-words-and-discarding-th%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What language are you using?
– Andy Lester
Nov 17 at 22:49