Regular expression to match sentence containing specific words and discarding the match if it contains...

up vote
2
down vote

favorite

The problem is as the title says. Is it even possible?

For example, I have two words that I search for: apple, orange And a word that makes the whole sentence wrong: box So the expression should accept this sentence:
One orange and one apple but discard this one orange and apple within a box.

I've been thinking about that for some time but I can't find any solution.

asked Nov 17 at 19:09

PatrykTraveler

356

What language are you using?
– Andy Lester
Nov 17 at 22:49

add a comment |

up vote
2
down vote

favorite

The problem is as the title says. Is it even possible?

I've been thinking about that for some time but I can't find any solution.

asked Nov 17 at 19:09

PatrykTraveler

356

What language are you using?
– Andy Lester
Nov 17 at 22:49

add a comment |

up vote
2
down vote

favorite

The problem is as the title says. Is it even possible?

I've been thinking about that for some time but I can't find any solution.

asked Nov 17 at 19:09

PatrykTraveler

356

The problem is as the title says. Is it even possible?

I've been thinking about that for some time but I can't find any solution.

regex

asked Nov 17 at 19:09

PatrykTraveler

356

asked Nov 17 at 19:09

PatrykTraveler

356

asked Nov 17 at 19:09

PatrykTraveler

356

asked Nov 17 at 19:09

PatrykTraveler

356

asked Nov 17 at 19:09

PatrykTraveler

356

What language are you using?
– Andy Lester
Nov 17 at 22:49

add a comment |

What language are you using?
– Andy Lester
Nov 17 at 22:49

What language are you using?
– Andy Lester
Nov 17 at 22:49

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You can use positive look ahead to match strings that contain either apple or orange word like this,

(?=.*(orange|apple))

and can use negative look ahead to discard the match if it contains box word like this,

(?!.*box)

Hence the over regex becomes this,

^(?=.*(orange|apple))(?!.*box).*$

Here is the demo for same

If you can provide what language you are using, I should be able to help you with sample codes as well.

Edit:

Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,

import re

strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']

for x in strArr:

    print (x + ' --> ', end="")

    print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

add a comment |

up vote
1
down vote

First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.

That said, consider something like:

I want to find the word "orange" anywhere in my string.

You can typically take advantage of regex searching by doing something like:

/orange/

But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:

/^.*orange/

(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)

You can do the same thing with apple, but how can you tie them together?

One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:

I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".

That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.

But, the sub-patterns are simple:

/orange.*apple/

/apple.*orange/

So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:

/(orange.*apple|apple.*orange)/

Then prepend the "tie to start of string" at the front:

/^.*(orange.*apple|apple.*orange)/

Now you can match text that contains both words in either order.

Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).

I don't want to be looking at the word "box" next.

Is a regex like:

/(?!box)/

But in your case, you want to say:

I don't want to be looking at the word "box" anywhere in the following text.

Which is another "any character" special:

/(?!.*box)/

Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.

For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:

I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.

You can do that by dropping the lookahead right after the anchor to the start:

/^(?!.*box).*(apple.*orange|orange.*apple)/

This translates to

At start of string,

 - confirm "box" does not appear in the line

 - match any character any number of times,

 - then either

   - match "apple", 

   - followed by any chars, any number of times

   - then "orange"

 - or

   - match "orange"

   - followed by any chars, any number of times

   - then "apple"

There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.

Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)

Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.

Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:

/^(?!.*box)(?=.*apple)(?=.*orange)./

The . at the end is just to force a single character to participate. This expression says

I want a line that does not hold the word "box", does hold "apple", and does hold "orange".

You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354590%2fregular-expression-to-match-sentence-containing-specific-words-and-discarding-th%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You can use positive look ahead to match strings that contain either apple or orange word like this,

(?=.*(orange|apple))

and can use negative look ahead to discard the match if it contains box word like this,

(?!.*box)

Hence the over regex becomes this,

^(?=.*(orange|apple))(?!.*box).*$

Here is the demo for same

If you can provide what language you are using, I should be able to help you with sample codes as well.

Edit:

Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,

import re

strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']

for x in strArr:

    print (x + ' --> ', end="")

    print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

add a comment |

up vote
1
down vote

accepted

You can use positive look ahead to match strings that contain either apple or orange word like this,

(?=.*(orange|apple))

and can use negative look ahead to discard the match if it contains box word like this,

(?!.*box)

Hence the over regex becomes this,

^(?=.*(orange|apple))(?!.*box).*$

Here is the demo for same

If you can provide what language you are using, I should be able to help you with sample codes as well.

Edit:

Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,

import re

strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']

for x in strArr:

    print (x + ' --> ', end="")

    print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

add a comment |

up vote
1
down vote

accepted

You can use positive look ahead to match strings that contain either apple or orange word like this,

(?=.*(orange|apple))

and can use negative look ahead to discard the match if it contains box word like this,

(?!.*box)

Hence the over regex becomes this,

^(?=.*(orange|apple))(?!.*box).*$

Here is the demo for same

If you can provide what language you are using, I should be able to help you with sample codes as well.

Edit:

Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,

import re

strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']

for x in strArr:

    print (x + ' --> ', end="")

    print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

You can use positive look ahead to match strings that contain either apple or orange word like this,

(?=.*(orange|apple))

and can use negative look ahead to discard the match if it contains box word like this,

(?!.*box)

Hence the over regex becomes this,

^(?=.*(orange|apple))(?!.*box).*$

Here is the demo for same

If you can provide what language you are using, I should be able to help you with sample codes as well.

Edit:

Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,

import re

strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']

for x in strArr:

    print (x + ' --> ', end="")

    print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

edited Nov 17 at 19:43

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

answered Nov 17 at 19:28

Pushpesh Kumar Rajwanshi

2,6701821

add a comment |

up vote
1
down vote

That said, consider something like:

I want to find the word "orange" anywhere in my string.

You can typically take advantage of regex searching by doing something like:

/orange/

But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:

/^.*orange/

(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)

You can do the same thing with apple, but how can you tie them together?

One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:

I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".

But, the sub-patterns are simple:

/orange.*apple/

/apple.*orange/

So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:

/(orange.*apple|apple.*orange)/

Then prepend the "tie to start of string" at the front:

/^.*(orange.*apple|apple.*orange)/

Now you can match text that contains both words in either order.

I don't want to be looking at the word "box" next.

Is a regex like:

/(?!box)/

But in your case, you want to say:

I don't want to be looking at the word "box" anywhere in the following text.

Which is another "any character" special:

/(?!.*box)/

For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:

I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.

You can do that by dropping the lookahead right after the anchor to the start:

/^(?!.*box).*(apple.*orange|orange.*apple)/

This translates to

At start of string,

 - confirm "box" does not appear in the line

 - match any character any number of times,

 - then either

   - match "apple", 

   - followed by any chars, any number of times

   - then "orange"

 - or

   - match "orange"

   - followed by any chars, any number of times

   - then "apple"

/^(?!.*box)(?=.*apple)(?=.*orange)./

The . at the end is just to force a single character to participate. This expression says

I want a line that does not hold the word "box", does hold "apple", and does hold "orange".

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

add a comment |

up vote
1
down vote

That said, consider something like:

I want to find the word "orange" anywhere in my string.

You can typically take advantage of regex searching by doing something like:

/orange/

But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:

/^.*orange/

(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)

You can do the same thing with apple, but how can you tie them together?

One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:

I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".

But, the sub-patterns are simple:

/orange.*apple/

/apple.*orange/

So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:

/(orange.*apple|apple.*orange)/

Then prepend the "tie to start of string" at the front:

/^.*(orange.*apple|apple.*orange)/

Now you can match text that contains both words in either order.

I don't want to be looking at the word "box" next.

Is a regex like:

/(?!box)/

But in your case, you want to say:

I don't want to be looking at the word "box" anywhere in the following text.

Which is another "any character" special:

/(?!.*box)/

For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:

I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.

You can do that by dropping the lookahead right after the anchor to the start:

/^(?!.*box).*(apple.*orange|orange.*apple)/

This translates to

At start of string,

 - confirm "box" does not appear in the line

 - match any character any number of times,

 - then either

   - match "apple", 

   - followed by any chars, any number of times

   - then "orange"

 - or

   - match "orange"

   - followed by any chars, any number of times

   - then "apple"

/^(?!.*box)(?=.*apple)(?=.*orange)./

The . at the end is just to force a single character to participate. This expression says

I want a line that does not hold the word "box", does hold "apple", and does hold "orange".

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

add a comment |

up vote
1
down vote

That said, consider something like:

I want to find the word "orange" anywhere in my string.

You can typically take advantage of regex searching by doing something like:

/orange/

But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:

/^.*orange/

(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)

You can do the same thing with apple, but how can you tie them together?

One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:

I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".

But, the sub-patterns are simple:

/orange.*apple/

/apple.*orange/

So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:

/(orange.*apple|apple.*orange)/

Then prepend the "tie to start of string" at the front:

/^.*(orange.*apple|apple.*orange)/

Now you can match text that contains both words in either order.

I don't want to be looking at the word "box" next.

Is a regex like:

/(?!box)/

But in your case, you want to say:

I don't want to be looking at the word "box" anywhere in the following text.

Which is another "any character" special:

/(?!.*box)/

For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:

I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.

You can do that by dropping the lookahead right after the anchor to the start:

/^(?!.*box).*(apple.*orange|orange.*apple)/

This translates to

At start of string,

 - confirm "box" does not appear in the line

 - match any character any number of times,

 - then either

   - match "apple", 

   - followed by any chars, any number of times

   - then "orange"

 - or

   - match "orange"

   - followed by any chars, any number of times

   - then "apple"

/^(?!.*box)(?=.*apple)(?=.*orange)./

The . at the end is just to force a single character to participate. This expression says

I want a line that does not hold the word "box", does hold "apple", and does hold "orange".

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

That said, consider something like:

I want to find the word "orange" anywhere in my string.

You can typically take advantage of regex searching by doing something like:

/orange/

But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:

/^.*orange/

(Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)

You can do the same thing with apple, but how can you tie them together?

One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:

I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".

But, the sub-patterns are simple:

/orange.*apple/

/apple.*orange/

So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:

/(orange.*apple|apple.*orange)/

Then prepend the "tie to start of string" at the front:

/^.*(orange.*apple|apple.*orange)/

Now you can match text that contains both words in either order.

I don't want to be looking at the word "box" next.

Is a regex like:

/(?!box)/

But in your case, you want to say:

I don't want to be looking at the word "box" anywhere in the following text.

Which is another "any character" special:

/(?!.*box)/

For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:

I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.

You can do that by dropping the lookahead right after the anchor to the start:

/^(?!.*box).*(apple.*orange|orange.*apple)/

This translates to

At start of string,

 - confirm "box" does not appear in the line

 - match any character any number of times,

 - then either

   - match "apple", 

   - followed by any chars, any number of times

   - then "orange"

 - or

   - match "orange"

   - followed by any chars, any number of times

   - then "apple"

/^(?!.*box)(?=.*apple)(?=.*orange)./

The . at the end is just to force a single character to participate. This expression says

I want a line that does not hold the word "box", does hold "apple", and does hold "orange".

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

answered Nov 17 at 20:08

Austin Hastings

10.6k11032

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk