Regular expression to match sentence containing specific words and discarding the match if it contains...











up vote
2
down vote

favorite












The problem is as the title says. Is it even possible?



For example, I have two words that I search for: apple, orange And a word that makes the whole sentence wrong: box So the expression should accept this sentence:
One orange and one apple but discard this one orange and apple within a box.



I've been thinking about that for some time but I can't find any solution.










share|improve this question






















  • What language are you using?
    – Andy Lester
    Nov 17 at 22:49















up vote
2
down vote

favorite












The problem is as the title says. Is it even possible?



For example, I have two words that I search for: apple, orange And a word that makes the whole sentence wrong: box So the expression should accept this sentence:
One orange and one apple but discard this one orange and apple within a box.



I've been thinking about that for some time but I can't find any solution.










share|improve this question






















  • What language are you using?
    – Andy Lester
    Nov 17 at 22:49













up vote
2
down vote

favorite









up vote
2
down vote

favorite











The problem is as the title says. Is it even possible?



For example, I have two words that I search for: apple, orange And a word that makes the whole sentence wrong: box So the expression should accept this sentence:
One orange and one apple but discard this one orange and apple within a box.



I've been thinking about that for some time but I can't find any solution.










share|improve this question













The problem is as the title says. Is it even possible?



For example, I have two words that I search for: apple, orange And a word that makes the whole sentence wrong: box So the expression should accept this sentence:
One orange and one apple but discard this one orange and apple within a box.



I've been thinking about that for some time but I can't find any solution.







regex






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 17 at 19:09









PatrykTraveler

356




356












  • What language are you using?
    – Andy Lester
    Nov 17 at 22:49


















  • What language are you using?
    – Andy Lester
    Nov 17 at 22:49
















What language are you using?
– Andy Lester
Nov 17 at 22:49




What language are you using?
– Andy Lester
Nov 17 at 22:49












2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










You can use positive look ahead to match strings that contain either apple or orange word like this,



(?=.*(orange|apple))


and can use negative look ahead to discard the match if it contains box word like this,



(?!.*box)


Hence the over regex becomes this,



^(?=.*(orange|apple))(?!.*box).*$


Here is the demo for same



If you can provide what language you are using, I should be able to help you with sample codes as well.



Edit:



Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,



import re
strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
for x in strArr:
print (x + ' --> ', end="")
print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))





share|improve this answer






























    up vote
    1
    down vote













    First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.



    That said, consider something like:




    I want to find the word "orange" anywhere in my string.




    You can typically take advantage of regex searching by doing something like:



    /orange/


    But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:



    /^.*orange/


    (Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)



    You can do the same thing with apple, but how can you tie them together?



    One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:




    I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".




    That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.



    But, the sub-patterns are simple:



    /orange.*apple/
    /apple.*orange/


    So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:



    /(orange.*apple|apple.*orange)/


    Then prepend the "tie to start of string" at the front:



    /^.*(orange.*apple|apple.*orange)/


    Now you can match text that contains both words in either order.



    Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).




    I don't want to be looking at the word "box" next.




    Is a regex like:



    /(?!box)/


    But in your case, you want to say:




    I don't want to be looking at the word "box" anywhere in the following text.




    Which is another "any character" special:



    /(?!.*box)/


    Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.



    For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:




    I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.




    You can do that by dropping the lookahead right after the anchor to the start:



    /^(?!.*box).*(apple.*orange|orange.*apple)/


    This translates to



    At start of string,
    - confirm "box" does not appear in the line
    - match any character any number of times,
    - then either
    - match "apple",
    - followed by any chars, any number of times
    - then "orange"
    - or
    - match "orange"
    - followed by any chars, any number of times
    - then "apple"


    There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.



    Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)



    Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.



    Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:



    /^(?!.*box)(?=.*apple)(?=.*orange)./


    The . at the end is just to force a single character to participate. This expression says




    I want a line that does not hold the word "box", does hold "apple", and does hold "orange".




    You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.






    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354590%2fregular-expression-to-match-sentence-containing-specific-words-and-discarding-th%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote



      accepted










      You can use positive look ahead to match strings that contain either apple or orange word like this,



      (?=.*(orange|apple))


      and can use negative look ahead to discard the match if it contains box word like this,



      (?!.*box)


      Hence the over regex becomes this,



      ^(?=.*(orange|apple))(?!.*box).*$


      Here is the demo for same



      If you can provide what language you are using, I should be able to help you with sample codes as well.



      Edit:



      Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,



      import re
      strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
      for x in strArr:
      print (x + ' --> ', end="")
      print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))





      share|improve this answer



























        up vote
        1
        down vote



        accepted










        You can use positive look ahead to match strings that contain either apple or orange word like this,



        (?=.*(orange|apple))


        and can use negative look ahead to discard the match if it contains box word like this,



        (?!.*box)


        Hence the over regex becomes this,



        ^(?=.*(orange|apple))(?!.*box).*$


        Here is the demo for same



        If you can provide what language you are using, I should be able to help you with sample codes as well.



        Edit:



        Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,



        import re
        strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
        for x in strArr:
        print (x + ' --> ', end="")
        print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))





        share|improve this answer

























          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          You can use positive look ahead to match strings that contain either apple or orange word like this,



          (?=.*(orange|apple))


          and can use negative look ahead to discard the match if it contains box word like this,



          (?!.*box)


          Hence the over regex becomes this,



          ^(?=.*(orange|apple))(?!.*box).*$


          Here is the demo for same



          If you can provide what language you are using, I should be able to help you with sample codes as well.



          Edit:



          Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,



          import re
          strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
          for x in strArr:
          print (x + ' --> ', end="")
          print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))





          share|improve this answer














          You can use positive look ahead to match strings that contain either apple or orange word like this,



          (?=.*(orange|apple))


          and can use negative look ahead to discard the match if it contains box word like this,



          (?!.*box)


          Hence the over regex becomes this,



          ^(?=.*(orange|apple))(?!.*box).*$


          Here is the demo for same



          If you can provide what language you are using, I should be able to help you with sample codes as well.



          Edit:



          Just in case you are using today's hottest language python (though my main is Java), here are the sample codes for same,



          import re
          strArr = ['One orange and one apple','One apple','One orange','orange and apple within a box','One apple and box','One orange and box','This contains none of accepted words so it doesn't match']
          for x in strArr:
          print (x + ' --> ', end="")
          print (bool(re.match('^(?=.*(orange|apple))(?!.*box).*$', x)))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 17 at 19:43

























          answered Nov 17 at 19:28









          Pushpesh Kumar Rajwanshi

          2,6701821




          2,6701821
























              up vote
              1
              down vote













              First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.



              That said, consider something like:




              I want to find the word "orange" anywhere in my string.




              You can typically take advantage of regex searching by doing something like:



              /orange/


              But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:



              /^.*orange/


              (Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)



              You can do the same thing with apple, but how can you tie them together?



              One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:




              I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".




              That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.



              But, the sub-patterns are simple:



              /orange.*apple/
              /apple.*orange/


              So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:



              /(orange.*apple|apple.*orange)/


              Then prepend the "tie to start of string" at the front:



              /^.*(orange.*apple|apple.*orange)/


              Now you can match text that contains both words in either order.



              Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).




              I don't want to be looking at the word "box" next.




              Is a regex like:



              /(?!box)/


              But in your case, you want to say:




              I don't want to be looking at the word "box" anywhere in the following text.




              Which is another "any character" special:



              /(?!.*box)/


              Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.



              For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:




              I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.




              You can do that by dropping the lookahead right after the anchor to the start:



              /^(?!.*box).*(apple.*orange|orange.*apple)/


              This translates to



              At start of string,
              - confirm "box" does not appear in the line
              - match any character any number of times,
              - then either
              - match "apple",
              - followed by any chars, any number of times
              - then "orange"
              - or
              - match "orange"
              - followed by any chars, any number of times
              - then "apple"


              There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.



              Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)



              Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.



              Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:



              /^(?!.*box)(?=.*apple)(?=.*orange)./


              The . at the end is just to force a single character to participate. This expression says




              I want a line that does not hold the word "box", does hold "apple", and does hold "orange".




              You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.






              share|improve this answer

























                up vote
                1
                down vote













                First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.



                That said, consider something like:




                I want to find the word "orange" anywhere in my string.




                You can typically take advantage of regex searching by doing something like:



                /orange/


                But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:



                /^.*orange/


                (Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)



                You can do the same thing with apple, but how can you tie them together?



                One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:




                I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".




                That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.



                But, the sub-patterns are simple:



                /orange.*apple/
                /apple.*orange/


                So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:



                /(orange.*apple|apple.*orange)/


                Then prepend the "tie to start of string" at the front:



                /^.*(orange.*apple|apple.*orange)/


                Now you can match text that contains both words in either order.



                Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).




                I don't want to be looking at the word "box" next.




                Is a regex like:



                /(?!box)/


                But in your case, you want to say:




                I don't want to be looking at the word "box" anywhere in the following text.




                Which is another "any character" special:



                /(?!.*box)/


                Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.



                For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:




                I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.




                You can do that by dropping the lookahead right after the anchor to the start:



                /^(?!.*box).*(apple.*orange|orange.*apple)/


                This translates to



                At start of string,
                - confirm "box" does not appear in the line
                - match any character any number of times,
                - then either
                - match "apple",
                - followed by any chars, any number of times
                - then "orange"
                - or
                - match "orange"
                - followed by any chars, any number of times
                - then "apple"


                There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.



                Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)



                Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.



                Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:



                /^(?!.*box)(?=.*apple)(?=.*orange)./


                The . at the end is just to force a single character to participate. This expression says




                I want a line that does not hold the word "box", does hold "apple", and does hold "orange".




                You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.






                share|improve this answer























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.



                  That said, consider something like:




                  I want to find the word "orange" anywhere in my string.




                  You can typically take advantage of regex searching by doing something like:



                  /orange/


                  But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:



                  /^.*orange/


                  (Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)



                  You can do the same thing with apple, but how can you tie them together?



                  One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:




                  I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".




                  That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.



                  But, the sub-patterns are simple:



                  /orange.*apple/
                  /apple.*orange/


                  So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:



                  /(orange.*apple|apple.*orange)/


                  Then prepend the "tie to start of string" at the front:



                  /^.*(orange.*apple|apple.*orange)/


                  Now you can match text that contains both words in either order.



                  Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).




                  I don't want to be looking at the word "box" next.




                  Is a regex like:



                  /(?!box)/


                  But in your case, you want to say:




                  I don't want to be looking at the word "box" anywhere in the following text.




                  Which is another "any character" special:



                  /(?!.*box)/


                  Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.



                  For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:




                  I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.




                  You can do that by dropping the lookahead right after the anchor to the start:



                  /^(?!.*box).*(apple.*orange|orange.*apple)/


                  This translates to



                  At start of string,
                  - confirm "box" does not appear in the line
                  - match any character any number of times,
                  - then either
                  - match "apple",
                  - followed by any chars, any number of times
                  - then "orange"
                  - or
                  - match "orange"
                  - followed by any chars, any number of times
                  - then "apple"


                  There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.



                  Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)



                  Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.



                  Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:



                  /^(?!.*box)(?=.*apple)(?=.*orange)./


                  The . at the end is just to force a single character to participate. This expression says




                  I want a line that does not hold the word "box", does hold "apple", and does hold "orange".




                  You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.






                  share|improve this answer












                  First this is possible, using negative lookahead. However, it is way too expensive to be useful. This is the kind of thing you do to satisfy a homework assignment or work around some kind of stupid limit imposed by a system you are abusing.



                  That said, consider something like:




                  I want to find the word "orange" anywhere in my string.




                  You can typically take advantage of regex searching by doing something like:



                  /orange/


                  But you can also tie your search to the beginning of the string by inserting "match any" patterns before your word:



                  /^.*orange/


                  (Note that neither example requires orange to be a word at present. Something like "storange" would match. Save that for later.)



                  You can do the same thing with apple, but how can you tie them together?



                  One easy way, that works in a lot of engines but might not perform well, is simply to spell out both possibilities:




                  I want to find the word "orange" followed by any number of characters followed by the word "apple" OR the word "apple" followed by any number of characters followed by the word "orange".




                  That's an alternation, which is | (vertical bar) in regex. Sometimes you may need to escape the vertical bar for the regex engine (basic vs. extended). Some other times you may have to escape it for the command line parser. So depending on how you are using your regex, you might have to write | or \\| or something in between.



                  But, the sub-patterns are simple:



                  /orange.*apple/
                  /apple.*orange/


                  So first you alternate them in a non-capturing group (if possible! Check your docs, use a capturing group if you must.) like so:



                  /(orange.*apple|apple.*orange)/


                  Then prepend the "tie to start of string" at the front:



                  /^.*(orange.*apple|apple.*orange)/


                  Now you can match text that contains both words in either order.



                  Finally, you can harness the power of negative lookahead to block the word "box". Use the special syntax for this, which may vary but is probably something close to (?! ... ) (where ... is "box" in our case).




                  I don't want to be looking at the word "box" next.




                  Is a regex like:



                  /(?!box)/


                  But in your case, you want to say:




                  I don't want to be looking at the word "box" anywhere in the following text.




                  Which is another "any character" special:



                  /(?!.*box)/


                  Now, how can you use this with your existing pattern? Lookahead (and "lookbehind") are both zero-width assertions. This means that they can fail, because they are assertions, but they consume zero input characters (zero-width). So all you have to do is pay attention to where you put them, since they make their assertion exactly in whatever place they correspond to.



                  For this scenario, I think you want to make one simple assertion right at the start: "the word box does not appear" and then proceed to your other matching:




                  I want to find a line that does not have the word "box", but that contains ... apple ... orange, etc.




                  You can do that by dropping the lookahead right after the anchor to the start:



                  /^(?!.*box).*(apple.*orange|orange.*apple)/


                  This translates to



                  At start of string,
                  - confirm "box" does not appear in the line
                  - match any character any number of times,
                  - then either
                  - match "apple",
                  - followed by any chars, any number of times
                  - then "orange"
                  - or
                  - match "orange"
                  - followed by any chars, any number of times
                  - then "apple"


                  There are a couple of other ways to approach this. But you need to be aware of performance. When you do a lookahead, you are inviting another scan of the string. So if you have a * or + in your lookahead, you could be re-scanning the same text over and over again. That slows you down, which is why I recommend putting the lookahead right at the start. You'll either succeed once, or fail immediately.



                  Likewise, the .* before and between your words is a potential problem. Modern engines are usually smart enough to deal with this, but some database engines aren't very smart. Beware of this: do some performance tests, with missing words as well as duplicate words (apple ... apple ... orange, apple ... orange ... orange) to make sure performance is okay. (In this case, '...' means 200 random words.)



                  Finally, consider how much you want the words to be words. There is a special syntax for that in regex, which may not be present or may vary by engine. Typically, a word boundary assertion is spelled b, like bappleb but you may have to write yappley, mappleM, <apple>, or even [[:<:]]apple[[:>:]]. Check your documentation.



                  Finally, consider that using positive lookahead is another way to deal with alternation when you have mutually exclusive alternates. Instead of the apple.*orange|orange.*apple construction, you might just use two positive lookahead expressions at the start of the pattern. This has definite performance implications, since the two expressions imply two scans through the text. It does simplify the construction of the regex, which might be an issue if you want more than two words, and especially if you want to generate the pattern programmatically:



                  /^(?!.*box)(?=.*apple)(?=.*orange)./


                  The . at the end is just to force a single character to participate. This expression says




                  I want a line that does not hold the word "box", does hold "apple", and does hold "orange".




                  You can see how to extend this with more words, but note that each time you do ?=.* you're re-scanning the text. If your text items are 80 characters or less you may not care, but if you are searching thousands of characters for words that are likely to be just a few characters apart, the previous version will perform better.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 17 at 20:08









                  Austin Hastings

                  10.6k11032




                  10.6k11032






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354590%2fregular-expression-to-match-sentence-containing-specific-words-and-discarding-th%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Costa Masnaga

                      Fotorealismo

                      Sidney Franklin