Overlapping matches in RLocate regex strings with repeats or a sliding windowFinding the indexes of multiple/overlapping matching substringsCount number of occurrences when string contains substringr ngram extraction with regexPositive look ahead in R - passing variablesMatching pattern multiple times in same string with regexR parse timestamp of form %m%d%Y with no leading zeroesMatch all occurrences of a regexRegular expression to match a line that doesn't contain a wordHow do you access the matched groups in a JavaScript regular expression?RegEx match open tags except XHTML self-contained tagsRegular expression to stop at first matchRegex to match part of string, when match does not contain a specific string - PCRE grepHow to count the number of matches and use it inside of a regular expression?Ruby Regex, get all possible matches (no clipping of the string)data.table vs dplyr: can one do something well the other can't or does poorly?Is it possible to add `'s` or `'` (if a word ends with `s`) using only PCRE replace?

Why did the population of Bhutan drop by 70% between 2007 and 2008?

Why did Starhopper's exhaust plume become brighter just before landing?

If I said I had $100 when asked, but I actually had $200, would I be lying by omission?

Modifing a GFF3 file and writting to a new file

Is this position a forced win for Black after move 14?

What is the name of this plot that has rows with two connected dots?

Is there a word or phrase that means "use other people's wifi or Internet service without consent"?

Is Nikon D500 a good fit for nature and ambient-lighting portraits and occasional other uses?

How to say "I only speak one language which is English" in French?

Should I ask for a raise one month before the end of an internship?

Can you illusion a window out of a solid wall?

Is allowing Barbarian features to work with Dex-based attacks imbalancing?

Looking for a plural noun related to ‘fulcrum’ or ‘pivot’ that denotes multiple things as crucial to success

Why is the Grievance Studies affair considered to be research requiring IRB approval?

How could a self contained organic body propel itself in space

What is Soda Fountain Etiquette?

Number of Fingers for a Math Oriented Race

Stolen MacBook should I worry about my data?

Printing a list as "a, b, c." using Python

How does attacking during a conversation affect initiative?

Why might one *not* want to use a capo?

What to do about my 1-month-old boy peeing through diapers?

In Endgame, wouldn't Stark have remembered Hulk busting out of the stairwell?

What does GDPR mean to myself regarding my own data?



Overlapping matches in R


Locate regex strings with repeats or a sliding windowFinding the indexes of multiple/overlapping matching substringsCount number of occurrences when string contains substringr ngram extraction with regexPositive look ahead in R - passing variablesMatching pattern multiple times in same string with regexR parse timestamp of form %m%d%Y with no leading zeroesMatch all occurrences of a regexRegular expression to match a line that doesn't contain a wordHow do you access the matched groups in a JavaScript regular expression?RegEx match open tags except XHTML self-contained tagsRegular expression to stop at first matchRegex to match part of string, when match does not contain a specific string - PCRE grepHow to count the number of matches and use it inside of a regular expression?Ruby Regex, get all possible matches (no clipping of the string)data.table vs dplyr: can one do something well the other can't or does poorly?Is it possible to add `'s` or `'` (if a word ends with `s`) using only PCRE replace?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








14















I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.



I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.



I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.



But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.



> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""


The same goes for using both the stringi and stringr package.



> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""


The correct results that should be returned when executing this are:



[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


Edit



  1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.


  2. Is the stringi and stringr package not capable of performing this over regmatches?


  3. Please feel free to add to my answer or come up with a different workaround than I have found.










share|improve this question
































    14















    I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.



    I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.



    I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.



    But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.



    > x <- 'ACCACCACCAC'
    > regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
    [1] "" "" "" "" "" "" ""


    The same goes for using both the stringi and stringr package.



    > library(stringi)
    > library(stringr)
    > stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
    [1] "" "" "" "" "" "" ""
    > str_extract_all(x, perl('(?=([AC]C))'))[[1]]
    [1] "" "" "" "" "" "" ""


    The correct results that should be returned when executing this are:



    [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


    Edit



    1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.


    2. Is the stringi and stringr package not capable of performing this over regmatches?


    3. Please feel free to add to my answer or come up with a different workaround than I have found.










    share|improve this question




























      14












      14








      14


      3






      I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.



      I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.



      I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.



      But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.



      > x <- 'ACCACCACCAC'
      > regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
      [1] "" "" "" "" "" "" ""


      The same goes for using both the stringi and stringr package.



      > library(stringi)
      > library(stringr)
      > stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
      [1] "" "" "" "" "" "" ""
      > str_extract_all(x, perl('(?=([AC]C))'))[[1]]
      [1] "" "" "" "" "" "" ""


      The correct results that should be returned when executing this are:



      [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


      Edit



      1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.


      2. Is the stringi and stringr package not capable of performing this over regmatches?


      3. Please feel free to add to my answer or come up with a different workaround than I have found.










      share|improve this question
















      I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.



      I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.



      I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.



      But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.



      > x <- 'ACCACCACCAC'
      > regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
      [1] "" "" "" "" "" "" ""


      The same goes for using both the stringi and stringr package.



      > library(stringi)
      > library(stringr)
      > stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
      [1] "" "" "" "" "" "" ""
      > str_extract_all(x, perl('(?=([AC]C))'))[[1]]
      [1] "" "" "" "" "" "" ""


      The correct results that should be returned when executing this are:



      [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


      Edit



      1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.


      2. Is the stringi and stringr package not capable of performing this over regmatches?


      3. Please feel free to add to my answer or come up with a different workaround than I have found.







      regex r string dna-sequence stringi






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 23 '17 at 11:51









      Community

      11 silver badge




      11 silver badge










      asked Sep 12 '14 at 2:56









      hwndhwnd

      61.3k4 gold badges59 silver badges102 bronze badges




      61.3k4 gold badges59 silver badges102 bronze badges

























          6 Answers
          6






          active

          oldest

          votes


















          6















          The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve



          x <- 'ACCACCACCAC'
          m <- gregexpr('(?=([AC]C))', x, perl=T)
          regmatches(x, m) <- "~"
          x
          # [1] "~A~CC~A~CC~A~CC~AC"


          Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.



          I've created a regcapturedmatches() function that I often use for such tasks. For example



          x <- 'ACCACCACCAC'
          regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

          # [,1] [,2] [,3] [,4] [,5] [,6] [,7]
          # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


          The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.






          share|improve this answer



























          • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

            – hwnd
            Sep 12 '14 at 3:45












          • I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

            – MrFlick
            Sep 12 '14 at 3:50












          • Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

            – hwnd
            Sep 12 '14 at 3:53


















          7















          As far as a workaround, this is what I have come up with to extract the overlapping matches.



          > x <- 'ACCACCACCAC'
          > m <- gregexpr('(?=([AC]C))', x, perl=T)
          > mapply(function(X) substr(x, X, X+1), m[[1]])
          [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


          Please feel free to add or comment on a better way to perform this task.






          share|improve this answer



























          • The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

            – Ken Williams
            Aug 10 '15 at 14:46











          • Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

            – Ken Williams
            Aug 10 '15 at 14:48


















          4















          Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":



          x <- c("ACCACCACCAC","ACCACCACCAC")
          m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
          m <- lapply(m, function(i)
          attr(i,"match.length") <- attr(i,"capture.length")
          i
          )
          regmatches(x,m)

          #[[1]]
          #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
          #
          #[[2]]
          #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





          share|improve this answer



























          • +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

            – hwnd
            Sep 12 '14 at 5:28


















          4















          It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.



          x <- 'ACCACCACCAC'
          y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
          y[y != "CA"]
          # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





          share|improve this answer


































            4















            A stringi solution using a capture group in the look-ahead part:



            > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
            ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





            share|improve this answer



























            • Weird, how come it failed to work with stri_extract_all_regex

              – hwnd
              Oct 26 '14 at 20:00












            • @hwnd: it's a 0-length match; (?=...) does not advance the input position.

              – gagolews
              Oct 26 '14 at 20:02












            • Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

              – hwnd
              Oct 26 '14 at 20:04











            • No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

              – gagolews
              Oct 26 '14 at 20:05












            • Ok now I see and understand what you mean.

              – hwnd
              Oct 26 '14 at 20:06



















            1















            An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:



            > x <- 'ACCACCACCAC'
            > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
            > start <- attr(m,"capture.start")
            > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
            > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
            [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


            Pretty ugly, which is why the stringr etc. packages exist.






            share|improve this answer



























              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f25800042%2foverlapping-matches-in-r%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              6 Answers
              6






              active

              oldest

              votes








              6 Answers
              6






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              6















              The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve



              x <- 'ACCACCACCAC'
              m <- gregexpr('(?=([AC]C))', x, perl=T)
              regmatches(x, m) <- "~"
              x
              # [1] "~A~CC~A~CC~A~CC~AC"


              Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.



              I've created a regcapturedmatches() function that I often use for such tasks. For example



              x <- 'ACCACCACCAC'
              regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

              # [,1] [,2] [,3] [,4] [,5] [,6] [,7]
              # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.






              share|improve this answer



























              • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

                – hwnd
                Sep 12 '14 at 3:45












              • I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

                – MrFlick
                Sep 12 '14 at 3:50












              • Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

                – hwnd
                Sep 12 '14 at 3:53















              6















              The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve



              x <- 'ACCACCACCAC'
              m <- gregexpr('(?=([AC]C))', x, perl=T)
              regmatches(x, m) <- "~"
              x
              # [1] "~A~CC~A~CC~A~CC~AC"


              Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.



              I've created a regcapturedmatches() function that I often use for such tasks. For example



              x <- 'ACCACCACCAC'
              regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

              # [,1] [,2] [,3] [,4] [,5] [,6] [,7]
              # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.






              share|improve this answer



























              • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

                – hwnd
                Sep 12 '14 at 3:45












              • I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

                – MrFlick
                Sep 12 '14 at 3:50












              • Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

                – hwnd
                Sep 12 '14 at 3:53













              6














              6










              6









              The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve



              x <- 'ACCACCACCAC'
              m <- gregexpr('(?=([AC]C))', x, perl=T)
              regmatches(x, m) <- "~"
              x
              # [1] "~A~CC~A~CC~A~CC~AC"


              Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.



              I've created a regcapturedmatches() function that I often use for such tasks. For example



              x <- 'ACCACCACCAC'
              regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

              # [,1] [,2] [,3] [,4] [,5] [,6] [,7]
              # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.






              share|improve this answer















              The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve



              x <- 'ACCACCACCAC'
              m <- gregexpr('(?=([AC]C))', x, perl=T)
              regmatches(x, m) <- "~"
              x
              # [1] "~A~CC~A~CC~A~CC~AC"


              Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.



              I've created a regcapturedmatches() function that I often use for such tasks. For example



              x <- 'ACCACCACCAC'
              regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

              # [,1] [,2] [,3] [,4] [,5] [,6] [,7]
              # [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Sep 12 '14 at 3:50

























              answered Sep 12 '14 at 3:37









              MrFlickMrFlick

              132k12 gold badges159 silver badges193 bronze badges




              132k12 gold badges159 silver badges193 bronze badges















              • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

                – hwnd
                Sep 12 '14 at 3:45












              • I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

                – MrFlick
                Sep 12 '14 at 3:50












              • Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

                – hwnd
                Sep 12 '14 at 3:53

















              • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

                – hwnd
                Sep 12 '14 at 3:45












              • I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

                – MrFlick
                Sep 12 '14 at 3:50












              • Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

                – hwnd
                Sep 12 '14 at 3:53
















              +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

              – hwnd
              Sep 12 '14 at 3:45






              +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?

              – hwnd
              Sep 12 '14 at 3:45














              I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

              – MrFlick
              Sep 12 '14 at 3:50






              I can't speak to stringr as I've never used that myself, but regmatches really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches() is capturing compared to my function.`

              – MrFlick
              Sep 12 '14 at 3:50














              Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

              – hwnd
              Sep 12 '14 at 3:53





              Yea I've used regmatches()<- like that before hand to observe the effect of the zero-width matches.

              – hwnd
              Sep 12 '14 at 3:53













              7















              As far as a workaround, this is what I have come up with to extract the overlapping matches.



              > x <- 'ACCACCACCAC'
              > m <- gregexpr('(?=([AC]C))', x, perl=T)
              > mapply(function(X) substr(x, X, X+1), m[[1]])
              [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              Please feel free to add or comment on a better way to perform this task.






              share|improve this answer



























              • The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

                – Ken Williams
                Aug 10 '15 at 14:46











              • Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

                – Ken Williams
                Aug 10 '15 at 14:48















              7















              As far as a workaround, this is what I have come up with to extract the overlapping matches.



              > x <- 'ACCACCACCAC'
              > m <- gregexpr('(?=([AC]C))', x, perl=T)
              > mapply(function(X) substr(x, X, X+1), m[[1]])
              [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              Please feel free to add or comment on a better way to perform this task.






              share|improve this answer



























              • The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

                – Ken Williams
                Aug 10 '15 at 14:46











              • Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

                – Ken Williams
                Aug 10 '15 at 14:48













              7














              7










              7









              As far as a workaround, this is what I have come up with to extract the overlapping matches.



              > x <- 'ACCACCACCAC'
              > m <- gregexpr('(?=([AC]C))', x, perl=T)
              > mapply(function(X) substr(x, X, X+1), m[[1]])
              [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              Please feel free to add or comment on a better way to perform this task.






              share|improve this answer















              As far as a workaround, this is what I have come up with to extract the overlapping matches.



              > x <- 'ACCACCACCAC'
              > m <- gregexpr('(?=([AC]C))', x, perl=T)
              > mapply(function(X) substr(x, X, X+1), m[[1]])
              [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


              Please feel free to add or comment on a better way to perform this task.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Sep 12 '14 at 3:54

























              answered Sep 12 '14 at 2:56









              hwndhwnd

              61.3k4 gold badges59 silver badges102 bronze badges




              61.3k4 gold badges59 silver badges102 bronze badges















              • The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

                – Ken Williams
                Aug 10 '15 at 14:46











              • Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

                – Ken Williams
                Aug 10 '15 at 14:48

















              • The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

                – Ken Williams
                Aug 10 '15 at 14:46











              • Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

                – Ken Williams
                Aug 10 '15 at 14:48
















              The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

              – Ken Williams
              Aug 10 '15 at 14:46





              The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:

              – Ken Williams
              Aug 10 '15 at 14:46













              Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

              – Ken Williams
              Aug 10 '15 at 14:48





              Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.

              – Ken Williams
              Aug 10 '15 at 14:48











              4















              Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":



              x <- c("ACCACCACCAC","ACCACCACCAC")
              m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
              m <- lapply(m, function(i)
              attr(i,"match.length") <- attr(i,"capture.length")
              i
              )
              regmatches(x,m)

              #[[1]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
              #
              #[[2]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





              share|improve this answer



























              • +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

                – hwnd
                Sep 12 '14 at 5:28















              4















              Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":



              x <- c("ACCACCACCAC","ACCACCACCAC")
              m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
              m <- lapply(m, function(i)
              attr(i,"match.length") <- attr(i,"capture.length")
              i
              )
              regmatches(x,m)

              #[[1]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
              #
              #[[2]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





              share|improve this answer



























              • +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

                – hwnd
                Sep 12 '14 at 5:28













              4














              4










              4









              Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":



              x <- c("ACCACCACCAC","ACCACCACCAC")
              m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
              m <- lapply(m, function(i)
              attr(i,"match.length") <- attr(i,"capture.length")
              i
              )
              regmatches(x,m)

              #[[1]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
              #
              #[[2]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





              share|improve this answer















              Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":



              x <- c("ACCACCACCAC","ACCACCACCAC")
              m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
              m <- lapply(m, function(i)
              attr(i,"match.length") <- attr(i,"capture.length")
              i
              )
              regmatches(x,m)

              #[[1]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
              #
              #[[2]]
              #[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Sep 12 '14 at 5:46

























              answered Sep 12 '14 at 5:10









              thelatemailthelatemail

              71.4k10 gold badges91 silver badges158 bronze badges




              71.4k10 gold badges91 silver badges158 bronze badges















              • +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

                – hwnd
                Sep 12 '14 at 5:28

















              • +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

                – hwnd
                Sep 12 '14 at 5:28
















              +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

              – hwnd
              Sep 12 '14 at 5:28





              +1 Thanks for the additional solution. I've done similar using capture.start and capture.length.

              – hwnd
              Sep 12 '14 at 5:28











              4















              It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.



              x <- 'ACCACCACCAC'
              y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
              y[y != "CA"]
              # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





              share|improve this answer































                4















                It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.



                x <- 'ACCACCACCAC'
                y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
                y[y != "CA"]
                # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





                share|improve this answer





























                  4














                  4










                  4









                  It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.



                  x <- 'ACCACCACCAC'
                  y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
                  y[y != "CA"]
                  # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





                  share|improve this answer















                  It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.



                  x <- 'ACCACCACCAC'
                  y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
                  y[y != "CA"]
                  # [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Aug 10 '15 at 15:52

























                  answered Sep 13 '14 at 2:54









                  Rich ScrivenRich Scriven

                  79.6k8 gold badges117 silver badges186 bronze badges




                  79.6k8 gold badges117 silver badges186 bronze badges
























                      4















                      A stringi solution using a capture group in the look-ahead part:



                      > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
                      ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





                      share|improve this answer



























                      • Weird, how come it failed to work with stri_extract_all_regex

                        – hwnd
                        Oct 26 '14 at 20:00












                      • @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                        – gagolews
                        Oct 26 '14 at 20:02












                      • Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                        – hwnd
                        Oct 26 '14 at 20:04











                      • No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                        – gagolews
                        Oct 26 '14 at 20:05












                      • Ok now I see and understand what you mean.

                        – hwnd
                        Oct 26 '14 at 20:06
















                      4















                      A stringi solution using a capture group in the look-ahead part:



                      > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
                      ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





                      share|improve this answer



























                      • Weird, how come it failed to work with stri_extract_all_regex

                        – hwnd
                        Oct 26 '14 at 20:00












                      • @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                        – gagolews
                        Oct 26 '14 at 20:02












                      • Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                        – hwnd
                        Oct 26 '14 at 20:04











                      • No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                        – gagolews
                        Oct 26 '14 at 20:05












                      • Ok now I see and understand what you mean.

                        – hwnd
                        Oct 26 '14 at 20:06














                      4














                      4










                      4









                      A stringi solution using a capture group in the look-ahead part:



                      > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
                      ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"





                      share|improve this answer















                      A stringi solution using a capture group in the look-ahead part:



                      > stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
                      ## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Mar 24 at 20:16

























                      answered Oct 26 '14 at 19:55









                      gagolewsgagolews

                      10.7k2 gold badges36 silver badges66 bronze badges




                      10.7k2 gold badges36 silver badges66 bronze badges















                      • Weird, how come it failed to work with stri_extract_all_regex

                        – hwnd
                        Oct 26 '14 at 20:00












                      • @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                        – gagolews
                        Oct 26 '14 at 20:02












                      • Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                        – hwnd
                        Oct 26 '14 at 20:04











                      • No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                        – gagolews
                        Oct 26 '14 at 20:05












                      • Ok now I see and understand what you mean.

                        – hwnd
                        Oct 26 '14 at 20:06


















                      • Weird, how come it failed to work with stri_extract_all_regex

                        – hwnd
                        Oct 26 '14 at 20:00












                      • @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                        – gagolews
                        Oct 26 '14 at 20:02












                      • Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                        – hwnd
                        Oct 26 '14 at 20:04











                      • No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                        – gagolews
                        Oct 26 '14 at 20:05












                      • Ok now I see and understand what you mean.

                        – hwnd
                        Oct 26 '14 at 20:06

















                      Weird, how come it failed to work with stri_extract_all_regex

                      – hwnd
                      Oct 26 '14 at 20:00






                      Weird, how come it failed to work with stri_extract_all_regex

                      – hwnd
                      Oct 26 '14 at 20:00














                      @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                      – gagolews
                      Oct 26 '14 at 20:02






                      @hwnd: it's a 0-length match; (?=...) does not advance the input position.

                      – gagolews
                      Oct 26 '14 at 20:02














                      Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                      – hwnd
                      Oct 26 '14 at 20:04





                      Yes I know it's a zero-width match =) I guess there is a difference between extract_all_regex and match_all_regex

                      – hwnd
                      Oct 26 '14 at 20:04













                      No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                      – gagolews
                      Oct 26 '14 at 20:05






                      No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)

                      – gagolews
                      Oct 26 '14 at 20:05














                      Ok now I see and understand what you mean.

                      – hwnd
                      Oct 26 '14 at 20:06






                      Ok now I see and understand what you mean.

                      – hwnd
                      Oct 26 '14 at 20:06












                      1















                      An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:



                      > x <- 'ACCACCACCAC'
                      > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
                      > start <- attr(m,"capture.start")
                      > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
                      > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
                      [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


                      Pretty ugly, which is why the stringr etc. packages exist.






                      share|improve this answer





























                        1















                        An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:



                        > x <- 'ACCACCACCAC'
                        > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
                        > start <- attr(m,"capture.start")
                        > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
                        > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
                        [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


                        Pretty ugly, which is why the stringr etc. packages exist.






                        share|improve this answer



























                          1














                          1










                          1









                          An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:



                          > x <- 'ACCACCACCAC'
                          > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
                          > start <- attr(m,"capture.start")
                          > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
                          > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
                          [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


                          Pretty ugly, which is why the stringr etc. packages exist.






                          share|improve this answer













                          An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:



                          > x <- 'ACCACCACCAC'
                          > m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
                          > start <- attr(m,"capture.start")
                          > end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
                          > sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
                          [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"


                          Pretty ugly, which is why the stringr etc. packages exist.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Aug 10 '15 at 14:51









                          Ken WilliamsKen Williams

                          13.5k5 gold badges61 silver badges106 bronze badges




                          13.5k5 gold badges61 silver badges106 bronze badges






























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f25800042%2foverlapping-matches-in-r%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

                              SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                              은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현