Overlapping matches in RLocate regex strings with repeats or a sliding windowFinding the indexes of multiple/overlapping matching substringsCount number of occurrences when string contains substringr ngram extraction with regexPositive look ahead in R - passing variablesMatching pattern multiple times in same string with regexR parse timestamp of form %m%d%Y with no leading zeroesMatch all occurrences of a regexRegular expression to match a line that doesn't contain a wordHow do you access the matched groups in a JavaScript regular expression?RegEx match open tags except XHTML self-contained tagsRegular expression to stop at first matchRegex to match part of string, when match does not contain a specific string - PCRE grepHow to count the number of matches and use it inside of a regular expression?Ruby Regex, get all possible matches (no clipping of the string)data.table vs dplyr: can one do something well the other can't or does poorly?Is it possible to add `'s` or `'` (if a word ends with `s`) using only PCRE replace?
Why did the population of Bhutan drop by 70% between 2007 and 2008?
Why did Starhopper's exhaust plume become brighter just before landing?
If I said I had $100 when asked, but I actually had $200, would I be lying by omission?
Modifing a GFF3 file and writting to a new file
Is this position a forced win for Black after move 14?
What is the name of this plot that has rows with two connected dots?
Is there a word or phrase that means "use other people's wifi or Internet service without consent"?
Is Nikon D500 a good fit for nature and ambient-lighting portraits and occasional other uses?
How to say "I only speak one language which is English" in French?
Should I ask for a raise one month before the end of an internship?
Can you illusion a window out of a solid wall?
Is allowing Barbarian features to work with Dex-based attacks imbalancing?
Looking for a plural noun related to ‘fulcrum’ or ‘pivot’ that denotes multiple things as crucial to success
Why is the Grievance Studies affair considered to be research requiring IRB approval?
How could a self contained organic body propel itself in space
What is Soda Fountain Etiquette?
Number of Fingers for a Math Oriented Race
Stolen MacBook should I worry about my data?
Printing a list as "a, b, c." using Python
How does attacking during a conversation affect initiative?
Why might one *not* want to use a capo?
What to do about my 1-month-old boy peeing through diapers?
In Endgame, wouldn't Stark have remembered Hulk busting out of the stairwell?
What does GDPR mean to myself regarding my own data?
Overlapping matches in R
Locate regex strings with repeats or a sliding windowFinding the indexes of multiple/overlapping matching substringsCount number of occurrences when string contains substringr ngram extraction with regexPositive look ahead in R - passing variablesMatching pattern multiple times in same string with regexR parse timestamp of form %m%d%Y with no leading zeroesMatch all occurrences of a regexRegular expression to match a line that doesn't contain a wordHow do you access the matched groups in a JavaScript regular expression?RegEx match open tags except XHTML self-contained tagsRegular expression to stop at first matchRegex to match part of string, when match does not contain a specific string - PCRE grepHow to count the number of matches and use it inside of a regular expression?Ruby Regex, get all possible matches (no clipping of the string)data.table vs dplyr: can one do something well the other can't or does poorly?Is it possible to add `'s` or `'` (if a word ends with `s`) using only PCRE replace?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that
regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.Is the
stringi
andstringr
package not capable of performing this overregmatches
?Please feel free to add to my answer or come up with a different workaround than I have found.
regex r string dna-sequence stringi
add a comment |
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that
regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.Is the
stringi
andstringr
package not capable of performing this overregmatches
?Please feel free to add to my answer or come up with a different workaround than I have found.
regex r string dna-sequence stringi
add a comment |
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that
regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.Is the
stringi
andstringr
package not capable of performing this overregmatches
?Please feel free to add to my answer or come up with a different workaround than I have found.
regex r string dna-sequence stringi
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T
in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi
and stringr
package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that
regmatches
does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.Is the
stringi
andstringr
package not capable of performing this overregmatches
?Please feel free to add to my answer or come up with a different workaround than I have found.
regex r string dna-sequence stringi
regex r string dna-sequence stringi
edited May 23 '17 at 11:51
Community♦
11 silver badge
11 silver badge
asked Sep 12 '14 at 2:56
hwndhwnd
61.3k4 gold badges59 silver badges102 bronze badges
61.3k4 gold badges59 silver badges102 bronze badges
add a comment |
add a comment |
6 Answers
6
active
oldest
votes
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak tostringr
as I've never used that myself, butregmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what theregmatches()
is capturing compared to my function.`
– MrFlick
Sep 12 '14 at 3:50
Yea I've usedregmatches()<-
like that before hand to observe the effect of the zero-width matches.
– hwnd
Sep 12 '14 at 3:53
add a comment |
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
add a comment |
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i)
attr(i,"match.length") <- attr(i,"capture.length")
i
)
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
+1 Thanks for the additional solution. I've done similar usingcapture.start
andcapture.length
.
– hwnd
Sep 12 '14 at 5:28
add a comment |
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
add a comment |
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Weird, how come it failed to work withstri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;(?=...)
does not advance the input position.
– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference betweenextract_all_regex
andmatch_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
add a comment |
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f25800042%2foverlapping-matches-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak tostringr
as I've never used that myself, butregmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what theregmatches()
is capturing compared to my function.`
– MrFlick
Sep 12 '14 at 3:50
Yea I've usedregmatches()<-
like that before hand to observe the effect of the zero-width matches.
– hwnd
Sep 12 '14 at 3:53
add a comment |
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak tostringr
as I've never used that myself, butregmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what theregmatches()
is capturing compared to my function.`
– MrFlick
Sep 12 '14 at 3:50
Yea I've usedregmatches()<-
like that before hand to observe the effect of the zero-width matches.
– hwnd
Sep 12 '14 at 3:53
add a comment |
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
The standard regmatches
does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<-
function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr
is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
edited Sep 12 '14 at 3:50
answered Sep 12 '14 at 3:37
MrFlickMrFlick
132k12 gold badges159 silver badges193 bronze badges
132k12 gold badges159 silver badges193 bronze badges
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak tostringr
as I've never used that myself, butregmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what theregmatches()
is capturing compared to my function.`
– MrFlick
Sep 12 '14 at 3:50
Yea I've usedregmatches()<-
like that before hand to observe the effect of the zero-width matches.
– hwnd
Sep 12 '14 at 3:53
add a comment |
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak tostringr
as I've never used that myself, butregmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what theregmatches()
is capturing compared to my function.`
– MrFlick
Sep 12 '14 at 3:50
Yea I've usedregmatches()<-
like that before hand to observe the effect of the zero-width matches.
– hwnd
Sep 12 '14 at 3:53
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this?
– hwnd
Sep 12 '14 at 3:45
I can't speak to
stringr
as I've never used that myself, but regmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches()
is capturing compared to my function.`– MrFlick
Sep 12 '14 at 3:50
I can't speak to
stringr
as I've never used that myself, but regmatches
really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the regmatches()
is capturing compared to my function.`– MrFlick
Sep 12 '14 at 3:50
Yea I've used
regmatches()<-
like that before hand to observe the effect of the zero-width matches.– hwnd
Sep 12 '14 at 3:53
Yea I've used
regmatches()<-
like that before hand to observe the effect of the zero-width matches.– hwnd
Sep 12 '14 at 3:53
add a comment |
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
add a comment |
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
add a comment |
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
edited Sep 12 '14 at 3:54
answered Sep 12 '14 at 2:56
hwndhwnd
61.3k4 gold badges59 silver badges102 bronze badges
61.3k4 gold badges59 silver badges102 bronze badges
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
add a comment |
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this:
– Ken Williams
Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer.
– Ken Williams
Aug 10 '15 at 14:48
add a comment |
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i)
attr(i,"match.length") <- attr(i,"capture.length")
i
)
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
+1 Thanks for the additional solution. I've done similar usingcapture.start
andcapture.length
.
– hwnd
Sep 12 '14 at 5:28
add a comment |
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i)
attr(i,"match.length") <- attr(i,"capture.length")
i
)
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
+1 Thanks for the additional solution. I've done similar usingcapture.start
andcapture.length
.
– hwnd
Sep 12 '14 at 5:28
add a comment |
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i)
attr(i,"match.length") <- attr(i,"capture.length")
i
)
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length"
with the "capture.length"
:
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i)
attr(i,"match.length") <- attr(i,"capture.length")
i
)
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
edited Sep 12 '14 at 5:46
answered Sep 12 '14 at 5:10
thelatemailthelatemail
71.4k10 gold badges91 silver badges158 bronze badges
71.4k10 gold badges91 silver badges158 bronze badges
+1 Thanks for the additional solution. I've done similar usingcapture.start
andcapture.length
.
– hwnd
Sep 12 '14 at 5:28
add a comment |
+1 Thanks for the additional solution. I've done similar usingcapture.start
andcapture.length
.
– hwnd
Sep 12 '14 at 5:28
+1 Thanks for the additional solution. I've done similar using
capture.start
and capture.length
.– hwnd
Sep 12 '14 at 5:28
+1 Thanks for the additional solution. I've done similar using
capture.start
and capture.length
.– hwnd
Sep 12 '14 at 5:28
add a comment |
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
add a comment |
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
add a comment |
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA
elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
edited Aug 10 '15 at 15:52
answered Sep 13 '14 at 2:54
Rich ScrivenRich Scriven
79.6k8 gold badges117 silver badges186 bronze badges
79.6k8 gold badges117 silver badges186 bronze badges
add a comment |
add a comment |
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Weird, how come it failed to work withstri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;(?=...)
does not advance the input position.
– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference betweenextract_all_regex
andmatch_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
add a comment |
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Weird, how come it failed to work withstri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;(?=...)
does not advance the input position.
– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference betweenextract_all_regex
andmatch_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
add a comment |
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
A stringi
solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
edited Mar 24 at 20:16
answered Oct 26 '14 at 19:55
gagolewsgagolews
10.7k2 gold badges36 silver badges66 bronze badges
10.7k2 gold badges36 silver badges66 bronze badges
Weird, how come it failed to work withstri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;(?=...)
does not advance the input position.
– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference betweenextract_all_regex
andmatch_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
add a comment |
Weird, how come it failed to work withstri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;(?=...)
does not advance the input position.
– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference betweenextract_all_regex
andmatch_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
Weird, how come it failed to work with
stri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
Weird, how come it failed to work with
stri_extract_all_regex
– hwnd
Oct 26 '14 at 20:00
@hwnd: it's a 0-length match;
(?=...)
does not advance the input position.– gagolews
Oct 26 '14 at 20:02
@hwnd: it's a 0-length match;
(?=...)
does not advance the input position.– gagolews
Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference between
extract_all_regex
and match_all_regex
– hwnd
Oct 26 '14 at 20:04
Yes I know it's a zero-width match =) I guess there is a difference between
extract_all_regex
and match_all_regex
– hwnd
Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :)
– gagolews
Oct 26 '14 at 20:05
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
Ok now I see and understand what you mean.
– hwnd
Oct 26 '14 at 20:06
add a comment |
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
add a comment |
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
add a comment |
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr
etc. packages exist.
answered Aug 10 '15 at 14:51
Ken WilliamsKen Williams
13.5k5 gold badges61 silver badges106 bronze badges
13.5k5 gold badges61 silver badges106 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f25800042%2foverlapping-matches-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown