Figuring out why character matching doesn't workreadr (or other packages from tidyverse) with data.frame instead of tibblecustom functions with group_by tidyverseadd row of non-matching column types to existing dataframeConvert class of a group of backticked columns using the tidyverse?join or merge data frames by if value is in one of multiple columnsSetting row names on a tibble is deprecated. Error: invalid 'row.names' lengthNormalize multiple values using values of one factor in RR: dplyr::lag throws error when trying to lag characters in tibblecan dplyr mutate() create a new tibble from an existing tibble?Splitting a data frame character column into an arbitrary number of columns with dynamic column names in R
Can White Castle?
Find the C-factor of a vote
How would modern naval warfare have to have developed differently for battleships to still be relevant in the 21st century?
What is "industrial ethernet"?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
How is hair tissue mineral analysis performed?
Is a single radon-daughter atom in air a solid?
How do I professionally let my manager know I'll quit over smoking in the office?
How long would it take to cross the Channel in 1890's?
Can humans ever directly see a few photons at a time? Can a human see a single photon?
Why tighten down in a criss-cross pattern?
Interaction between Leyline of Anticipation and Teferi, Time Raveler
Impossible darts scores
Array initialization optimization
Dates on degrees don’t make sense – will people care?
Should developer taking test phones home or put in office?
How much will studying magic in an academy cost?
Run specific apex tests during "sfdx force:source:deploy"
When to remove insignificant variables?
Do I have any obligations to my PhD supervisor's requests after I have graduated?
Why did pressing the joystick button spit out keypresses?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
How to model a twisted cylinder like this
Does this Wild Magic result affect the sorcerer or just other creatures?
Figuring out why character matching doesn't work
readr (or other packages from tidyverse) with data.frame instead of tibblecustom functions with group_by tidyverseadd row of non-matching column types to existing dataframeConvert class of a group of backticked columns using the tidyverse?join or merge data frames by if value is in one of multiple columnsSetting row names on a tibble is deprecated. Error: invalid 'row.names' lengthNormalize multiple values using values of one factor in RR: dplyr::lag throws error when trying to lag characters in tibblecan dplyr mutate() create a new tibble from an existing tibble?Splitting a data frame character column into an arbitrary number of columns with dynamic column names in R
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
For the project at hand, I am forced to join several data frames by a character column. This is sometimes problematic due to - for example - trailing whitespace but can also be remedied easily. However, in this case the joining does not work and I am unable to figure out what distinguishes the character values in the column used to do the joining.
Since in its original format the problem was not reproducible, here is a link where data in question can be downloaded. It looks like this:
readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
Name category
<chr> <dbl>
1 Grundschule Kronsberg 1
readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
Name value
<chr> <dbl>
1 Grundschule Kronsberg 2
I want to join these data frames together using the common id variable Name
. As you can see, it appears the value is exactly the same in both frames. However, when I apply any joining procedure, this happens:
library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")
joined.df
# A tibble: 2 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 NA
2 Grundschule Kronsberg NA 2
In trying to figure this out, I tried to remove the whitespace from the Name
column but, using standard procedure, was only able to so for sourceA
. For sourceB
it looks as if the procedure does not cut out the whitespace in between "Grundschule" and "Kronsberg".
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 Grundschule Kronsberg
Weirdly, when using stringr::str_replace_all(Name, "\pWHITE_SPACE", "")
, it works:
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, "\pWHITE_SPACE", ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 GrundschuleKronsberg
I don't know anything about how the lookup in "\pWHITE_SPACE"
differs from fixed(" ")
under the hood, but I thought it might be good clue for someone who does.
r merge tidyverse
|
show 7 more comments
For the project at hand, I am forced to join several data frames by a character column. This is sometimes problematic due to - for example - trailing whitespace but can also be remedied easily. However, in this case the joining does not work and I am unable to figure out what distinguishes the character values in the column used to do the joining.
Since in its original format the problem was not reproducible, here is a link where data in question can be downloaded. It looks like this:
readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
Name category
<chr> <dbl>
1 Grundschule Kronsberg 1
readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
Name value
<chr> <dbl>
1 Grundschule Kronsberg 2
I want to join these data frames together using the common id variable Name
. As you can see, it appears the value is exactly the same in both frames. However, when I apply any joining procedure, this happens:
library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")
joined.df
# A tibble: 2 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 NA
2 Grundschule Kronsberg NA 2
In trying to figure this out, I tried to remove the whitespace from the Name
column but, using standard procedure, was only able to so for sourceA
. For sourceB
it looks as if the procedure does not cut out the whitespace in between "Grundschule" and "Kronsberg".
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 Grundschule Kronsberg
Weirdly, when using stringr::str_replace_all(Name, "\pWHITE_SPACE", "")
, it works:
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, "\pWHITE_SPACE", ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 GrundschuleKronsberg
I don't know anything about how the lookup in "\pWHITE_SPACE"
differs from fixed(" ")
under the hood, but I thought it might be good clue for someone who does.
r merge tidyverse
1
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Same here, solution withfull_join()
seems fine.
– heck1
Mar 25 at 8:52
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.
– heck1
Mar 25 at 9:03
I did try that. The problem is not contingent on usingdplyr
s joining verbs
– tifu
Mar 25 at 9:04
|
show 7 more comments
For the project at hand, I am forced to join several data frames by a character column. This is sometimes problematic due to - for example - trailing whitespace but can also be remedied easily. However, in this case the joining does not work and I am unable to figure out what distinguishes the character values in the column used to do the joining.
Since in its original format the problem was not reproducible, here is a link where data in question can be downloaded. It looks like this:
readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
Name category
<chr> <dbl>
1 Grundschule Kronsberg 1
readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
Name value
<chr> <dbl>
1 Grundschule Kronsberg 2
I want to join these data frames together using the common id variable Name
. As you can see, it appears the value is exactly the same in both frames. However, when I apply any joining procedure, this happens:
library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")
joined.df
# A tibble: 2 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 NA
2 Grundschule Kronsberg NA 2
In trying to figure this out, I tried to remove the whitespace from the Name
column but, using standard procedure, was only able to so for sourceA
. For sourceB
it looks as if the procedure does not cut out the whitespace in between "Grundschule" and "Kronsberg".
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 Grundschule Kronsberg
Weirdly, when using stringr::str_replace_all(Name, "\pWHITE_SPACE", "")
, it works:
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, "\pWHITE_SPACE", ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 GrundschuleKronsberg
I don't know anything about how the lookup in "\pWHITE_SPACE"
differs from fixed(" ")
under the hood, but I thought it might be good clue for someone who does.
r merge tidyverse
For the project at hand, I am forced to join several data frames by a character column. This is sometimes problematic due to - for example - trailing whitespace but can also be remedied easily. However, in this case the joining does not work and I am unable to figure out what distinguishes the character values in the column used to do the joining.
Since in its original format the problem was not reproducible, here is a link where data in question can be downloaded. It looks like this:
readRDS("path/sourceA") -> sourceA
sourceA
# A tibble: 1 x 2
Name category
<chr> <dbl>
1 Grundschule Kronsberg 1
readRDS("path/sourceB") -> sourceB
sourceB
# A tibble: 1 x 2
Name value
<chr> <dbl>
1 Grundschule Kronsberg 2
I want to join these data frames together using the common id variable Name
. As you can see, it appears the value is exactly the same in both frames. However, when I apply any joining procedure, this happens:
library(tidyverse)
joined.df <- full_join(sourceA, sourceB, by = "Name")
joined.df
# A tibble: 2 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 NA
2 Grundschule Kronsberg NA 2
In trying to figure this out, I tried to remove the whitespace from the Name
column but, using standard procedure, was only able to so for sourceA
. For sourceB
it looks as if the procedure does not cut out the whitespace in between "Grundschule" and "Kronsberg".
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, fixed(" "), ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 Grundschule Kronsberg
Weirdly, when using stringr::str_replace_all(Name, "\pWHITE_SPACE", "")
, it works:
joined.df %>%
mutate(Name_test = stringr::str_replace_all(Name, "\pWHITE_SPACE", ""))
# A tibble: 2 x 4
Name category value Name_test
<chr> <dbl> <dbl> <chr>
1 Grundschule Kronsberg 1 NA GrundschuleKronsberg
2 Grundschule Kronsberg NA 2 GrundschuleKronsberg
I don't know anything about how the lookup in "\pWHITE_SPACE"
differs from fixed(" ")
under the hood, but I thought it might be good clue for someone who does.
r merge tidyverse
r merge tidyverse
edited Mar 25 at 14:14
tifu
asked Mar 25 at 8:37
tifutifu
1,093214
1,093214
1
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Same here, solution withfull_join()
seems fine.
– heck1
Mar 25 at 8:52
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.
– heck1
Mar 25 at 9:03
I did try that. The problem is not contingent on usingdplyr
s joining verbs
– tifu
Mar 25 at 9:04
|
show 7 more comments
1
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Same here, solution withfull_join()
seems fine.
– heck1
Mar 25 at 8:52
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.
– heck1
Mar 25 at 9:03
I did try that. The problem is not contingent on usingdplyr
s joining verbs
– tifu
Mar 25 at 9:04
1
1
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Same here, solution with
full_join()
seems fine.– heck1
Mar 25 at 8:52
Same here, solution with
full_join()
seems fine.– heck1
Mar 25 at 8:52
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.– heck1
Mar 25 at 9:03
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.– heck1
Mar 25 at 9:03
I did try that. The problem is not contingent on using
dplyr
s joining verbs– tifu
Mar 25 at 9:04
I did try that. The problem is not contingent on using
dplyr
s joining verbs– tifu
Mar 25 at 9:04
|
show 7 more comments
1 Answer
1
active
oldest
votes
After quite a discussion in the comments, I was able to solve the problem. While the Name
variables looked identical (and where parsed identically by dput()
), there was a subtle difference when converting the characters into their ASCII codes:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
sourceB
has an extra code compared to sourceA
and different values in position 12 and 13. Using chr()
(also from gtools
), I was able to re-convert the ASCII codes into characters:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
In sourceB
, there is an extra  (ASCII decimal code 194) in the string, and the white space is coded with the decimal 160 instead of 32. I still don't know why in conjunction these two ASCII codes were displayed as a regular white space, but was able to solve the issue by simply replacing all white spaces with " "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\pWHITE_SPACE", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
This (somehow) changed the ASCII codes so that they now align with one another:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333906%2ffiguring-out-why-character-matching-doesnt-work%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
After quite a discussion in the comments, I was able to solve the problem. While the Name
variables looked identical (and where parsed identically by dput()
), there was a subtle difference when converting the characters into their ASCII codes:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
sourceB
has an extra code compared to sourceA
and different values in position 12 and 13. Using chr()
(also from gtools
), I was able to re-convert the ASCII codes into characters:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
In sourceB
, there is an extra  (ASCII decimal code 194) in the string, and the white space is coded with the decimal 160 instead of 32. I still don't know why in conjunction these two ASCII codes were displayed as a regular white space, but was able to solve the issue by simply replacing all white spaces with " "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\pWHITE_SPACE", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
This (somehow) changed the ASCII codes so that they now align with one another:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
add a comment |
After quite a discussion in the comments, I was able to solve the problem. While the Name
variables looked identical (and where parsed identically by dput()
), there was a subtle difference when converting the characters into their ASCII codes:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
sourceB
has an extra code compared to sourceA
and different values in position 12 and 13. Using chr()
(also from gtools
), I was able to re-convert the ASCII codes into characters:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
In sourceB
, there is an extra  (ASCII decimal code 194) in the string, and the white space is coded with the decimal 160 instead of 32. I still don't know why in conjunction these two ASCII codes were displayed as a regular white space, but was able to solve the issue by simply replacing all white spaces with " "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\pWHITE_SPACE", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
This (somehow) changed the ASCII codes so that they now align with one another:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
add a comment |
After quite a discussion in the comments, I was able to solve the problem. While the Name
variables looked identical (and where parsed identically by dput()
), there was a subtle difference when converting the characters into their ASCII codes:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
sourceB
has an extra code compared to sourceA
and different values in position 12 and 13. Using chr()
(also from gtools
), I was able to re-convert the ASCII codes into characters:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
In sourceB
, there is an extra  (ASCII decimal code 194) in the string, and the white space is coded with the decimal 160 instead of 32. I still don't know why in conjunction these two ASCII codes were displayed as a regular white space, but was able to solve the issue by simply replacing all white spaces with " "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\pWHITE_SPACE", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
This (somehow) changed the ASCII codes so that they now align with one another:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
After quite a discussion in the comments, I was able to solve the problem. While the Name
variables looked identical (and where parsed identically by dput()
), there was a subtle difference when converting the characters into their ASCII codes:
library(gtools)
asc(sourceA$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 32
[13,] 75
[14,] 114
[15,] 111
[16,] 110
[17,] 115
[18,] 98
[19,] 101
[20,] 114
[21,] 103
asc(sourceB$Name)
Grundschule Kronsberg
[1,] 71
[2,] 114
[3,] 117
[4,] 110
[5,] 100
[6,] 115
[7,] 99
[8,] 104
[9,] 117
[10,] 108
[11,] 101
[12,] 194
[13,] 160
[14,] 75
[15,] 114
[16,] 111
[17,] 110
[18,] 115
[19,] 98
[20,] 101
[21,] 114
[22,] 103
sourceB
has an extra code compared to sourceA
and different values in position 12 and 13. Using chr()
(also from gtools
), I was able to re-convert the ASCII codes into characters:
chr(asc(sourceA$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" "Â" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
In sourceB
, there is an extra  (ASCII decimal code 194) in the string, and the white space is coded with the decimal 160 instead of 32. I still don't know why in conjunction these two ASCII codes were displayed as a regular white space, but was able to solve the issue by simply replacing all white spaces with " "
sourceB <- sourceB %>%
mutate(Name = stringr::str_replace_all(Name, "\pWHITE_SPACE", " "))
full_join(sourceA, sourceB, by = "Name")
# A tibble: 1 x 3
Name category value
<chr> <dbl> <dbl>
1 Grundschule Kronsberg 1 2
This (somehow) changed the ASCII codes so that they now align with one another:
chr(asc(sourceB$Name))
[1] "G" "r" "u" "n" "d" "s" "c" "h" "u" "l" "e" " " "K" "r" "o" "n" "s" "b" "e" "r" "g"
edited Mar 26 at 8:13
answered Mar 25 at 14:05
tifutifu
1,093214
1,093214
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55333906%2ffiguring-out-why-character-matching-doesnt-work%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Can't reproduce. I get just one row.
– nicola
Mar 25 at 8:51
Same here, solution with
full_join()
seems fine.– heck1
Mar 25 at 8:52
Hm, interesting. Did some updates last week, checking that now
– tifu
Mar 25 at 8:56
merge(sourceA,sourceB, by = "Name")
works as well, you could try that if there are package/update troubles.– heck1
Mar 25 at 9:03
I did try that. The problem is not contingent on using
dplyr
s joining verbs– tifu
Mar 25 at 9:04