Blob.decode with replacement does not seem to workHow to get UTF-8 working in Java webapps?How do I properly work with unicode characters in python to keep from getting errors?Perl: utf8::decode vs. Encode::decodeStrange Base64 encode/decode problemWorking with UTF-8 encoding in Python sourcePython decoding works for me but not othersAndroid Replace “…” with ellipsis characterJava encoding/decoding a String to/from a longLatin1 character values not displaying the same as in utf8Data.table, logical comparison and encoding bugs/errors in non-English environment
Is the purpose of sheet music to be played along to? Or a guide for learning and reference during playing?
Why does "git status" show I'm on the master branch and "git branch" does not in a newly created repository?
When does Fisher's "go get more data" approach make sense?
Wordplay addition paradox
Jump back to the position I started a search
Is this Android phone Android 9.0 or Android 6.0?
Create Array from list of indices/values
Operation Unz̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻
What "fuel more powerful than anything the West (had) in stock" put Laika in orbit aboard Sputnik 2?
Why did Steve Rogers choose this character in Endgame?
Strategy to pay off revolving debt while building reserve savings fund?
When designing an adventure, how can I ensure a continuous player experience in a setting that's likely to favor TPKs?
How was Peter Parker able to use EDITH in the end?
Will this tire fail its MOT?
Wordplay subtraction paradox
Is the Münchhausen trilemma really a trilemma?
How to remove the first colon ':' from a timestamp?
A scene of Jimmy diversity
Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?
Was Apollo 13 radio blackout on reentry longer than expected?
Did 007 exist before James Bond?
How can electric field be defined as force per charge, if the charge makes its own, singular electric field?
What happens if a company buys back all of its shares?
how slow a car engine can run
Blob.decode with replacement does not seem to work
How to get UTF-8 working in Java webapps?How do I properly work with unicode characters in python to keep from getting errors?Perl: utf8::decode vs. Encode::decodeStrange Base64 encode/decode problemWorking with UTF-8 encoding in Python sourcePython decoding works for me but not othersAndroid Replace “…” with ellipsis characterJava encoding/decoding a String to/from a longLatin1 character values not displaying the same as in utf8Data.table, logical comparison and encoding bugs/errors in non-English environment
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
This code:
my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say
Fails with:
Will not decode invalid ASCII (code point > 127 found)
And this one:
my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say
Simply does not seem to work, replacing € by ¬.
It's true that those methods are not tested, but is the syntax right?
encoding perl6
add a comment |
This code:
my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say
Fails with:
Will not decode invalid ASCII (code point > 127 found)
And this one:
my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say
Simply does not seem to work, replacing € by ¬.
It's true that those methods are not tested, but is the syntax right?
encoding perl6
1
This question had a bounty worth+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to:replacement
and:strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.
– raiph
Apr 15 at 23:20
add a comment |
This code:
my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say
Fails with:
Will not decode invalid ASCII (code point > 127 found)
And this one:
my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say
Simply does not seem to work, replacing € by ¬.
It's true that those methods are not tested, but is the syntax right?
encoding perl6
This code:
my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say
Fails with:
Will not decode invalid ASCII (code point > 127 found)
And this one:
my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say
Simply does not seem to work, replacing € by ¬.
It's true that those methods are not tested, but is the syntax right?
encoding perl6
encoding perl6
asked Mar 26 at 8:57
jjmerelojjmerelo
8,2374 gold badges20 silver badges54 bronze badges
8,2374 gold badges20 silver badges54 bronze badges
1
This question had a bounty worth+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to:replacement
and:strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.
– raiph
Apr 15 at 23:20
add a comment |
1
This question had a bounty worth+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to:replacement
and:strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.
– raiph
Apr 15 at 23:20
1
1
This question had a bounty worth
+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement
and :strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.– raiph
Apr 15 at 23:20
This question had a bounty worth
+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement
and :strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.– raiph
Apr 15 at 23:20
add a comment |
1 Answer
1
active
oldest
votes
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6
:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config
in ops.c
.
This in turn calls MVM_string_decode_config
in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.
For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode
function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
the more precise answer to the latin1 part of the question is thatBlob.new
is by defaultBlob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a¬
, as that's what is encoded by0xac
– timotimo
Mar 27 at 17:31
2
@jjmerelo "$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.
– raiph
Mar 31 at 13:18
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353143%2fblob-decode-with-replacement-does-not-seem-to-work%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6
:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config
in ops.c
.
This in turn calls MVM_string_decode_config
in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.
For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode
function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
the more precise answer to the latin1 part of the question is thatBlob.new
is by defaultBlob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a¬
, as that's what is encoded by0xac
– timotimo
Mar 27 at 17:31
2
@jjmerelo "$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.
– raiph
Mar 31 at 13:18
|
show 2 more comments
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6
:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config
in ops.c
.
This in turn calls MVM_string_decode_config
in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.
For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode
function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
the more precise answer to the latin1 part of the question is thatBlob.new
is by defaultBlob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a¬
, as that's what is encoded by0xac
– timotimo
Mar 27 at 17:31
2
@jjmerelo "$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.
– raiph
Mar 31 at 13:18
|
show 2 more comments
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6
:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config
in ops.c
.
This in turn calls MVM_string_decode_config
in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.
For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode
function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6
:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config
in ops.c
.
This in turn calls MVM_string_decode_config
in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.
For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode
function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
edited Mar 31 at 12:01
answered Mar 26 at 9:08
raiphraiph
14.6k3 gold badges27 silver badges52 bronze badges
14.6k3 gold badges27 silver badges52 bronze badges
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
the more precise answer to the latin1 part of the question is thatBlob.new
is by defaultBlob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a¬
, as that's what is encoded by0xac
– timotimo
Mar 27 at 17:31
2
@jjmerelo "$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.
– raiph
Mar 31 at 13:18
|
show 2 more comments
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
the more precise answer to the latin1 part of the question is thatBlob.new
is by defaultBlob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a¬
, as that's what is encoded by0xac
– timotimo
Mar 27 at 17:31
2
@jjmerelo "$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.
– raiph
Mar 31 at 13:18
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)
– jjmerelo
Mar 26 at 10:42
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.
– jjmerelo
Mar 27 at 5:18
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309
– jjmerelo
Mar 27 at 9:14
2
2
the more precise answer to the latin1 part of the question is that
Blob.new
is by default Blob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a ¬
, as that's what is encoded by 0xac
– timotimo
Mar 27 at 17:31
the more precise answer to the latin1 part of the question is that
Blob.new
is by default Blob[uint8].new
, which will truncate the values passed to 8 bit. That's why you get a ¬
, as that's what is encoded by 0xac
– timotimo
Mar 27 at 17:31
2
2
@jjmerelo "
$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.– raiph
Mar 31 at 13:18
@jjmerelo "
$replacement
makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.– raiph
Mar 31 at 13:18
|
show 2 more comments
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353143%2fblob-decode-with-replacement-does-not-seem-to-work%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
This question had a bounty worth
+100
reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to:replacement
and:strict
for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.– raiph
Apr 15 at 23:20