Blob.decode with replacement does not seem to workHow to get UTF-8 working in Java webapps?How do I properly work with unicode characters in python to keep from getting errors?Perl: utf8::decode vs. Encode::decodeStrange Base64 encode/decode problemWorking with UTF-8 encoding in Python sourcePython decoding works for me but not othersAndroid Replace “…” with ellipsis characterJava encoding/decoding a String to/from a longLatin1 character values not displaying the same as in utf8Data.table, logical comparison and encoding bugs/errors in non-English environment

Is the purpose of sheet music to be played along to? Or a guide for learning and reference during playing?

Why does "git status" show I'm on the master branch and "git branch" does not in a newly created repository?

When does Fisher's "go get more data" approach make sense?

Wordplay addition paradox

Jump back to the position I started a search

Is this Android phone Android 9.0 or Android 6.0?

Create Array from list of indices/values

Operation Unz̖̬̜̺̬a͇͖̯͔͉l̟̭g͕̝̼͇͓̪͍o̬̝͍̹̻

What "fuel more powerful than anything the West (had) in stock" put Laika in orbit aboard Sputnik 2?

Why did Steve Rogers choose this character in Endgame?

Strategy to pay off revolving debt while building reserve savings fund?

When designing an adventure, how can I ensure a continuous player experience in a setting that's likely to favor TPKs?

How was Peter Parker able to use EDITH in the end?

Will this tire fail its MOT?

Wordplay subtraction paradox

Is the Münchhausen trilemma really a trilemma?

How to remove the first colon ':' from a timestamp?

A scene of Jimmy diversity

Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?

Was Apollo 13 radio blackout on reentry longer than expected?

Did 007 exist before James Bond?

How can electric field be defined as force per charge, if the charge makes its own, singular electric field?

What happens if a company buys back all of its shares?

how slow a car engine can run



Blob.decode with replacement does not seem to work


How to get UTF-8 working in Java webapps?How do I properly work with unicode characters in python to keep from getting errors?Perl: utf8::decode vs. Encode::decodeStrange Base64 encode/decode problemWorking with UTF-8 encoding in Python sourcePython decoding works for me but not othersAndroid Replace “…” with ellipsis characterJava encoding/decoding a String to/from a longLatin1 character values not displaying the same as in utf8Data.table, logical comparison and encoding bugs/errors in non-English environment






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








8















This code:



my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say


Fails with:



Will not decode invalid ASCII (code point > 127 found)␤


And this one:



my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say


Simply does not seem to work, replacing € by ¬.



It's true that those methods are not tested, but is the syntax right?










share|improve this question

















  • 1





    This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

    – raiph
    Apr 15 at 23:20

















8















This code:



my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say


Fails with:



Will not decode invalid ASCII (code point > 127 found)␤


And this one:



my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say


Simply does not seem to work, replacing € by ¬.



It's true that those methods are not tested, but is the syntax right?










share|improve this question

















  • 1





    This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

    – raiph
    Apr 15 at 23:20













8












8








8








This code:



my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say


Fails with:



Will not decode invalid ASCII (code point > 127 found)␤


And this one:



my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say


Simply does not seem to work, replacing € by ¬.



It's true that those methods are not tested, but is the syntax right?










share|improve this question














This code:



my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say


Fails with:



Will not decode invalid ASCII (code point > 127 found)␤


And this one:



my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say


Simply does not seem to work, replacing € by ¬.



It's true that those methods are not tested, but is the syntax right?







encoding perl6






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 26 at 8:57









jjmerelojjmerelo

8,2374 gold badges20 silver badges54 bronze badges




8,2374 gold badges20 silver badges54 bronze badges







  • 1





    This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

    – raiph
    Apr 15 at 23:20












  • 1





    This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

    – raiph
    Apr 15 at 23:20







1




1





This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

– raiph
Apr 15 at 23:20





This question had a bounty worth +100 reputation from me. I was looking for an answer drawing from credible and/or official sources, hoping to get an answer from a core dev like samcv, or from someone else providing a link to core dev discussion (irc or an issue etc.) about it that either corrects or adds value to my current answer by injecting an authoritative response about what currently works and what should work in relation to :replacement and :strict for the various encodings. It looks like the original points were wasted but I'll happily redo the award if someone does as I hoped.

– raiph
Apr 15 at 23:20












1 Answer
1






active

oldest

votes


















7














TL;DR:



  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.


  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1


  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.


  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.


  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


Following the relevant code path



Your code calls this code in Buf.pm6:



multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))



The nqp::decoderepconf function directly maps to a corresponding function in the backend.



On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.



This in turn calls MVM_string_decode_config in the same file.



From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:




Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.



For now windows-1252 and windows-1251 are the only ones this makes a difference on.




Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.



Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2



What happens with ascii and latin1 in particular



The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.



So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.



say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)


The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.



The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3



But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.



Footnotes



1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.



2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.



3 See timotimo++'s comment below.






share|improve this answer

























  • it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

    – jjmerelo
    Mar 26 at 10:42











  • Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

    – jjmerelo
    Mar 27 at 5:18











  • github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

    – jjmerelo
    Mar 27 at 9:14






  • 2





    the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

    – timotimo
    Mar 27 at 17:31






  • 2





    @jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

    – raiph
    Mar 31 at 13:18










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353143%2fblob-decode-with-replacement-does-not-seem-to-work%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









7














TL;DR:



  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.


  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1


  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.


  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.


  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


Following the relevant code path



Your code calls this code in Buf.pm6:



multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))



The nqp::decoderepconf function directly maps to a corresponding function in the backend.



On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.



This in turn calls MVM_string_decode_config in the same file.



From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:




Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.



For now windows-1252 and windows-1251 are the only ones this makes a difference on.




Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.



Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2



What happens with ascii and latin1 in particular



The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.



So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.



say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)


The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.



The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3



But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.



Footnotes



1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.



2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.



3 See timotimo++'s comment below.






share|improve this answer

























  • it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

    – jjmerelo
    Mar 26 at 10:42











  • Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

    – jjmerelo
    Mar 27 at 5:18











  • github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

    – jjmerelo
    Mar 27 at 9:14






  • 2





    the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

    – timotimo
    Mar 27 at 17:31






  • 2





    @jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

    – raiph
    Mar 31 at 13:18















7














TL;DR:



  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.


  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1


  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.


  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.


  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


Following the relevant code path



Your code calls this code in Buf.pm6:



multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))



The nqp::decoderepconf function directly maps to a corresponding function in the backend.



On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.



This in turn calls MVM_string_decode_config in the same file.



From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:




Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.



For now windows-1252 and windows-1251 are the only ones this makes a difference on.




Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.



Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2



What happens with ascii and latin1 in particular



The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.



So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.



say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)


The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.



The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3



But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.



Footnotes



1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.



2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.



3 See timotimo++'s comment below.






share|improve this answer

























  • it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

    – jjmerelo
    Mar 26 at 10:42











  • Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

    – jjmerelo
    Mar 27 at 5:18











  • github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

    – jjmerelo
    Mar 27 at 9:14






  • 2





    the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

    – timotimo
    Mar 27 at 17:31






  • 2





    @jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

    – raiph
    Mar 31 at 13:18













7












7








7







TL;DR:



  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.


  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1


  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.


  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.


  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


Following the relevant code path



Your code calls this code in Buf.pm6:



multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))



The nqp::decoderepconf function directly maps to a corresponding function in the backend.



On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.



This in turn calls MVM_string_decode_config in the same file.



From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:




Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.



For now windows-1252 and windows-1251 are the only ones this makes a difference on.




Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.



Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2



What happens with ascii and latin1 in particular



The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.



So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.



say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)


The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.



The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3



But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.



Footnotes



1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.



2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.



3 See timotimo++'s comment below.






share|improve this answer















TL;DR:



  • Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.


  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1


  • Specifying the $replacement argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.


  • The "replacer" code path passes the $replacement and $strict arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.


  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


Following the relevant code path



Your code calls this code in Buf.pm6:



multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False)
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))



The nqp::decoderepconf function directly maps to a corresponding function in the backend.



On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.



This in turn calls MVM_string_decode_config in the same file.



From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:




Unlike MVM_string_decode, it will not pass through codepoints which have no official mapping.



For now windows-1252 and windows-1251 are the only ones this makes a difference on.




Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.



Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2



What happens with ascii and latin1 in particular



The current code for MVM_string_decode_config does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode and MVM_string_latin1_decode functions.



So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.



say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)


The first string (as a Buf) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.



The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3



But it's no better if you use a Buf with a larger element size. The result is still a ¬, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.



Footnotes



1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.



2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.



3 See timotimo++'s comment below.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 31 at 12:01

























answered Mar 26 at 9:08









raiphraiph

14.6k3 gold badges27 silver badges52 bronze badges




14.6k3 gold badges27 silver badges52 bronze badges












  • it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

    – jjmerelo
    Mar 26 at 10:42











  • Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

    – jjmerelo
    Mar 27 at 5:18











  • github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

    – jjmerelo
    Mar 27 at 9:14






  • 2





    the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

    – timotimo
    Mar 27 at 17:31






  • 2





    @jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

    – raiph
    Mar 31 at 13:18

















  • it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

    – jjmerelo
    Mar 26 at 10:42











  • Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

    – jjmerelo
    Mar 27 at 5:18











  • github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

    – jjmerelo
    Mar 27 at 9:14






  • 2





    the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

    – timotimo
    Mar 27 at 17:31






  • 2





    @jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

    – raiph
    Mar 31 at 13:18
















it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

– jjmerelo
Mar 26 at 10:42





it's not that it does not use them. It's that I don't know how they are used, so I couldn't write the documents that explains what they do. And I know those characters are beyond the range of the representation. That's what the replacement is supposed to be for: to replace those characters (or that's what I though, and how the underlying NQP code works)

– jjmerelo
Mar 26 at 10:42













Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

– jjmerelo
Mar 27 at 5:18





Thankis again for the answer, but that's not the code that's used. It's a Blob, which has got its own code. It's quite similar, thoough... Again, what I gather from that code is that it should replace whatever code point that can't be passed through. The tests point in that direction, also.

– jjmerelo
Mar 27 at 5:18













github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

– jjmerelo
Mar 27 at 9:14





github.com/rakudo/rakudo/blob/master/src/core/Buf.pm6#L297-L309

– jjmerelo
Mar 27 at 9:14




2




2





the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

– timotimo
Mar 27 at 17:31





the more precise answer to the latin1 part of the question is that Blob.new is by default Blob[uint8].new, which will truncate the values passed to 8 bit. That's why you get a ¬, as that's what is encoded by 0xac

– timotimo
Mar 27 at 17:31




2




2





@jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

– raiph
Mar 31 at 13:18





@jjmerelo "$replacement makes no difference." Based on my read of the MoarVM code and comments, it works for the two windows encodings and the shiftjis encoding but does nothing for other encodings such as ascii and latin1. I've edited the question to make my answer as clear as I think I can make it.

– raiph
Mar 31 at 13:18








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353143%2fblob-decode-with-replacement-does-not-seem-to-work%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript