PySpark: Use the primary key of a row as a seed for rand [duplicate]Using a column value as a parameter to a spark DataFrame functionON DUPLICATE KEY UPDATE while inserting from pyspark dataframe to an external database table via JDBCPyspark Removing null values from a column in dataframeSparkSql Random using column as seedCurried UDF - PysparkPySpark: Replace Punctuations with Space Looping Through ColumnsPyspark Replicate Row based on column valuePySpark - to_date format from columnHow to paralellize a function with PySparkPyspark create DataFrame from rows/data with varying columnsRounding hours of datetime in PySpark

Correct word for a little toy that always stands up?

Boots or trail runners with reference to blisters?

Why tantalum for the Hayabusa bullets?

Earth observation-like spacecraft orbiting other planets or moons?

How should I quote American English speakers in a British English essay?

How to have poached eggs in "sphere form"?

Load Product Qty by sku in Magento 2 Controller

How to innovate in OR

Why didn't Stark and Nebula use jump points with their ship to go back to Earth?

Is it possible to tell if a child will turn into a Hag?

Exploiting the delay when a festival ticket is scanned

What is a Trio Word™?

Why would an invisible personal shield be necessary?

How do I say "this is why…"?

How do I make my photos have more impact?

Antonym of "Megalomania"

Is Ear Protection Necessary For General Aviation Airplanes?

How do you deal with characters with multiple races?

Should I intervene when a colleague in a different department makes students run laps as part of their grade?

What would the United Kingdom's "optimal" Brexit deal look like?

Coworker mumbles to herself when working, how to ask her to stop?

What force enables us to walk? Friction or normal reaction?

Why would anyone ever invest in a cash-only etf?

What is the source of this clause, often used to mark the completion of something?

PySpark: Use the primary key of a row as a seed for rand [duplicate]

Using a column value as a parameter to a spark DataFrame functionON DUPLICATE KEY UPDATE while inserting from pyspark dataframe to an external database table via JDBCPyspark Removing null values from a column in dataframeSparkSql Random using column as seedCurried UDF - PysparkPySpark: Replace Punctuations with Space Looping Through ColumnsPyspark Replicate Row based on column valuePySpark - to_date format from columnHow to paralellize a function with PySparkPyspark create DataFrame from rows/data with varying columnsRounding hours of datetime in PySpark

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

This question already has an answer here:

Using a column value as a parameter to a spark DataFrame function

1 answer

I'm trying to use the rand function in PySpark to generate a column with random numbers. I would like the rand function to take in the primary key of the row as the seed so that the number is reproducible. However, when I run:

df.withColumn('rand_key', F.rand(F.col('primary_id')))

I get the error

TypeError: 'Column' object is not callable

How can I use the value in the row as my rand seed?

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Apr 30 at 6:34

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'"

– nao
Mar 26 at 21:33

How are you using expr? What is the datatype of primary_id? Try df.withColumn('rand_key', F.expr("rand(primary_id)"))

– pault
Mar 26 at 21:41

add a comment |

This question already has an answer here:

Using a column value as a parameter to a spark DataFrame function

1 answer

df.withColumn('rand_key', F.rand(F.col('primary_id')))

I get the error

TypeError: 'Column' object is not callable

How can I use the value in the row as my rand seed?

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Apr 30 at 6:34

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'"

– nao
Mar 26 at 21:33

How are you using expr? What is the datatype of primary_id? Try df.withColumn('rand_key', F.expr("rand(primary_id)"))

– pault
Mar 26 at 21:41

add a comment |

This question already has an answer here:

Using a column value as a parameter to a spark DataFrame function

1 answer

df.withColumn('rand_key', F.rand(F.col('primary_id')))

I get the error

TypeError: 'Column' object is not callable

How can I use the value in the row as my rand seed?

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

This question already has an answer here:

Using a column value as a parameter to a spark DataFrame function

1 answer

df.withColumn('rand_key', F.rand(F.col('primary_id')))

I get the error

TypeError: 'Column' object is not callable

How can I use the value in the row as my rand seed?

This question already has an answer here:

Using a column value as a parameter to a spark DataFrame function

1 answer

apache-spark pyspark apache-spark-sql

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

asked Mar 26 at 21:25

nao

4515 silver badges20 bronze badges

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Apr 30 at 6:34

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Apr 30 at 6:34

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function()
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function()
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function()
$hover.showInfoMessage('',
messageElement: $msg.clone().show(),
transient: false,
position: my: 'bottom left', at: 'top center', offsetTop: -7 ,
dismissable: false,
relativeToBody: true
);
,
function()
StackExchange.helpers.removeMessages();

);
);
);
Apr 30 at 6:34

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'"

– nao
Mar 26 at 21:33

How are you using expr? What is the datatype of primary_id? Try df.withColumn('rand_key', F.expr("rand(primary_id)"))

– pault
Mar 26 at 21:41

add a comment |

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'"

– nao
Mar 26 at 21:33

How are you using expr? What is the datatype of primary_id? Try df.withColumn('rand_key', F.expr("rand(primary_id)"))

– pault
Mar 26 at 21:41

Wasn't able to get it working using expr. Instead I got "AnalysisException: u'Input argument to rand must be an integer, long or null literal.;'"

– nao
Mar 26 at 21:33

How are you using expr? What is the datatype of primary_id? Try df.withColumn('rand_key', F.expr("rand(primary_id)"))

– pault
Mar 26 at 21:41

add a comment |

1 Answer
1

active

oldest

votes

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
 random.seed(seed)
 return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
 random.seed(seed)
 return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

add a comment |

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
 random.seed(seed)
 return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

add a comment |

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
 random.seed(seed)
 return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).

One way to go around this is to create your own rand function that would take column as parameter:

import random

def rand(seed):
 random.seed(seed)
 return random.random()

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

answered Mar 26 at 22:19

botchniaque

1,83314 silver badges32 bronze badges

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

1 Answer
1

1 Answer
1

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

1 Answer 1

1 Answer 1

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1