How to get the top X of words from a SparseVector to a string array with PySparkHow to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist

How to get the top X of words from a SparseVector to a string array with PySparkHow to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method getnewargs([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist

Why is the Intel 8086 CPU called a 16-bit CPU?

Finding all possible pairs of square numbers in an array

Random piece of plastic

Term “console” in game consoles

Column deletion based on number of string matches within column

At which point can a system be compromised when downloading archived data from an untrusted source?

What is this geomorphological feature?

I have found a mistake on someone's code published online: what is the protocol?

Why did my "seldom" get corrected?

Locked-up DOS computer beeped on keypress. What mechanism caused that?

Difference between Pure EdDSA (ed25519) and HashEdDSA (ed25519ph)

An entire function all whose forward orbits are bounded

Necroskitter and creatures dying because of placing -1/-1 counters

I want to identify a part from a photo

How did Jayne know when to shoot?

When can a polynomial be written as a polynomial function of another polynomial?

A scene of Jimmy diversity

What causes a rotating object to rotate forever without external force—inertia, or something else?

Is the Münchhausen trilemma really a trilemma?

Why is Google approaching my VPS machine?

What happens if a company buys back all of its shares?

Arithmetics in LuaLaTeX

Exporting animation to Unity

Why won't some unicode characters print to my terminal?

How to get the top X of words from a SparseVector to a string array with PySpark

How to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :

When I do :

getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopwords_udf(col('means'))) 
 .select("prediction", "topWord") 
 .show(2, truncate=100)

I am getting this error :

Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
 return self(*args)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
 judf = self._judf
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
 self._judf_placeholder = self._create_judf()
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
 wrapped_func = _wrap_function(sc, self.func, self.returnType)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
 pickled_command = ser.dumps(command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
 raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)

I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening

vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", vector_udf(col('means'))) 
 .withColumn("topWord2", vector2_udf(col('topWord'))) 
 .select("prediction", "topWord", "topWord2") 
 .show(2, truncate=100)

But I get this error TypeError: 'NoneType' object is not subscriptable

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

1

one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31

Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09

add a comment |

I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :

When I do :

getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopwords_udf(col('means'))) 
 .select("prediction", "topWord") 
 .show(2, truncate=100)

I am getting this error :

Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
 return self(*args)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
 judf = self._judf
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
 self._judf_placeholder = self._create_judf()
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
 wrapped_func = _wrap_function(sc, self.func, self.returnType)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
 pickled_command = ser.dumps(command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
 raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)

I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening

vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", vector_udf(col('means'))) 
 .withColumn("topWord2", vector2_udf(col('topWord'))) 
 .select("prediction", "topWord", "topWord2") 
 .show(2, truncate=100)

But I get this error TypeError: 'NoneType' object is not subscriptable

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

1

one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31

Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09

add a comment |

I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :

When I do :

getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopwords_udf(col('means'))) 
 .select("prediction", "topWord") 
 .show(2, truncate=100)

I am getting this error :

Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
 return self(*args)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
 judf = self._judf
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
 self._judf_placeholder = self._create_judf()
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
 wrapped_func = _wrap_function(sc, self.func, self.returnType)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
 pickled_command = ser.dumps(command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
 raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)

I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening

vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", vector_udf(col('means'))) 
 .withColumn("topWord2", vector2_udf(col('topWord'))) 
 .select("prediction", "topWord", "topWord2") 
 .show(2, truncate=100)

But I get this error TypeError: 'NoneType' object is not subscriptable

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :

When I do :

getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopwords_udf(col('means'))) 
 .select("prediction", "topWord") 
 .show(2, truncate=100)

I am getting this error :

Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
 return self(*args)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
 judf = self._judf
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
 self._judf_placeholder = self._create_judf()
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
 wrapped_func = _wrap_function(sc, self.func, self.returnType)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
 pickled_command = ser.dumps(command)
 File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
 raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
 at py4j.Gateway.invoke(Gateway.java:274)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)

I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening

vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", vector_udf(col('means'))) 
 .withColumn("topWord2", vector2_udf(col('topWord'))) 
 .select("prediction", "topWord", "topWord2") 
 .show(2, truncate=100)

But I get this error TypeError: 'NoneType' object is not subscriptable

python apache-spark pyspark

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

edited Mar 26 at 10:38

asked Mar 26 at 9:19

Kaharon

408 bronze badges

asked Mar 26 at 9:19

Kaharon

408 bronze badges

asked Mar 26 at 9:19

Kaharon

408 bronze badges

1

one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31

Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09

add a comment |

1

one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31

Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09

one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31

Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09

add a comment |

1 Answer
1

active

oldest

votes

I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...

def getTopWordContainer(v):
 def getTopWord(vector):
 vectorConverted = vector.toArray().tolist()
 listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
 return [v[j] for j in listSortedDesc]
 return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopWord_udf(col('means'))) 
 .select("prediction", "topWord")

I am a beginner in spark so if you know hot to enhance it, let me know :)

answered Mar 29 at 8:16

Kaharon

408 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353505%2fhow-to-get-the-top-x-of-words-from-a-sparsevector-to-a-string-array-with-pyspark%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...

def getTopWordContainer(v):
 def getTopWord(vector):
 vectorConverted = vector.toArray().tolist()
 listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
 return [v[j] for j in listSortedDesc]
 return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopWord_udf(col('means'))) 
 .select("prediction", "topWord")

I am a beginner in spark so if you know hot to enhance it, let me know :)

answered Mar 29 at 8:16

Kaharon

408 bronze badges

add a comment |

I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...

def getTopWordContainer(v):
 def getTopWord(vector):
 vectorConverted = vector.toArray().tolist()
 listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
 return [v[j] for j in listSortedDesc]
 return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopWord_udf(col('means'))) 
 .select("prediction", "topWord")

I am a beginner in spark so if you know hot to enhance it, let me know :)

answered Mar 29 at 8:16

Kaharon

408 bronze badges

add a comment |

I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...

def getTopWordContainer(v):
 def getTopWord(vector):
 vectorConverted = vector.toArray().tolist()
 listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
 return [v[j] for j in listSortedDesc]
 return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopWord_udf(col('means'))) 
 .select("prediction", "topWord")

I am a beginner in spark so if you know hot to enhance it, let me know :)

answered Mar 29 at 8:16

Kaharon

408 bronze badges

I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...

def getTopWordContainer(v):
 def getTopWord(vector):
 vectorConverted = vector.toArray().tolist()
 listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
 return [v[j] for j in listSortedDesc]
 return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means")) 
 .withColumn("topWord", getTopWord_udf(col('means'))) 
 .select("prediction", "topWord")

I am a beginner in spark so if you know hot to enhance it, let me know :)

answered Mar 29 at 8:16

Kaharon

408 bronze badges

answered Mar 29 at 8:16

Kaharon

408 bronze badges

answered Mar 29 at 8:16

Kaharon

408 bronze badges

answered Mar 29 at 8:16

Kaharon

408 bronze badges

add a comment |

Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

1 Answer
1

1 Answer
1

1 Answer
1