How to get the top X of words from a SparseVector to a string array with PySparkHow to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist
Why is the Intel 8086 CPU called a 16-bit CPU?
Finding all possible pairs of square numbers in an array
Random piece of plastic
Term “console” in game consoles
Column deletion based on number of string matches within column
At which point can a system be compromised when downloading archived data from an untrusted source?
What is this geomorphological feature?
I have found a mistake on someone's code published online: what is the protocol?
Why did my "seldom" get corrected?
Locked-up DOS computer beeped on keypress. What mechanism caused that?
Difference between Pure EdDSA (ed25519) and HashEdDSA (ed25519ph)
An entire function all whose forward orbits are bounded
Necroskitter and creatures dying because of placing -1/-1 counters
I want to identify a part from a photo
How did Jayne know when to shoot?
When can a polynomial be written as a polynomial function of another polynomial?
A scene of Jimmy diversity
What causes a rotating object to rotate forever without external force—inertia, or something else?
Is the Münchhausen trilemma really a trilemma?
Why is Google approaching my VPS machine?
What happens if a company buys back all of its shares?
Arithmetics in LuaLaTeX
Exporting animation to Unity
Why won't some unicode characters print to my terminal?
How to get the top X of words from a SparseVector to a string array with PySpark
How to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :
When I do :
getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)
I am getting this error :
Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening
vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)
But I get this error TypeError: 'NoneType' object is not subscriptable
python apache-spark pyspark
add a comment |
I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :
When I do :
getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)
I am getting this error :
Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening
vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)
But I get this error TypeError: 'NoneType' object is not subscriptable
python apache-spark pyspark
1
one potential issue might be invector_udf2where you havevector.sort()[:10], asvector.sortdoes not return a value.
– ags29
Mar 26 at 12:31
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09
add a comment |
I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :
When I do :
getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)
I am getting this error :
Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening
vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)
But I get this error TypeError: 'NoneType' object is not subscriptable
python apache-spark pyspark
I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :
When I do :
getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)
I am getting this error :
Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening
vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))
predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)
But I get this error TypeError: 'NoneType' object is not subscriptable
python apache-spark pyspark
python apache-spark pyspark
edited Mar 26 at 10:38
Kaharon
asked Mar 26 at 9:19
KaharonKaharon
408 bronze badges
408 bronze badges
1
one potential issue might be invector_udf2where you havevector.sort()[:10], asvector.sortdoes not return a value.
– ags29
Mar 26 at 12:31
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09
add a comment |
1
one potential issue might be invector_udf2where you havevector.sort()[:10], asvector.sortdoes not return a value.
– ags29
Mar 26 at 12:31
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09
1
1
one potential issue might be in
vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.– ags29
Mar 26 at 12:31
one potential issue might be in
vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.– ags29
Mar 26 at 12:31
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09
add a comment |
1 Answer
1
active
oldest
votes
I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")
I am a beginner in spark so if you know hot to enhance it, let me know :)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353505%2fhow-to-get-the-top-x-of-words-from-a-sparsevector-to-a-string-array-with-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")
I am a beginner in spark so if you know hot to enhance it, let me know :)
add a comment |
I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")
I am a beginner in spark so if you know hot to enhance it, let me know :)
add a comment |
I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")
I am a beginner in spark so if you know hot to enhance it, let me know :)
I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...
def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord
getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))
top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")
I am a beginner in spark so if you know hot to enhance it, let me know :)
answered Mar 29 at 8:16
KaharonKaharon
408 bronze badges
408 bronze badges
add a comment |
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353505%2fhow-to-get-the-top-x-of-words-from-a-sparsevector-to-a-string-array-with-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
one potential issue might be in
vector_udf2where you havevector.sort()[:10], asvector.sortdoes not return a value.– ags29
Mar 26 at 12:31
Yes indeed, the python sort method return nothing: it directly alters the list.
– Kaharon
Mar 29 at 8:09