Q & A
Dataiku is for…
Governance & Security
Learn Dataiku DSS
Q & A
Ask a Question
Email or Username
I forgot my password
Spark dataframe with illegal characters in column names
When I try and run a recipe that uses a dataframe that has a column with a space inside the name (like 'Number of Entries'), the recipe crashes with an exception: org.apache.spark.sql.AnalysisException, saying the column name has invalid characters.
Is there a way to change this in the Dataiku settings page for the dataset? I tried to edit the override variables under advanced, with something like 'schema.columns.name', but this did not appear to have any effect. What is the best solution to deal with these kinds of problems?
to add a comment.
to answer this question.
The Parquet writer in Spark cannot handle special characters in column names at all, it's unsupported.
If you are in a code recipe, you'll need to rename your column in your code using select, alias or withColumnRenamed.
If you are in a visual recipe, you'll need to rename your column prior to this recipe, for example with a prepare recipe.
Other options can include using CSV instead of Parquet
Generally speaking, given the multiple idiosyncrasies and differences of behaviors between engines, we strongly recommend that as soon as your data enter the Hadoop/Spark world, you should only use lowercased column names without any special characters just_like_that.
ask related question
Actually, the 'withColumnRenamed' trick doesn't appear to work, at least on the data we have. I used to get around this problem by manually specifying the schema when making the 'sparkContext.read.parquet()' function call. Since dataiku is now doing the reading for us, will you ever add support for schema overrides in the future?
I agree in general though, column names with whitespace are generally a bad thing, and are best avoided all-together.
to add a comment.
Most popular tags
Legacy API calls in Spark
Error with CSV dataset in Spark
Spark IllegalArgumentException using partitioning
Set column names in Python recipe
No rows in train dataframe after target remap. Target empty? Type mismatch?
Welcome to Dataiku Answers, where you can ask questions and receive answers from other members of the community.
©Dataiku 2012-2018 -