0 votes
I am writing a python recipe to join two csv datasets and write to a postgres table.

Here is the first error:

An invalid argument has been encountered : Column name not matching between query and schema (Total Randomized or Receiving Treatment-United States of Americ, expected Total Randomized or Receiving Treatment-United States of America) SELECT * FROM "CLINICALTRIALCOSTPROJECTION_sitespatientsvisitsbyregion_t" LIMIT 10000. Please check dataset schema.

The length of the column ("Total Randomized or Receiving Treatment-United States of America") is 64

 

Here is another error after the script runs when I try to explore the data (I get a similar error to the one above) and then go to created tabe settings:

Connection OK. Table exists, but schema does not match : Name mismatch for column 39 : '1st Patient Recruited _FPI_' in dataset, '1st Patient Recruited (FPI)' in table
asked by
edited by
So the python recipe ran and created the table in Postgres.  It is a populated table but I cannot import it because of parenthesis in the table column names:

 Invalid argument
An invalid argument has been encountered : Column name not matching between query and schema (1st Patient Recruited (FPI), expected 1st Patient Recruited _FPI_) SELECT * FROM "CLINICALTRIALCOSTPROJECTION_jointed_t" LIMIT 10000. Please check dataset schema.

1 Answer

+1 vote
Hello,

The problem is very likely to come from your database.

Assuming you are using Postgres, there is a default length for labels.

It works in such a way that you can write with a long string for this label, but Postgres by default will truncate it as per the documentation

https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

DSS did write his schema with the label you provided when creating the output dataset, while Postgres truncated it, that is why after the recipe completes and DSS tries to synchronize his schema (with the accurate label) with the Postgres schema (with the truncated label), it gives an error.

As per Postgres documentation you can raise this limit in the Postgres config

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

Cheers from DKU
answered by
Thanks @kgdku.  What about my follow up comment?
What method do you use to write the SQL table?
Hi @Alex. I had two uploaded csv files and I was using a python script recipe to join them and write them to a postgres table.  I used pgAdmin to check the structure of the table in postgres and it was fine.  I was not able to import the table as a new dataset.
Can you paste the part of the python code where you write the data back?
Here you go:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
globalCombined = dataiku.Dataset("GlobalCombined")
globalCombined_df = globalCombined.get_dataframe()
sitesPatientsVisitsbyRegion = dataiku.Dataset("SitesPatientsVisitsbyRegion")
sitesPatientsVisitsbyRegion_df = sitesPatientsVisitsbyRegion.get_dataframe()

joined_python_df = globalCombined_df.set_index('OPPORTUNITY_NUMBER__C', drop=False).join(sitesPatientsVisitsbyRegion_df.set_index("OPPORTUNITY_NUMBER__C"), how='left')

# Write recipe outputs
joined_python = dataiku.Dataset("joined_python")
joined_python.write_with_schema(joined_python_df)
Hi Pouya, try using dropAndCreate=True as parameter of https://doc.dataiku.com/dss/latest/api/python/datasets.html?highlight=write_with_schema#dataiku.Dataset.write_with_schema
Note that doing a join in-memory using Python is not very efficient. Can you use a Join recipe instead to leverage in-database computation?
Hi, did you solve your issue?
839 questions
868 answers
811 comments
1,005 users