Is there a maximum length for the dynamic SQL generated for moving data into a SQL based table?

pyousefi · ‎06-11-2018

I am writing a python recipe to join two csv datasets and write to a postgres table.

Here is the first error:

An invalid argument has been encountered : Column name not matching between query and schema (Total Randomized or Receiving Treatment-United States of Americ, expected Total Randomized or Receiving Treatment-United States of America) SELECT * FROM "CLINICALTRIALCOSTPROJECTION_sitespatientsvisitsbyregion_t" LIMIT 10000. Please check dataset schema.

The length of the column ("Total Randomized or Receiving Treatment-United States of America") is 64

Here is another error after the script runs when I try to explore the data (I get a similar error to the one above) and then go to created tabe settings:

Connection OK. Table exists, but schema does not match : Name mismatch for column 39 : '1st Patient Recruited _FPI_' in dataset, '1st Patient Recruited (FPI)' in table

kgdku · ‎06-11-2018

Hello,

The problem is very likely to come from your database.

Assuming you are using Postgres, there is a default length for labels.

It works in such a way that you can write with a long string for this label, but Postgres by default will truncate it as per the documentation

https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

DSS did write his schema with the label you provided when creating the output dataset, while Postgres truncated it, that is why after the recipe completes and DSS tries to synchronize his schema (with the accurate label) with the Postgres schema (with the truncated label), it gives an error.

As per Postgres documentation you can raise this limit in the Postgres config

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

Cheers from DKU

pyousefi · ‎06-11-2018

Thanks @kgdku. What about my follow up comment?

Alex_Combessie · ‎06-11-2018

What method do you use to write the SQL table?

pyousefi · ‎06-12-2018

Hi @Alex. I had two uploaded csv files and I was using a python script recipe to join them and write them to a postgres table. I used pgAdmin to check the structure of the table in postgres and it was fine. I was not able to import the table as a new dataset.

Alex_Combessie · ‎06-12-2018

Can you paste the part of the python code where you write the data back?

pyousefi · ‎06-12-2018

Here you go:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
globalCombined = dataiku.Dataset("GlobalCombined")
globalCombined_df = globalCombined.get_dataframe()
sitesPatientsVisitsbyRegion = dataiku.Dataset("SitesPatientsVisitsbyRegion")
sitesPatientsVisitsbyRegion_df = sitesPatientsVisitsbyRegion.get_dataframe()

joined_python_df = globalCombined_df.set_index('OPPORTUNITY_NUMBER__C', drop=False).join(sitesPatientsVisitsbyRegion_df.set_index("OPPORTUNITY_NUMBER__C"), how='left')

# Write recipe outputs
joined_python = dataiku.Dataset("joined_python")
joined_python.write_with_schema(joined_python_df)

Alex_Combessie · ‎06-12-2018

Hi Pouya, try using dropAndCreate=True as parameter of https://doc.dataiku.com/dss/latest/api/python/datasets.html?highlight=write_with_schema#dataiku.Dataset.write_with_schema
Note that doing a join in-memory using Python is not very efficient. Can you use a Join recipe instead to leverage in-database computation?

Alex_Combessie · ‎06-20-2018

Hi, did you solve your issue?

Sign up to take part

Is there a maximum length for the dynamic SQL generated for moving data into a SQL based table?

Is there a maximum length for the dynamic SQL generated for moving data into a SQL based table?

Labels

Python

SQL