Pipelines

A Pipeline defines how your data will be wrangled when running a Job.

Sample Data

Sample data can be used when creating a Pipeline. If this sample data roughly represents the data you expect to be pass through the Pipeline (a Job), then Segna’s suggestions (how to map columns, clean data, format data) when running a Job will be better from the get-go, and improve faster over time.

Sample data does NOT need to be exactly what you expect to see.

Output Destination

A destination for the output of the Pipeline needs to be specified during creation. Currently Segna supports outputting data to data-lakes (e.g. S3, Google Cloud Storage), as well as databases (e.g. PostgreSQL, MySQL, Athena). You can add multiple outputs for a single file - should you want to replicate data across multiple databases/datalakes. If you think there is a connector missing, please reach out - it’s likely that the connector you want is currently in development or testing!

Output Schema

The output schema is just the list of fields that you want to appear in the output data:

14971497

Fields

Fields can be added by clicking the “Add Field” button. Whatever fields that are added to the Pipeline are required to be present in the output data of every Job using the Pipeline. Don’t worry, Segna will help you match your input data to the pipeline schema if the input data doesn’t match the schema!

It is worth noting that you cannot remove fields specified on the Pipeline dynamically on the Job.

Selecting “Allow for future additional fields” allows you to add additional fields dynamically to the output data of
each job.

Deselecting it will ensure that the schema of the output data is always fixed and will always reflect what is on the
Pipeline.

Data Type

There are 4 data types available:

  • Number - floats, integers
  • Datetime
  • Category - text with a limited number of unique values
  • Rich text - anything else

If a field is selected to be a Number data type, it will nullify any values that cannot be safely type-casted to an
integer or a float.

If a field is selected to be Datetime data type, it will nullify any values that cannot be safely type-casted to a date, a time, or a date-time.

Selecting Category as a data type instead of Rich text.

Output Unit

If the data type of a field is datetime, then you can select the desired output timezone in the form of a unit. This
means that the timezone of an input column of a Job will be converted to whatever is specified on the respective output field’s timezone on a pipeline. This way conversions will happen automatically and you won’t have to worry about timezones.

Datetime Format

If the data type of a field is datetime, you will also be able to specify the desired datetime format e.g. 12:34 p.m
21/Jan/2020.